Re: [Neo4j] shortestPath slower than it could be

John cyuczieekc Sat, 23 Jul 2011 05:29:27 -0700

Hail :)
Beware, lot of noise follows (ie. webnoise kinda noise), while I thank
anyone in advance for reading, I must also apologize for failing to be
concise :-"
On Sat, Jul 23, 2011 at 10:42 AM, Michael Hunger <
michael.hun...@neotechnology.com> wrote:


> John,
>
> no problem :)
>
> You pointed out both problems:
> - cold caches
> - lots of rels on the one side
>
> There are some performance implications of loading millions of rels from
> disk. We'll be working on improving those in 1.5.
>
> The easiest way to solve that is to switch start and end-node which you
> already did. It is much easier for you _with domain knowledge_ about the
> graph than for the algorithm.
>
Yeah I would probably apply that manually, that is: call a particular method
when I know which one of the nodes has least rels. (as a ' `temporary` or
`acceptable so far` workaround ' [u see the graph here?! xD]* I will for now
use findSinglePath as it is*, simply because I don't want the overhead of
using a particular method for some use case scenario and the generic one for
others - besides it's fast enough, for now)
 My problem here is that I might not know (in the future) which node (first
or second) will have the least nodes. So I was hoping to create a generic
method for such find.
 For now, I could say, I can know.
For example:
AllLists->A
AllLists->B
AllLists->C
where A,B,C are considered nodes representing lists, in this case, I could
maybe say that while AllLists can have lots of outgoing rels (aka lots of
lists), list A may not have that many nodes pointing to it (aka list is not
used in as many places as there are lists), so I could consider A as being
node with least nodes.
Though I would want a generic method that would always know which node has
least nodes and parse that one, especially in my 1 hop case.
  I am, as seen here, basically tagging the nodes A,B,C with AllLists, by
having AllLists point to them, and this way, I know that node A is of
AllLists type (so to speak). This was my old way of thinking while using bdb
je, I might need to revise it since then I was completely ignoring
relationships and not considering them as a single and accessible object. (I
am currently trying to port that old way of thinking to neo4j)
  In this old way, I was using the node AllLists pointing/outgoing to A as
being (in neo4j) a relationship of the type AllLists incoming to A.

>
> Especially if you have > 1 hop paths.
>
In my particular use case, I will probably not (yet!) have cases with >1
hop. (I can't really visualize that far in the future xD ) but for sure I
will have lots of ==1 hops cases where I just want to see if thisnode is
tagged with thatnode by checking thatnode-->thisnode

>
> There might be ways in improving the algorithm, e.g. by iterating both
> sides at the same time, which would lead to the end with the fewer
> relationships being exhausted and resolved first.
>
for now, I totally avoided getting into the findSinglePath 's source and
considering doing changes there or even seeing how it works:)
But that algo sounds like something I'd do, in like 2 (reusable) threads
even xD (I remember having considered this in the past, when using bdb je
but I opted to parse the node with least rels instead)

>
> How large is your graph at all? And how is it structured, e.g. how many
> rels do the other 9 nodes have that your second node points to?
>
I was merely trying this as an example(but this should be a practical use
case in my project), the thing is, the other 9 nodes can have any number or
rels, in fact *any node could have any number of rels incoming and outgoing
from it*,... (but in this case, the 9 nodes each had a new unused node
point(outgoing) to them; I'd post the source but i'm already using wrappers
and the whole example would be more than 1 java file, but I can link to it:
https://github.com/13th-floor/neo4john/blob/52e2d75a39a28200b34032aaf45a5a09c1e1b22c/src/org/benchtests/neo4j/TestLinkage.java
that's the main file to run, tho you don't have to and you prolly don't have
the time; just putting it here for consistency/historical-reasons I guess xD
)
  In my project, I am trying something generic... something like building
sets and lists of nodes based on graphs. Imagine having no metadata
key-value properties, and instead *trying to use only nodes to represent *those
properties (well actually I do have one key aka "name" so i could reference
the same node id across application restarts).


>
> What are both times in a second or third run (cached) ?
>
Path from one to two: (one)-->(two) timedelta=2,151,018,862 ns
Path from one to two: (one)-->(two) timedelta=131,458,292 ns
Path from one to two: (one)-->(two) timedelta=119,015,019 ns
another run:
Path from one to two: (one)-->(two) timedelta=1,824,099,688 ns
Path from one to two: (one)-->(two) timedelta=275,772,087 ns
Path from one to two: (one)-->(two) timedelta=352,562,121 ns
---- and when swapped:
Path from one to two: (two)<--(one) timedelta=20,601,181 ns
Path from one to two: (two)<--(one) timedelta=284,212 ns
Path from one to two: (two)<--(one) timedelta=255,431 ns
and another run:
Path from one to two: (two)<--(one) timedelta=20,513,338 ns
Path from one to two: (two)<--(one) timedelta=280,314 ns
Path from one to two: (two)<--(one) timedelta=255,731 ns
(I almost got scared there on the second and third xD)
and here's if I count them first (twice):
total relations count=100,011 timedelta=3,077,046,964 ns
total relations count=100,011 timedelta=151,727,293 ns
Path from one to two: (one)-->(two) timedelta=173,697,069 ns
Path from one to two: (one)-->(two) timedelta=70,160,217 ns
Path from one to two: (one)-->(two) timedelta=91,760,936 ns
---- swapped:
total relations count=100,011 timedelta=3,219,749,623 ns
total relations count=100,011 timedelta=151,675,727 ns
Path from one to two: (two)<--(one) timedelta=10,717,914 ns
Path from one to two: (two)<--(one) timedelta=204,465 ns
Path from one to two: (two)<--(one) timedelta=143,005 ns

>
> Relationships are stored as a chained/linked list in the disk format so it
> is not possible to know how many there are per node.
>
Will this(storage way) possibly change in 1.5 or in the future? I don't
really know how ie. berkeleydb je stores key-value pairs (BTree?) but it
might potential help get a perspective on that(actually I believe y'all know
already);
  In bdb I am able to count the number of values for the same key (using a
cursor via *cursor.count()* )
 By the way, I was storing relationships in bdb je, like this:
 key->value
where each *key *would be the *start node*, and each *value *would be *end
node*, thus a key->value would form a relationship, though I would never
consider a relationship as a single object(I would always need both start
and end nodes to identify/access a relationship). And a key could have more
than one value associated with it. (one to many)
ie.
sameKey->value1
sameKey->value2
sameKey->value3
and using a cursor in bdb, I would be able to parse all values (and count()
them tho not sure how it would do that internally, hopefully not by parsing
them all).
Of course i would have a second database storing them backwards(endnode as
key, startnode as value) like:
value1->sameKey
value2->sameKey
value3->sameKey
this way I would be able to search by endnode too, since I could only find
stuff (BTree/hash fast) only by looking up by key.
(each key/value would be a nodeID btw; I would also have another database
for nodeID->String and String->nodeID, so I can search for both, where the
equivalent of that in neo4j would be the node having a "name" property with
that String value - this so I could access the same nodeID access java
application restarts, that is by using that fixed String)

My problem with bdb je, was that I couldn't find a way (for my use case) to
use transactions without eventually deadlocking...(but if I was as good as
you guys I probably would've found it)... so last time I totally disabled
the use of transactions so that it would seem simpler while I would focus on
some other project aspect.
(ie. parsing from value1 to value 3 in one thread, and from value3 to value1
in another; but not just this case; not to mention *it doesn't support
nested transactions for java edition*)

>
> Cheers
>
Cheerios =))

>
> Michael
>
John


PS: btw, this is a lifelong project(so to speak), which I failed to complete
or even begin (in 6 years so far) due to postponing it each time I ran into
problems (I know this isn't the way to do it like that, instead keep at it
and maybe allow compromises - so fail on my part). And finding your neo4j
database is giving me a new opportunity to redo it using neo4j instead of
bdbje :) I know(!) graph databases(and stuff based on them, that is: THAT
kind of connectivity/accessibility) are the future, and I aim for immediate
accessibility and customizability ;)

>
> Am 23.07.2011 um 04:42 schrieb John cyuczieekc:
>
> > Hey guys, me bugging you again :)
> >
> > (This whole thing is kind of based on the lack of being able to get the
> > number of relationships a node has)
> >
> > If I have two nodes, and the first one has 1 million outgoing
> relationships
> > of the type X to 1 million unique/different nodes,
> > and the second node has 10 incoming relationships of type X (same type)
> of
> > which one is from the first node,
> > then using GraphAlgoFactory.shortestPath  (or suggest a better way?)
> > How can I tell neo4j to iterate the search on the second node's incoming
> > rels simply because it has 10 relationships instead of 1 million, in
> order
> > to check if each relationship is in the form of firstNode-->secondNode ?
> >
> > For the case when first node has 100,000 relationships and second node
> has
> > 10,
> > it takes *1.7 seconds* for shortestPath to find the only one link between
> > them using:
> >
> > final PathFinder<Path> finder = GraphAlgoFactory.shortestPath(
> > Traversal.expanderForTypes( rel, Direction.OUTGOING  ), 1 );
> > final Path foundPath = finder.findSinglePath( *one, two* );
> >
> > I can put Direction.*BOTH *and get the same amount of time
> > *Path from one to two: (one)-->(two) timedelta=1,862,726,634 ns*
> >
> > *BUT*, get this: if I swap the nodes:
> > finder.findSinglePath(* two, one*);
> > and i use either Direction.INCOMING or Direction.*BOTH  *(which makes
> sense
> > for the second node ,right) then I get *20ms* the time until it
> finishes...
> > *Path from one to two: (two)<--(one) timedelta=20,830,111 ns*
> >
> > (both cases are without data being priorly cached)
> >
> > I was expecting it to act like this: (but only when using Direction.BOTH)
> >  see which node has the least number of relationships and iterate on
> those,
> > but this would work if findSinglePath would be made for depth 1 (aka
> > particular case), but as I read "Tries to find a single path between
> startand
> > end nodes." then it makes sense to me why it works like it does... that
> is,
> > iterate on relationships from start node, rather than from end node...
> but
> > I'm not sure if it would *not *make sense to iterate on the end node
> instead
> > of start node, when knowing that end node has less relationships, for
> make
> > the search faster (well at least if depth is one) - I didn't look into
> how
> > neo4j actually does stuff yet :D
> >
> > anyway, it's fairly clear to me that I could make a simple wrapper method
> to
> > make this kind of search faster, *IF* I had the ability to know how many
> > relationships each node has, so I can call findSinglePath  with the first
> > param being the node with the least relationship count :) But as I
> > understood it, it's not possible to find how many rels a node has...
> gimme
> > feat! :)) [by not possible I mean, without having to iterate thru all and
> > count them, which would make the use case here obsolete]
> >
> > PS: clearly all the text I wrote here would benefit from being
> represented
> > by a graph, just think about all those grouping with autohiding the ie.
> "[]"
> > and all kinds of stuff... heh
> > _______________________________________________
> > Neo4j mailing list
> > User@lists.neo4j.org
> > https://lists.neo4j.org/mailman/listinfo/user
>
> _______________________________________________
> Neo4j mailing list
> User@lists.neo4j.org
> https://lists.neo4j.org/mailman/listinfo/user
>
_______________________________________________
Neo4j mailing list
User@lists.neo4j.org
https://lists.neo4j.org/mailman/listinfo/user

Re: [Neo4j] shortestPath slower than it could be

Reply via email to