Re: [Neo4j] shortestPath slower than it could be

Michael Hunger Sat, 23 Jul 2011 06:16:53 -0700

Perhaps you can catch up with Niels Hoogeven who's currently implementing 
Collection data structures, B- and R-Trees to be transparently mapped into 
graph structures at:


https://github.com/peterneubauer/graph-collections

Cheers

Michael

Perhaps you should describe the intent / idea behind your project. Dropping too 
early into implementation details might remove some possibilities that are able 
with graph databases by loosing the big picture.

And I wouldn't care about performance right now (btw. 
System.currentTimeMillis() is mostly good enough for measurement). But rather 
on modeling your domain to the graph and specify which concepts and 
operations/use-cases you'd like to support.

You can optimize afterwards anyway.

Am 23.07.2011 um 14:29 schrieb John cyuczieekc:

> Hail :)
> Beware, lot of noise follows (ie. webnoise kinda noise), while I thank
> anyone in advance for reading, I must also apologize for failing to be
> concise :-"
> On Sat, Jul 23, 2011 at 10:42 AM, Michael Hunger <
> michael.hun...@neotechnology.com> wrote:
> 
>> John,
>> 
>> no problem :)
>> 
>> You pointed out both problems:
>> - cold caches
>> - lots of rels on the one side
>> 
>> There are some performance implications of loading millions of rels from
>> disk. We'll be working on improving those in 1.5.
>> 
>> The easiest way to solve that is to switch start and end-node which you
>> already did. It is much easier for you _with domain knowledge_ about the
>> graph than for the algorithm.
>> 
> Yeah I would probably apply that manually, that is: call a particular method
> when I know which one of the nodes has least rels. (as a ' `temporary` or
> `acceptable so far` workaround ' [u see the graph here?! xD]* I will for now
> use findSinglePath as it is*, simply because I don't want the overhead of
> using a particular method for some use case scenario and the generic one for
> others - besides it's fast enough, for now)
> My problem here is that I might not know (in the future) which node (first
> or second) will have the least nodes. So I was hoping to create a generic
> method for such find.
> For now, I could say, I can know.
> For example:
> AllLists->A
> AllLists->B
> AllLists->C
> where A,B,C are considered nodes representing lists, in this case, I could
> maybe say that while AllLists can have lots of outgoing rels (aka lots of
> lists), list A may not have that many nodes pointing to it (aka list is not
> used in as many places as there are lists), so I could consider A as being
> node with least nodes.
> Though I would want a generic method that would always know which node has
> least nodes and parse that one, especially in my 1 hop case.
>  I am, as seen here, basically tagging the nodes A,B,C with AllLists, by
> having AllLists point to them, and this way, I know that node A is of
> AllLists type (so to speak). This was my old way of thinking while using bdb
> je, I might need to revise it since then I was completely ignoring
> relationships and not considering them as a single and accessible object. (I
> am currently trying to port that old way of thinking to neo4j)
>  In this old way, I was using the node AllLists pointing/outgoing to A as
> being (in neo4j) a relationship of the type AllLists incoming to A.
> 
>> 
>> Especially if you have > 1 hop paths.
>> 
> In my particular use case, I will probably not (yet!) have cases with >1
> hop. (I can't really visualize that far in the future xD ) but for sure I
> will have lots of ==1 hops cases where I just want to see if thisnode is
> tagged with thatnode by checking thatnode-->thisnode
> 
>> 
>> There might be ways in improving the algorithm, e.g. by iterating both
>> sides at the same time, which would lead to the end with the fewer
>> relationships being exhausted and resolved first.
>> 
> for now, I totally avoided getting into the findSinglePath 's source and
> considering doing changes there or even seeing how it works:)
> But that algo sounds like something I'd do, in like 2 (reusable) threads
> even xD (I remember having considered this in the past, when using bdb je
> but I opted to parse the node with least rels instead)
> 
>> 
>> How large is your graph at all? And how is it structured, e.g. how many
>> rels do the other 9 nodes have that your second node points to?
>> 
> I was merely trying this as an example(but this should be a practical use
> case in my project), the thing is, the other 9 nodes can have any number or
> rels, in fact *any node could have any number of rels incoming and outgoing
> from it*,... (but in this case, the 9 nodes each had a new unused node
> point(outgoing) to them; I'd post the source but i'm already using wrappers
> and the whole example would be more than 1 java file, but I can link to it:
> https://github.com/13th-floor/neo4john/blob/52e2d75a39a28200b34032aaf45a5a09c1e1b22c/src/org/benchtests/neo4j/TestLinkage.java
> that's the main file to run, tho you don't have to and you prolly don't have
> the time; just putting it here for consistency/historical-reasons I guess xD
> )
>  In my project, I am trying something generic... something like building
> sets and lists of nodes based on graphs. Imagine having no metadata
> key-value properties, and instead *trying to use only nodes to represent 
> *those
> properties (well actually I do have one key aka "name" so i could reference
> the same node id across application restarts).
> 
> 
>> 
>> What are both times in a second or third run (cached) ?
>> 
> Path from one to two: (one)-->(two) timedelta=2,151,018,862 ns
> Path from one to two: (one)-->(two) timedelta=131,458,292 ns
> Path from one to two: (one)-->(two) timedelta=119,015,019 ns
> another run:
> Path from one to two: (one)-->(two) timedelta=1,824,099,688 ns
> Path from one to two: (one)-->(two) timedelta=275,772,087 ns
> Path from one to two: (one)-->(two) timedelta=352,562,121 ns
> ---- and when swapped:
> Path from one to two: (two)<--(one) timedelta=20,601,181 ns
> Path from one to two: (two)<--(one) timedelta=284,212 ns
> Path from one to two: (two)<--(one) timedelta=255,431 ns
> and another run:
> Path from one to two: (two)<--(one) timedelta=20,513,338 ns
> Path from one to two: (two)<--(one) timedelta=280,314 ns
> Path from one to two: (two)<--(one) timedelta=255,731 ns
> (I almost got scared there on the second and third xD)
> and here's if I count them first (twice):
> total relations count=100,011 timedelta=3,077,046,964 ns
> total relations count=100,011 timedelta=151,727,293 ns
> Path from one to two: (one)-->(two) timedelta=173,697,069 ns
> Path from one to two: (one)-->(two) timedelta=70,160,217 ns
> Path from one to two: (one)-->(two) timedelta=91,760,936 ns
> ---- swapped:
> total relations count=100,011 timedelta=3,219,749,623 ns
> total relations count=100,011 timedelta=151,675,727 ns
> Path from one to two: (two)<--(one) timedelta=10,717,914 ns
> Path from one to two: (two)<--(one) timedelta=204,465 ns
> Path from one to two: (two)<--(one) timedelta=143,005 ns
> 
>> 
>> Relationships are stored as a chained/linked list in the disk format so it
>> is not possible to know how many there are per node.
>> 
> Will this(storage way) possibly change in 1.5 or in the future? I don't
> really know how ie. berkeleydb je stores key-value pairs (BTree?) but it
> might potential help get a perspective on that(actually I believe y'all know
> already);
>  In bdb I am able to count the number of values for the same key (using a
> cursor via *cursor.count()* )
> By the way, I was storing relationships in bdb je, like this:
> key->value
> where each *key *would be the *start node*, and each *value *would be *end
> node*, thus a key->value would form a relationship, though I would never
> consider a relationship as a single object(I would always need both start
> and end nodes to identify/access a relationship). And a key could have more
> than one value associated with it. (one to many)
> ie.
> sameKey->value1
> sameKey->value2
> sameKey->value3
> and using a cursor in bdb, I would be able to parse all values (and count()
> them tho not sure how it would do that internally, hopefully not by parsing
> them all).
> Of course i would have a second database storing them backwards(endnode as
> key, startnode as value) like:
> value1->sameKey
> value2->sameKey
> value3->sameKey
> this way I would be able to search by endnode too, since I could only find
> stuff (BTree/hash fast) only by looking up by key.
> (each key/value would be a nodeID btw; I would also have another database
> for nodeID->String and String->nodeID, so I can search for both, where the
> equivalent of that in neo4j would be the node having a "name" property with
> that String value - this so I could access the same nodeID access java
> application restarts, that is by using that fixed String)
> 
> My problem with bdb je, was that I couldn't find a way (for my use case) to
> use transactions without eventually deadlocking...(but if I was as good as
> you guys I probably would've found it)... so last time I totally disabled
> the use of transactions so that it would seem simpler while I would focus on
> some other project aspect.
> (ie. parsing from value1 to value 3 in one thread, and from value3 to value1
> in another; but not just this case; not to mention *it doesn't support
> nested transactions for java edition*)
> 
>> 
>> Cheers
>> 
> Cheerios =))
> 
>> 
>> Michael
>> 
> John
> 
> 
> PS: btw, this is a lifelong project(so to speak), which I failed to complete
> or even begin (in 6 years so far) due to postponing it each time I ran into
> problems (I know this isn't the way to do it like that, instead keep at it
> and maybe allow compromises - so fail on my part). And finding your neo4j
> database is giving me a new opportunity to redo it using neo4j instead of
> bdbje :) I know(!) graph databases(and stuff based on them, that is: THAT
> kind of connectivity/accessibility) are the future, and I aim for immediate
> accessibility and customizability ;)
> 
>> 
>> Am 23.07.2011 um 04:42 schrieb John cyuczieekc:
>> 
>>> Hey guys, me bugging you again :)
>>> 
>>> (This whole thing is kind of based on the lack of being able to get the
>>> number of relationships a node has)
>>> 
>>> If I have two nodes, and the first one has 1 million outgoing
>> relationships
>>> of the type X to 1 million unique/different nodes,
>>> and the second node has 10 incoming relationships of type X (same type)
>> of
>>> which one is from the first node,
>>> then using GraphAlgoFactory.shortestPath  (or suggest a better way?)
>>> How can I tell neo4j to iterate the search on the second node's incoming
>>> rels simply because it has 10 relationships instead of 1 million, in
>> order
>>> to check if each relationship is in the form of firstNode-->secondNode ?
>>> 
>>> For the case when first node has 100,000 relationships and second node
>> has
>>> 10,
>>> it takes *1.7 seconds* for shortestPath to find the only one link between
>>> them using:
>>> 
>>> final PathFinder<Path> finder = GraphAlgoFactory.shortestPath(
>>> Traversal.expanderForTypes( rel, Direction.OUTGOING  ), 1 );
>>> final Path foundPath = finder.findSinglePath( *one, two* );
>>> 
>>> I can put Direction.*BOTH *and get the same amount of time
>>> *Path from one to two: (one)-->(two) timedelta=1,862,726,634 ns*
>>> 
>>> *BUT*, get this: if I swap the nodes:
>>> finder.findSinglePath(* two, one*);
>>> and i use either Direction.INCOMING or Direction.*BOTH  *(which makes
>> sense
>>> for the second node ,right) then I get *20ms* the time until it
>> finishes...
>>> *Path from one to two: (two)<--(one) timedelta=20,830,111 ns*
>>> 
>>> (both cases are without data being priorly cached)
>>> 
>>> I was expecting it to act like this: (but only when using Direction.BOTH)
>>> see which node has the least number of relationships and iterate on
>> those,
>>> but this would work if findSinglePath would be made for depth 1 (aka
>>> particular case), but as I read "Tries to find a single path between
>> startand
>>> end nodes." then it makes sense to me why it works like it does... that
>> is,
>>> iterate on relationships from start node, rather than from end node...
>> but
>>> I'm not sure if it would *not *make sense to iterate on the end node
>> instead
>>> of start node, when knowing that end node has less relationships, for
>> make
>>> the search faster (well at least if depth is one) - I didn't look into
>> how
>>> neo4j actually does stuff yet :D
>>> 
>>> anyway, it's fairly clear to me that I could make a simple wrapper method
>> to
>>> make this kind of search faster, *IF* I had the ability to know how many
>>> relationships each node has, so I can call findSinglePath  with the first
>>> param being the node with the least relationship count :) But as I
>>> understood it, it's not possible to find how many rels a node has...
>> gimme
>>> feat! :)) [by not possible I mean, without having to iterate thru all and
>>> count them, which would make the use case here obsolete]
>>> 
>>> PS: clearly all the text I wrote here would benefit from being
>> represented
>>> by a graph, just think about all those grouping with autohiding the ie.
>> "[]"
>>> and all kinds of stuff... heh
>>> _______________________________________________
>>> Neo4j mailing list
>>> User@lists.neo4j.org
>>> https://lists.neo4j.org/mailman/listinfo/user
>> 
>> _______________________________________________
>> Neo4j mailing list
>> User@lists.neo4j.org
>> https://lists.neo4j.org/mailman/listinfo/user
>> 
> _______________________________________________
> Neo4j mailing list
> User@lists.neo4j.org
> https://lists.neo4j.org/mailman/listinfo/user

_______________________________________________
Neo4j mailing list
User@lists.neo4j.org
https://lists.neo4j.org/mailman/listinfo/user

Re: [Neo4j] shortestPath slower than it could be

Reply via email to