Re: [Neo4j] shortestPath slower than it could be

John cyuczieekc Sat, 23 Jul 2011 07:15:50 -0700

I would gladly try to catch up with Niels but from what I've read (from his
prev. emails) I hardly understand what he was talking about (implying I am
no way near as smart to understand that; which might mean I am simply
missing the background/unstated-assumptions or focusing fail :)) [also I
don't know if we're doing similar things? probably because I don't yet
understand what he's doing, or I don't fully understand which causes me to
drop even what I did understand, flagging it as `didn't understand anything`
xD]


I would describe the idea behind my project, but I'm not sure that I can
envision that far (due to failure to keep my focus perhaps), hence I was
hoping I could work my way bottom-up implementing (though this would never
really work). And you're right about performance, I shouldn't care, but I
lose the sight sometimes and get stuck into small details like that,
effectively preventing me to continue. Anyway, enough about what is, let's
change that heh. I really appreciate what you said btw and I find it
helpful.

Ok so idea would be something like, building systems on top of other systems
with the main goal being accessibility. It's kind of generic really. I don't
want something very specific. I just want to be able to access information
and have the computer also access this information in an AI-friendly kinda
way. That is, if I don't know anything, but I know how to navigate links
(supposedly most basic task), then I(as an AI or intelligence) would be able
to parse and understand how this (part of the) system works, and even change
it (evolve it, ie. multiversioning? / git-like).
 Specific examples, let's see, when done, I should be able to do programming
or storing random thoughts/phrases have them linked to the timestamp of
their creation even, create a small random monitor (think 2D screen in 3D)
and have it show something and starting from any pixel on it, I would be
able to track things down and see/understand how that pixel ended up on that
particular spot. So basically a "right" way of organizing and accessing
anything, be it offline data or live programs, such that an intelligence
(human or otherwise AI/computer) would have no trouble
understanding/learning how it works). I think the key to building such a
thing is simply links/relationships. I just know this can be done (on top of
this lowest level:) by simply creating Nodes (which are basically random
unique IDs) and having nothing but those nodes and the relationships between
them (no real metadata would really be needed), just the ability to know
that this node X is different than this node Y (which is done via their IDs
really; but if the java level requires accessing the same node ID then it
would require a layer for itself only to have a name associated with that
ID, so it can access it by name, across app restarts), and knowing if X
points to Y or Y is pointed-to by X.
(btw, I don't really know or use BTrees but I was hoping them being use at
the lower than Node/Relationship level, would benefit fast searches ie. is
X->Y or find x such that {A->x->B; F->C->x->K; P->x}; but at the high level
I am referring with Nodes/Relationships I don't see the need for a BTree
altho it could be implemented on top of whatever exists; don't know about
Rtree though, I am wikipedia-ing it right now, looks yummy xD)

Anyway, the point is, like in a graph I guess, start from any point and be
able to parse/understand the entire system (except if some grouped nodes are
simply disconnected from the grouped ones I started in, you'd say; but
really there would/should probably be no such nodes, at least the reference
node would be pointing to them, but most likely they would be grouped
together by a parent node pointing to each of them)

So basically, I lost it :)
There will be issues that need to be tackled, like not using locks, using
multiversioning instead (or something)... I don't even know anymore - I know
sometimes it seemed impossible even theoretically but sometimes it seemed
those impossibilities vanished.

What more should I say? (really, 'cause I don't know)
I just want a system you know, where stuff can be accessed to the very core,
to all level depths, such that nothing would really be inaccessible, so that
I wouldn't need to go search for how something is made by looking on a 3rd
party site or doc file just because there's no tool/way to allow me the kind
of accessibility I know I want, sort of like going around it and hoping the
doc/src is in sync with the binary I'm just running (just an example), but
rather bin+src+doc would be one, and even allow me to change them on the fly
with changes propagating to all dependencies... erm, yeah feel free to stop
me from blabbing xD

Peace off :)

On Sat, Jul 23, 2011 at 3:16 PM, Michael Hunger <
michael.hun...@neotechnology.com> wrote:

> Perhaps you can catch up with Niels Hoogeven who's currently implementing
> Collection data structures, B- and R-Trees to be transparently mapped into
> graph structures at:
>
> it might be that B&R Trees would be only a subset of what I'm trying to do,
and so I wouldn't use them to implement what I want on top of them, rather
side by side to them, but then again, I fail to properly focus so I might be
misunderstanding the whole thing.

> https://github.com/peterneubauer/graph-collections
>
> Cheers
>
> Michael
>
> Perhaps you should describe the intent / idea behind your project. Dropping
> too early into implementation details might remove some possibilities that
> are able with graph databases by loosing the big picture.
>
I definitely lost myself into details and implementation limits since long
ago... such that whatever the idea was is merely now based-on/adapted-to the
implementation limits.

>
> And I wouldn't care about performance right now (btw.
> System.currentTimeMillis() is mostly good enough for measurement). But
> rather on modeling your domain to the graph and specify which concepts and
> operations/use-cases you'd like to support.
>
> You can optimize afterwards anyway.
>
Sometime i would find that I had to redo/redesign the entire project in
order to be able to gain the required performance or in order to be able to
have a certain feature. Examples(which may not pertain to what I just said):
find a centralized way of reporting all exceptions thrown, using aspectj to
hook on each call as to be able to catch each `throw`; making sure throws in
finally don't overwrite throws in try; transactional memory attempt... -
this kind of silly non-sensical details

>
> Am 23.07.2011 um 14:29 schrieb John cyuczieekc:
>
> > Hail :)
> > Beware, lot of noise follows (ie. webnoise kinda noise), while I thank
> > anyone in advance for reading, I must also apologize for failing to be
> > concise :-"
> > On Sat, Jul 23, 2011 at 10:42 AM, Michael Hunger <
> > michael.hun...@neotechnology.com> wrote:
> >
> >> John,
> >>
> >> no problem :)
> >>
> >> You pointed out both problems:
> >> - cold caches
> >> - lots of rels on the one side
> >>
> >> There are some performance implications of loading millions of rels from
> >> disk. We'll be working on improving those in 1.5.
> >>
> >> The easiest way to solve that is to switch start and end-node which you
> >> already did. It is much easier for you _with domain knowledge_ about the
> >> graph than for the algorithm.
> >>
> > Yeah I would probably apply that manually, that is: call a particular
> method
> > when I know which one of the nodes has least rels. (as a ' `temporary` or
> > `acceptable so far` workaround ' [u see the graph here?! xD]* I will for
> now
> > use findSinglePath as it is*, simply because I don't want the overhead of
> > using a particular method for some use case scenario and the generic one
> for
> > others - besides it's fast enough, for now)
> > My problem here is that I might not know (in the future) which node
> (first
> > or second) will have the least nodes. So I was hoping to create a generic
> > method for such find.
> > For now, I could say, I can know.
> > For example:
> > AllLists->A
> > AllLists->B
> > AllLists->C
> > where A,B,C are considered nodes representing lists, in this case, I
> could
> > maybe say that while AllLists can have lots of outgoing rels (aka lots of
> > lists), list A may not have that many nodes pointing to it (aka list is
> not
> > used in as many places as there are lists), so I could consider A as
> being
> > node with least nodes.
> > Though I would want a generic method that would always know which node
> has
> > least nodes and parse that one, especially in my 1 hop case.
> >  I am, as seen here, basically tagging the nodes A,B,C with AllLists, by
> > having AllLists point to them, and this way, I know that node A is of
> > AllLists type (so to speak). This was my old way of thinking while using
> bdb
> > je, I might need to revise it since then I was completely ignoring
> > relationships and not considering them as a single and accessible object.
> (I
> > am currently trying to port that old way of thinking to neo4j)
> >  In this old way, I was using the node AllLists pointing/outgoing to A as
> > being (in neo4j) a relationship of the type AllLists incoming to A.
> >
> >>
> >> Especially if you have > 1 hop paths.
> >>
> > In my particular use case, I will probably not (yet!) have cases with >1
> > hop. (I can't really visualize that far in the future xD ) but for sure I
> > will have lots of ==1 hops cases where I just want to see if thisnode is
> > tagged with thatnode by checking thatnode-->thisnode
> >
> >>
> >> There might be ways in improving the algorithm, e.g. by iterating both
> >> sides at the same time, which would lead to the end with the fewer
> >> relationships being exhausted and resolved first.
> >>
> > for now, I totally avoided getting into the findSinglePath 's source and
> > considering doing changes there or even seeing how it works:)
> > But that algo sounds like something I'd do, in like 2 (reusable) threads
> > even xD (I remember having considered this in the past, when using bdb je
> > but I opted to parse the node with least rels instead)
> >
> >>
> >> How large is your graph at all? And how is it structured, e.g. how many
> >> rels do the other 9 nodes have that your second node points to?
> >>
> > I was merely trying this as an example(but this should be a practical use
> > case in my project), the thing is, the other 9 nodes can have any number
> or
> > rels, in fact *any node could have any number of rels incoming and
> outgoing
> > from it*,... (but in this case, the 9 nodes each had a new unused node
> > point(outgoing) to them; I'd post the source but i'm already using
> wrappers
> > and the whole example would be more than 1 java file, but I can link to
> it:
> >
> https://github.com/13th-floor/neo4john/blob/52e2d75a39a28200b34032aaf45a5a09c1e1b22c/src/org/benchtests/neo4j/TestLinkage.java
> > that's the main file to run, tho you don't have to and you prolly don't
> have
> > the time; just putting it here for consistency/historical-reasons I guess
> xD
> > )
> >  In my project, I am trying something generic... something like building
> > sets and lists of nodes based on graphs. Imagine having no metadata
> > key-value properties, and instead *trying to use only nodes to represent
> *those
> > properties (well actually I do have one key aka "name" so i could
> reference
> > the same node id across application restarts).
> >
> >
> >>
> >> What are both times in a second or third run (cached) ?
> >>
> > Path from one to two: (one)-->(two) timedelta=2,151,018,862 ns
> > Path from one to two: (one)-->(two) timedelta=131,458,292 ns
> > Path from one to two: (one)-->(two) timedelta=119,015,019 ns
> > another run:
> > Path from one to two: (one)-->(two) timedelta=1,824,099,688 ns
> > Path from one to two: (one)-->(two) timedelta=275,772,087 ns
> > Path from one to two: (one)-->(two) timedelta=352,562,121 ns
> > ---- and when swapped:
> > Path from one to two: (two)<--(one) timedelta=20,601,181 ns
> > Path from one to two: (two)<--(one) timedelta=284,212 ns
> > Path from one to two: (two)<--(one) timedelta=255,431 ns
> > and another run:
> > Path from one to two: (two)<--(one) timedelta=20,513,338 ns
> > Path from one to two: (two)<--(one) timedelta=280,314 ns
> > Path from one to two: (two)<--(one) timedelta=255,731 ns
> > (I almost got scared there on the second and third xD)
> > and here's if I count them first (twice):
> > total relations count=100,011 timedelta=3,077,046,964 ns
> > total relations count=100,011 timedelta=151,727,293 ns
> > Path from one to two: (one)-->(two) timedelta=173,697,069 ns
> > Path from one to two: (one)-->(two) timedelta=70,160,217 ns
> > Path from one to two: (one)-->(two) timedelta=91,760,936 ns
> > ---- swapped:
> > total relations count=100,011 timedelta=3,219,749,623 ns
> > total relations count=100,011 timedelta=151,675,727 ns
> > Path from one to two: (two)<--(one) timedelta=10,717,914 ns
> > Path from one to two: (two)<--(one) timedelta=204,465 ns
> > Path from one to two: (two)<--(one) timedelta=143,005 ns
> >
> >>
> >> Relationships are stored as a chained/linked list in the disk format so
> it
> >> is not possible to know how many there are per node.
> >>
> > Will this(storage way) possibly change in 1.5 or in the future? I don't
> > really know how ie. berkeleydb je stores key-value pairs (BTree?) but it
> > might potential help get a perspective on that(actually I believe y'all
> know
> > already);
> >  In bdb I am able to count the number of values for the same key (using a
> > cursor via *cursor.count()* )
> > By the way, I was storing relationships in bdb je, like this:
> > key->value
> > where each *key *would be the *start node*, and each *value *would be
> *end
> > node*, thus a key->value would form a relationship, though I would never
> > consider a relationship as a single object(I would always need both start
> > and end nodes to identify/access a relationship). And a key could have
> more
> > than one value associated with it. (one to many)
> > ie.
> > sameKey->value1
> > sameKey->value2
> > sameKey->value3
> > and using a cursor in bdb, I would be able to parse all values (and
> count()
> > them tho not sure how it would do that internally, hopefully not by
> parsing
> > them all).
> > Of course i would have a second database storing them backwards(endnode
> as
> > key, startnode as value) like:
> > value1->sameKey
> > value2->sameKey
> > value3->sameKey
> > this way I would be able to search by endnode too, since I could only
> find
> > stuff (BTree/hash fast) only by looking up by key.
> > (each key/value would be a nodeID btw; I would also have another database
> > for nodeID->String and String->nodeID, so I can search for both, where
> the
> > equivalent of that in neo4j would be the node having a "name" property
> with
> > that String value - this so I could access the same nodeID access java
> > application restarts, that is by using that fixed String)
> >
> > My problem with bdb je, was that I couldn't find a way (for my use case)
> to
> > use transactions without eventually deadlocking...(but if I was as good
> as
> > you guys I probably would've found it)... so last time I totally disabled
> > the use of transactions so that it would seem simpler while I would focus
> on
> > some other project aspect.
> > (ie. parsing from value1 to value 3 in one thread, and from value3 to
> value1
> > in another; but not just this case; not to mention *it doesn't support
> > nested transactions for java edition*)
> >
> >>
> >> Cheers
> >>
> > Cheerios =))
> >
> >>
> >> Michael
> >>
> > John
> >
> >
> > PS: btw, this is a lifelong project(so to speak), which I failed to
> complete
> > or even begin (in 6 years so far) due to postponing it each time I ran
> into
> > problems (I know this isn't the way to do it like that, instead keep at
> it
> > and maybe allow compromises - so fail on my part). And finding your neo4j
> > database is giving me a new opportunity to redo it using neo4j instead of
> > bdbje :) I know(!) graph databases(and stuff based on them, that is: THAT
> > kind of connectivity/accessibility) are the future, and I aim for
> immediate
> > accessibility and customizability ;)
> >
> >>
> >> Am 23.07.2011 um 04:42 schrieb John cyuczieekc:
> >>
> >>> Hey guys, me bugging you again :)
> >>>
> >>> (This whole thing is kind of based on the lack of being able to get the
> >>> number of relationships a node has)
> >>>
> >>> If I have two nodes, and the first one has 1 million outgoing
> >> relationships
> >>> of the type X to 1 million unique/different nodes,
> >>> and the second node has 10 incoming relationships of type X (same type)
> >> of
> >>> which one is from the first node,
> >>> then using GraphAlgoFactory.shortestPath  (or suggest a better way?)
> >>> How can I tell neo4j to iterate the search on the second node's
> incoming
> >>> rels simply because it has 10 relationships instead of 1 million, in
> >> order
> >>> to check if each relationship is in the form of firstNode-->secondNode
> ?
> >>>
> >>> For the case when first node has 100,000 relationships and second node
> >> has
> >>> 10,
> >>> it takes *1.7 seconds* for shortestPath to find the only one link
> between
> >>> them using:
> >>>
> >>> final PathFinder<Path> finder = GraphAlgoFactory.shortestPath(
> >>> Traversal.expanderForTypes( rel, Direction.OUTGOING  ), 1 );
> >>> final Path foundPath = finder.findSinglePath( *one, two* );
> >>>
> >>> I can put Direction.*BOTH *and get the same amount of time
> >>> *Path from one to two: (one)-->(two) timedelta=1,862,726,634 ns*
> >>>
> >>> *BUT*, get this: if I swap the nodes:
> >>> finder.findSinglePath(* two, one*);
> >>> and i use either Direction.INCOMING or Direction.*BOTH  *(which makes
> >> sense
> >>> for the second node ,right) then I get *20ms* the time until it
> >> finishes...
> >>> *Path from one to two: (two)<--(one) timedelta=20,830,111 ns*
> >>>
> >>> (both cases are without data being priorly cached)
> >>>
> >>> I was expecting it to act like this: (but only when using
> Direction.BOTH)
> >>> see which node has the least number of relationships and iterate on
> >> those,
> >>> but this would work if findSinglePath would be made for depth 1 (aka
> >>> particular case), but as I read "Tries to find a single path between
> >> startand
> >>> end nodes." then it makes sense to me why it works like it does... that
> >> is,
> >>> iterate on relationships from start node, rather than from end node...
> >> but
> >>> I'm not sure if it would *not *make sense to iterate on the end node
> >> instead
> >>> of start node, when knowing that end node has less relationships, for
> >> make
> >>> the search faster (well at least if depth is one) - I didn't look into
> >> how
> >>> neo4j actually does stuff yet :D
> >>>
> >>> anyway, it's fairly clear to me that I could make a simple wrapper
> method
> >> to
> >>> make this kind of search faster, *IF* I had the ability to know how
> many
> >>> relationships each node has, so I can call findSinglePath  with the
> first
> >>> param being the node with the least relationship count :) But as I
> >>> understood it, it's not possible to find how many rels a node has...
> >> gimme
> >>> feat! :)) [by not possible I mean, without having to iterate thru all
> and
> >>> count them, which would make the use case here obsolete]
> >>>
> >>> PS: clearly all the text I wrote here would benefit from being
> >> represented
> >>> by a graph, just think about all those grouping with autohiding the ie.
> >> "[]"
> >>> and all kinds of stuff... heh
> >>> _______________________________________________
> >>> Neo4j mailing list
> >>> User@lists.neo4j.org
> >>> https://lists.neo4j.org/mailman/listinfo/user
> >>
> >> _______________________________________________
> >> Neo4j mailing list
> >> User@lists.neo4j.org
> >> https://lists.neo4j.org/mailman/listinfo/user
> >>
> > _______________________________________________
> > Neo4j mailing list
> > User@lists.neo4j.org
> > https://lists.neo4j.org/mailman/listinfo/user
>
> _______________________________________________
> Neo4j mailing list
> User@lists.neo4j.org
> https://lists.neo4j.org/mailman/listinfo/user
>
_______________________________________________
Neo4j mailing list
User@lists.neo4j.org
https://lists.neo4j.org/mailman/listinfo/user

Re: [Neo4j] shortestPath slower than it could be

Reply via email to