Re: [Neo4j] traversing densely populated nodes

Craig Taverner Thu, 30 Jun 2011 14:50:37 -0700

In the amanzi-index I link all indexed nodes into the index tree, so
traversals are straight up the tree. Of course this also means that there
are at least as many relationships as indexed nodes.


I was reviewing Michaels code for the relationship expander, and think that
is a great idea, tranparently using an index instead of the normal
relationships API, and can imagine using the relationship expander to
instead traverse the BTree to the final relationship to the leaf nodes.

So if we imagine a BTree with perhaps 10 or 20 hops from the root to the
leaf node, the relationship expander Michael described would complete all
hops and return only the last relationship, giving the illusion of direct
connections from root to leaf. This would certainly perform well, especially
for cases where there are factors limiting the number of relationships we
want returned. I think the request for type and direction is the first
obvious case, but we could be even more explicit than that, if we pass
constraints based on the BTree's consistent hash.

On Thu, Jun 30, 2011 at 11:36 PM, Niels Hoogeveen <pd_aficion...@hotmail.com
> wrote:

>
> In theory the approach I described earlier could work, though there are
> some pitfalls to the current implementation that need ironing out before
> this can become a recommended approach.
> The choice of Timeline instead of Btree may actually be the wrong choice
> after all. I chose Timeline because of my familiarity with this particular
> class, but its implementation may actually not be all that suitable for this
> particular use case. This has to do with the fact that Timeline is not just
>  a tree, but a list where entries with an interval of max. 1000 are stored
> in a Btree index. This works reasonably well for a Timeline, but makes the
> approach less ideal for storing dense relationships.
> The problem with the Timeline implementation is the ability to lookup the
> tree root from a particular leave. In an ordinary Btree is would simply be a
> traversal from the leave through the layers of block nodes to the tree root.
> In Timeline the traversal will be different. It first has to move through
> the Timeline list until it finds an entry that is stored in the Btree (which
> worst case takes 1000 hops), and then it has to traverse the Btree up to the
> tree root. To avoid this complicated traversal I ended up doing a lookup
> through Lucene of the timeline URI (which is stored in all timeline list
> entries). In fact I might as well have added the URI of the dense node as a
> property and do the lookup through Lucene without the Timeline, it just
> happens that I like the sort order of Timeline, making it a useful approach
> anyway.
> I will experiment using Btree directly (without Timeline) and see if that
> leads to a simpler and faster traversal from leave to root node.
> There is one more issue before this can become production ready. Btree as
> it is implemented now is not thread safe (per the implementations Javadocs),
> so it need some love and attention to make it work properly.
> Niels
>
> > Date: Thu, 30 Jun 2011 13:57:20 +0200
> > From: cr...@amanzi.com
> > To: user@lists.neo4j.org
> > Subject: Re: [Neo4j] traversing densely populated nodes
> >
> > This topics has come up before, and the domain level solutions are
> usually
> > very similar, like Norbert's category/proxy nodes (to group by
> > type/direction) and Niels' TimeLineIndex (BTree). I wonder whether we can
> > build a generic user-level solution that can also be wrapped to appear as
> an
> > internal database solution?
> >
> > For example, consider Niels's solution of the TimeLine index. In this
> case
> > we group all the nodes based on a consistent hash. Usually the timeline
> > would use a timestamp, but really any reasonably variable property can
> do,
> > even the node-id itself. Then we have a BTree between the dense nodes and
> > the root node (node with too many relationships). How about this crazy
> idea,
> > create an API that mimics the normal node.getRelationship*() API, but
> > internally traverses the entire tree? And also for creating the
> > relationships? So for most cod we just do the usual
> > node.createRelationshipTo(node,type,direction) and node.traverse(...),
> but
> > internally we actually traverse the b-tree.
> >
> > This would solve the performance bottleneck being observed while keeping
> the
> > 'illusion' of directly connected relationships. The solution would be
> > implemented mostly in the application space, so will not need any changes
> to
> > the core database. I see this as being of the same kind of solution as
> the
> > auto-indexing. We setup some initial configuration that results in
> certain
> > structures being created on demand. With auto-indexing we are talking
> about
> > mostly automatically adding lucene indexes. With this idea we are talking
> > about automatically replacing direct relationships with b-trees to
> resolve a
> > specific performance issue.
> >
> > And when the relationship density is very low, if the b-tree is
> > auto-balancing, it could just be a direct relationship anyway.
> >
> > On Wed, Jun 29, 2011 at 6:56 PM, Agelos Pikoulas
> > <agelos.pikou...@gmail.com>wrote:
> >
> > > My problem pattern is exactly the same as Niels's :
> > >
> > > A dense-node has millions of relations of a certain direction & type,
> > > and only a few (sparse) relations of a different direction and type.
> > > The traversing is usually following only those sparse relationships on
> > > those
> > > dense-nodes.
> > >
> > > Now, even when traversing on these sparse relations, neo4j becomes
> > > extremely
> > > slow
> > > on a certainly non linear Order (the big cs O).
> > >
> > > Some tests I run (email me if u want the code) reveal that even the
> number
> > > of those dense-nodes in the database greatly influences the results.
> > >
> > > I just reported to Michael the runs with the latest M05 snapshot, which
> are
> > > not very positive...
> > > I have suggested an (auto) indexing of relationship types / direction
> that
> > > is used by traversing frameworks,
> > > but I ain't no graphdb-engine expert :-(
> > >
> > > A'
> > >
> > >
> > > Message: 5
> > > > Date: Wed, 29 Jun 2011 18:19:10 +0200
> > > > From: Niels Hoogeveen <pd_aficion...@hotmail.com>
> > > > Subject: Re: [Neo4j] traversing densely populated nodes
> > > > To: <user@lists.neo4j.org>
> > > > Message-ID: <col110-w326b152552b8f7fbe1312d8b...@phx.gbl>
> > > > Content-Type: text/plain; charset="iso-8859-1"
> > > >
> > > >
> > > > Michael,
> > > >
> > > >
> > > >
> > > > The issue I am refering to does not pertain to traversing many
> relations
> > > at
> > > > once
> > > >
> > > > but the impact many relationship of one type have on relationships
> > > >
> > > > of another type on the same node.
> > > >
> > > >
> > > >
> > > > Example:
> > > >
> > > >
> > > >
> > > > A topic class has 2 million outgoing relationships of type
> "HAS_INSTANCE"
> > > > and
> > > >
> > > > has 3 outgoing relationships of type "SUB_CLASS_OF".
> > > >
> > > >
> > > >
> > > > Fetching the 3 relations of type "SUB_CLASS_OF" takes very long,
> > > >
> > > > I presume due to the presence of the 2 million other relationships.
> > > >
> > > >
> > > >
> > > > I have no need to ever fetch the "HAS_INSTANCE" relationships from
> > > >
> > > > the topic node. That relation is always traversed from the other
> > > direction.
> > > >
> > > >
> > > >
> > > > I do want to know the class of a topic instance, leading to he topic
> > > class,
> > > >
> > > > but have no real interest ever to traverse all topic instance from
>  the
> > > > topic
> > > >
> > > > class (at least not directly.. i do want to know the most recent
> > > addition,
> > > >
> > > > and that's what I use the timeline index for).
> > > >
> > > >
> > > >
> > > > Niels
> > > >
> > > >
> > > > > From: michael.hun...@neotechnology.com
> > > > > Date: Wed, 29 Jun 2011 17:50:08 +0200
> > > > > To: user@lists.neo4j.org
> > > > > Subject: Re: [Neo4j] traversing densely populated nodes
> > > > >
> > > > > I think this is the same problem that Angelos is facing, we are
> > > currently
> > > > evaluating options to improve the performance on those highly
> connected
> > > > supernodes.
> > > > >
> > > > > A traditional option is really to split them into group or even
> kind of
> > > > shard their relationships to a second layer.
> > > > >
> > > > > We're looking into storage improvement options as well as
> modifications
> > > > to retrieval of that many relationships at once.
> > > > >
> > > > > Cheers
> > > > >
> > > > > Michael
> > > >
> > > _______________________________________________
> > > Neo4j mailing list
> > > User@lists.neo4j.org
> > > https://lists.neo4j.org/mailman/listinfo/user
> > >
> > _______________________________________________
> > Neo4j mailing list
> > User@lists.neo4j.org
> > https://lists.neo4j.org/mailman/listinfo/user
>
> _______________________________________________
> Neo4j mailing list
> User@lists.neo4j.org
> https://lists.neo4j.org/mailman/listinfo/user
>
_______________________________________________
Neo4j mailing list
User@lists.neo4j.org
https://lists.neo4j.org/mailman/listinfo/user

Re: [Neo4j] traversing densely populated nodes

Reply via email to