Re: [Neo4j] User Digest, Vol 51, Issue 164

Agelos Pikoulas Thu, 30 Jun 2011 13:16:57 -0700

Dear Graphistas,

Im glad this topic is getting traction, cause I think this issue is *MAJOR*.



I firmly believe that if neo4j wants to become the Web 2/3 ready and Social
1/2 ready graphdb, it has to be able to accommodate fast Traversals among
hundreds of thousands or millions of "artifact" nodes (eg Books, Movies,
songs) that have perhaps thousands or millions of relationships of "what"
(likes, purchases, categories etc) from  perhaps some among hundred millions
of "person" nodes (If your graph app makes it, that is the point, right ?).
There's no way avoiding having no super/dense nodes.

To the topic,

Aliabbas, I ve sent you the code and some info.

>
Craig Taverne, I think your idea sounds wonderful. Perhaps it could be even
simpler than this : each node should be able to access *only* its outgoing
or incoming relations, as well as each relation-types using a separate
inherent "collection" for each. Perhaps this could be a simple
Hashmap<RelationType,
RelationList> or something. But maybe I am being too naive here, if it were
so trivial the wonderful people @ neotechnology would have done it, right?

Best Regards,

Agelos Pikoulas
------------------------------

>
> Message: 8
> Date: Thu, 30 Jun 2011 13:57:20 +0200
> From: Craig Taverner <cr...@amanzi.com>
> Subject: Re: [Neo4j] traversing densely populated nodes
> To: Neo4j user discussions <user@lists.neo4j.org>
> Message-ID: <BANLkTikz-u8Ok-Oobqvz=PMFDLBwi=_...@mail.gmail.com>
> Content-Type: text/plain; charset=ISO-8859-1
>
> This topics has come up before, and the domain level solutions are usually
> very similar, like Norbert's category/proxy nodes (to group by
> type/direction) and Niels' TimeLineIndex (BTree). I wonder whether we can
> build a generic user-level solution that can also be wrapped to appear as
> an
> internal database solution?
>
> For example, consider Niels's solution of the TimeLine index. In this case
> we group all the nodes based on a consistent hash. Usually the timeline
> would use a timestamp, but really any reasonably variable property can do,
> even the node-id itself. Then we have a BTree between the dense nodes and
> the root node (node with too many relationships). How about this crazy
> idea,
> create an API that mimics the normal node.getRelationship*() API, but
> internally traverses the entire tree? And also for creating the
> relationships? So for most cod we just do the usual
> node.createRelationshipTo(node,type,direction) and node.traverse(...), but
> internally we actually traverse the b-tree.
>
> This would solve the performance bottleneck being observed while keeping
> the
> 'illusion' of directly connected relationships. The solution would be
> implemented mostly in the application space, so will not need any changes
> to
> the core database. I see this as being of the same kind of solution as the
> auto-indexing. We setup some initial configuration that results in certain
> structures being created on demand. With auto-indexing we are talking about
> mostly automatically adding lucene indexes. With this idea we are talking
> about automatically replacing direct relationships with b-trees to resolve
> a
> specific performance issue.
>
> And when the relationship density is very low, if the b-tree is
> auto-balancing, it could just be a direct relationship anyway.
>
> On Wed, Jun 29, 2011 at 6:56 PM, Agelos Pikoulas
> <agelos.pikou...@gmail.com>wrote:
>
> > My problem pattern is exactly the same as Niels's :
> >
> > A dense-node has millions of relations of a certain direction & type,
> > and only a few (sparse) relations of a different direction and type.
> > The traversing is usually following only those sparse relationships on
> > those
> > dense-nodes.
> >
> > Now, even when traversing on these sparse relations, neo4j becomes
> > extremely
> > slow
> > on a certainly non linear Order (the big cs O).
> >
> > Some tests I run (email me if u want the code) reveal that even the
> number
> > of those dense-nodes in the database greatly influences the results.
> >
> > I just reported to Michael the runs with the latest M05 snapshot, which
> are
> > not very positive...
> > I have suggested an (auto) indexing of relationship types / direction
> that
> > is used by traversing frameworks,
> > but I ain't no graphdb-engine expert :-(
> >
> > A'
> >
> >
> > Message: 5
> > > Date: Wed, 29 Jun 2011 18:19:10 +0200
> > > From: Niels Hoogeveen <pd_aficion...@hotmail.com>
> > > Subject: Re: [Neo4j] traversing densely populated nodes
> > > To: <user@lists.neo4j.org>
> > > Message-ID: <col110-w326b152552b8f7fbe1312d8b...@phx.gbl>
> > > Content-Type: text/plain; charset="iso-8859-1"
> > >
> > >
> > > Michael,
> > >
> > >
> > >
> > > The issue I am refering to does not pertain to traversing many
> relations
> > at
> > > once
> > >
> > > but the impact many relationship of one type have on relationships
> > >
> > > of another type on the same node.
> > >
> > >
> > >
> > > Example:
> > >
> > >
> > >
> > > A topic class has 2 million outgoing relationships of type
> "HAS_INSTANCE"
> > > and
> > >
> > > has 3 outgoing relationships of type "SUB_CLASS_OF".
> > >
> > >
> > >
> > > Fetching the 3 relations of type "SUB_CLASS_OF" takes very long,
> > >
> > > I presume due to the presence of the 2 million other relationships.
> > >
> > >
> > >
> > > I have no need to ever fetch the "HAS_INSTANCE" relationships from
> > >
> > > the topic node. That relation is always traversed from the other
> > direction.
> > >
> > >
> > >
> > > I do want to know the class of a topic instance, leading to he topic
> > class,
> > >
> > > but have no real interest ever to traverse all topic instance from  the
> > > topic
> > >
> > > class (at least not directly.. i do want to know the most recent
> > addition,
> > >
> > > and that's what I use the timeline index for).
> > >
> > >
> > >
> > > Niels
> > >
> > >
> > > > From: michael.hun...@neotechnology.com
> > > > Date: Wed, 29 Jun 2011 17:50:08 +0200
> > > > To: user@lists.neo4j.org
> > > > Subject: Re: [Neo4j] traversing densely populated nodes
> > > >
> > > > I think this is the same problem that Angelos is facing, we are
> > currently
> > > evaluating options to improve the performance on those highly connected
> > > supernodes.
> > > >
> > > > A traditional option is really to split them into group or even kind
> of
> > > shard their relationships to a second layer.
> > > >
> > > > We're looking into storage improvement options as well as
> modifications
> > > to retrieval of that many relationships at once.
> > > >
> > > > Cheers
> > > >
> > > > Michael
> > >
> > _______________________________________________
> > Neo4j mailing list
> > User@lists.neo4j.org
> > https://lists.neo4j.org/mailman/listinfo/user
> >
>
>
> ------------------------------
>
> _______________________________________________
> User mailing list
> User@lists.neo4j.org
> https://lists.neo4j.org/mailman/listinfo/user
>
>
> End of User Digest, Vol 51, Issue 164
> *************************************
>
_______________________________________________
Neo4j mailing list
User@lists.neo4j.org
https://lists.neo4j.org/mailman/listinfo/user

Re: [Neo4j] User Digest, Vol 51, Issue 164

Reply via email to