The Expander solution seems to be working well (thanks Michael) & the I am
sure proposed B/RTree solutions could be even better.

But the big question is to have this supernode/relationship indexing
functionality integrated into neo4j in a coherent (perhaps plugable) way,
such that can be utilized correctly by users & especially API makers (eg I
really need to hint TinkerPop Pipes/Gremlin to use this relationship
indexing when I do a v(1).outE('HAS_PART') ).

Agelos

Date: Thu, 30 Jun 2011 22:56:22 +0200
From: Michael Hunger <michael.hun...@neotechnology.com>
Subject: Re: [Neo4j] traversing densely populated nodes
To: Neo4j user discussions <user@lists.neo4j.org>
Message-ID: <977ac945-acf0-4a8e-992b-7f6e3a5d0...@neotechnology.com
>
Content-Type: text/plain; charset="us-ascii"

If you want to go with the "indexed-relationship" solution for the supernode
with massive relationships and just a _few_ relationships that I want to
traverse, I tried the following solution that performs better for the
cold-cache but a bit worse for the warm-cache (2nd run) test.

Please make sure that you only use that approach for supernodes and for cold
caches otherwise it will impact the store-performance (as lucene is
definitively slower than graph traversal on cached nodes). And please make
also sure that you only index the "few" important relationships otherwise
the index lookup will be slower.

Cheers

Michael

   private static class IndexedRelationshipExpander implements
RelationshipExpander {
       private final Direction direction;
       private RelType relType;
       private static RelationshipIndex index =
gr.index().forRelationships("myindex");

       public IndexedRelationshipExpander(Direction direction, RelType
relType) {
           this.direction = direction;
           this.relType = relType;
       }

       @Override
       public Iterable<Relationship> expand(Node node) {
           if (direction==Direction.OUTGOING)
               return index.get("type", relType.name(), node, null);
           else
               return index.get("type", relType.name(), null, node);
       }

       public static Relationship index(Relationship relationship) {
           index.add(relationship,"type",relationship.getType().name());
           return relationship;
       }

       @Override
       public RelationshipExpander reversed() {
          return new IndexedRelationshipExpander(direction.reverse(),
relType);
       }
   }


   static void traverseAndPrintIndexedTraversal(Node mainPartNode) {
       TraversalDescription td = Traversal.description().expand(new
IndexedRelationshipExpander(Direction.OUTGOING,
RelType.HAS_PART)).depthFirst();
       long startMillis = System.currentTimeMillis();
       int totalCount = 0;
       for (Node part : td.traverse(mainPartNode).nodes()) {
           totalCount++;
           System.out.println(part.getProperty("name"));
       }
       long delta = (System.currentTimeMillis() - startMillis);
       System.out.printf("traversing %d relationships took %d ms. %.2f
rels/ms%n", totalCount, delta, (float)(delta>0 ? totalCount / delta :
totalCount));
   }

   private static Node createPart(Node part, int nr) {
       Node newPart = gr.createNode();
       final Relationship rel = part.createRelationshipTo(newPart,
RelType.HAS_PART);
       newPart.setProperty("name", "Part" + nr);
       IndexedRelationshipExpander.index(rel);
       return newPart;
   }


   static void createNodeData() {
       Transaction tx = gr.beginTx();
       try {
           mainPart = gr.createNode();
           mainPart.setProperty("name", "MainPart");
           gr.getReferenceNode().createRelationshipTo(mainPart,
RelType.HAS_PART);
           List<Node> nodes = new ArrayList<Node>(10);
           nodes.add(createPart(mainPart, 1));
           Node part2 = createPart(mainPart,2);
           nodes.add(part2);

 
nodes.addAll(asList(createPart(part2,3),createPart(part2,4),createPart(part2,5),createPart(part2,6)));

           int count = 0;
           for (int i = 1; i <= containerCount; i++) {
               for (Node part : nodes) {
               Node container = gr.createNode();
               container.setProperty("name", "Container-" + i);
               relateToContainer(container, part);

               // inform us every 1000 nodes
               count++;
               if (count >= 10000) {
                   tx.success();
                   tx.finish();
                   tx = gr.beginTx();
                   System.out.println("Written " + i + " nodes...");
                   count = 0;
               }
           }}

           tx.success();
           System.out.println("Nodes created successfully!");
       } finally {
           tx.finish();
       }
   }


-------------- next part --------------


Am 30.06.2011 um 14:47 schrieb Niels Hoogeveen:

>
> Michael,
> Is it possible/feasible to change the store-format so relationships are
stored partitioned per relationshiptype per direction? What would be the
impact of such change?
> Niels
>
>> From: michael.hun...@neotechnology.com
>> Date: Thu, 30 Jun 2011 14:10:38 +0200
>> To: user@lists.neo4j.org
>> Subject: Re: [Neo4j] traversing densely populated nodes
>>
>> The issue being that relationships on disk are not ordered, so that, even
when just accessing the few relationships of the one
>> type you still have to scan all rels.
>>
>> For supporting different approaches you either have to change the
store-format to handle the storage and loading of relationships of
supernodes differently.
>>
>> Or:
>> - if you'd rather want to avoid supernodes you can have an Expander in
your traversers that delays the traversals of supernodes (perhaps
>> the other nodes give you already enough results for the domain to
consume). (You could also limit/timebox that).
>> - You can also have Expanders that use an B-Tree or Relationship index
internally (for the "few"-rels). With autoindexing you could actually put a
property on those few "rels" that makes them automatically indexed. And then
in the Expander use
>> relIndex.get(startNode,null,null) to get all of them for expansion.
>>
>> It can be useful to get something like this implemented once and then
offered to all users either as part of the product or perhaps in a component
like neo4j-collections ? (these are just ideas)
>>
>> Cheers
>>
>> Michael
>>
>> Am 30.06.2011 um 13:57 schrieb Craig Taverner:
>>
>>> This topics has come up before, and the domain level solutions are
usually
>>> very similar, like Norbert's category/proxy nodes (to group by
>>> type/direction) and Niels' TimeLineIndex (BTree). I wonder whether we
can
>>> build a generic user-level solution that can also be wrapped to appear
as an
>>> internal database solution?
>>>
>>> For example, consider Niels's solution of the TimeLine index. In this
case
>>> we group all the nodes based on a consistent hash. Usually the timeline
>>> would use a timestamp, but really any reasonably variable property can
do,
>>> even the node-id itself. Then we have a BTree between the dense nodes
and
>>> the root node (node with too many relationships). How about this crazy
idea,
>>> create an API that mimics the normal node.getRelationship*() API, but
>>> internally traverses the entire tree? And also for creating the
>>> relationships? So for most cod we just do the usual
>>> node.createRelationshipTo(node,type,direction) and node.traverse(...),
but
>>> internally we actually traverse the b-tree.
>>>
>>> This would solve the performance bottleneck being observed while keeping
the
>>> 'illusion' of directly connected relationships. The solution would be
>>> implemented mostly in the application space, so will not need any
changes to
>>> the core database. I see this as being of the same kind of solution as
the
>>> auto-indexing. We setup some initial configuration that results in
certain
>>> structures being created on demand. With auto-indexing we are talking
about
>>> mostly automatically adding lucene indexes. With this idea we are
talking
>>> about automatically replacing direct relationships with b-trees to
resolve a
>>> specific performance issue.
>>>
>>> And when the relationship density is very low, if the b-tree is
>>> auto-balancing, it could just be a direct relationship anyway.
>>>
>>> On Wed, Jun 29, 2011 at 6:56 PM, Agelos Pikoulas
>>> <agelos.pikou...@gmail.com>wrote:
>>>
>>>> My problem pattern is exactly the same as Niels's :
>>>>
>>>> A dense-node has millions of relations of a certain direction & type,
>>>> and only a few (sparse) relations of a different direction and type.
>>>> The traversing is usually following only those sparse relationships on
>>>> those
>>>> dense-nodes.
>>>>
>>>> Now, even when traversing on these sparse relations, neo4j becomes
>>>> extremely
>>>> slow
>>>> on a certainly non linear Order (the big cs O).
>>>>
>>>> Some tests I run (email me if u want the code) reveal that even the
number
>>>> of those dense-nodes in the database greatly influences the results.
>>>>
>>>> I just reported to Michael the runs with the latest M05 snapshot, which
are
>>>> not very positive...
>>>> I have suggested an (auto) indexing of relationship types / direction
that
>>>> is used by traversing frameworks,
>>>> but I ain't no graphdb-engine expert :-(
>>>>
>>>> A'
>>>>
>>>>
>>>> Message: 5
>>>>> Date: Wed, 29 Jun 2011 18:19:10 +0200
>>>>> From: Niels Hoogeveen <pd_aficion...@hotmail.com>
>>>>> Subject: Re: [Neo4j] traversing densely populated nodes
>>>>> To: <user@lists.neo4j.org>
>>>>> Message-ID: <col110-w326b152552b8f7fbe1312d8b...@phx.gbl>
>>>>> Content-Type: text/plain; charset="iso-8859-1"
>>>>>
>>>>>
>>>>> Michael,
>>>>>
>>>>>
>>>>>
>>>>> The issue I am refering to does not pertain to traversing many
relations
>>>> at
>>>>> once
>>>>>
>>>>> but the impact many relationship of one type have on relationships
>>>>>
>>>>> of another type on the same node.
>>>>>
>>>>>
>>>>>
>>>>> Example:
>>>>>
>>>>>
>>>>>
>>>>> A topic class has 2 million outgoing relationships of type
"HAS_INSTANCE"
>>>>> and
>>>>>
>>>>> has 3 outgoing relationships of type "SUB_CLASS_OF".
>>>>>
>>>>>
>>>>>
>>>>> Fetching the 3 relations of type "SUB_CLASS_OF" takes very long,
>>>>>
>>>>> I presume due to the presence of the 2 million other relationships.
>>>>>
>>>>>
>>>>>
>>>>> I have no need to ever fetch the "HAS_INSTANCE" relationships from
>>>>>
>>>>> the topic node. That relation is always traversed from the other
>>>> direction.
>>>>>
>>>>>
>>>>>
>>>>> I do want to know the class of a topic instance, leading to he topic
>>>> class,
>>>>>
>>>>> but have no real interest ever to traverse all topic instance from
 the
>>>>> topic
>>>>>
>>>>> class (at least not directly.. i do want to know the most recent
>>>> addition,
>>>>>
>>>>> and that's what I use the timeline index for).
>>>>>
>>>>>
>>>>>
>>>>> Niels
>>>>>
>>>>>
>>>>>> From: michael.hun...@neotechnology.com
>>>>>> Date: Wed, 29 Jun 2011 17:50:08 +0200
>>>>>> To: user@lists.neo4j.org
>>>>>> Subject: Re: [Neo4j] traversing densely populated nodes
>>>>>>
>>>>>> I think this is the same problem that Angelos is facing, we are
>>>> currently
>>>>> evaluating options to improve the performance on those highly
connected
>>>>> supernodes.
>>>>>>
>>>>>> A traditional option is really to split them into group or even kind
of
>>>>> shard their relationships to a second layer.
>>>>>>
>>>>>> We're looking into storage improvement options as well as
modifications
>>>>> to retrieval of that many relationships at once.
>>>>>>
>>>>>> Cheers
>>>>>>
>>>>>> Michael
>>>>>
_______________________________________________
Neo4j mailing list
User@lists.neo4j.org
https://lists.neo4j.org/mailman/listinfo/user

Reply via email to