Re: [Neo4j] Performance issue on nodes with lots of relationships

Niels Hoogeveen Thu, 07 Jul 2011 16:27:59 -0700

I did a write up on indexed relationships in the Git repo: 
https://github.com/peterneubauer/graph-collections/wiki/Indexed-relationships
A performance comparison would indeed be great. Anecdotally, I have witnessed 
the difference when trying to load all entries of Dbpedia. With 2.5 G heap 
space, loading becomes problematic after some 70,000 relationships have been 
added to the supernode. With the indexed relationship no such problems arise 
and 1.6 million relationships are easily created without  performance 
degradation. 
Having real performance figures would be nice though.
Niels


> From: michael.hun...@neotechnology.com
> Date: Thu, 7 Jul 2011 22:56:17 +0200
> To: user@lists.neo4j.org
> Subject: Re: [Neo4j] Performance issue on nodes with lots of relationships
> 
> Niels could you perhaps write up a blog post detailing the usage (also for 
> your own scenario and how that solution would compare to the naive supernodes 
> with just millions of relationships.
> 
> Also I'd like to see a performance comparision of both approaches.
> 
> Thanks so much for your work
> 
> Michael
> 
> Am 07.07.2011 um 22:24 schrieb Niels Hoogeveen:
> 
> > 
> > I am glad to see a solution will be provided at the core level. 
> > Today, I pushed IndexedRelationships and IndexedRelationshipExpander to 
> > Git, see: 
> > https://github.com/peterneubauer/graph-collections/tree/master/src/main/java/org/neo4j/collections/indexedrelationship
> > This provides a solution to the issue, but is certainly not as fast as a 
> > solution in core would be. 
> > However, it does solve my issues and as a bonus, indexed relationships can 
> > be traversed in sorted order,this is especially pleasant, since I usually 
> > want to know only the recent additions of dense relationships.
> > Niels
> > 
> > 
> >> Date: Thu, 7 Jul 2011 21:37:26 +0200
> >> From: matt...@neotechnology.com
> >> To: user@lists.neo4j.org
> >> Subject: Re: [Neo4j] Performance issue on nodes with lots of relationships
> >> 
> >> 2011/7/7 Agelos Pikoulas <agelos.pikou...@gmail.com>
> >> 
> >>> I think its the same problem pattern that been in discussion lately with
> >>> dense nodes or supernodes (check
> >>> http://lists.neo4j.org/pipermail/user/2011-July/009832.html).
> >>> 
> >>> Michael Hunger has provided a quick solution to visiting the *few*
> >>> RelationshipTypes on a node that has *millions* of others, utilizing a
> >>> RelationshipExpander with an Index (check
> >>> http://paste.pocoo.org/show/traM5oY1ng7dRQAaf1oV/)
> >>> 
> >>> Ideally this would be abstracted & implemented in the core distribution so
> >>> that all API's (including Cypher & tinkerpop Pipes/Gremlin) can use it
> >>> efficiently...
> >>> 
> >> 
> >> Yes, I'm positive that something will be done on a core level to make
> >> getting relationships of a specific type regardless of the total number of
> >> relationships fast. In the foreseeable future hopefully.
> >> 
> >>> 
> >>> Agelos
> >>> 
> >>> On Thu, Jul 7, 2011 at 3:16 PM, Andrew White <li...@andrewewhite.net>
> >>> wrote:
> >>> 
> >>>> I use the shell as-is, but the messages.log is reporting...
> >>>> 
> >>>>    Physical mem: 3962MB, Heap size: 881MB
> >>>> 
> >>>> My point is that if you ignore caching altogether, why did one run take
> >>>> 17x longer with only 2.4x more data? Considering this is a rather
> >>>> iterative algorithm, I don't see why you would even read a node or
> >>>> relationship more than once and thus a cache shouldn't matter at all.
> >>>> 
> >>>> In this particular case, I can't imagine taking 9+ minutes to read a
> >>>> mear 3.4M nodes (that's only 6k nodes per sec). Perhaps this is just an
> >>>> artifact of Cypher in which it is building a set of Rs before applying
> >>>> `count` rather than making count accept an iterable stream.
> >>>> 
> >>>> Andrew
> >>>> 
> >>>> On 07/06/2011 11:33 PM, David Montag wrote:
> >>>>> Hi Andrew,
> >>>>> 
> >>>>> How big is your configured Java heap? It could be that all the nodes
> >>> and
> >>>>> relationships don't fit into the cache.
> >>>>> 
> >>>>> David
> >>>>> 
> >>>>> On Wed, Jul 6, 2011 at 8:03 PM, Andrew White<li...@andrewewhite.net>
> >>>> wrote:
> >>>>> 
> >>>>>> Here is some interesting stats to consider. First, I split my nodes
> >>> into
> >>>>>> two groups, one node with 1.4M children and the other with 3.4M
> >>>>>> children. While I do see some cache warm-up improvements, the
> >>>>>> transversal doesn't seem to scale linearly; ie the larger super-node
> >>> has
> >>>>>> 2.4x more children but takes 17x longer to transverse.
> >>>>>> 
> >>>>>> neo4j-sh (0)$ start n=(1) match (n)-[r]-(x) return count(r)
> >>>>>> +----------+
> >>>>>> | count(r) |
> >>>>>> +----------+
> >>>>>> | 1468486  |
> >>>>>> +----------+
> >>>>>> 1 rows, 25724 ms
> >>>>>> neo4j-sh (0)$ start n=(1) match (n)-[r]-(x) return count(r)
> >>>>>> +----------+
> >>>>>> | count(r) |
> >>>>>> +----------+
> >>>>>> | 1468486  |
> >>>>>> +----------+
> >>>>>> 1 rows, 19763 ms
> >>>>>> 
> >>>>>> neo4j-sh (0)$ start n=(2) match (n)-[r]-(x) return count(r)
> >>>>>> +----------+
> >>>>>> | count(r) |
> >>>>>> +----------+
> >>>>>> | 3472174  |
> >>>>>> +----------+
> >>>>>> 1 rows, 565448 ms
> >>>>>> neo4j-sh (0)$ start n=(2) match (n)-[r]-(x) return count(r)
> >>>>>> +----------+
> >>>>>> | count(r) |
> >>>>>> +----------+
> >>>>>> | 3472174  |
> >>>>>> +----------+
> >>>>>> 1 rows, 337975 ms
> >>>>>> 
> >>>>>> Any ideas on this?
> >>>>>> Andrew
> >>>>>> 
> >>>>>> On 07/06/2011 09:55 AM, Peter Neubauer wrote:
> >>>>>>> Andrew,
> >>>>>>> if you upgrade to 1.4.M06, your shell should be able to do Cypher in
> >>>>>>> order to count the relationships of a node, not returning them:
> >>>>>>> 
> >>>>>>> start n=(1) match (n)-[r]-(x) return count(r)
> >>>>>>> 
> >>>>>>> and try that several times to see if cold caches are initially
> >>> slowing
> >>>>>>> down things.
> >>>>>>> 
> >>>>>>> or something along these lines. In the LS and Neoclipse the output
> >>> and
> >>>>>>> visualization will be slow for that amount of data.
> >>>>>>> 
> >>>>>>> Cheers,
> >>>>>>> 
> >>>>>>> /peter neubauer
> >>>>>>> 
> >>>>>>> GTalk:      neubauer.peter
> >>>>>>> Skype       peter.neubauer
> >>>>>>> Phone       +46 704 106975
> >>>>>>> LinkedIn   http://www.linkedin.com/in/neubauer
> >>>>>>> Twitter      http://twitter.com/peterneubauer
> >>>>>>> 
> >>>>>>> http://www.neo4j.org               - Your high performance graph
> >>>>>> database.
> >>>>>>> http://startupbootcamp.org/    - Öresund - Innovation happens HERE.
> >>>>>>> http://www.thoughtmade.com - Scandinavia's coolest Bring-a-Thing
> >>>> party.
> >>>>>>> 
> >>>>>>> 
> >>>>>>> 
> >>>>>>> On Wed, Jul 6, 2011 at 4:15 PM, Andrew White<li...@andrewewhite.net>
> >>>>>>  wrote:
> >>>>>>>> I have a graph with roughly 10M nodes. Some of these nodes are
> >>> highly
> >>>>>>>> connected to other nodes. For example I may have a single node with
> >>>> 1M+
> >>>>>>>> relationships. A good analogy is a population that has a  "lives-in"
> >>>>>>>> relationship to a state. Now the problem...
> >>>>>>>> 
> >>>>>>>> Both neoclipse or neo4j-shell are terribly slow when working with
> >>>> these
> >>>>>>>> nodes. In the shell I would expect a `cd<node-id>` to be very fast,
> >>>>>>>> much like selecting via a rowid in a standard DB. Instead, I usually
> >>>> see
> >>>>>>>> several seconds delay. Doing a `ls` takes so long that I usually
> >>> have
> >>>> to
> >>>>>>>> just kill the process. In fact `ls` never outputs anything which is
> >>>> odd
> >>>>>>>> since I would expect it to "stream" the output as it found it. I
> >>> have
> >>>>>>>> very similar performance issues with neoclipse.
> >>>>>>>> 
> >>>>>>>> I am using Neo4j 1.3 embedded on Ubuntu 10.04 with 4GB of RAM.
> >>>>>>>> Disclaimer, I am new to Neo4j.
> >>>>>>>> 
> >>>>>>>> Thanks,
> >>>>>>>> Andrew
> >>>>>>>> _______________________________________________
> >>>>>>>> Neo4j mailing list
> >>>>>>>> User@lists.neo4j.org
> >>>>>>>> https://lists.neo4j.org/mailman/listinfo/user
> >>>>>>>> 
> >>>>>>> _______________________________________________
> >>>>>>> Neo4j mailing list
> >>>>>>> User@lists.neo4j.org
> >>>>>>> https://lists.neo4j.org/mailman/listinfo/user
> >>>>>>> 
> >>>>>> _______________________________________________
> >>>>>> Neo4j mailing list
> >>>>>> User@lists.neo4j.org
> >>>>>> https://lists.neo4j.org/mailman/listinfo/user
> >>>>>> 
> >>>>> 
> >>>>> 
> >>>> 
> >>>> _______________________________________________
> >>>> Neo4j mailing list
> >>>> User@lists.neo4j.org
> >>>> https://lists.neo4j.org/mailman/listinfo/user
> >>>> 
> >>> _______________________________________________
> >>> Neo4j mailing list
> >>> User@lists.neo4j.org
> >>> https://lists.neo4j.org/mailman/listinfo/user
> >>> 
> >> 
> >> 
> >> 
> >> -- 
> >> Mattias Persson, [matt...@neotechnology.com]
> >> Hacker, Neo Technology
> >> www.neotechnology.com
> >> _______________________________________________
> >> Neo4j mailing list
> >> User@lists.neo4j.org
> >> https://lists.neo4j.org/mailman/listinfo/user
> >                                       
> > _______________________________________________
> > Neo4j mailing list
> > User@lists.neo4j.org
> > https://lists.neo4j.org/mailman/listinfo/user
> 
> _______________________________________________
> Neo4j mailing list
> User@lists.neo4j.org
> https://lists.neo4j.org/mailman/listinfo/user
                                          
_______________________________________________
Neo4j mailing list
User@lists.neo4j.org
https://lists.neo4j.org/mailman/listinfo/user

Re: [Neo4j] Performance issue on nodes with lots of relationships

Reply via email to