Re: [Neo4j] OutOfMemory while populating large graph

2010-07-09 Thread Bill Janssen
Note that a couple of memory issues are fixed in Lucene 2.9.3.  Leaking
when indexing big docs, and indolent reclamation of space from the
FieldCache.

Bill

Arijit Mukherjee  wrote:

> I've a similar problem. Although I'm not going out of memory yet, I can see
> the heap constantly growing, and JProfiler says most of it is due to the
> Lucene indexing. And even if I do the commit after every X transactions,
> once the population is finished, the final commit is done, and the graph db
> closed - the heap stays like that - almost full. An explicit gc will clean
> up some part, but not fully.
> 
> Arijit
> 
> On 9 July 2010 17:00, Mattias Persson  wrote:
> 
> > 2010/7/9 Marko Rodriguez 
> >
> > > Hi,
> > >
> > > > Would it actually be worth something to be able to begin a transaction
> > > which
> > > > auto-committs stuff every X write operation, like a batch inserter mode
> > > > which can be used in normal EmbeddedGraphDatabase? Kind of like:
> > > >
> > > >graphDb.beginTx( Mode.BATCH_INSERT )
> > > >
> > > > ...so that you can start such a transaction and then just insert data
> > > > without having to care about restarting it now and then?
> > >
> > > Thats cool! Does that already exist? In my code (like others on the list
> > it
> > > seems) I have a counter++ that every 20,000 inserts (some made up number
> > > that is not going to throw an OutOfMemory) commits and the reopens a new
> > > transaction. Sorta sux.
> > >
> >
> > No it doesn't, I just wrote stuff which I though someone could think of as
> > useful. A cool thing with just telling it to do a batch insert mode
> > transaction (not the actual commit interval) is that it could look at how
> > much memory it had to play around with and commit whenever it would be the
> > most efficient, even having the ability to change the limit on the fly if
> > the memory suddenly ran out.
> >
> >
> > > Thanks,
> > > Marko.
> > >
> > > ___
> > > Neo4j mailing list
> > > User@lists.neo4j.org
> > > https://lists.neo4j.org/mailman/listinfo/user
> > >
> >
> >
> >
> > --
> > Mattias Persson, [matt...@neotechnology.com]
> > Hacker, Neo Technology
> > www.neotechnology.com
> > ___
> > Neo4j mailing list
> > User@lists.neo4j.org
> > https://lists.neo4j.org/mailman/listinfo/user
> >
> 
> 
> 
> -- 
> "And when the night is cloudy,
> There is still a light that shines on me,
> Shine on until tomorrow, let it be."
> ___
> Neo4j mailing list
> User@lists.neo4j.org
> https://lists.neo4j.org/mailman/listinfo/user
___
Neo4j mailing list
User@lists.neo4j.org
https://lists.neo4j.org/mailman/listinfo/user


Re: [Neo4j] How to traverse by the number of relationships between nodes?

2010-07-09 Thread Max De Marzi Jr.
Can you expand on this a bit... as to what the graph internals are doing:

Option 1:
You have "colored" relationships (RED, BLUE, GREEN, etc to 10k colors).
>From a random node, you traverse the graph finding all nodes that it is
connected to via the PURPLE or FUSIA relationship.

vs

Option 2:
You have a COLOR relationship with a name property that contains the actual
color name.
>From a random node, you traverse the graph finding all nodes that it is
connected to via the COLOR relationship containing the name "PURPLE" or
"FUSIA" in the relationship.


For some reason I thought it was more expensive (in terms of traversal time)
to look up a property on a relationship than to simply pass a named
relationship type.


On Fri, Jul 9, 2010 at 8:45 AM, Johan Svensson wrote:

> Hi,
>
> I would not recommend to use large amounts of different (dynamically
> created) relationship types. It is better to use well defined
> relationship types with an additional property on the relationship
> whenever needed. The limit is actually not 64k but 2^31, but having
> large amounts of relationship types like 10k-100k+ will reduce
> performance and consume a lot of memory.
>
> Regards,
> Johan
>
> On Thu, Jul 8, 2010 at 4:13 PM, Max De Marzi Jr. 
> wrote:
> > Can somebody verify the max number of relationship types? If it is 64k,
> is
> > there a way to increase it without significant effort?
> >
> >
> >>  I believe you can have something like 64k
> >> relationship types, so using the relationship type for the route name is
> >> possible.
> ___
> Neo4j mailing list
> User@lists.neo4j.org
> https://lists.neo4j.org/mailman/listinfo/user
>
___
Neo4j mailing list
User@lists.neo4j.org
https://lists.neo4j.org/mailman/listinfo/user


Re: [Neo4j] OutOfMemory while populating large graph

2010-07-09 Thread Paul A. Jackson
I confess I had not investigated the batch inserter.  From the description it 
fits my requirements exactly.

With respect to auto-commits, it seems there are two use cases.  The first is 
every day operations that might run out of memory.  In this case it might be 
nice for neo4j to swap out memory to temporary disk as needed.  If this 
performs acceptably, I think that should be default behavior.  The second case 
is the initial population of a graph, where there is no need for roll back and 
so there is no need to commit to a temporary location.  In this case, it seems 
having neo4j decide when to commit would be ideal.

My concern with the first use case is that swapping to temporary storage at 
ideal intervals may be less efficient than having the user commit to permanent 
storage at less-than-ideal intervals.  If that is the case, then the only real 
justification for committing to temporary storage would be if there was a 
requirement to potentially roll back a transaction that was larger than memory 
could support.

-Paul


-Original Message-
From: user-boun...@lists.neo4j.org [mailto:user-boun...@lists.neo4j.org] On 
Behalf Of Mattias Persson
Sent: Friday, July 09, 2010 7:30 AM
To: Neo4j user discussions
Subject: Re: [Neo4j] OutOfMemory while populating large graph

2010/7/9 Marko Rodriguez 

> Hi,
>
> > Would it actually be worth something to be able to begin a transaction
> which
> > auto-committs stuff every X write operation, like a batch inserter mode
> > which can be used in normal EmbeddedGraphDatabase? Kind of like:
> >
> >graphDb.beginTx( Mode.BATCH_INSERT )
> >
> > ...so that you can start such a transaction and then just insert data
> > without having to care about restarting it now and then?
>
> Thats cool! Does that already exist? In my code (like others on the list it
> seems) I have a counter++ that every 20,000 inserts (some made up number
> that is not going to throw an OutOfMemory) commits and the reopens a new
> transaction. Sorta sux.
>

No it doesn't, I just wrote stuff which I though someone could think of as
useful. A cool thing with just telling it to do a batch insert mode
transaction (not the actual commit interval) is that it could look at how
much memory it had to play around with and commit whenever it would be the
most efficient, even having the ability to change the limit on the fly if
the memory suddenly ran out.


> Thanks,
> Marko.
>
> ___
> Neo4j mailing list
> User@lists.neo4j.org
> https://lists.neo4j.org/mailman/listinfo/user
>



-- 
Mattias Persson, [matt...@neotechnology.com]
Hacker, Neo Technology
www.neotechnology.com
___
Neo4j mailing list
User@lists.neo4j.org
https://lists.neo4j.org/mailman/listinfo/user

___
Neo4j mailing list
User@lists.neo4j.org
https://lists.neo4j.org/mailman/listinfo/user


Re: [Neo4j] How to traverse by the number of relationships between nodes?

2010-07-09 Thread Johan Svensson
Hi,

I would not recommend to use large amounts of different (dynamically
created) relationship types. It is better to use well defined
relationship types with an additional property on the relationship
whenever needed. The limit is actually not 64k but 2^31, but having
large amounts of relationship types like 10k-100k+ will reduce
performance and consume a lot of memory.

Regards,
Johan

On Thu, Jul 8, 2010 at 4:13 PM, Max De Marzi Jr.  wrote:
> Can somebody verify the max number of relationship types? If it is 64k, is
> there a way to increase it without significant effort?
>
>
>>  I believe you can have something like 64k
>> relationship types, so using the relationship type for the route name is
>> possible.
___
Neo4j mailing list
User@lists.neo4j.org
https://lists.neo4j.org/mailman/listinfo/user


Re: [Neo4j] OutOfMemory while populating large graph

2010-07-09 Thread Rick Bullotta
Short answer is "maybe". ;-)

There are some cases where the transaction is an "all or nothing" scenario,
others where incremental commits are OK.  Having the ability to do
incremental autocommits would be useful, however.  In a perfect world, it
could be based on a "bucket" (e.g. XXX transactions), a time (each 30
seconds), or on a memory usage rule.

-Original Message-
From: user-boun...@lists.neo4j.org [mailto:user-boun...@lists.neo4j.org] On
Behalf Of Mattias Persson
Sent: Friday, July 09, 2010 7:30 AM
To: Neo4j user discussions
Subject: Re: [Neo4j] OutOfMemory while populating large graph

2010/7/9 Marko Rodriguez 

> Hi,
>
> > Would it actually be worth something to be able to begin a transaction
> which
> > auto-committs stuff every X write operation, like a batch inserter mode
> > which can be used in normal EmbeddedGraphDatabase? Kind of like:
> >
> >graphDb.beginTx( Mode.BATCH_INSERT )
> >
> > ...so that you can start such a transaction and then just insert data
> > without having to care about restarting it now and then?
>
> Thats cool! Does that already exist? In my code (like others on the list
it
> seems) I have a counter++ that every 20,000 inserts (some made up number
> that is not going to throw an OutOfMemory) commits and the reopens a new
> transaction. Sorta sux.
>

No it doesn't, I just wrote stuff which I though someone could think of as
useful. A cool thing with just telling it to do a batch insert mode
transaction (not the actual commit interval) is that it could look at how
much memory it had to play around with and commit whenever it would be the
most efficient, even having the ability to change the limit on the fly if
the memory suddenly ran out.


> Thanks,
> Marko.
>
> ___
> Neo4j mailing list
> User@lists.neo4j.org
> https://lists.neo4j.org/mailman/listinfo/user
>



-- 
Mattias Persson, [matt...@neotechnology.com]
Hacker, Neo Technology
www.neotechnology.com
___
Neo4j mailing list
User@lists.neo4j.org
https://lists.neo4j.org/mailman/listinfo/user

___
Neo4j mailing list
User@lists.neo4j.org
https://lists.neo4j.org/mailman/listinfo/user


Re: [Neo4j] Is it possible to count common nodes when traversing?

2010-07-09 Thread Mattias Persson
Sorry, it should be:

for ( Node currentNode : Traversal.description()
  .breadthFirst().uniqueness(
  Uniqueness.RELATIONSHIP_GLOBAL)
  .relationships(MyRelationships.SIMILAR)
  .relationships(MyRelationships.CATEGORY)
  .prune(TraversalFactory.pruneAfterDepth(2)).traverse(node) ) {


2010/7/9 Mattias Persson 

> Just to notify you guys on this... since as of now (r4717) the
> TraversalFactory class is named Traversal instead, so code would look like:
>
>   for ( Node currentNode : TraversalFactory.description()
>
>   .breadthFirst().uniqueness(Uniqueness.RELATIONSHIP_GLOBAL)
>   .relationships(MyRelationships.SIMILAR)
>   .relationships(MyRelationships.CATEGORY)
>   .prune(TraversalFactory.pruneAfterDepth(2)).traverse(node) ) {
>
>
> 2010/7/8 Mattias Persson 
>
>  Your problem is that a node can't be visited more than once in a
>> traversal, right? Have you looked at the new traversal framework in
>> 1.1-SNAPSHOT? It solves that problem in that you can specify uniqueness for
>> the traverser... you can instead say that each Relationship can't be visited
>> more than once, but Nodes can. Your example:
>>
>>
>>   Map result = new HashMap();
>>   for ( Node currentNode : TraversalFactory.createTraversalDescription()
>>   .breadthFirst().uniqueness(Uniqueness.RELATIONSHIP_GLOBAL)
>>   .relationships(MyRelationships.SIMILAR)
>>   .relationships(MyRelationships.CATEGORY)
>>   .prune(TraversalFactory.pruneAfterDepth(2)).traverse(node) ) {
>>
>>   if(currentNode.hasProperty("category")) {
>>
>>   if(result.get(currentNode) == null) {
>>   result.put(currentNode, 1);
>>   } else {
>>   result.put(currentNode, result.get(currentNode) + 1);
>>   }
>>   }
>>   }
>>
>> 2010/7/8 Rick Bullotta 
>>
>> A performance improvement might be achieved by minimizing object
>>> creation/hash inserts using a "counter" wrapper.
>>>
>>> - Create a simple class "Counter" with a single public property "count"
>>> of
>>> type int (not Integer) with an initial value of 1
>>>
>>> - Tweak your code to something like:
>>>
>>>public Map findCategoriesForWord(String word) {
>>> final Node node = index.getSingleNode("word", word);
>>> final Map result = new HashMap>> Counter>();
>>> if(node != null) {
>>>Traverser traverserWords =
>>> node.traverse(Traverser.Order.BREADTH_FIRST,
>>>StopEvaluator.DEPTH_ONE, new ReturnableEvaluator() {
>>>@Override
>>>public boolean isReturnableNode(TraversalPosition
>>> traversalPosition) {
>>>final Node currentNode =
>>> traversalPosition.currentNode();
>>>final Iterator
>>> relationshipIterator =
>>> currentNode.getRelationships(MyRelationships.CATEGORY).iterator();
>>>while(relationshipIterator.hasNext()) {
>>>final Relationship relationship =
>>> relationshipIterator.next();
>>>final String categoryName = (String)
>>> relationship.getProperty("catId");
>>>
>>> Counter counter =
>>> result.get(categoryName);
>>>
>>>if(counter == null) {
>>>result.put(categoryName, new Counter());
>>>} else {
>>> ++counter.count;
>>> }
>>>}
>>>return true;
>>>}
>>>}, MyRelationships.SIMILAR, Direction.BOTH);
>>>traverserWords.getAllNodes();
>>>}
>>>return result;
>>>}
>>>
>>>
>>> -Original Message-
>>> From: user-boun...@lists.neo4j.org [mailto:user-boun...@lists.neo4j.org]
>>> On
>>> Behalf Of Java Programmer
>>> Sent: Thursday, July 08, 2010 8:12 AM
>>> To: Neo4j user discussions
>>> Subject: Re: [Neo4j] Is it possible to count common nodes when
>>> traversing?
>>>
>>> Hi,
>>> Thanks for your answer but it's not exactly what I was on my mind -
>>> word can belong to several categories, and different words can share
>>> same category e.g.:
>>>
>>> "word 1" : "category 1", "category 2", "category 3"
>>> "word 2" : "category 2", "category 3"
>>> "word 3" : "category 3"
>>>
>>> there is relation between "word 1" and "word 2" and between "word 2"
>>> and "word 3" (SIMILAR).
>>> As a result when querying for "word 1" with depth 1, I would like to get:
>>> "category 1" -> 1 (result), "category 2" -> 2, "category 3" -> 2 (not
>>> 3 because it's out of depth)
>>>
>>> So far I have changed previous method to use the relationship with
>>> property of categoryId, but I don't know if there won't be
>>> a performance issues (I iterate for all relationship of the found node
>>> (every similar), and store the categories in Map). If you could look
>>> at it and tell me if the way of thinking is good

Re: [Neo4j] OutOfMemory while populating large graph

2010-07-09 Thread Arijit Mukherjee
I've a similar problem. Although I'm not going out of memory yet, I can see
the heap constantly growing, and JProfiler says most of it is due to the
Lucene indexing. And even if I do the commit after every X transactions,
once the population is finished, the final commit is done, and the graph db
closed - the heap stays like that - almost full. An explicit gc will clean
up some part, but not fully.

Arijit

On 9 July 2010 17:00, Mattias Persson  wrote:

> 2010/7/9 Marko Rodriguez 
>
> > Hi,
> >
> > > Would it actually be worth something to be able to begin a transaction
> > which
> > > auto-committs stuff every X write operation, like a batch inserter mode
> > > which can be used in normal EmbeddedGraphDatabase? Kind of like:
> > >
> > >graphDb.beginTx( Mode.BATCH_INSERT )
> > >
> > > ...so that you can start such a transaction and then just insert data
> > > without having to care about restarting it now and then?
> >
> > Thats cool! Does that already exist? In my code (like others on the list
> it
> > seems) I have a counter++ that every 20,000 inserts (some made up number
> > that is not going to throw an OutOfMemory) commits and the reopens a new
> > transaction. Sorta sux.
> >
>
> No it doesn't, I just wrote stuff which I though someone could think of as
> useful. A cool thing with just telling it to do a batch insert mode
> transaction (not the actual commit interval) is that it could look at how
> much memory it had to play around with and commit whenever it would be the
> most efficient, even having the ability to change the limit on the fly if
> the memory suddenly ran out.
>
>
> > Thanks,
> > Marko.
> >
> > ___
> > Neo4j mailing list
> > User@lists.neo4j.org
> > https://lists.neo4j.org/mailman/listinfo/user
> >
>
>
>
> --
> Mattias Persson, [matt...@neotechnology.com]
> Hacker, Neo Technology
> www.neotechnology.com
> ___
> Neo4j mailing list
> User@lists.neo4j.org
> https://lists.neo4j.org/mailman/listinfo/user
>



-- 
"And when the night is cloudy,
There is still a light that shines on me,
Shine on until tomorrow, let it be."
___
Neo4j mailing list
User@lists.neo4j.org
https://lists.neo4j.org/mailman/listinfo/user


Re: [Neo4j] Is it possible to count common nodes when traversing?

2010-07-09 Thread Mattias Persson
Just to notify you guys on this... since as of now (r4717) the
TraversalFactory class is named Traversal instead, so code would look like:

  for ( Node currentNode : TraversalFactory.description()
  .breadthFirst().uniqueness(Uniqueness.RELATIONSHIP_GLOBAL)
  .relationships(MyRelationships.SIMILAR)
  .relationships(MyRelationships.CATEGORY)
  .prune(TraversalFactory.pruneAfterDepth(2)).traverse(node) ) {


2010/7/8 Mattias Persson 

> Your problem is that a node can't be visited more than once in a traversal,
> right? Have you looked at the new traversal framework in 1.1-SNAPSHOT? It
> solves that problem in that you can specify uniqueness for the traverser...
> you can instead say that each Relationship can't be visited more than once,
> but Nodes can. Your example:
>
>
>   Map result = new HashMap();
>   for ( Node currentNode : TraversalFactory.createTraversalDescription()
>   .breadthFirst().uniqueness(Uniqueness.RELATIONSHIP_GLOBAL)
>   .relationships(MyRelationships.SIMILAR)
>   .relationships(MyRelationships.CATEGORY)
>   .prune(TraversalFactory.pruneAfterDepth(2)).traverse(node) ) {
>
>   if(currentNode.hasProperty("category")) {
>
>   if(result.get(currentNode) == null) {
>   result.put(currentNode, 1);
>   } else {
>   result.put(currentNode, result.get(currentNode) + 1);
>   }
>   }
>   }
>
> 2010/7/8 Rick Bullotta 
>
> A performance improvement might be achieved by minimizing object
>> creation/hash inserts using a "counter" wrapper.
>>
>> - Create a simple class "Counter" with a single public property "count" of
>> type int (not Integer) with an initial value of 1
>>
>> - Tweak your code to something like:
>>
>>public Map findCategoriesForWord(String word) {
>> final Node node = index.getSingleNode("word", word);
>> final Map result = new HashMap> Counter>();
>> if(node != null) {
>>Traverser traverserWords =
>> node.traverse(Traverser.Order.BREADTH_FIRST,
>>StopEvaluator.DEPTH_ONE, new ReturnableEvaluator() {
>>@Override
>>public boolean isReturnableNode(TraversalPosition
>> traversalPosition) {
>>final Node currentNode =
>> traversalPosition.currentNode();
>>final Iterator
>> relationshipIterator =
>> currentNode.getRelationships(MyRelationships.CATEGORY).iterator();
>>while(relationshipIterator.hasNext()) {
>>final Relationship relationship =
>> relationshipIterator.next();
>>final String categoryName = (String)
>> relationship.getProperty("catId");
>>
>> Counter counter =
>> result.get(categoryName);
>>
>>if(counter == null) {
>>result.put(categoryName, new Counter());
>>} else {
>> ++counter.count;
>> }
>>}
>>return true;
>>}
>>}, MyRelationships.SIMILAR, Direction.BOTH);
>>traverserWords.getAllNodes();
>>}
>>return result;
>>}
>>
>>
>> -Original Message-
>> From: user-boun...@lists.neo4j.org [mailto:user-boun...@lists.neo4j.org]
>> On
>> Behalf Of Java Programmer
>> Sent: Thursday, July 08, 2010 8:12 AM
>> To: Neo4j user discussions
>> Subject: Re: [Neo4j] Is it possible to count common nodes when traversing?
>>
>> Hi,
>> Thanks for your answer but it's not exactly what I was on my mind -
>> word can belong to several categories, and different words can share
>> same category e.g.:
>>
>> "word 1" : "category 1", "category 2", "category 3"
>> "word 2" : "category 2", "category 3"
>> "word 3" : "category 3"
>>
>> there is relation between "word 1" and "word 2" and between "word 2"
>> and "word 3" (SIMILAR).
>> As a result when querying for "word 1" with depth 1, I would like to get:
>> "category 1" -> 1 (result), "category 2" -> 2, "category 3" -> 2 (not
>> 3 because it's out of depth)
>>
>> So far I have changed previous method to use the relationship with
>> property of categoryId, but I don't know if there won't be
>> a performance issues (I iterate for all relationship of the found node
>> (every similar), and store the categories in Map). If you could look
>> at it and tell me if the way of thinking is good, I would be very
>> appreciate:
>>
>>public Map findCategoriesForWord(String word) {
>>final Node node = index.getSingleNode("word", word);
>>final Map result = new HashMap();
>>if(node != null) {
>>Traverser traverserWords =
>> node.traverse(Traverser.Order.BREADTH_FIRST,
>>StopEvaluator.DEPTH_ONE, new ReturnableEvaluator() {
>>@Override
>>public boolean isReturnableNod

Re: [Neo4j] New tentative API in trunk: Expander/Expansion

2010-07-09 Thread Tobias Ivarsson
Thanks for all the input guys!

As of revision 4717 these methods no longer exist in trunk. Since this was
just a tentative API.
I will continue experimenting with this API in a branch and it will likely
make it back into the core API in a later release.

Cheers,
Tobias
-- 
Tobias Ivarsson 
Hacker, Neo Technology
www.neotechnology.com
Cellphone: +46 706 534857
___
Neo4j mailing list
User@lists.neo4j.org
https://lists.neo4j.org/mailman/listinfo/user


Re: [Neo4j] OutOfMemory while populating large graph

2010-07-09 Thread Mattias Persson
2010/7/9 Marko Rodriguez 

> Hi,
>
> > Would it actually be worth something to be able to begin a transaction
> which
> > auto-committs stuff every X write operation, like a batch inserter mode
> > which can be used in normal EmbeddedGraphDatabase? Kind of like:
> >
> >graphDb.beginTx( Mode.BATCH_INSERT )
> >
> > ...so that you can start such a transaction and then just insert data
> > without having to care about restarting it now and then?
>
> Thats cool! Does that already exist? In my code (like others on the list it
> seems) I have a counter++ that every 20,000 inserts (some made up number
> that is not going to throw an OutOfMemory) commits and the reopens a new
> transaction. Sorta sux.
>

No it doesn't, I just wrote stuff which I though someone could think of as
useful. A cool thing with just telling it to do a batch insert mode
transaction (not the actual commit interval) is that it could look at how
much memory it had to play around with and commit whenever it would be the
most efficient, even having the ability to change the limit on the fly if
the memory suddenly ran out.


> Thanks,
> Marko.
>
> ___
> Neo4j mailing list
> User@lists.neo4j.org
> https://lists.neo4j.org/mailman/listinfo/user
>



-- 
Mattias Persson, [matt...@neotechnology.com]
Hacker, Neo Technology
www.neotechnology.com
___
Neo4j mailing list
User@lists.neo4j.org
https://lists.neo4j.org/mailman/listinfo/user


Re: [Neo4j] Can I use neo4j for this?

2010-07-09 Thread Laurent Laborde
On Fri, Jul 9, 2010 at 12:33 PM,   wrote:
> Dear all,
Dear you,

> I'm completely new to neo4j (and don't even really speak Java),

I'm a sysadmin with poor object programing skill. It wasn't a problem
to use neo4j, as the api is simple and clear. (No "enterprisy"
factoryFactory.Proxy.Processor(thingy.stuff()). Just simple object)

> but I have been struggling in vain for quite a while to get sensible 
> performance on my graph-data in MySQL and PostgreSQL. From your webpage and 
> other posts on the lists I got the great feeling that newbies are welcome 
> here,

Friendly Greetings \o/

> so I hope it is all right if I tell you something about my data and what I 
> want to know about it so that maybe someone can tell whether I can actually 
> do this with neo4j.

Yes, you can ! (c)(r)(tm)

> My data is about 250 million separate graphs with a grand total of about 5 
> billion nodes.
> - The graphs are of a tree-like structure (many are actual trees, but not all 
> of them).
> - Every graph has an id.
> - Every node has 4 properties:

It should work :)

> 1. name (some names are very common, many occur only once or twice)
> 2. category1 (there are about 40 different categories on this level)
> 3. name-group (John, Jon, Jonathan form one group, many of the names that 
> occur only once get their own name group)
> 4. category2 (there are about 10 different categories on this level)
> - Every edge has one or two properties
> 1. type (currently about 50 different ones) [obligatory]
> 2. attribute [only there for 3 types; about 10 per cent of all edges]

Please note that you can have many edge per node and many "type" of edge.
Edges can have many properties. So no problems here.

>
[snip]
> If no highlighting is done, we just return the ids.
> If highlighting was done, let's say on n4.name, then I want all names that 
> occur in this position of any graph.

I don't understand the "highlight" thing.

[snip]
> I hope I managed to make myself understood. If not, I am happy to draw some 
> graphs and upload them somewhere.
>
> I know that I will need indices on the name and name-family properties. Not 
> sure how well they would perform on the less selective properties, though.

Lucene (prefered index engine for neo4j) is very powerfull :)

[snip]

> Performance: For many queries of the type outlined above, I have to wait for 
> more than two minutes on my SMALL dataset (6 million graphs, 100 million 
> nodes, 87 million edges) via PostgreSQL. For some it is more like 10 or 20 
> minutes... I would prefer not to have to wait for more than 5 seconds on the 
> small dataset and 20 or 30 on the big dataset.

I can try to help with postgresql (i'm postgresql DBA) but, imho, you
should try with neo4j :)
(and i'll love to hear your feedback about graph in pgsql vs graph in neo4j)

I have a question : you say "250 millions separate graph".
Each 250 millions graph are totally independant ? (okay, 250 millions
shards are probably overkill but... well... just wondering :) )

*hugs*

-- 
Laurent "ker2x" Laborde
Sysadmin & DBA at http://www.over-blog.com/
___
Neo4j mailing list
User@lists.neo4j.org
https://lists.neo4j.org/mailman/listinfo/user


Re: [Neo4j] OutOfMemory while populating large graph

2010-07-09 Thread Marko Rodriguez
Hi,

> Would it actually be worth something to be able to begin a transaction which
> auto-committs stuff every X write operation, like a batch inserter mode
> which can be used in normal EmbeddedGraphDatabase? Kind of like:
> 
>graphDb.beginTx( Mode.BATCH_INSERT )
> 
> ...so that you can start such a transaction and then just insert data
> without having to care about restarting it now and then?

Thats cool! Does that already exist? In my code (like others on the list it 
seems) I have a counter++ that every 20,000 inserts (some made up number that 
is not going to throw an OutOfMemory) commits and the reopens a new 
transaction. Sorta sux.

Thanks,
Marko.

___
Neo4j mailing list
User@lists.neo4j.org
https://lists.neo4j.org/mailman/listinfo/user


Re: [Neo4j] OutOfMemory while populating large graph

2010-07-09 Thread Mattias Persson
Modifications in a transaction are kept in memory so that there's the
ability to rollback the transaction completely if something would go wrong.
There could of course be a solution where (I'm just spawning supposedly), so
that if a tx gets big enough such a transaction gets converted into its own
graph database or some other on-disk data structure which would then be
merged into the main database on commit.

Would it actually be worth something to be able to begin a transaction which
auto-committs stuff every X write operation, like a batch inserter mode
which can be used in normal EmbeddedGraphDatabase? Kind of like:

graphDb.beginTx( Mode.BATCH_INSERT )

...so that you can start such a transaction and then just insert data
without having to care about restarting it now and then?

Another view of this is that such big transactions (I'm assuming here) are
only really used for a first-time insertion of a big data set, where the
BatchInserter can be used and does exactly that... it flushes to disk
whenever it feels like and you can just go on feeding it more and more data.

2010/7/8 Rick Bullotta 

> Paul, I also would like to see automatic swapping/paging to disk as part of
> Neo4J, minimally when in "bulk insert" mode...and ideally in every usage
> scenario.  I don't fully understand why the in-memory logs get so large
> and/or aren't backed by the on-disk log, or if they are, why they need to
> be
> kept in memory as well.  Perhaps it isn't the transaction "stuff" that is
> taking up memory, but the graph itself?
>
> Can any of the Neo team help provide some insight?
>
> Thanks!
>
>
> -Original Message-
> From: user-boun...@lists.neo4j.org [mailto:user-boun...@lists.neo4j.org]
> On
> Behalf Of Paul A. Jackson
> Sent: Thursday, July 08, 2010 1:35 PM
> To: (User@lists.neo4j.org)
> Subject: [Neo4j] OutOfMemory while populating large graph
>
> I have seen people discuss committing transactions after some microbatch of
> a few hundred records, but I thought this was optional.  I thought Neo4J
> would automatically write out to disk as memory became full.  Well, I
> encountered an OOM and want to make sure that I understand the reason.  Was
> my understanding incorrect, or is there a parameter that I need to set to
> some limit, or is the problem them I am indexing as I go.  The stack trace,
> FWIW, is:
>
> Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
>at java.util.HashMap.(HashMap.java:209)
>at java.util.HashSet.(HashSet.java:86)
>at
>
> org.neo4j.index.lucene.LuceneTransaction$TxCache.add(LuceneTransaction.java:
> 334)
>at
> org.neo4j.index.lucene.LuceneTransaction.insert(LuceneTransaction.java:93)
>at
> org.neo4j.index.lucene.LuceneTransaction.index(LuceneTransaction.java:59)
>at
> org.neo4j.index.lucene.LuceneXaConnection.index(LuceneXaConnection.java:94)
>at
>
> org.neo4j.index.lucene.LuceneIndexService.indexThisTx(LuceneIndexService.jav
> a:220)
>at
> org.neo4j.index.impl.GenericIndexService.index(GenericIndexService.java:54)
>at
>
> org.neo4j.index.lucene.LuceneIndexService.index(LuceneIndexService.java:209)
>at
> JiraLoader$JiraExtractor$Item.setNodeProperty(JiraLoader.java:321)
>at
> JiraLoader$JiraExtractor$Item.updateGraph(JiraLoader.java:240)
>
> Thanks,
> Paul Jackson
> ___
> Neo4j mailing list
> User@lists.neo4j.org
> https://lists.neo4j.org/mailman/listinfo/user
>
> ___
> Neo4j mailing list
> User@lists.neo4j.org
> https://lists.neo4j.org/mailman/listinfo/user
>



-- 
Mattias Persson, [matt...@neotechnology.com]
Hacker, Neo Technology
www.neotechnology.com
___
Neo4j mailing list
User@lists.neo4j.org
https://lists.neo4j.org/mailman/listinfo/user


[Neo4j] Can I use neo4j for this?

2010-07-09 Thread Gurkensalat
Dear all,

I'm completely new to neo4j (and don't even really speak Java), but I have been 
struggling in vain for quite a while to get sensible performance on my 
graph-data in MySQL and PostgreSQL. From your webpage and other posts on the 
lists I got the great feeling that newbies are welcome here, so I hope it is 
all right if I tell you something about my data and what I want to know about 
it so that maybe someone can tell whether I can actually do this with neo4j.

My data is about 250 million separate graphs with a grand total of about 5 
billion nodes. 
- The graphs are of a tree-like structure (many are actual trees, but not all 
of them).
- Every graph has an id.
- Every node has 4 properties:
1. name (some names are very common, many occur only once or twice)
2. category1 (there are about 40 different categories on this level)
3. name-group (John, Jon, Jonathan form one group, many of the names that occur 
only once get their own name group)
4. category2 (there are about 10 different categories on this level)
- Every edge has one or two properties
1. type (currently about 50 different ones) [obligatory]
2. attribute [only there for 3 types; about 10 per cent of all edges]

There are two sorts of questions I want to be able to answer:
1. The user specifies a subgraph (currently even a subtree, but not sure 
whether it will remain that way) and wants the ids of all matching graphs.
2. The user specifies a subgraph and highlights one position he didn't fill in. 
As a result, he wants a list of all items that occur in this position ordered 
by their frequency in this position.

Examples of queries: (sorry for the weird format, but I have no idea how to 
represent a tree in text)

EXAMPLE 1:
relations:
n1 > n2  (relation type: t1)
n1 > n3  (relation type: t2)
n3 > n4  (relation type: t3)
n3 > n5  (relation type: t3)
n3 > n6  (relation type not specified, just has to exist)

node properties:
n1: name-group: John-like; category2: c2-13
n2: [no properties specified, just has to exist]
n3: category1: c1-15
n4: [no properties specified, just has to exist]
n5: [no properties specified, just has to exist]
n6: name: Ben; category2: c2-13

If no highlighting is done, we just return the ids.
If highlighting was done, let's say on n4.name, then I want all names that 
occur in this position of any graph.

EXAMPLE 2:
relations:
n1 > n2  (relation type: t1)
n1 > n3  (relation type: t2)
n3 > n4  (relation type: t3)

no node properties specified.

I hope I managed to make myself understood. If not, I am happy to draw some 
graphs and upload them somewhere.

I know that I will need indices on the name and name-family properties. Not 
sure how well they would perform on the less selective properties, though.

Basically, my problem is similar to the one found here:
http://lists.neo4j.org/pipermail/user/2009-June/001331.html
But what makes me worry is a quote from here: 
http://components.neo4j.org/neo4j-graph-matching/
"The pattern matching is done by first defining a graph pattern and then 
searching for matching occurrences of that pattern in the graph around a given 
anchor node."
I do not necessarily have an anchor node. And I have lots of graphs...

Performance: For many queries of the type outlined above, I have to wait for 
more than two minutes on my SMALL dataset (6 million graphs, 100 million nodes, 
87 million edges) via PostgreSQL. For some it is more like 10 or 20 minutes... 
I would prefer not to have to wait for more than 5 seconds on the small dataset 
and 20 or 30 on the big dataset.

Sorry for the lengthy email and I'm looking forward to your replies!

Best regards,
Jonathan

-- 
GMX DSL: Internet-, Telefon- und Handy-Flat ab 19,99 EUR/mtl.  
Bis zu 150 EUR Startguthaben inklusive! http://portal.gmx.net/de/go/dsl
___
Neo4j mailing list
User@lists.neo4j.org
https://lists.neo4j.org/mailman/listinfo/user


Re: [Neo4j] How to traverse by the number of relationships between nodes?

2010-07-09 Thread Tim Jones
Hi Craig,

That's great, thanks a lot. I'll give it a go.

Cheers, 
Tim



- Original Message 
From: Craig Taverner 
To: Neo4j user discussions 
Sent: Thu, July 8, 2010 8:49:38 PM
Subject: Re: [Neo4j] How to traverse by the number of relationships between 
nodes?

Hi Tim,

It is exactly the same approach, but instead of building the route cache
while loading the graph, just do it on a second pass which traverses the
graph. If the graph is structured like you describe, then write a traverser
that visits each Visit node once, and for each Visit node iterate over the
Page relationships, creating an Array of relationships. Sort the array by
your visit order property and you have the route cache. Step backwards
through the route creating the route relationships as described before.

Cheers, Craig

On Thu, Jul 8, 2010 at 5:33 PM, Tim Jones  wrote:

> Hi Craig, thanks for your answer.
>
> What's your approach that would allow me to specify the destination node at
> analysis time? I'd like to retain the flexibility to do this too.
>
> Thanks
> Tim
>
>
>
> - Original Message 
> From: Craig Taverner 
> To: Neo4j user discussions 
> Sent: Thu, July 8, 2010 12:09:28 PM
> Subject: Re: [Neo4j] How to traverse by the number of relationships between
> nodes?
>
> Even without the new traversal framework, the returnable evaluator has
> access to the current node being evaluated and can investigate it's
> relationships (and even run another traverser). I'm not sure if nested
> traversing is a good idea, but I certainly have used methods like
> getRelationships inside an evaluator with no problems.
>
> As for the main goal, I think there are many ways to skin a cat. For
> performance reasons I would always look for the way that embeds the final
> result in the graph structure itself, so you don't need complex traversals
> to get your answer. So in your case you want the 10 most popular routes, I
> guess what you are looking for are relationships between pages that define
> a
> route and a popularity score. So the final answer would be found by simply
> sorting these relationships to the destination page by popularity. No
> traversal required :-)
>
> Your current structure is a good match for the incoming data, but requires
> lots of traversing to determine the main answer you are after. So I would
> vote for adding a new structure that includes the answer. I think I have an
> idea that can be done during load if you know in advance the destination
> node you want to analyse, as well as after load (second pass) if you want
> to
> specify the destination node only at analysis time. I'll describe the
> 'during load' approach.
>
> Load the apache log data, optionally building the structure you do now, but
> also identifying all start points and routes to the destination. This can
> be
> achieved by an in memory cache for each user session (visit) of the route
> from the entry point, appended to as each new page is visited (just an
> ArrayList of page Nodes, growing page-by-page), and when the destination
> Page is reached, create a unique identifier for that route (eg. a string of
> all node-ids in the route, or the hashcode of that). Then step back along
> all nodes in the route, adding relations with
> DynamicRelationshipType.withName("ROUTE-"+routeName) and property count=1,
> and if the relationship already exists for that name, increment the count.
>
> You can even load later apache logs to this and it will continue to
> incremement the route counters nicely. And to reset the counters, just
> delete all those route relationships.
>
> Now the final answer for your query is only to iterate over all incoming
> relationships to the destination page, and if the relationship type name
> starts with 'ROUTE-' add to an ArrayList of relationships, and then sort
> that list by the counter property. This should be almost instantaneous
> result :-)
>
> Of course, this algorithm assumes that the total number of possible routes
> is not unreasonably high. I believe you can have something like 64k
> relationship types, so using the relationship type for the route name is
> possible. If you are uncomfortable with that, just use a static type like
> 'ROUTE', and put the relationship name in a relationship property. That
> slightly increases the complexity of the test for the route during creation
> and slightly decreases the complecity of the test for the route during the
> final scoring. For this example, the performance difference is
> insignificant.
>
> Cheers, Craig
>
>
> On Thu, Jul 8, 2010 at 10:57 AM, Anders Nawroth  >wrote:
>
> > Hi Tim!
> >
> > Maybe you can use the new traversal framework, this interface comes to
> > mind:
> >
> >
>http://components.neo4j.org/neo4j-kernel/apidocs/org/neo4j/graphdb/traversal/SourceSelector.html
>l
> >l
> >
> > Regarding the number of relationships, it could be a good idea to store
> > it as a property on the node.
> >
> > /anders
> >
> > > Is there any way I can write a ReturnableEva

Re: [Neo4j] Write Neo4j Books - Packt Publishing

2010-07-09 Thread Laurent Laborde
On Fri, Jul 9, 2010 at 9:47 AM, Kshipra Singh  wrote:
> Hi All,
>
> I represent Packt Publishing, the publishers of computer related books.
>
> We are planning to extend our range of Open Source books based on Java 
> technology and are currently inviting authors interested in writing them. 
> This doesn't require any past writing experience. All that we expect from our 
> authors is a good subject knowlegde, a passion to share it with others and an 
> ability to communicate clearly in English.
>
> So, if you love Neo4j and fancy writing a book, here's an opportunity for 
> you! Send us your book ideas at aut...@packtpub.com and our editorial team 
> will be happy to evaluate them. Even if you don't have a book idea and are 
> simply interested in writing a book, we are still keen to hear from you.

I'll happily buy a neo4j book. (I won't write it :D )
(funny enough, i never heard about paktpub until yesterday, when i
bought a book a Solr)

-- 
Laurent "ker2x" Laborde
Sysadmin & DBA at http://www.over-blog.com/
___
Neo4j mailing list
User@lists.neo4j.org
https://lists.neo4j.org/mailman/listinfo/user


[Neo4j] Write Neo4j Books - Packt Publishing

2010-07-09 Thread Kshipra Singh
Hi All, 

I represent Packt Publishing, the publishers of computer related books. 

We are planning to extend our range of Open Source books based on Java 
technology and are currently inviting authors interested in writing them. This 
doesn't require any past writing experience. All that we expect from our 
authors is a good subject knowlegde, a passion to share it with others and an 
ability to communicate clearly in English.
 
So, if you love Neo4j and fancy writing a book, here's an opportunity for you! 
Send us your book ideas at aut...@packtpub.com and our editorial team will be 
happy to evaluate them. Even if you don't have a book idea and are simply 
interested in writing a book, we are still keen to hear from you.

Packt runs an Open Source royalty scheme so by writing for Packt you will be 
giving back to the Open Source Community.
 
More details about this opportunity are available at: 
http://authors.packtpub.com/content/calling-open-source-java-based-technology-enthusiasts-write-packt
 
Thanks
Kshipra Singh
Author Relationship Manager
Packt Publishing
www.PacktPub.com
 
Skype: kshiprasingh15
Twitter: http://twitter.com/packtauthors

Interested in becoming an author? Visit http://authors.packtpub.com for all the 
information you need about writing for Packt.
___
Neo4j mailing list
User@lists.neo4j.org
https://lists.neo4j.org/mailman/listinfo/user