subject:"Re\: \[Neo\] LuceneIndexBatchInserter doubt"

Re: [Neo] LuceneIndexBatchInserter doubt

2009-12-26 Thread Peter Neubauer

Hi Núria,
the current ID-scheme of using Integers for IDs for both Nodes,
Relationships and Properties limits the possible node space size to 4
Billion nodes, 4 Billion Relationships and 4 Billion properties. Of
course one could switch to Long as IDs, but that will increase the
reserved amount of bytes and cause possible performance penalties.
However, this is the current limit, after that you have to start
thinking about sharding along a suitable domain-specific criteria.
What size and domain are you imagining?

However, when dealing with bigger nodespaces you probably want to
increase RAM of your server machine and think about SSD in order to
keep the often-used parts of your graph cached and minimize IO cost.

HTH

Cheers,

/peter neubauer

COO and Sales, Neo Technology

GTalk:  neubauer.peter
Skype   peter.neubauer
Phone   +46 704 106975
LinkedIn   http://www.linkedin.com/in/neubauer
Twitter  http://twitter.com/peterneubauer

http://www.neo4j.org- Relationships count.
http://gremlin.tinkerpop.com- PageRank in 2 lines of code.
http://www.linkedprocess.org   - Computing at LinkedData scale.



On Sat, Dec 26, 2009 at 4:10 PM, Núria Trench nuriatre...@gmail.com wrote:
 Hi,

 I have just finished parsing and creating the database with the latest
 index-util-0.9-SNAPSHOT available in your repository. It has been finished
 succesfully so I must thank you for your interest and useful help.
 And, finally, I have one last question. I have been created 180 million of
 edges and 20 million of nodes. Is it possible to create a bigger amount of
 edges and nodes with Neo4j? Do you have a limit?

 Thank your very much again.

 2009/12/21 Núria Trench nuriatre...@gmail.com

 Hi again Mattias,

 I'm still trying to parse all the data in order to create the graph. I will
 report the results as soon as possible.
 Thank you very much for your interest.

 Núria.

 2009/12/21 Mattias Persson matt...@neotechnology.com

 Hi again,

 any luck with this yet?

 2009/12/11 Núria Trench nuriatre...@gmail.com:
  Thank you very much Mattias. I will test it as soon as possible and I'll
  will tell you something.
 
  Núria.
 
  2009/12/11 Mattias Persson matt...@neotechnology.com
 
  I've tried this a couple of times now and first of all I see some
  problems in your code:
 
  1) In the method createRelationsTitleImage you have an inverted head
  != -1 check where it should be head == -1
 
  2) You index relationships in createRelationsBetweenTitles method,
  this isn't ok since the index can only manage nodes.
 
  And I recently committed a fix which removed the caching layer in
  the LuceneIndexBatchInserterImpl (and therefore also
  LuceneFulltextIndexBatchInserter). This probably fixes your problems.
  I'm also working on a performance fix which makes consecutive getNodes
  calls faster.
 
  So I think that with these fixes (1) and (2) and the latest index-util
  0.9-SNAPSHOT your sample will run fine. Also you could try without
  calling optimize. See more information at
  http://wiki.neo4j.org/content/Indexing_with_BatchInserter
 
  2009/12/10 Mattias Persson matt...@neotechnology.com:
   To continue this thread in the user list:
  
   Thanks Núria, I've gotten your samples code/files and I'm running it
   now to try to reproduce you problem.
  
   2009/12/9 Núria Trench nuriatre...@gmail.com:
   I have finished uploading the 4 csv files. You'll see an e-mail with
 the
   other 3 csv files packed in a rar file.
   Thanks,
  
   Núria.
  
   2009/12/9 Núria Trench nuriatre...@gmail.com
  
   Yes, you are right. But there is one csv file that is too big to be
  packed
   with other files and I am reducing it.
   I am sending the other files now.
  
   2009/12/9 Mattias Persson matt...@neotechnology.com
  
   By the way, you might consider packing those files (with zip or
 tar.gz
   or something) cause they will shrink quite well
  
   2009/12/9 Mattias Persson matt...@neotechnology.com:
Great, but I only got the images.csv file... I'm starting to
 test
  with
that at least
   
2009/12/9 Núria Trench nuriatre...@gmail.com:
Hi again,
   
The errors show up after being parsed 2 csv files to create all
 the
nodes,
just in the moment of calling the method getSingleNode for
  looking
up the
tail and head node for creating all the edges by reading the
 other
  two
csv
files.
   
I am sending with Sprend the four csv files that will help you
 to
trigger
index behaviour.
   
Thank you,
   
Núria.
   
2009/12/9 Mattias Persson matt...@neotechnology.com
   
Hmm, I've no idea... but does the errors show up early in the
  process
or do you have to insert a LOT of data to trigger it? In such
 case
you
could send me a part of them... maybe using
 http://www.sprend.se,
WDYT?
   
2009/12/9 Núria Trench nuriatre...@gmail.com:
 Hi Mattias,

 The data isn't confident but the files are very big (5,5
 GB).

Re: [Neo] LuceneIndexBatchInserter doubt

2009-12-21 Thread Núria Trench

Hi again Mattias,

I'm still trying to parse all the data in order to create the graph. I will
report the results as soon as possible.
Thank you very much for your interest.

Núria.

2009/12/21 Mattias Persson matt...@neotechnology.com

 Hi again,

 any luck with this yet?

 2009/12/11 Núria Trench nuriatre...@gmail.com:
  Thank you very much Mattias. I will test it as soon as possible and I'll
  will tell you something.
 
  Núria.
 
  2009/12/11 Mattias Persson matt...@neotechnology.com
 
  I've tried this a couple of times now and first of all I see some
  problems in your code:
 
  1) In the method createRelationsTitleImage you have an inverted head
  != -1 check where it should be head == -1
 
  2) You index relationships in createRelationsBetweenTitles method,
  this isn't ok since the index can only manage nodes.
 
  And I recently committed a fix which removed the caching layer in
  the LuceneIndexBatchInserterImpl (and therefore also
  LuceneFulltextIndexBatchInserter). This probably fixes your problems.
  I'm also working on a performance fix which makes consecutive getNodes
  calls faster.
 
  So I think that with these fixes (1) and (2) and the latest index-util
  0.9-SNAPSHOT your sample will run fine. Also you could try without
  calling optimize. See more information at
  http://wiki.neo4j.org/content/Indexing_with_BatchInserter
 
  2009/12/10 Mattias Persson matt...@neotechnology.com:
   To continue this thread in the user list:
  
   Thanks Núria, I've gotten your samples code/files and I'm running it
   now to try to reproduce you problem.
  
   2009/12/9 Núria Trench nuriatre...@gmail.com:
   I have finished uploading the 4 csv files. You'll see an e-mail with
 the
   other 3 csv files packed in a rar file.
   Thanks,
  
   Núria.
  
   2009/12/9 Núria Trench nuriatre...@gmail.com
  
   Yes, you are right. But there is one csv file that is too big to be
  packed
   with other files and I am reducing it.
   I am sending the other files now.
  
   2009/12/9 Mattias Persson matt...@neotechnology.com
  
   By the way, you might consider packing those files (with zip or
 tar.gz
   or something) cause they will shrink quite well
  
   2009/12/9 Mattias Persson matt...@neotechnology.com:
Great, but I only got the images.csv file... I'm starting to test
  with
that at least
   
2009/12/9 Núria Trench nuriatre...@gmail.com:
Hi again,
   
The errors show up after being parsed 2 csv files to create all
 the
nodes,
just in the moment of calling the method getSingleNode for
  looking
up the
tail and head node for creating all the edges by reading the
 other
  two
csv
files.
   
I am sending with Sprend the four csv files that will help you
 to
trigger
index behaviour.
   
Thank you,
   
Núria.
   
2009/12/9 Mattias Persson matt...@neotechnology.com
   
Hmm, I've no idea... but does the errors show up early in the
  process
or do you have to insert a LOT of data to trigger it? In such
 case
you
could send me a part of them... maybe using
 http://www.sprend.se,
WDYT?
   
2009/12/9 Núria Trench nuriatre...@gmail.com:
 Hi Mattias,

 The data isn't confident but the files are very big (5,5 GB).
 How can I send you this data?

 2009/12/9 Mattias Persson matt...@neotechnology.com

 Yep I got the java code, thanks. Yeah if the data is
 confident
  or
 sensitive you can just send me the formatting, else consider
 sending
 the files as well (or a subset if they are big).

 2009/12/9 Núria Trench nuriatre...@gmail.com:
 
 
 
  --
  Mattias Persson, [matt...@neotechnology.com]
  Neo Technology, www.neotechnology.com
  ___
  Neo mailing list
  User@lists.neo4j.org
  https://lists.neo4j.org/mailman/listinfo/user
 
  ___
  Neo mailing list
  User@lists.neo4j.org
  https://lists.neo4j.org/mailman/listinfo/user
 



 --
 Mattias Persson, [matt...@neotechnology.com]
 Neo Technology, www.neotechnology.com
 ___
 Neo mailing list
 User@lists.neo4j.org
 https://lists.neo4j.org/mailman/listinfo/user

___
Neo mailing list
User@lists.neo4j.org
https://lists.neo4j.org/mailman/listinfo/user

Re: [Neo] LuceneIndexBatchInserter doubt

2009-12-11 Thread Mattias Persson

I've tried this a couple of times now and first of all I see some
problems in your code:

1) In the method createRelationsTitleImage you have an inverted head
!= -1 check where it should be head == -1

2) You index relationships in createRelationsBetweenTitles method,
this isn't ok since the index can only manage nodes.

And I recently committed a fix which removed the caching layer in
the LuceneIndexBatchInserterImpl (and therefore also
LuceneFulltextIndexBatchInserter). This probably fixes your problems.
I'm also working on a performance fix which makes consecutive getNodes
calls faster.

So I think that with these fixes (1) and (2) and the latest index-util
0.9-SNAPSHOT your sample will run fine. Also you could try without
calling optimize. See more information at
http://wiki.neo4j.org/content/Indexing_with_BatchInserter

2009/12/10 Mattias Persson matt...@neotechnology.com:
 To continue this thread in the user list:

 Thanks Núria, I've gotten your samples code/files and I'm running it
 now to try to reproduce you problem.

 2009/12/9 Núria Trench nuriatre...@gmail.com:
 I have finished uploading the 4 csv files. You'll see an e-mail with the
 other 3 csv files packed in a rar file.
 Thanks,

 Núria.

 2009/12/9 Núria Trench nuriatre...@gmail.com

 Yes, you are right. But there is one csv file that is too big to be packed
 with other files and I am reducing it.
 I am sending the other files now.

 2009/12/9 Mattias Persson matt...@neotechnology.com

 By the way, you might consider packing those files (with zip or tar.gz
 or something) cause they will shrink quite well

 2009/12/9 Mattias Persson matt...@neotechnology.com:
  Great, but I only got the images.csv file... I'm starting to test with
  that at least
 
  2009/12/9 Núria Trench nuriatre...@gmail.com:
  Hi again,
 
  The errors show up after being parsed 2 csv files to create all the
  nodes,
  just in the moment of calling the method getSingleNode for looking
  up the
  tail and head node for creating all the edges by reading the other two
  csv
  files.
 
  I am sending with Sprend the four csv files that will help you to
  trigger
  index behaviour.
 
  Thank you,
 
  Núria.
 
  2009/12/9 Mattias Persson matt...@neotechnology.com
 
  Hmm, I've no idea... but does the errors show up early in the process
  or do you have to insert a LOT of data to trigger it? In such case
  you
  could send me a part of them... maybe using http://www.sprend.se ,
  WDYT?
 
  2009/12/9 Núria Trench nuriatre...@gmail.com:
   Hi Mattias,
  
   The data isn't confident but the files are very big (5,5 GB).
   How can I send you this data?
  
   2009/12/9 Mattias Persson matt...@neotechnology.com
  
   Yep I got the java code, thanks. Yeah if the data is confident or
   sensitive you can just send me the formatting, else consider
   sending
   the files as well (or a subset if they are big).
  
   2009/12/9 Núria Trench nuriatre...@gmail.com:



-- 
Mattias Persson, [matt...@neotechnology.com]
Neo Technology, www.neotechnology.com
___
Neo mailing list
User@lists.neo4j.org
https://lists.neo4j.org/mailman/listinfo/user

Re: [Neo] LuceneIndexBatchInserter doubt

2009-12-10 Thread Mattias Persson

To continue this thread in the user list:

Thanks Núria, I've gotten your samples code/files and I'm running it
now to try to reproduce you problem.

2009/12/9 Núria Trench nuriatre...@gmail.com:
 I have finished uploading the 4 csv files. You'll see an e-mail with the
 other 3 csv files packed in a rar file.
 Thanks,

 Núria.

 2009/12/9 Núria Trench nuriatre...@gmail.com

 Yes, you are right. But there is one csv file that is too big to be packed
 with other files and I am reducing it.
 I am sending the other files now.

 2009/12/9 Mattias Persson matt...@neotechnology.com

 By the way, you might consider packing those files (with zip or tar.gz
 or something) cause they will shrink quite well

 2009/12/9 Mattias Persson matt...@neotechnology.com:
  Great, but I only got the images.csv file... I'm starting to test with
  that at least
 
  2009/12/9 Núria Trench nuriatre...@gmail.com:
  Hi again,
 
  The errors show up after being parsed 2 csv files to create all the
  nodes,
  just in the moment of calling the method getSingleNode for looking
  up the
  tail and head node for creating all the edges by reading the other two
  csv
  files.
 
  I am sending with Sprend the four csv files that will help you to
  trigger
  index behaviour.
 
  Thank you,
 
  Núria.
 
  2009/12/9 Mattias Persson matt...@neotechnology.com
 
  Hmm, I've no idea... but does the errors show up early in the process
  or do you have to insert a LOT of data to trigger it? In such case
  you
  could send me a part of them... maybe using http://www.sprend.se ,
  WDYT?
 
  2009/12/9 Núria Trench nuriatre...@gmail.com:
   Hi Mattias,
  
   The data isn't confident but the files are very big (5,5 GB).
   How can I send you this data?
  
   2009/12/9 Mattias Persson matt...@neotechnology.com
  
   Yep I got the java code, thanks. Yeah if the data is confident or
   sensitive you can just send me the formatting, else consider
   sending
   the files as well (or a subset if they are big).
  
   2009/12/9 Núria Trench nuriatre...@gmail.com:
   
   
  
  
  
   --
   Mattias Persson, [matt...@neotechnology.com]
   Neo Technology, www.neotechnology.com
  
  
 
 
 
  --
  Mattias Persson, [matt...@neotechnology.com]
  Neo Technology, www.neotechnology.com
 
 
 
 
 
  --
  Mattias Persson, [matt...@neotechnology.com]
  Neo Technology, www.neotechnology.com
 



 --
 Mattias Persson, [matt...@neotechnology.com]
 Neo Technology, www.neotechnology.com






-- 
Mattias Persson, [matt...@neotechnology.com]
Neo Technology, www.neotechnology.com
___
Neo mailing list
User@lists.neo4j.org
https://lists.neo4j.org/mailman/listinfo/user

Re: [Neo] LuceneIndexBatchInserter doubt

2009-12-09 Thread Núria Trench

Hi Todd,

The sample code creates nodes and relationships by parsing 4 csv files.
Thank you for trying to trigger this behaviour with this sample.

Núria

2009/12/9 Mattias Persson matt...@neotechnology.com

Could you provide me with some sample code which can trigger this
behaviour with the latest index-util-0.9-SNAPSHOT Núria?

2009/12/9 Núria Trench nuriatre...@gmail.com:
Todd,

I haven't the same problem. In my case, after indexing all the
attributes/properties of each node, the application creates all the edges
by
looking up the tail node and the head node. So, it calls the method
org.neo4j.util.index.
LuceneIndexBatchInserterImpl.getSingleNode which returns -1 (no found
node)
in many occasions.

Any one has an alternative to get a node with indexex
attributes/properties?

Thank you,

Núria.

2009/12/7 Mattias Persson matt...@neotechnology.com

Todd, are you sure you have the latest index-util 0.9-SNAPSHOT? This
is a bug that we fixed yesterday... (assuming it's the same bug).

2009/12/7 Todd Stavish toddstav...@gmail.com:
Hi Mattias, Núria.

I am also running into scalability problems with the Lucene batch
inserter at much smaller numbers, 30,000 indexed nodes. I tried
calling optimize more. Increasing ulimit didn't help.

INFO] Exception in thread main java.lang.RuntimeException:
java.io.FileNotFoundException:

/Users/todd/Code/neo4Jprototype/target/classes/data/graph/lucene/name/_0.cfx
(Too many open files)
[INFO] at

org.neo4j.util.index.LuceneIndexBatchInserterImpl.getNodes(LuceneIndexBatchInserterImpl.java:186)
[INFO] at

org.neo4j.util.index.LuceneIndexBatchInserterImpl.getSingleNode(LuceneIndexBatchInserterImpl.java:238)
[INFO] at
com.collectiveintelligence.QueryNeo.loadDataToGraph(QueryNeo.java:277)
[INFO] at com.collectiveintelligence.QueryNeo.main(QueryNeo.java:57)
[INFO] Caused by: java.io.FileNotFoundException:

/Users/todd/Code/neo4Jprototype/target/classes/data/graph/lucene/name/_0.cfx
(Too many open files)

I tried breaking up to separate batchinserter instances, and it hangs
now. Can I create more than one batch inserter per process if they run
sequentially and non-threaded?

Thanks,
Todd

On Mon, Dec 7, 2009 at 7:28 AM, Núria Trench nuriatre...@gmail.com
wrote:
Hi again Mattias,

I have tried to execute my application with the last version
available
in
the maven repository and I still have the same problem. After
creating
and
indexing all the nodes, the application calls the optimize method
and,
then, it creates all the edges by calling the method getNodes in
order
to
select the tail and head node of the edge, but it doesn't work
because
many
nodes are not found.

I have tried to create only 30 nodes and 15 edges and it works
properly,
but
if I try to create a big graph (180 million edges + 20 million nodes)
it
doesn't.

I have also tried to call the optimize method every time the
application
has been created 1 million nodes but it doesn't work.

Have you tried to create as many nodes as I have said with the newer
index-util version?

Thank you,

Núria.

2009/12/4 Núria Trench nuriatre...@gmail.com

Hi Mattias,

Thank you very much for fixing the problem so fast. I will try it as
soon
as the new changes will be available in the maven repository.

Núria.

2009/12/4 Mattias Persson matt...@neotechnology.com

I fixed the problem and also added a cache per key for faster
getNodes/getSingleNode lookup during the insert process. However
the
cache assumes that there's nothing in the index when the process
starts (which almost always will be true) to speed things up even
further.

You can control the cache size and if it should be used by
overriding
the (this is also documented in the Javadoc):

boolean useCache()
int getMaxCacheSizePerKey()

methods in your LuceneIndexBatchInserterImpl instance. The new
changes
should be available in the maven repository within an hour.

2009/12/4 Mattias Persson matt...@neotechnology.com:
I think I found the problem... it's indexing as it should, but it
isn't reflected in getNodes/getSingleNode properly until you
flush/optimize/shutdown the index. I'll try to fix it today!

2009/12/3 Núria Trench nuriatre...@gmail.com:
Thank you very much for your response.
If you need more information, you only have to send an e-mail
and I
will try
to explain it better.

Núria.

2009/12/3 Mattias Persson matt...@neotechnology.com

This is something I'd like to reproduce and I'll do some
testing
on
this tomorrow

2009/12/3 Núria Trench nuriatre...@gmail.com:
Hello,

Last week, I decided to download your graph database core in
order
to use
it. First, I created a new project to

Re: [Neo] LuceneIndexBatchInserter doubt

2009-12-09 Thread Mattias Persson

Hi again, Núria (it was I, Mattias who asked for the sample code).
Well... the fact that you parse 4 csv files doesn't really help me
setup a test for this... I mean how can I know that my test will be
similar to yours? Would it be ok to attach your code/csv files as
well?

/ Mattias

2009/12/9 Núria Trench nuriatre...@gmail.com:
Hi Todd,