Re: [Neo] LuceneIndexBatchInserter doubt
Hi again Mattias, I have tried to execute my application with the last version available in the maven repository and I still have the same problem. After creating and indexing all the nodes, the application calls the optimize method and, then, it creates all the edges by calling the method getNodes in order to select the tail and head node of the edge, but it doesn't work because many nodes are not found. I have tried to create only 30 nodes and 15 edges and it works properly, but if I try to create a big graph (180 million edges + 20 million nodes) it doesn't. I have also tried to call the optimize method every time the application has been created 1 million nodes but it doesn't work. Have you tried to create as many nodes as I have said with the newer index-util version? Thank you, Núria. 2009/12/4 Núria Trench nuriatre...@gmail.com Hi Mattias, Thank you very much for fixing the problem so fast. I will try it as soon as the new changes will be available in the maven repository. Núria. 2009/12/4 Mattias Persson matt...@neotechnology.com I fixed the problem and also added a cache per key for faster getNodes/getSingleNode lookup during the insert process. However the cache assumes that there's nothing in the index when the process starts (which almost always will be true) to speed things up even further. You can control the cache size and if it should be used by overriding the (this is also documented in the Javadoc): boolean useCache() int getMaxCacheSizePerKey() methods in your LuceneIndexBatchInserterImpl instance. The new changes should be available in the maven repository within an hour. 2009/12/4 Mattias Persson matt...@neotechnology.com: I think I found the problem... it's indexing as it should, but it isn't reflected in getNodes/getSingleNode properly until you flush/optimize/shutdown the index. I'll try to fix it today! 2009/12/3 Núria Trench nuriatre...@gmail.com: Thank you very much for your response. If you need more information, you only have to send an e-mail and I will try to explain it better. Núria. 2009/12/3 Mattias Persson matt...@neotechnology.com This is something I'd like to reproduce and I'll do some testing on this tomorrow 2009/12/3 Núria Trench nuriatre...@gmail.com: Hello, Last week, I decided to download your graph database core in order to use it. First, I created a new project to parse my CSV files and create a new graph database with Neo4j. This CSV files contain 150 milion edges and 20 milion nodes. When I finished to write the code which will create the graph database, I executed it and, after six hours of execution, the program crashes because of a Lucene exception. The exception is related to the index merging and it has the following message: mergeFields produced an invalid result: docCount is 385282378 but fdx file size is 3082259028; now aborting this merge to prevent index corruption I have searched on the net and I found that it is a lucene bug. The libraries used for executing my project were: neo-1.0-b10 index-util-0.7 lucene-core-2.4.0 So, I decided to use a newer Lucene version. I found that you have a newer index-util version so I updated the libraries: neo-1.0-b10 index-util-0.9 lucene-core-2.9.1 When I had updated those libraries, I tried to execute my project again and I found that, in many occassions, it was not indexing properly. So, I tried to optimize the index after every time I indexed something. This was a solution because, after that, it was indexing properly but the time execution increased a lot. I am not using transactions, instead of this, I am using the Batch Inserter with the LuceneIndexBatchInserter. So, my question is: What can I do to solve this problem? If use index-util-0.7 I cannot finish the execution of creating the graph database and I use index-util-0.9 I have to optimize the index in every insertion and the execution never ever ends. Thank you very much in advance, Núria. ___ Neo mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user -- Mattias Persson, [matt...@neotechnology.com] Neo Technology, www.neotechnology.com ___ Neo mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user ___ Neo mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user -- Mattias Persson, [matt...@neotechnology.com] Neo Technology, www.neotechnology.com -- Mattias Persson, [matt...@neotechnology.com] Neo Technology, www.neotechnology.com ___ Neo mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user
Re: [Neo] LuceneIndexBatchInserter doubt
Hi Mattias, Núria. I am also running into scalability problems with the Lucene batch inserter at much smaller numbers, 30,000 indexed nodes. I tried calling optimize more. Increasing ulimit didn't help. INFO] Exception in thread main java.lang.RuntimeException: java.io.FileNotFoundException: /Users/todd/Code/neo4Jprototype/target/classes/data/graph/lucene/name/_0.cfx (Too many open files) [INFO] at org.neo4j.util.index.LuceneIndexBatchInserterImpl.getNodes(LuceneIndexBatchInserterImpl.java:186) [INFO] at org.neo4j.util.index.LuceneIndexBatchInserterImpl.getSingleNode(LuceneIndexBatchInserterImpl.java:238) [INFO] at com.collectiveintelligence.QueryNeo.loadDataToGraph(QueryNeo.java:277) [INFO] at com.collectiveintelligence.QueryNeo.main(QueryNeo.java:57) [INFO] Caused by: java.io.FileNotFoundException: /Users/todd/Code/neo4Jprototype/target/classes/data/graph/lucene/name/_0.cfx (Too many open files) I tried breaking up to separate batchinserter instances, and it hangs now. Can I create more than one batch inserter per process if they run sequentially and non-threaded? Thanks, Todd On Mon, Dec 7, 2009 at 7:28 AM, Núria Trench nuriatre...@gmail.com wrote: Hi again Mattias, I have tried to execute my application with the last version available in the maven repository and I still have the same problem. After creating and indexing all the nodes, the application calls the optimize method and, then, it creates all the edges by calling the method getNodes in order to select the tail and head node of the edge, but it doesn't work because many nodes are not found. I have tried to create only 30 nodes and 15 edges and it works properly, but if I try to create a big graph (180 million edges + 20 million nodes) it doesn't. I have also tried to call the optimize method every time the application has been created 1 million nodes but it doesn't work. Have you tried to create as many nodes as I have said with the newer index-util version? Thank you, Núria. 2009/12/4 Núria Trench nuriatre...@gmail.com Hi Mattias, Thank you very much for fixing the problem so fast. I will try it as soon as the new changes will be available in the maven repository. Núria. 2009/12/4 Mattias Persson matt...@neotechnology.com I fixed the problem and also added a cache per key for faster getNodes/getSingleNode lookup during the insert process. However the cache assumes that there's nothing in the index when the process starts (which almost always will be true) to speed things up even further. You can control the cache size and if it should be used by overriding the (this is also documented in the Javadoc): boolean useCache() int getMaxCacheSizePerKey() methods in your LuceneIndexBatchInserterImpl instance. The new changes should be available in the maven repository within an hour. 2009/12/4 Mattias Persson matt...@neotechnology.com: I think I found the problem... it's indexing as it should, but it isn't reflected in getNodes/getSingleNode properly until you flush/optimize/shutdown the index. I'll try to fix it today! 2009/12/3 Núria Trench nuriatre...@gmail.com: Thank you very much for your response. If you need more information, you only have to send an e-mail and I will try to explain it better. Núria. 2009/12/3 Mattias Persson matt...@neotechnology.com This is something I'd like to reproduce and I'll do some testing on this tomorrow 2009/12/3 Núria Trench nuriatre...@gmail.com: Hello, Last week, I decided to download your graph database core in order to use it. First, I created a new project to parse my CSV files and create a new graph database with Neo4j. This CSV files contain 150 milion edges and 20 milion nodes. When I finished to write the code which will create the graph database, I executed it and, after six hours of execution, the program crashes because of a Lucene exception. The exception is related to the index merging and it has the following message: mergeFields produced an invalid result: docCount is 385282378 but fdx file size is 3082259028; now aborting this merge to prevent index corruption I have searched on the net and I found that it is a lucene bug. The libraries used for executing my project were: neo-1.0-b10 index-util-0.7 lucene-core-2.4.0 So, I decided to use a newer Lucene version. I found that you have a newer index-util version so I updated the libraries: neo-1.0-b10 index-util-0.9 lucene-core-2.9.1 When I had updated those libraries, I tried to execute my project again and I found that, in many occassions, it was not indexing properly. So, I tried to optimize the index after every time I indexed something. This was a solution because, after that, it was indexing properly but the time execution increased a lot. I am not using transactions, instead of this, I am using the Batch Inserter with
Re: [Neo] LuceneIndexBatchInserter doubt
Todd, are you sure you have the latest index-util 0.9-SNAPSHOT? This is a bug that we fixed yesterday... (assuming it's the same bug). 2009/12/7 Todd Stavish toddstav...@gmail.com: Hi Mattias, Núria. I am also running into scalability problems with the Lucene batch inserter at much smaller numbers, 30,000 indexed nodes. I tried calling optimize more. Increasing ulimit didn't help. INFO] Exception in thread main java.lang.RuntimeException: java.io.FileNotFoundException: /Users/todd/Code/neo4Jprototype/target/classes/data/graph/lucene/name/_0.cfx (Too many open files) [INFO] at org.neo4j.util.index.LuceneIndexBatchInserterImpl.getNodes(LuceneIndexBatchInserterImpl.java:186) [INFO] at org.neo4j.util.index.LuceneIndexBatchInserterImpl.getSingleNode(LuceneIndexBatchInserterImpl.java:238) [INFO] at com.collectiveintelligence.QueryNeo.loadDataToGraph(QueryNeo.java:277) [INFO] at com.collectiveintelligence.QueryNeo.main(QueryNeo.java:57) [INFO] Caused by: java.io.FileNotFoundException: /Users/todd/Code/neo4Jprototype/target/classes/data/graph/lucene/name/_0.cfx (Too many open files) I tried breaking up to separate batchinserter instances, and it hangs now. Can I create more than one batch inserter per process if they run sequentially and non-threaded? Thanks, Todd On Mon, Dec 7, 2009 at 7:28 AM, Núria Trench nuriatre...@gmail.com wrote: Hi again Mattias, I have tried to execute my application with the last version available in the maven repository and I still have the same problem. After creating and indexing all the nodes, the application calls the optimize method and, then, it creates all the edges by calling the method getNodes in order to select the tail and head node of the edge, but it doesn't work because many nodes are not found. I have tried to create only 30 nodes and 15 edges and it works properly, but if I try to create a big graph (180 million edges + 20 million nodes) it doesn't. I have also tried to call the optimize method every time the application has been created 1 million nodes but it doesn't work. Have you tried to create as many nodes as I have said with the newer index-util version? Thank you, Núria. 2009/12/4 Núria Trench nuriatre...@gmail.com Hi Mattias, Thank you very much for fixing the problem so fast. I will try it as soon as the new changes will be available in the maven repository. Núria. 2009/12/4 Mattias Persson matt...@neotechnology.com I fixed the problem and also added a cache per key for faster getNodes/getSingleNode lookup during the insert process. However the cache assumes that there's nothing in the index when the process starts (which almost always will be true) to speed things up even further. You can control the cache size and if it should be used by overriding the (this is also documented in the Javadoc): boolean useCache() int getMaxCacheSizePerKey() methods in your LuceneIndexBatchInserterImpl instance. The new changes should be available in the maven repository within an hour. 2009/12/4 Mattias Persson matt...@neotechnology.com: I think I found the problem... it's indexing as it should, but it isn't reflected in getNodes/getSingleNode properly until you flush/optimize/shutdown the index. I'll try to fix it today! 2009/12/3 Núria Trench nuriatre...@gmail.com: Thank you very much for your response. If you need more information, you only have to send an e-mail and I will try to explain it better. Núria. 2009/12/3 Mattias Persson matt...@neotechnology.com This is something I'd like to reproduce and I'll do some testing on this tomorrow 2009/12/3 Núria Trench nuriatre...@gmail.com: Hello, Last week, I decided to download your graph database core in order to use it. First, I created a new project to parse my CSV files and create a new graph database with Neo4j. This CSV files contain 150 milion edges and 20 milion nodes. When I finished to write the code which will create the graph database, I executed it and, after six hours of execution, the program crashes because of a Lucene exception. The exception is related to the index merging and it has the following message: mergeFields produced an invalid result: docCount is 385282378 but fdx file size is 3082259028; now aborting this merge to prevent index corruption I have searched on the net and I found that it is a lucene bug. The libraries used for executing my project were: neo-1.0-b10 index-util-0.7 lucene-core-2.4.0 So, I decided to use a newer Lucene version. I found that you have a newer index-util version so I updated the libraries: neo-1.0-b10 index-util-0.9 lucene-core-2.9.1 When I had updated those libraries, I tried to execute my project again and I found that, in many occassions, it was not indexing properly. So, I tried to optimize the index after every time I indexed something.