On Sat, Jun 25, 2011 at 2:59 AM, Michael Hunger <michael.hun...@neotechnology.com> wrote:
> Massimo, > > when profiling this it quickly becomes apparent that the issue is within the > lucene document. > (org.apache.lucene.document.Document) > > it holds an arraylist of all its fields which amount to all the memory. > > It also contains several methods that walk over that list (filtering it) and > or returning copies of that. > > Another issue that came up, the addtion takes longer and longer (because of > Lucene doing a quick-sort on the fields at each flush()). > > So my suggestion would be to shard the indexing over several arguments and > hide that behind a domain level API, each document should have around 50k > entries to allow lucene to handle it gracefully. After you introduced this > API you should perhaps consider replacing this large index with a more > appropriate key-value store (like redis, jdbm, custom-impl - depending on > your real use-case which you haven't revealed :) ). > > Cheers My use case is this one: I got a big series of log row which I have to read and understand but I need to be sure to parse the log row one and only one time, so i calculate an SHA1 hash of the log row and put it in the index, if there's already that hash in the index I skip the log row cause it means it has already been processsed. I've made a test with jdbm and is by far a lot worst then plain Lucene. BTW If i do the same test with plain Lucene implementation is works flawlessly without any pain, so I guess something going wired in the way Lucene is been used by neo4j, but I'll try to follow your suggestion. BTW I've also teste MongoDB which is slower but seems more stable... but my test isn't finished yet... Cheers -- Massimo _______________________________________________ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user