On Sat, Jun 25, 2011 at 2:59 AM, Michael Hunger
<michael.hun...@neotechnology.com> wrote:

> Massimo,
>
> when profiling this it quickly becomes apparent that the issue is within the 
> lucene document.
> (org.apache.lucene.document.Document)
>
> it holds an arraylist of all its fields which amount to all the memory.
>
> It also contains several methods that walk over that list (filtering it) and 
> or returning copies of that.
>
> Another issue that came up, the addtion takes longer and longer (because of 
> Lucene doing a quick-sort on the fields at each flush()).
>
> So my suggestion would be to shard the indexing over several arguments and 
> hide that behind a domain level API, each document should have around 50k 
> entries to allow lucene to handle it gracefully. After you introduced this 
> API you should perhaps consider replacing this large index with a more 
> appropriate key-value store (like redis, jdbm, custom-impl - depending on 
> your real use-case which you haven't revealed :) ).
>
> Cheers

My use case is this one: I got a big series of log row which I have to
read and understand but I need to be sure to parse the log row one and
only one time, so i calculate an SHA1 hash of the log row and put it
in the index, if there's already that hash in the index I skip the log
row cause it means it has already been processsed.

I've made a test with jdbm and is by far a lot worst then plain
Lucene. BTW If i do the same test with plain Lucene implementation is
works flawlessly without any pain, so I guess something going wired in
the way Lucene is been used by neo4j, but I'll try to follow your
suggestion.
BTW I've also teste MongoDB which is slower but seems more stable...
but my test isn't finished yet...

Cheers
-- 
Massimo
_______________________________________________
Neo4j mailing list
User@lists.neo4j.org
https://lists.neo4j.org/mailman/listinfo/user

Reply via email to