inserting millions of entries

Jens Grivolla Thu, 28 Jun 2007 08:32:21 -0700

Hi,

I have a Lucene index with a few million entries, and I will need to
add batches of a few hundred thousand or a few million additional
entries.  Unfortunately, I absolutely need to have all indexed entries
available when inserting a new one, even within one batch, in order to
do some duplicate detection (using Lucene).


I believe this means having to close the writer and reopen the reader
to reflect the changes after each add.  I'm thinking of having the
original index remain static and add the new entries to a separate
index and merge the two later on.  I can even query them separately
during the batch insertion if that gives me better performance.

The questions:
How costly is merging two big indexes?
When / how often do I need to call optimize() on the new index?
Should I just keep MergeFactor at the default value?
If I have autoCommit=true, I can keep the writer open, but still need
to flush() and reopen readers to reflect the changes, right?
Is it better to have a MultiReader on both indexes or query them
separately so I don't have to reopen the old one every time?

Additional info:
Documents are very short, just a few words each.  I have  4 gigs of
RAM in the machine, of which I could allocate quite a bit as heap for
the writer if that helps.

Thanks for any hints on how to best go about this,

Jens

P.S.: all my mails to the list get silently dropped when sending throughGMail, possibly because the sender is not the same as the from: header(and only the from: actually contains the subscribed address). This isvery annoying and makes it impossible for me to write to the list inmany situations.


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

inserting millions of entries

Reply via email to