Hi,

I have a Lucene index with a few million entries, and I will need to
add batches of a few hundred thousand or a few million additional
entries.  Unfortunately, I absolutely need to have all indexed entries
available when inserting a new one, even within one batch, in order to
do some duplicate detection (using Lucene).

I believe this means having to close the writer and reopen the reader
to reflect the changes after each add.  I'm thinking of having the
original index remain static and add the new entries to a separate
index and merge the two later on.  I can even query them separately
during the batch insertion if that gives me better performance.

The questions:
How costly is merging two big indexes?
When / how often do I need to call optimize() on the new index?
Should I just keep MergeFactor at the default value?
If I have autoCommit=true, I can keep the writer open, but still need
to flush() and reopen readers to reflect the changes, right?
Is it better to have a MultiReader on both indexes or query them
separately so I don't have to reopen the old one every time?

Additional info:
Documents are very short, just a few words each.  I have  4 gigs of
RAM in the machine, of which I could allocate quite a bit as heap for
the writer if that helps.

Thanks for any hints on how to best go about this,

Jens

P.S.: all my mails to the list get silently dropped when sending through GMail, possibly because the sender is not the same as the from: header (and only the from: actually contains the subscribed address). This is very annoying and makes it impossible for me to write to the list in many situations.

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to