On Wed, Aug 5, 2009 at 3:50 PM, Michael McCandless <
luc...@mikemccandless.com> wrote:

> On Wed, Aug 5, 2009 at 12:08 PM, Nigel<nigelspl...@gmail.com> wrote:
> > We periodically optimize large indexes (100 - 200gb) by calling
> > IndexWriter.optimize().  It takes a heck of a long time, and I'm
> wondering
> > if a more efficient solution might be the following:
> >
> > - Create a new empty index on a different filesystem
> > - Set a merge policy for the new index so it puts everything into one
> giant
> > segment (not sure how to do this off-hand, but I assume it's possible)
> > - Enumerate all documents in the unoptimized index and add them to the
> new
> > index
>
> Actually IndexWriter must periodically flush, which will always
> create new segments, which will then always require merging.  Ie
> there's no way to just add everything to only one segment in one
> shot.
>

Hmm, that makes sense now that you mention it.  And if you have to merge in
the end anyway, there's no point to my idea of adding docs to a new index.

But addIndexes(IndexReader[]) as you suggest would solve that problem.


> Merge performance does seem rather slow... I recently profiled it and
> was suprised to find that the merging of terms dict & postings was cpu
> bound, even on a modern CPU (core i7 920) and with 3 merges running
> concurrently.  I think most of the CPU cost comes from the pqueue
> that's used to do the merge sort, plus read/writeVInt.  When Lucene
> [eventually] switches to PForDelta, that should be more CPU friendly.


That's interesting.  I recently did one of our big merges on a different
server that has the same disks as the one I was using before, but a faster
processor.  It seemed like the merge was quite a bit faster than usual
(though it's possible I was fooled by other factors).

Also, it's tons of IO because for each merge it must read every single
> byte and write nearly every single byte, so that's ~2X bytes moved.
> Then, if you have more segments in your index than your mergeFactor,
> multiple such merges are needed and you're looking at, at least, 4X
> your index size in net bytes moved.  If you have CFS enabled, it's 8X
> the index size.


Not to get too sidetracked, but this reminds me of another question I meant
to ask.  We use the compound format right now.  Our merge factor is
relatively low, so switching to the non-compound format would certainly be
possible without running into problems with too many open files.  Is there
any significant speed different between compound and non-compound for
indexing, searching, or merging?  (Searching for us would be the most
important by far.)


>   * If possible, make sure you always add the same fields to your
>    docs, in the same order (this results in consistent numbering of
>    field name -> number).  This is very much an unexpected
>    gotchya... the merging of stored fields and term vectors is much,
>    much faster if the field numbers are identical.  LUCENE-1737 is
>    open to fix Lucene so it consistently numbers automatically, but
>    it's somewhat tricky because many places in Lucene assume the
>    field names are densely packed.
>

We generally do this already, but some of our fields are nullable and so for
some documents the number-to-name mapping will be different.  Is there any
value in adding dummy values like "NULL" in these cases?  That presumably
adds overhead of its own, though.

Thanks,
Chris

Reply via email to