On Wed, Aug 5, 2009 at 3:50 PM, Michael McCandless < luc...@mikemccandless.com> wrote:
> On Wed, Aug 5, 2009 at 12:08 PM, Nigel<nigelspl...@gmail.com> wrote: > > We periodically optimize large indexes (100 - 200gb) by calling > > IndexWriter.optimize(). It takes a heck of a long time, and I'm > wondering > > if a more efficient solution might be the following: > > > > - Create a new empty index on a different filesystem > > - Set a merge policy for the new index so it puts everything into one > giant > > segment (not sure how to do this off-hand, but I assume it's possible) > > - Enumerate all documents in the unoptimized index and add them to the > new > > index > > Actually IndexWriter must periodically flush, which will always > create new segments, which will then always require merging. Ie > there's no way to just add everything to only one segment in one > shot. > Hmm, that makes sense now that you mention it. And if you have to merge in the end anyway, there's no point to my idea of adding docs to a new index. But addIndexes(IndexReader[]) as you suggest would solve that problem. > Merge performance does seem rather slow... I recently profiled it and > was suprised to find that the merging of terms dict & postings was cpu > bound, even on a modern CPU (core i7 920) and with 3 merges running > concurrently. I think most of the CPU cost comes from the pqueue > that's used to do the merge sort, plus read/writeVInt. When Lucene > [eventually] switches to PForDelta, that should be more CPU friendly. That's interesting. I recently did one of our big merges on a different server that has the same disks as the one I was using before, but a faster processor. It seemed like the merge was quite a bit faster than usual (though it's possible I was fooled by other factors). Also, it's tons of IO because for each merge it must read every single > byte and write nearly every single byte, so that's ~2X bytes moved. > Then, if you have more segments in your index than your mergeFactor, > multiple such merges are needed and you're looking at, at least, 4X > your index size in net bytes moved. If you have CFS enabled, it's 8X > the index size. Not to get too sidetracked, but this reminds me of another question I meant to ask. We use the compound format right now. Our merge factor is relatively low, so switching to the non-compound format would certainly be possible without running into problems with too many open files. Is there any significant speed different between compound and non-compound for indexing, searching, or merging? (Searching for us would be the most important by far.) > * If possible, make sure you always add the same fields to your > docs, in the same order (this results in consistent numbering of > field name -> number). This is very much an unexpected > gotchya... the merging of stored fields and term vectors is much, > much faster if the field numbers are identical. LUCENE-1737 is > open to fix Lucene so it consistently numbers automatically, but > it's somewhat tricky because many places in Lucene assume the > field names are densely packed. > We generally do this already, but some of our fields are nullable and so for some documents the number-to-name mapping will be different. Is there any value in adding dummy values like "NULL" in these cases? That presumably adds overhead of its own, though. Thanks, Chris