On 5/2/2012 5:54 AM, Prakashganesh, Prabhu wrote:
We have a fairly large scale system - about 200 million docs and fairly high 
indexing activity - about 300k docs per day with peak ingestion rates of about 
20 docs per sec. I want to work out what a good mergeFactor setting would be by 
testing with different mergeFactor settings. I think the default of 10 might be 
high, I want to try with 5 and compare. Unless I know when a merge starts and 
finishes, it would be quite difficult to work out the impact of changing 
mergeFactor. I want to be able to measure how long merges take, run queries 
during the merge activity and see what the response times are etc..

With a lot of indexing activity, if you are attempting to avoid large merges, I would think you would want a higher mergeFactor, not a lower one, and do occasional optimizes during non-peak hours. With a small mergeFactor, you will be merging a lot more often, and you are more likely to encounter merges of already-merged segments, which can be very slow.

My index is nearing 70 million documents. I've got seven shards - six large indexes with about 11.5 million docs each, and a small index that I try to keep below half a million documents. The small index contains the newest documents, between 3.5 and 7 days worth. With this setup and the way I manage it, large merges pretty much never happen.

Once a minute, I do an update cycle. This looks for and applies deletions, reinserts, and new document inserts. New document inserts happen only on the small index, and there are usually a few dozen documents to insert on each update cycle. Deletions and reinserts can happen on any of the seven shards, but there are not usually deletions and reinserts on every update cycle, and the number of reinserts is usually very very small. Once an hour, I optimize the small index, which takes about 30 seconds. Once a day, I optimize one of the large indexes during non-peak hours, so every large index gets optimized once every six days. This takes about 15 minutes, during which deletes and reinserts are not applied, but new document inserts continue to happen.

My mergeFactor is set to 35. I wanted a large value here, and this particular number has a side effect -- uniformity in segment filenames on the disk during full rebuilds. Lucene uses a base-36 segment numbering scheme. I usually end up with less than 10 segments in the larger indexes, which means they don't do merges. The small index does do merges, but I have never had a problem with those merges going slowly.

Because I do occasionally optimize, I am fairly sure that even when I do have merges, they happen with 35 very small segment files, and leave the large initial segment alone. I have not tested this theory, but it seems the most sensible way to do things, and I've found that Lucene/Solr usually does things in a sensible manner. If I am wrong here (using 3.5 and its improved merging), I would appreciate knowing.

Thanks,
Shawn

Reply via email to