[ https://issues.apache.org/jira/browse/LUCENE-856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12487049 ]
Michael McCandless commented on LUCENE-856: ------------------------------------------- OK I re-ran the above test (10 MM docs @ ~5,500 bytes plain text each) with autoCommit=false: this time it took 5 hrs 7 minutes, which is 40.7% faster than the autoCommit=true test above. Both of these tests were run with the patch from LUCENE-843. So this means, if all you need to do is build a massive index with term vector positions & offsets, the fastest way to do so is with the patch from LUCENE-843 and with autoCommit=false with your writer. Basically LUCENE-843 makes autoCommit=false quite a bit faster for a very large index, assuming you are storing term vectors / stored fields. Still, I think optimizing segment merging is important because for many uses of Lucene, the "interactivity" (how quickly a searcher sees the recently indexed documents) is very important. For such cases you should open a writer with autoCommit=false and then periodically close & re-open it to publish the indexed documents to the searchers. With that model, segment merging will still be a factor slowing down indexing (though how much of a factor depends on how often you close/open your writers). > Optimize segment merging > ------------------------ > > Key: LUCENE-856 > URL: https://issues.apache.org/jira/browse/LUCENE-856 > Project: Lucene - Java > Issue Type: Improvement > Components: Index > Affects Versions: 2.1 > Reporter: Michael McCandless > Assigned To: Michael McCandless > Priority: Minor > > With LUCENE-843, the time spent indexing documents has been > substantially reduced and now the time spent merging is a sizable > portion of indexing time. > I ran a test using the patch for LUCENE-843, building an index of 10 > million docs, each with ~5,500 byte plain text, with term vectors > (positions + offsets) on and with 2 small stored fields per document. > RAM buffer size was 32 MB. I didn't optimize the index in the end, > though optimize speed would also improve if we optimize segment > merging. Index size is 86 GB. > Total time to build the index was 8 hrs 38 minutes, 5 hrs 40 minutes > of which was spent merging. That's 65.6% of the time! > Most of this time is presumably IO which probably can't be reduced > much unless we improve overall merge policy and experiment with values > for mergeFactor / buffer size. > These tests were run on a Mac Pro with 2 dual-core Intel CPUs. The IO > system is RAID 0 of 4 drives, so, these times are probably better than > the more common case of a single hard drive which would likely be > slower IO. > I think there are some simple things we could do to speed up merging: > * Experiment with buffer sizes -- maybe larger buffers for the > IndexInputs used during merging could help? Because at a default > mergeFactor of 10, the disk heads must do alot of seeking back and > forth between these 10 files (and then to the 11th file where we > are writing). > * Use byte copying when possible, eg if there are no deletions on a > segment we can almost (I think?) just copy things like prox > postings, stored fields, term vectors, instead of full parsing to > Jave objects and then re-serializing them. > * Experiment with mergeFactor / different merge policies. For > example I think LUCENE-854 would reduce time spend merging for a > given index size. > This is currently just a place to list ideas for optimizing segment > merges. I don't plan on working on this until after LUCENE-843. > Note that for "autoCommit=false", this optimization is somewhat less > important, depending on how often you actually close/open a new > IndexWriter. In the extreme case, if you open a writer, add 100 MM > docs, close the writer, then no segment merges happen at all. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]