Optimize segment merging ------------------------ Key: LUCENE-856 URL: https://issues.apache.org/jira/browse/LUCENE-856 Project: Lucene - Java Issue Type: Improvement Components: Index Affects Versions: 2.1 Reporter: Michael McCandless Assigned To: Michael McCandless Priority: Minor
With LUCENE-843, the time spent indexing documents has been substantially reduced and now the time spent merging is a sizable portion of indexing time. I ran a test using the patch for LUCENE-843, building an index of 10 million docs, each with ~5,500 byte plain text, with term vectors (positions + offsets) on and with 2 small stored fields per document. RAM buffer size was 32 MB. I didn't optimize the index in the end, though optimize speed would also improve if we optimize segment merging. Index size is 86 GB. Total time to build the index was 8 hrs 38 minutes, 5 hrs 40 minutes of which was spent merging. That's 65.6% of the time! Most of this time is presumably IO which probably can't be reduced much unless we improve overall merge policy and experiment with values for mergeFactor / buffer size. These tests were run on a Mac Pro with 2 dual-core Intel CPUs. The IO system is RAID 0 of 4 drives, so, these times are probably better than the more common case of a single hard drive which would likely be slower IO. I think there are some simple things we could do to speed up merging: * Experiment with buffer sizes -- maybe larger buffers for the IndexInputs used during merging could help? Because at a default mergeFactor of 10, the disk heads must do alot of seeking back and forth between these 10 files (and then to the 11th file where we are writing). * Use byte copying when possible, eg if there are no deletions on a segment we can almost (I think?) just copy things like prox postings, stored fields, term vectors, instead of full parsing to Jave objects and then re-serializing them. * Experiment with mergeFactor / different merge policies. For example I think LUCENE-854 would reduce time spend merging for a given index size. This is currently just a place to list ideas for optimizing segment merges. I don't plan on working on this until after LUCENE-843. Note that for "autoCommit=false", this optimization is somewhat less important, depending on how often you actually close/open a new IndexWriter. In the extreme case, if you open a writer, add 100 MM docs, close the writer, then no segment merges happen at all. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]