Optimize segment merging
------------------------

                 Key: LUCENE-856
                 URL: https://issues.apache.org/jira/browse/LUCENE-856
             Project: Lucene - Java
          Issue Type: Improvement
          Components: Index
    Affects Versions: 2.1
            Reporter: Michael McCandless
         Assigned To: Michael McCandless
            Priority: Minor


With LUCENE-843, the time spent indexing documents has been
substantially reduced and now the time spent merging is a sizable
portion of indexing time.

I ran a test using the patch for LUCENE-843, building an index of 10
million docs, each with ~5,500 byte plain text, with term vectors
(positions + offsets) on and with 2 small stored fields per document.
RAM buffer size was 32 MB.  I didn't optimize the index in the end,
though optimize speed would also improve if we optimize segment
merging.  Index size is 86 GB.

Total time to build the index was 8 hrs 38 minutes, 5 hrs 40 minutes
of which was spent merging.  That's 65.6% of the time!

Most of this time is presumably IO which probably can't be reduced
much unless we improve overall merge policy and experiment with values
for mergeFactor / buffer size.

These tests were run on a Mac Pro with 2 dual-core Intel CPUs.  The IO
system is RAID 0 of 4 drives, so, these times are probably better than
the more common case of a single hard drive which would likely be
slower IO.

I think there are some simple things we could do to speed up merging:

  * Experiment with buffer sizes -- maybe larger buffers for the
    IndexInputs used during merging could help?  Because at a default
    mergeFactor of 10, the disk heads must do alot of seeking back and
    forth between these 10 files (and then to the 11th file where we
    are writing).

  * Use byte copying when possible, eg if there are no deletions on a
    segment we can almost (I think?) just copy things like prox
    postings, stored fields, term vectors, instead of full parsing to
    Jave objects and then re-serializing them.

  * Experiment with mergeFactor / different merge policies.  For
    example I think LUCENE-854 would reduce time spend merging for a
    given index size.

This is currently just a place to list ideas for optimizing segment
merges.  I don't plan on working on this until after LUCENE-843.

Note that for "autoCommit=false", this optimization is somewhat less
important, depending on how often you actually close/open a new
IndexWriter.  In the extreme case, if you open a writer, add 100 MM
docs, close the writer, then no segment merges happen at all.


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to