[jira] Updated: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

Michael McCandless (JIRA) Fri, 15 Jun 2007 12:02:47 -0700

     [ 
https://issues.apache.org/jira/browse/LUCENE-843?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Michael McCandless updated LUCENE-843:
--------------------------------------

    Attachment: LUCENE-843.take8.patch

Attached latest patch.

I think this patch is ready to commit.  I will let it sit for a while
so people can review it.

We still need to do LUCENE-845 before it can be committed as is.

However one option instead would be to commit this patch, but leave
IndexWriter flushing by doc count by default and then later switch it
to flush by net RAM usage once LUCENE-845 is done.  I like this option
best.

All tests pass (I've re-enabled the disk full tests and fixed error
handling so they now pass) on Windows XP, Debian Linux and OS X.

Summary of the changes in this rev:

  * Finished cleaning up & commenting code

  * Exception handling: if there is a disk full or any other exception
    while adding a document or flushing then the index is rolled back
    to the last commit point.

  * Added more unit tests

  * Removed my profiling tool from the patch (not intended to be
    committed)

  * Fixed a thread safety issue where if you flush by doc count you
    would sometimes get more than the doc count at flush than you
    requested.  I moved the thread synchronization for determining
    flush time down into DocumentsWriter.

  * Also fixed thread safety of calling flush with one thread while
    other threads are still adding documents.

  * The biggest change is: absorbed all merging logic back into
    IndexWriter.

    Previously in DocumentsWriter I was tracking my own
    flushed/partial segments and merging them on my own (but using
    SegmentMerger).  This makes DocumentsWriter much simpler: now its
    sole purpose is to gather added docs and write a new segment.

    This turns out to be a big win:

      - Code is much simpler (no duplication of "merging"
        policy/logic)

      - 21-25% additional performance gain for autoCommit=false case
        when stored fields & vectors are used

      - IndexWriter.close() no longer takes an unexpected long time to
        close in autoCommit=false case

    However I had to make a change to the index format to do this.
    The basic idea is to allow multiple segments to share access to
    the "doc store" (stored fields, vectors) index files.

    The change is quite simple: FieldsReader/VectorsReader are now
    told the doc offset that they should start from when seeking in
    the index stream (this info is stored in SegmentInfo).  When
    merging segments we don't merge the "doc store" files when all
    segments are sharing the same ones (big performance gain), else,
    we make a private copy of the "doc store" files (ie as segments
    normally are on the trunk today).

    The change is fully backwards compatible (I added a test case to
    the backwards compatibility unit test to be sure) and the change
    is only used when autoCommit=false.

    When autoCommit=false, the writer will append stored fields /
    vectors to a single set of files even though it is flushing normal
    segments whenever RAM is full.  These normal segments all refer to
    the single shared set of "doc store" files.  Then when segments
    are merged, the newly merged segment has its own "private" doc
    stores again.  So the sharing only occurs for the "level 0"
    segments.

    I still need to update fileformats doc with this change.


> improve how IndexWriter uses RAM to buffer added documents
> ----------------------------------------------------------
>
>                 Key: LUCENE-843
>                 URL: https://issues.apache.org/jira/browse/LUCENE-843
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>    Affects Versions: 2.2
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>            Priority: Minor
>         Attachments: LUCENE-843.patch, LUCENE-843.take2.patch, 
> LUCENE-843.take3.patch, LUCENE-843.take4.patch, LUCENE-843.take5.patch, 
> LUCENE-843.take6.patch, LUCENE-843.take7.patch, LUCENE-843.take8.patch
>
>
> I'm working on a new class (MultiDocumentWriter) that writes more than
> one document directly into a single Lucene segment, more efficiently
> than the current approach.
> This only affects the creation of an initial segment from added
> documents.  I haven't changed anything after that, eg how segments are
> merged.
> The basic ideas are:
>   * Write stored fields and term vectors directly to disk (don't
>     use up RAM for these).
>   * Gather posting lists & term infos in RAM, but periodically do
>     in-RAM merges.  Once RAM is full, flush buffers to disk (and
>     merge them later when it's time to make a real segment).
>   * Recycle objects/buffers to reduce time/stress in GC.
>   * Other various optimizations.
> Some of these changes are similar to how KinoSearch builds a segment.
> But, I haven't made any changes to Lucene's file format nor added
> requirements for a global fields schema.
> So far the only externally visible change is a new method
> "setRAMBufferSize" in IndexWriter (and setMaxBufferedDocs is
> deprecated) so that it flushes according to RAM usage and not a fixed
> number documents added.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Updated: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

Reply via email to