[
https://issues.apache.org/jira/browse/LUCENE-843?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Michael McCandless updated LUCENE-843:
--------------------------------------
Attachment: LUCENE-843.take8.patch
Attached latest patch.
I think this patch is ready to commit. I will let it sit for a while
so people can review it.
We still need to do LUCENE-845 before it can be committed as is.
However one option instead would be to commit this patch, but leave
IndexWriter flushing by doc count by default and then later switch it
to flush by net RAM usage once LUCENE-845 is done. I like this option
best.
All tests pass (I've re-enabled the disk full tests and fixed error
handling so they now pass) on Windows XP, Debian Linux and OS X.
Summary of the changes in this rev:
* Finished cleaning up & commenting code
* Exception handling: if there is a disk full or any other exception
while adding a document or flushing then the index is rolled back
to the last commit point.
* Added more unit tests
* Removed my profiling tool from the patch (not intended to be
committed)
* Fixed a thread safety issue where if you flush by doc count you
would sometimes get more than the doc count at flush than you
requested. I moved the thread synchronization for determining
flush time down into DocumentsWriter.
* Also fixed thread safety of calling flush with one thread while
other threads are still adding documents.
* The biggest change is: absorbed all merging logic back into
IndexWriter.
Previously in DocumentsWriter I was tracking my own
flushed/partial segments and merging them on my own (but using
SegmentMerger). This makes DocumentsWriter much simpler: now its
sole purpose is to gather added docs and write a new segment.
This turns out to be a big win:
- Code is much simpler (no duplication of "merging"
policy/logic)
- 21-25% additional performance gain for autoCommit=false case
when stored fields & vectors are used
- IndexWriter.close() no longer takes an unexpected long time to
close in autoCommit=false case
However I had to make a change to the index format to do this.
The basic idea is to allow multiple segments to share access to
the "doc store" (stored fields, vectors) index files.
The change is quite simple: FieldsReader/VectorsReader are now
told the doc offset that they should start from when seeking in
the index stream (this info is stored in SegmentInfo). When
merging segments we don't merge the "doc store" files when all
segments are sharing the same ones (big performance gain), else,
we make a private copy of the "doc store" files (ie as segments
normally are on the trunk today).
The change is fully backwards compatible (I added a test case to
the backwards compatibility unit test to be sure) and the change
is only used when autoCommit=false.
When autoCommit=false, the writer will append stored fields /
vectors to a single set of files even though it is flushing normal
segments whenever RAM is full. These normal segments all refer to
the single shared set of "doc store" files. Then when segments
are merged, the newly merged segment has its own "private" doc
stores again. So the sharing only occurs for the "level 0"
segments.
I still need to update fileformats doc with this change.
> improve how IndexWriter uses RAM to buffer added documents
> ----------------------------------------------------------
>
> Key: LUCENE-843
> URL: https://issues.apache.org/jira/browse/LUCENE-843
> Project: Lucene - Java
> Issue Type: Improvement
> Components: Index
> Affects Versions: 2.2
> Reporter: Michael McCandless
> Assignee: Michael McCandless
> Priority: Minor
> Attachments: LUCENE-843.patch, LUCENE-843.take2.patch,
> LUCENE-843.take3.patch, LUCENE-843.take4.patch, LUCENE-843.take5.patch,
> LUCENE-843.take6.patch, LUCENE-843.take7.patch, LUCENE-843.take8.patch
>
>
> I'm working on a new class (MultiDocumentWriter) that writes more than
> one document directly into a single Lucene segment, more efficiently
> than the current approach.
> This only affects the creation of an initial segment from added
> documents. I haven't changed anything after that, eg how segments are
> merged.
> The basic ideas are:
> * Write stored fields and term vectors directly to disk (don't
> use up RAM for these).
> * Gather posting lists & term infos in RAM, but periodically do
> in-RAM merges. Once RAM is full, flush buffers to disk (and
> merge them later when it's time to make a real segment).
> * Recycle objects/buffers to reduce time/stress in GC.
> * Other various optimizations.
> Some of these changes are similar to how KinoSearch builds a segment.
> But, I haven't made any changes to Lucene's file format nor added
> requirements for a global fields schema.
> So far the only externally visible change is a new method
> "setRAMBufferSize" in IndexWriter (and setMaxBufferedDocs is
> deprecated) so that it flushes according to RAM usage and not a fixed
> number documents added.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]