[
https://issues.apache.org/jira/browse/LUCENE-843?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Michael McCandless updated LUCENE-843:
--------------------------------------
Attachment: LUCENE-843.take3.patch
Another rev of the patch:
* Got thread concurrency working: removed "synchronized" from entire
call to MultiDocWriter.addDocument and instead synchronize two
quick steps (init/finish) addDocument leaving the real work
(processDocument) unsynchronized.
* Fixed bug that was failing to delete temp files from index
* Reduced memory usage of Posting by inlining positions, start
offset, end offset into a single int array.
* Enabled IndexLineFiles.java (tool I use for local benchmarking) to
run multiple threads
* Other small optimizations
BTW, one of the nice side effects of this patch is it cleans up the
mergeSegments method of IndexWriter by separating out "flush" of added
docs & deletions because it's no longer a merge, from the "true"
mergeSegments whose purpose is then to merge disk segments.
Previously mergeSegments was getting rather confusing with the
different cases/combinations of added docs or not, deleted docs or
not, any merges or not.
> improve how IndexWriter uses RAM to buffer added documents
> ----------------------------------------------------------
>
> Key: LUCENE-843
> URL: https://issues.apache.org/jira/browse/LUCENE-843
> Project: Lucene - Java
> Issue Type: Improvement
> Components: Index
> Affects Versions: 2.2
> Reporter: Michael McCandless
> Assigned To: Michael McCandless
> Priority: Minor
> Attachments: LUCENE-843.patch, LUCENE-843.take2.patch,
> LUCENE-843.take3.patch
>
>
> I'm working on a new class (MultiDocumentWriter) that writes more than
> one document directly into a single Lucene segment, more efficiently
> than the current approach.
> This only affects the creation of an initial segment from added
> documents. I haven't changed anything after that, eg how segments are
> merged.
> The basic ideas are:
> * Write stored fields and term vectors directly to disk (don't
> use up RAM for these).
> * Gather posting lists & term infos in RAM, but periodically do
> in-RAM merges. Once RAM is full, flush buffers to disk (and
> merge them later when it's time to make a real segment).
> * Recycle objects/buffers to reduce time/stress in GC.
> * Other various optimizations.
> Some of these changes are similar to how KinoSearch builds a segment.
> But, I haven't made any changes to Lucene's file format nor added
> requirements for a global fields schema.
> So far the only externally visible change is a new method
> "setRAMBufferSize" in IndexWriter (and setMaxBufferedDocs is
> deprecated) so that it flushes according to RAM usage and not a fixed
> number documents added.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]