[ 
https://issues.apache.org/jira/browse/LUCENE-843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12492748
 ] 

Michael McCandless commented on LUCENE-843:
-------------------------------------------

> How does this work with pending deletes?
> I assume that if autocommit is false, then you need to wait until the end 
> when you get a real lucene segment to delete the pending terms?

Yes, all of this sits "below" the pending deletes layer since this
change writes a single segment either when RAM is full
(autoCommit=true) or when writer is closed (autoCommit=false).  Then
the deletes get applied like normal (I haven't changed that part).

> Also, how has the merge policy (or index invariants) of lucene segments 
> changed?
> If autocommit is off, then you wait until the end to create a big lucene 
> segment.  This new segment may be much larger than segments to it's "left".  
> I suppose the idea of merging rightmost segments should just be dropped in 
> favor of merging the smallest adjacent segments?  Sorry if this has already 
> been covered... as I said, I'm trying to follow along at a high level.

Has not been covered, and as usual these are excellent questions
Yonik!

I haven't yet changed anything about merge policy, but you're right
the current invariants won't hold anymore.  In fact they already don't
hold if you "flush by RAM" now (APIs are exposed in 2.1 to let you do
this).  So we need to do something.

I like your idea to relax merge policy (& invariants) to allow
"merging of any adjacent segments" (not just rightmost ones) and then
make the policy merge the smallest ones / most similarly sized ones,
measuring size by net # bytes in the segment.  This would preserve the
"docID monotonicity invariance".

If we take that approach then it would automatically resolve
LUCENE-845 as well (which would otherwise block this issue).


> improve how IndexWriter uses RAM to buffer added documents
> ----------------------------------------------------------
>
>                 Key: LUCENE-843
>                 URL: https://issues.apache.org/jira/browse/LUCENE-843
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>    Affects Versions: 2.2
>            Reporter: Michael McCandless
>         Assigned To: Michael McCandless
>            Priority: Minor
>         Attachments: LUCENE-843.patch, LUCENE-843.take2.patch, 
> LUCENE-843.take3.patch, LUCENE-843.take4.patch, LUCENE-843.take5.patch
>
>
> I'm working on a new class (MultiDocumentWriter) that writes more than
> one document directly into a single Lucene segment, more efficiently
> than the current approach.
> This only affects the creation of an initial segment from added
> documents.  I haven't changed anything after that, eg how segments are
> merged.
> The basic ideas are:
>   * Write stored fields and term vectors directly to disk (don't
>     use up RAM for these).
>   * Gather posting lists & term infos in RAM, but periodically do
>     in-RAM merges.  Once RAM is full, flush buffers to disk (and
>     merge them later when it's time to make a real segment).
>   * Recycle objects/buffers to reduce time/stress in GC.
>   * Other various optimizations.
> Some of these changes are similar to how KinoSearch builds a segment.
> But, I haven't made any changes to Lucene's file format nor added
> requirements for a global fields schema.
> So far the only externally visible change is a new method
> "setRAMBufferSize" in IndexWriter (and setMaxBufferedDocs is
> deprecated) so that it flushes according to RAM usage and not a fixed
> number documents added.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to