[jira] Commented: (LUCENE-2047) IndexWriter should immediately resolve deleted docs to docID in near-real-time mode

Michael McCandless (JIRA) Tue, 17 Nov 2009 12:11:04 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-2047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12779093#action_12779093
 ]


Michael McCandless commented on LUCENE-2047:
--------------------------------------------


{quote}
bq. 1: Analyzing hits an exception for a doc, it's doc id has already been 
allocated so we mark it for deletion later (on flush?) in BufferedDeletes.

So there's only one use case right now, which is only when
analyzing an individual doc fails. The update doc adds the term
to the BufferedDeletes for later application.
{quote}

No, it adds the docID for later application.  This is the one case
(entirely internal to IW) where we delete by docID in the writer.

{quote}
I think we can resolve the update doc term in the foreground. I'm
wondering if we need a different doc id queue for these? I get
the hunch yes, because the other doc ids need to be applied even
on IO exception, whereas update doc id will not be applied?
{quote}

I think we can use the same queue -- whether they are applied or not
follows exactly the same logic (ie, successful flush moves all
deletesInRAM over to deletesFlushed), ie, an aborting exception
cleares the deletesInRAM.

{quote}
2: RAM Buffer writing hits an exception, we've had
updates which marked deletes in current segments, however they
haven't been applied yet because they're stored in
BufferedDeletes docids. They're applied on successful flush.

In essence we need to implement number 2?
{quote}

I'm confused -- #2 is already what's implemented in IW, today.

The changes on the table now are to:

  * Immediately resolve deleted (updated) terms/queries -> docIDs

  * Change how we enqueue docIDs to be per-SR, instead.

  * But: you still must buffer Term/Query for the current RAM buffer,
    and on flushing it to a real segment, resolve them to docIDs.

Otherwise we don't need to change what's done today (ie, keep the
deletesInRAM vs deletesFlushed)?

bq. What is an example of a non-aborting exception?

Anything hit during analysis (ie, TokenStream.incrementToken()), or
anywhere except DocInverterPerField in the indexing chain (eg if we
hit something when writing the stored fields, I don't think it'll
abort).

{quote}
I'm browsing through the applyDeletes call path, I'm tempted to
rework how we're doing this.
{quote}

I think this would be a good improvement -- it would mean we don't
need to ever remapDeletes, right?

The thing is... this has to work in non-NRT mode, too.  Somehow, you
have to buffer such a deleted docID against the next segment to be
flushed (ie the current RAM buffer).  And once it's flushed, we move
the docID from deletesInRAM (stored per-SR) to the SR's actual deleted
docs BV.

We would still keep the deletes partitioned, into those done during
the current RAM segment vs those successfully flushed, right?

bq. I can't find the code that handles aborts.

It's DW's abort() method, and eg in DocInverterPerField we call
DW.setAborting on exception to note that this exception is an aborting
one.


> IndexWriter should immediately resolve deleted docs to docID in 
> near-real-time mode
> -----------------------------------------------------------------------------------
>
>                 Key: LUCENE-2047
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2047
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>            Priority: Minor
>             Fix For: 3.1
>
>         Attachments: LUCENE-2047.patch, LUCENE-2047.patch
>
>
> Spinoff from LUCENE-1526.
> When deleteDocuments(Term) is called, we currently always buffer the
> Term and only later, when it's time to flush deletes, resolve to
> docIDs.  This is necessary because we don't in general hold
> SegmentReaders open.
> But, when IndexWriter is in NRT mode, we pool the readers, and so
> deleting in the foreground is possible.
> It's also beneficial, in that in can reduce the turnaround time when
> reopening a new NRT reader by taking this resolution off the reopen
> path.  And if multiple threads are used to do the deletion, then we
> gain concurrency, vs reopen which is not concurrent when flushing the
> deletes.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2047) IndexWriter should immediately resolve deleted docs to docID in near-real-time mode

Reply via email to