[ https://issues.apache.org/jira/browse/LUCENE-2047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12777912#action_12777912 ]
Michael McCandless commented on LUCENE-2047: -------------------------------------------- {quote} I'm hoping we can do the deletes unsynced, which will make this patch a net performance gain because we're allowing multiple threads to delete concurrently (whereas today we're performing them synced at flush time, i.e. the current patch is merely shifting the term/query lookup cost from flush to deleteDocument). {quote} Merely shifting the cost off the critical reopen path is already a good step forward :) But I agree, if we can also allow deletes to happen concurrently, that would be fabulous. Currently the buffered deleted docIDs in DW are only due to exceptions hit while adding a document. EG say for a given doc there was an exception during analysis, after yielding 100 tokens. At this point DW's RAM buffer (postings) has already been changed to hold this docID, and we can't easily undo that. So we instead mark the docID as immediately deleted. These deleted docIDs can then carry in RAM for some time, and get remapped when merges commit. bq. On a delete by query with many hits, I'm concerned about storing too many doc id Integers in BufferedDeletes. I agree, that's a risk if we are buffering an Integer for each deleted docID, so we should avoid that. We really just need an efficient IntSet. Or, it's even possible a data structure that does no deduping (eg a simple growable int[]) would be fine, since apps would tend not to delete the same docID over and over. Since we flush when deletes use too much RAM, we'd be protected from silly apps... Or maybe start with a growable int[], but cutover to BV once that's too large? This would protect the worst case of deleting by a single query matching many docs in the index. Whatever data structure it is, it should live under BufferedDeletes, so the exception logic that already discards these deletes on hitting an aborting exception will just continue to work. bq. Memory is less of a concern with the paged BV from the pending LUCENE-1526 patch I'm not sure we even want to switch to a paged BV for NRT in general (just added comment to that effect), though I guess for just this case (buffering deletes in IW in NRT mode) it could make sense? {quote} Couldn't we save off a per SR BV for the update doc rollback case, merging the special updated doc BV into the SR's deletes on successful flush, throwing them away on failure? {quote} I like that approach. Because it means the normal case (no exception is hit) is very fast (no merging of the two BufferedDeletes, on flush, like we do today; and no docID remapping required), and only on exception must we restore the previously saved deletes. Another alternative would be to redefine the semantics of IW on hitting an error.... right now, if you hit an aborting exception (one that may have corrupted DW's internal RAM postings), you only lose those docs buffered in DW at the time. Any already flushed segments & new deletions within this IW session are preserved. So in theory if you closed your IW, those changes would persist. We could consider relaxing this, such that the entire session of IW is rolled back, to the last commit/when-IW-was-opened, just like we now do with OOME. {quote} A possible solution is, deleteDocument would synchronously add the delete query/term to a queue per SR and return. Asynchronously (i.e. in background threads) the deletes could be applied. {quote} I'd prefer not to add further BG threads to IW, ie, take the app's thread to perform the deletion. If the app wants concurrency for deletes, it can use multiple threads. I believe we only have to sync on the merge committing its deletes, right? So we should make a separate lock for that? And, on commit() we must also wait for all in-flight deletes to finish. Finally, when a new segment is flushed, we should resolve the buffered Term/Query deletions against it, during the flush? {quote} Also, if the per SR delete queue were implemented, we could expose the callback, and allow users to delete by doc id, edit norms (and in the future, update field caches) for a particular IndexReader. We'd pass the reader via a callback that resembles IndexReaderWarmer, then deletes, norms updates, etc, could be performed like they can be today with a non-readonly IR. {quote} I'd like to strongly decouple this discussion of extensibility, from what we're doing in this issue. I think it's a good idea, but let's handle it separately -- maybe under LUCENE-2026 (refactoring IW, which'd pull out the reader pool). This issue should all be "under the hood" improvements. > IndexWriter should immediately resolve deleted docs to docID in > near-real-time mode > ----------------------------------------------------------------------------------- > > Key: LUCENE-2047 > URL: https://issues.apache.org/jira/browse/LUCENE-2047 > Project: Lucene - Java > Issue Type: Improvement > Components: Index > Reporter: Michael McCandless > Assignee: Michael McCandless > Priority: Minor > Fix For: 3.1 > > Attachments: LUCENE-2047.patch, LUCENE-2047.patch > > > Spinoff from LUCENE-1526. > When deleteDocuments(Term) is called, we currently always buffer the > Term and only later, when it's time to flush deletes, resolve to > docIDs. This is necessary because we don't in general hold > SegmentReaders open. > But, when IndexWriter is in NRT mode, we pool the readers, and so > deleting in the foreground is possible. > It's also beneficial, in that in can reduce the turnaround time when > reopening a new NRT reader by taking this resolution off the reopen > path. And if multiple threads are used to do the deletion, then we > gain concurrency, vs reopen which is not concurrent when flushing the > deletes. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org