[jira] [Commented] (LUCENE-5693) don't write deleted documents on flush

Robert Muir (JIRA) Thu, 22 May 2014 05:22:08 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-5693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14005862#comment-14005862
 ]


Robert Muir commented on LUCENE-5693:
-------------------------------------

{quote}
I guess what bothers me here is this apparent precedent that deleted docs are 
in fact required to be present everywhere in a segment. Yes, this is the case 
today, but I think it's an impl detail and should not be required, e.g. 
enforced by CheckIndex, tests asserting that it's the case.
{quote}

Thats not the case. I am worried about *bugs, complexities, and slowdowns in 
lucene itself*. I already mentioned my list of concerns and I think they are 
all realistic.

To me, the patch is a bit naive.

Perhaps you forgot (or didn't think about) what Sorted/SortedSetDocValuesWriter 
would have to do, if it wanted to filter out deleted documents? This would slow 
down flushing a lot, which presumably is important to people who are deleting 
documents in IndexWriter's ramBuffer. Filtering out deleted documents here 
would only *hurt* the user. Better to leave this to merge.

And what about stored fields and term vectors? why wouldn't you put a TODO 
there in your patch? Is it because its "ok" to have the API and system 
inconsistency there, because it would be slower to buffer them in RAM?

I don't like these implicit exceptions to the rule. If we want to intentionally 
make a mess, there needs to be hard justification why we are doing such a 
thing. All these little unproven optimizations, API inconsistencies, 
exceptional cases, they all add up. I think it would be better to only 
complicate things when its a big win, otherwise the whole codebase will end out 
looking like IndexWriter.java.

All that being said, as I already stated on this issue, I am fine with 
filtering out the postings as an exception to the rule. I really don't like it 
one bit: but i can compromise for this piece if it really brings big benefits. 
Doing it for the rest of the codec api makes no sense at all.


> don't write deleted documents on flush
> --------------------------------------
>
>                 Key: LUCENE-5693
>                 URL: https://issues.apache.org/jira/browse/LUCENE-5693
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Michael McCandless
>         Attachments: LUCENE-5693.patch
>
>
> When we flush a new segment, sometimes some documents are "born deleted", 
> e.g. if the app did a IW.deleteDocuments that matched some not-yet-flushed 
> documents.
> We already compute the liveDocs on flush, but then we continue (wastefully) 
> to send those known-deleted documents to all Codec parts.
> I started to implement this on LUCENE-5675 but it was too controversial.
> Also, I expect typically the number of deleted docs is 0, or small, so not 
> writing "born deleted" docs won't be much of a win for most apps.  Still it 
> seems silly to write them, consuming IO/CPU in the process, only to consume 
> more IO/CPU later for merging to re-delete them.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-5693) don't write deleted documents on flush

Reply via email to