[jira] [Commented] (LUCENE-7344) Deletion by query of uncommitted docs not working with DV updates

Shai Erera (JIRA) Wed, 10 Aug 2016 00:52:08 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-7344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15414871#comment-15414871
 ]


Shai Erera commented on LUCENE-7344:
------------------------------------

bq. I don't understand most of what you're saying

To clarify the problem, both for you but also in the interest of writing a 
detailed plan of the proposed solution: currently when a DBQ is processed, it 
uses the LeafReader *without* the NDV updates, and therefore has no knowledge 
of the updated values. This is relatively easily solved in the patch I 
uploaded, by applying the DV updates before the DBQ is processed. That way, the 
DBQ uses a LeafReader which is already aware of the updates and all works well.

However, there is an order of update operations that occur in IndexWriter. In 
our case it could be a mix in of DBQ and NDV updates. So if we apply *all* the 
DV updates before any of the DBQs, we'll get incorrect results where the DBQ 
either delete a document it shouldn't (see code example above, and also what 
your {{testDeleteFollowedByUpdateOfDeletedValue}} shows), or not delete a 
document that it should.

To properly solve this problem, we need to apply the DV updates and DBQs in the 
order they were received (as opposed to applying them in bulk in current code). 
Meaning if the order of operations is NDVU1, NDVU2, DBQ1, NDVU3, DBQ2, DBQ3, 
NDVU4, then we need to:
# Apply NDVU1 + NDVU2; this will cause a new LeafReader to be created
# Apply DBQ1; using the already updated LeafReader
# Apply NDVU3; another LeafReader will be created, now reflecting all 3 NDV 
updates
# Apply DBQ2 and DBQ3; using the updated LeafReader from above
# Apply NDVU4; this will cause another LeafReader to be created

The adversarial affect in this case is that we cause 3 LeafReader reopens, each 
time (due to how NDV updates are currently implemented) writing the full DV 
field to a new stack. If you have many documents, it's going to be very 
expensive. Also, if you have a bigger sequence of interleaving updates and 
deletes, this gets worse and worse.

And so here comes the optimization that Mike and I discussed above. Since the 
NDV updates are held in-memory until they're applied, we can avoid flushing 
them to disk and creating a LeafReader which reads the original DV field + the 
in-memory DV updates. Note though: not *all* DV updates, but only the ones that 
are relevant up until this point. So in the case above, that LeafReader will 
view only NDVU1 and NDVU2, and later it will be updated to view NDVU3 as well.

This is purely an optimization step and has nothing to do with correctness (of 
course, that optimization is tricky and needs to be implemented correctly!). 
Therefore my plan of attack in this case is:

# Have enough tests that try different cases before any of this is implemented. 
For example, Mike proposed above to have the LeafReader + DV field "view" use 
docIdUpto. I need to check the code again, but I want to make sure that if 
NDVU2, NDVU3 and NDVU4 (with the interleaving DBQs) all affect the *same* 
document, everything still works.
# Implement the less-efficient approach, i.e. flush the DV updates to disk 
before each DBQ is processed. This ensures that we have a proper solution 
implemented, and we leave the optimization to a later step (either literally a 
later commit, or just a different patch or whatever). I think this is 
complicated enough to start with.
# Improve the solution to avoid flushing DV updates between the DBQs, as 
proposed above.

bq. testBiasedMixOfRandomUpdates

I briefly reviewed the test, but not thoroughly (I intend to). However, notice 
that committing (hard/soft ; commit/NRT) completely avoids the problem because 
a commit/NRT already means flushing DV updates. So if that's what this test 
does, I don't think it's going to expose the problem. Perhaps with the 
explanation I wrote above, you can revisit the test and make it fail though.

> Deletion by query of uncommitted docs not working with DV updates
> -----------------------------------------------------------------
>
>                 Key: LUCENE-7344
>                 URL: https://issues.apache.org/jira/browse/LUCENE-7344
>             Project: Lucene - Core
>          Issue Type: Bug
>            Reporter: Ishan Chattopadhyaya
>         Attachments: LUCENE-7344.patch, LUCENE-7344.patch, LUCENE-7344.patch
>
>
> When DVs are updated, delete by query doesn't work with the updated DV value.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-7344) Deletion by query of uncommitted docs not working with DV updates

Reply via email to