[jira] [Commented] (LUCENE-5248) Improve the data structure used in ReaderAndLiveDocs to hold the updates

Michael McCandless (JIRA) Tue, 01 Oct 2013 05:05:43 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-5248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13782853#comment-13782853
 ]


Michael McCandless commented on LUCENE-5248:
--------------------------------------------

bq. One thing to note though, is if we never buffer updates, it means we always 
write them to disk, even if they are not needed by a Reader.

That's a good point, and I guess it means we are vulnerable to an adversary 
here ... but that adversary isn't really SO bad right?  It'd do one update, and 
index a bunch of docs, and then we flush and write a new NDV gen 
"unnecessarily".  But the adversary had to quite a bit of "real" work 
(indexing) so I'm not sure it's really an adversary ...

We could try to be smarter, and carry the buffer if it's "smallish".  This is 
how deletes work: if we are flushing because of too-much-RAM, but deletes are 
using less than 1/2 of the RAM buffer, we just carry them.  We only apply them 
once they are using >= 1/2.

> Improve the data structure used in ReaderAndLiveDocs to hold the updates
> ------------------------------------------------------------------------
>
>                 Key: LUCENE-5248
>                 URL: https://issues.apache.org/jira/browse/LUCENE-5248
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: core/index
>            Reporter: Shai Erera
>            Assignee: Shai Erera
>         Attachments: LUCENE-5248.patch
>
>
> Currently ReaderAndLiveDocs holds the updates in two structures:
> +Map<String,Map<Integer,Long>>+
> Holds a mapping from each field, to all docs that were updated and their 
> values. This structure is updated when applyDeletes is called, and needs to 
> satisfy several requirements:
> # Un-ordered writes: if a field "f" is updated by two terms, termA and termB, 
> in that order, and termA affects doc=100 and termB doc=2, then the updates 
> are applied in that order, meaning we cannot rely on updates coming in order.
> # Same document may be updated multiple times, either by same term (e.g. 
> several calls to IW.updateNDV) or by different terms. Last update wins.
> # Sequential read: when writing the updates to the Directory 
> (fieldsConsumer), we iterate on the docs in-order and for each one check if 
> it's updated and if not, pull its value from the current DV.
> # A single update may affect several million documents, therefore need to be 
> efficient w.r.t. memory consumption.
> +Map<Integer,Map<String,Long>>+
> Holds a mapping from a document, to all the fields that it was updated in and 
> the updated value for each field. This is used by IW.commitMergedDeletes to 
> apply the updates that came in while the segment was merging. The 
> requirements this structure needs to satisfy are:
> # Access in doc order: this is how commitMergedDeletes works.
> # One-pass: we visit a document once (currently) and so if we can, it's 
> better if we know all the fields in which it was updated. The updates are 
> applied to the merged ReaderAndLiveDocs (where they are stored in the first 
> structure mentioned above).
> Comments with proposals will follow next.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-5248) Improve the data structure used in ReaderAndLiveDocs to hold the updates

Reply via email to