[jira] [Commented] (LUCENE-5248) Improve the data structure used in ReaderAndLiveDocs to hold the updates

Shai Erera (JIRA) Mon, 30 Sep 2013 09:08:57 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-5248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13781966#comment-13781966
 ]

Shai Erera commented on LUCENE-5248:
------------------------------------

I thought about it some more ... perhaps there's a way to keep the updates 
without holding (almost) any RAM. Today, the code follows the delete path, by 
resolving the delete terms to the docIDs they affect. With deletes it's easy, 
there's a low RAM overhead.

I don't see a reason why we need to resolve the updates when we register them 
with RALD (but perhaps I'm overlooking something) as they aren't used 
in-memory. If a Reader needs to see them, we flush them to disk, unlike 
liveDocs which are shared in-memory and not flushed to disk. So perhaps we 
could keep in RALD a Map<String,NumericUpdate[]>, a mapping from a field to all 
numeric updates. When applying them in writeLiveDocs, we will manage multiple 
DocsEnums (one per NumericUpdate.term) and iterate them in order, ensuring to 
apply the recent update to the document that is pointed in the current 
iteration. So if termA affects docs 1,3,6 and termB 2,3,5,6, we iterate on both 
and position termA on 1 and termB on 2. Since they don't match we return the 
update value for doc1. When both are position on doc3, we apply the update of 
termB (as it came last). doc4 is assumed as not updated and so forth.

This is definitely hairy, but consumes much less RAM. Also, in terms of 
performance I don't think that we lose anything, because it's not like we will 
resolve same updates multiple times (once per-segment, but that's done 
anyway?). It's just hairy code, but not sure how much more hairy than the 
changes to commitMergedDeletes would require with the proposal detailed above. 
For commitMergedDeletes, we won't need to resolve any updates, just record them 
in the merged RALD, they will be resolved (using the merged DocsEnum) when it's 
time to writeLiveDocs that segment?

> Improve the data structure used in ReaderAndLiveDocs to hold the updates
> ------------------------------------------------------------------------
>
>                 Key: LUCENE-5248
>                 URL: https://issues.apache.org/jira/browse/LUCENE-5248
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: core/index
>            Reporter: Shai Erera
>            Assignee: Shai Erera
>
> Currently ReaderAndLiveDocs holds the updates in two structures:
> +Map<String,Map<Integer,Long>>+
> Holds a mapping from each field, to all docs that were updated and their 
> values. This structure is updated when applyDeletes is called, and needs to 
> satisfy several requirements:
> # Un-ordered writes: if a field "f" is updated by two terms, termA and termB, 
> in that order, and termA affects doc=100 and termB doc=2, then the updates 
> are applied in that order, meaning we cannot rely on updates coming in order.
> # Same document may be updated multiple times, either by same term (e.g. 
> several calls to IW.updateNDV) or by different terms. Last update wins.
> # Sequential read: when writing the updates to the Directory 
> (fieldsConsumer), we iterate on the docs in-order and for each one check if 
> it's updated and if not, pull its value from the current DV.
> # A single update may affect several million documents, therefore need to be 
> efficient w.r.t. memory consumption.
> +Map<Integer,Map<String,Long>>+
> Holds a mapping from a document, to all the fields that it was updated in and 
> the updated value for each field. This is used by IW.commitMergedDeletes to 
> apply the updates that came in while the segment was merging. The 
> requirements this structure needs to satisfy are:
> # Access in doc order: this is how commitMergedDeletes works.
> # One-pass: we visit a document once (currently) and so if we can, it's 
> better if we know all the fields in which it was updated. The updates are 
> applied to the merged ReaderAndLiveDocs (where they are stored in the first 
> structure mentioned above).
> Comments with proposals will follow next.

--
This message was sent by Atlassian JIRA
(v6.1#6144)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-5248) Improve the data structure used in ReaderAndLiveDocs to hold the updates

Reply via email to