[jira] [Updated] (LUCENE-5248) Improve the data structure used in ReaderAndLiveDocs to hold the updates

Shai Erera (JIRA) Tue, 01 Oct 2013 23:06:34 -0700

     [ 
https://issues.apache.org/jira/browse/LUCENE-5248?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shai Erera updated LUCENE-5248:
-------------------------------

    Attachment: LUCENE-5248.patch

Patch improves the test and also hacks a solution -- in 
BufferedDeleteStream.applyDeletes I call RALD.writeLiveDocs to flush all the 
updates. Few comments:

* First, this is just a hack - I need to separate writeLiveDocs from 
writeFieldUpdates, so that we don't also always write liveDocs.

* Still need to improve the data structure used to hold the resolved updates. I 
think I will create a FieldUpdates interface/abstract class so we can 
experiment with different representations, including optimizations for e.g. 
only one NumericUpdate (where we don't need to materialize anything into memory 
since it's only one Term).

* The test OOMs less, but sometimes it does (depends on test params). The 
reason is that even if we resolve the updates and flush to disk, as long as we 
resolve them in-memory, there could be a case of an innocent-looking update 
which affects millions of documents, and consumes a large amount of RAM, no 
matter the representation (a more efficient structure means more docs can be 
updated, but still...). I don't think there's a way around it -- if you do very 
large updates (I think it's the edge case, not the common), you should allocate 
enough RAM. It's just like IW's RAM buffer doesn't help when you index a single 
very large document.
** With the FieldUpdates abstraction, we could optimize for various cases 
including few very large updates, where we can iterate on their DocsEnum in 
parallel and consume less RAM than if we materialize them all into memory etc.

Next I will work on the FieldUpdates abstraction, separate writeLiveDocs from 
writeFieldUpdates and implement the structure mentioned above (parallel 
compressed arrays for docs, updates and bits=docsWithField).

> Improve the data structure used in ReaderAndLiveDocs to hold the updates
> ------------------------------------------------------------------------
>
>                 Key: LUCENE-5248
>                 URL: https://issues.apache.org/jira/browse/LUCENE-5248
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: core/index
>            Reporter: Shai Erera
>            Assignee: Shai Erera
>         Attachments: LUCENE-5248.patch, LUCENE-5248.patch
>
>
> Currently ReaderAndLiveDocs holds the updates in two structures:
> +Map<String,Map<Integer,Long>>+
> Holds a mapping from each field, to all docs that were updated and their 
> values. This structure is updated when applyDeletes is called, and needs to 
> satisfy several requirements:
> # Un-ordered writes: if a field "f" is updated by two terms, termA and termB, 
> in that order, and termA affects doc=100 and termB doc=2, then the updates 
> are applied in that order, meaning we cannot rely on updates coming in order.
> # Same document may be updated multiple times, either by same term (e.g. 
> several calls to IW.updateNDV) or by different terms. Last update wins.
> # Sequential read: when writing the updates to the Directory 
> (fieldsConsumer), we iterate on the docs in-order and for each one check if 
> it's updated and if not, pull its value from the current DV.
> # A single update may affect several million documents, therefore need to be 
> efficient w.r.t. memory consumption.
> +Map<Integer,Map<String,Long>>+
> Holds a mapping from a document, to all the fields that it was updated in and 
> the updated value for each field. This is used by IW.commitMergedDeletes to 
> apply the updates that came in while the segment was merging. The 
> requirements this structure needs to satisfy are:
> # Access in doc order: this is how commitMergedDeletes works.
> # One-pass: we visit a document once (currently) and so if we can, it's 
> better if we know all the fields in which it was updated. The updates are 
> applied to the merged ReaderAndLiveDocs (where they are stored in the first 
> structure mentioned above).
> Comments with proposals will follow next.

--
This message was sent by Atlassian JIRA
(v6.1#6144)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-5248) Improve the data structure used in ReaderAndLiveDocs to hold the updates

Reply via email to