[ 
https://issues.apache.org/jira/browse/LUCENE-5248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13782227#comment-13782227
 ] 

Shai Erera commented on LUCENE-5248:
------------------------------------

I discussed that w/ Mike on chat and here's the plan we came to:

* Don't buffer updates in RALD anymore. It's silly, since as Mike wrote above, 
one of the reasons we applyDeletes is because IW's RAM buffer limit was 
reached. By buffering updates, we only move the RAM elsewhere, where it's not 
accounted for (RALD).
* Instead, BufferedDeleteStream will build the Map<String,FieldUpdates> 
structure as described above and hand them to RALD.writeFieldUpdates
* RALD.writeFieldUpdates will execute the portion of the code that is currently 
executed in writeLiveDocs.
** If the segment isn't merging ({{isMerging=false}}), the map is discarded and 
can be GC'd.
** Otherwise, it will need to buffer the resolved updates, so they can later be 
applied to the merged segment (a note on that below).
** That's not bad though, as this is done only temporarily, until the segment 
finishes merging, or merge is aborted/failed, then it's cleared away.

The reason why we need to buffer the resolved updates in the {{isMerging}} case 
is because the raw form keeps a docIDUpto, which after merging may make no 
sense. For example, if you have two segments to which an update is applied: for 
_0, docIDUpto=MAX_VAL (i.e. it's an already existing segment) and for _1 it's 
17 (i.e. it's a newly flushed segment where updates should be applied up to doc 
17), and if you use SortingMP .. docIDUpto=17 and MAX_VAL become irrelevant. 
The docs can be entirely shuffled and then you don't know which docs should 
receive the updates anymore. And if you have SortingMP and deletes, it only 
becomes more complicated.

I think that for now we should buffer the resolved updates, improve the data 
structure used to buffer them, and handle that later.

> Improve the data structure used in ReaderAndLiveDocs to hold the updates
> ------------------------------------------------------------------------
>
>                 Key: LUCENE-5248
>                 URL: https://issues.apache.org/jira/browse/LUCENE-5248
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: core/index
>            Reporter: Shai Erera
>            Assignee: Shai Erera
>
> Currently ReaderAndLiveDocs holds the updates in two structures:
> +Map<String,Map<Integer,Long>>+
> Holds a mapping from each field, to all docs that were updated and their 
> values. This structure is updated when applyDeletes is called, and needs to 
> satisfy several requirements:
> # Un-ordered writes: if a field "f" is updated by two terms, termA and termB, 
> in that order, and termA affects doc=100 and termB doc=2, then the updates 
> are applied in that order, meaning we cannot rely on updates coming in order.
> # Same document may be updated multiple times, either by same term (e.g. 
> several calls to IW.updateNDV) or by different terms. Last update wins.
> # Sequential read: when writing the updates to the Directory 
> (fieldsConsumer), we iterate on the docs in-order and for each one check if 
> it's updated and if not, pull its value from the current DV.
> # A single update may affect several million documents, therefore need to be 
> efficient w.r.t. memory consumption.
> +Map<Integer,Map<String,Long>>+
> Holds a mapping from a document, to all the fields that it was updated in and 
> the updated value for each field. This is used by IW.commitMergedDeletes to 
> apply the updates that came in while the segment was merging. The 
> requirements this structure needs to satisfy are:
> # Access in doc order: this is how commitMergedDeletes works.
> # One-pass: we visit a document once (currently) and so if we can, it's 
> better if we know all the fields in which it was updated. The updates are 
> applied to the merged ReaderAndLiveDocs (where they are stored in the first 
> structure mentioned above).
> Comments with proposals will follow next.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to