[ 
https://issues.apache.org/jira/browse/LUCENE-5248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13782136#comment-13782136
 ] 

Michael McCandless commented on LUCENE-5248:
--------------------------------------------

I think the lack of RAM tracking in RALD is an important thing to fix,
separately from optimizing how RALD uses its RAM.  Especially the
spooky case where on update, consuming tiny amounts of RAM, can
resolve to millions of documents, consuming tons of RAM in RALD.

Today, there are three reasons why we
BufferedDeletesStream.applyDeletes, which "resolves" the Term/Query
passed to deleteDocuments/updateNumericDocValue:

  * We've hit IW's RAM buffer

  * We're opening a new NRT reader, and applyAllDeletes=true

  * A merge is kicking off

As things stand now, the first case will resolve the updates and move
them into RALD but not write them to disk, while the other two cases
will write them to disk and clear RALD's maps I think?  Maybe a simple
fix is to also write to disk in case 1?

But, if the segment is large, we can still have a big spike as we
populate those Maps with millions of docs worth of updates?

bq. I don't see a reason why we need to resolve the updates when we register 
them with RALD 

If we "resolve & move the updates to disk" as a single operation (ie,
fix the first case above), then I think we can just keep the logic in
BD, but have it immediately move the updates to disk, rather than
buffer them up in RALD, except in the "this segment is merging" case?

bq. When applying them in writeLiveDocs, we will manage multiple DocsEnums (one 
per NumericUpdate.term) and iterate them in order

I think this is a neat idea; though I would worry about the RAM
required for N DocsEnums where N is possibly quite large...


> Improve the data structure used in ReaderAndLiveDocs to hold the updates
> ------------------------------------------------------------------------
>
>                 Key: LUCENE-5248
>                 URL: https://issues.apache.org/jira/browse/LUCENE-5248
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: core/index
>            Reporter: Shai Erera
>            Assignee: Shai Erera
>
> Currently ReaderAndLiveDocs holds the updates in two structures:
> +Map<String,Map<Integer,Long>>+
> Holds a mapping from each field, to all docs that were updated and their 
> values. This structure is updated when applyDeletes is called, and needs to 
> satisfy several requirements:
> # Un-ordered writes: if a field "f" is updated by two terms, termA and termB, 
> in that order, and termA affects doc=100 and termB doc=2, then the updates 
> are applied in that order, meaning we cannot rely on updates coming in order.
> # Same document may be updated multiple times, either by same term (e.g. 
> several calls to IW.updateNDV) or by different terms. Last update wins.
> # Sequential read: when writing the updates to the Directory 
> (fieldsConsumer), we iterate on the docs in-order and for each one check if 
> it's updated and if not, pull its value from the current DV.
> # A single update may affect several million documents, therefore need to be 
> efficient w.r.t. memory consumption.
> +Map<Integer,Map<String,Long>>+
> Holds a mapping from a document, to all the fields that it was updated in and 
> the updated value for each field. This is used by IW.commitMergedDeletes to 
> apply the updates that came in while the segment was merging. The 
> requirements this structure needs to satisfy are:
> # Access in doc order: this is how commitMergedDeletes works.
> # One-pass: we visit a document once (currently) and so if we can, it's 
> better if we know all the fields in which it was updated. The updates are 
> applied to the merged ReaderAndLiveDocs (where they are stored in the first 
> structure mentioned above).
> Comments with proposals will follow next.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to