[ 
https://issues.apache.org/jira/browse/LUCENE-5248?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shai Erera updated LUCENE-5248:
-------------------------------

    Attachment: LUCENE-5248.patch

Patch with testTonsOfUpdates: it's a real nasty test with which I hit OOM when 
running w/ -Dtests.nightly=true and -Dtests.multiplier=3. It first adds many 
documents (few 10Ks, w/ these params 250K) with several update terms (so that 
each term affects many docs) and few NDV fields. It then applies many numeric 
updates (with these params 20K), but sets IW's ram buffer to 512 bytes, so we 
get many flushes.

Because currently the resolved updates are held in RALD, and since the test 
doesn't invoke any merge while applying the updates, they just keep 
accumulating there, until RAM is exhausted. I should say that even when running 
the test with less docs, update terms and updates, I saw memory keeps on 
growing, but it wasn't enough to hit the heap space limit. But perhaps it means 
we can use RamUsageEstimator to assert that IW RAM consumption doesn't 
continuously increase, so that we catch this even if an OOM isn't hit?

I plan to handle it in two steps:

# Stop buffering updates in RALD except the isMerging case, but still use the 
Map<String,Map<Integer,Long>>. Then we should see IW's sizeOf remains somewhat 
stable. Also, the test shouldn't OOM.
# Optimize the temporary spike in RAM (and buffering for isMerging) by trying 
Map<String,IntToLongMap> (a Map on primitives, no compression but no object 
allocations) and Map<String,FieldUpdates> (with compression, but more 
complicated code).

> Improve the data structure used in ReaderAndLiveDocs to hold the updates
> ------------------------------------------------------------------------
>
>                 Key: LUCENE-5248
>                 URL: https://issues.apache.org/jira/browse/LUCENE-5248
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: core/index
>            Reporter: Shai Erera
>            Assignee: Shai Erera
>         Attachments: LUCENE-5248.patch
>
>
> Currently ReaderAndLiveDocs holds the updates in two structures:
> +Map<String,Map<Integer,Long>>+
> Holds a mapping from each field, to all docs that were updated and their 
> values. This structure is updated when applyDeletes is called, and needs to 
> satisfy several requirements:
> # Un-ordered writes: if a field "f" is updated by two terms, termA and termB, 
> in that order, and termA affects doc=100 and termB doc=2, then the updates 
> are applied in that order, meaning we cannot rely on updates coming in order.
> # Same document may be updated multiple times, either by same term (e.g. 
> several calls to IW.updateNDV) or by different terms. Last update wins.
> # Sequential read: when writing the updates to the Directory 
> (fieldsConsumer), we iterate on the docs in-order and for each one check if 
> it's updated and if not, pull its value from the current DV.
> # A single update may affect several million documents, therefore need to be 
> efficient w.r.t. memory consumption.
> +Map<Integer,Map<String,Long>>+
> Holds a mapping from a document, to all the fields that it was updated in and 
> the updated value for each field. This is used by IW.commitMergedDeletes to 
> apply the updates that came in while the segment was merging. The 
> requirements this structure needs to satisfy are:
> # Access in doc order: this is how commitMergedDeletes works.
> # One-pass: we visit a document once (currently) and so if we can, it's 
> better if we know all the fields in which it was updated. The updates are 
> applied to the merged ReaderAndLiveDocs (where they are stored in the first 
> structure mentioned above).
> Comments with proposals will follow next.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to