[ 
https://issues.apache.org/jira/browse/OAK-11238?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nuno Santos resolved OAK-11238.
-------------------------------
    Fix Version/s: 1.72.0
       Resolution: Done

> indexing-job - de-duplicate entries in sorted batches when saving them to disk
> ------------------------------------------------------------------------------
>
>                 Key: OAK-11238
>                 URL: https://issues.apache.org/jira/browse/OAK-11238
>             Project: Jackrabbit Oak
>          Issue Type: Improvement
>          Components: indexing
>            Reporter: Nuno Santos
>            Priority: Minor
>             Fix For: 1.72.0
>
>
> The pipelined strategy writes several batches with sorted node state entries 
> as intermediate files, before the final merge phase.
> In some cases, these batches may contain duplicate entries (when there was a 
> disconnection to Mongo or with parallel download when the streams cross). As 
> the batches are sorted before being written to disk, it is very cheap to 
> de-duplicate entries (before writing an entry, compare with the previous 
> one). 
> This is a best-effort at avoiding duplicates, but as it is cheap and simple, 
> I believe it is worth implementing it.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to