[
https://issues.apache.org/jira/browse/OAK-11238?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Nuno Santos resolved OAK-11238.
-------------------------------
Fix Version/s: 1.72.0
Resolution: Done
> indexing-job - de-duplicate entries in sorted batches when saving them to disk
> ------------------------------------------------------------------------------
>
> Key: OAK-11238
> URL: https://issues.apache.org/jira/browse/OAK-11238
> Project: Jackrabbit Oak
> Issue Type: Improvement
> Components: indexing
> Reporter: Nuno Santos
> Priority: Minor
> Fix For: 1.72.0
>
>
> The pipelined strategy writes several batches with sorted node state entries
> as intermediate files, before the final merge phase.
> In some cases, these batches may contain duplicate entries (when there was a
> disconnection to Mongo or with parallel download when the streams cross). As
> the batches are sorted before being written to disk, it is very cheap to
> de-duplicate entries (before writing an entry, compare with the previous
> one).
> This is a best-effort at avoiding duplicates, but as it is cheap and simple,
> I believe it is worth implementing it.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)