[jira] [Created] (OAK-11238) indexing-job - de-duplicate entries in sorted batches when saving them to disk

Nuno Santos (Jira) Wed, 30 Oct 2024 07:51:35 -0700

Nuno Santos created OAK-11238:
---------------------------------

             Summary: indexing-job - de-duplicate entries in sorted batches 
when saving them to disk
                 Key: OAK-11238
                 URL: https://issues.apache.org/jira/browse/OAK-11238
             Project: Jackrabbit Oak
          Issue Type: Improvement
          Components: indexing
            Reporter: Nuno Santos



The pipelined strategy writes several batches with sorted node state entries as 
intermediate files, before the final merge phase.
In some cases, these batches may contain duplicate entries (when there was a 
disconnection to Mongo or with parallel download when the streams cross). As 
the batches are sorted before being written to disk, it is very cheap to 
de-duplicate entries (before writing an entry, compare with the previous one). 
This is a best-effort at avoiding duplicates, but as it is cheap and simple, I 
believe it is worth implementing it.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (OAK-11238) indexing-job - de-duplicate entries in sorted batches when saving them to disk

Reply via email to