[PR] OAK-11238 - indexing-job: de-duplicate entries when writing to disk the intermediate sorted files with FFS contents. [jackrabbit-oak]

via GitHub Wed, 30 Oct 2024 08:03:37 -0700


nfsantos opened a new pull request, #1835:
URL: https://github.com/apache/jackrabbit-oak/pull/1835


   The pipelined strategy writes several batches with sorted node state entries 
as intermediate files, before the final merge phase.
   In some cases, these batches may contain duplicate entries (when there was a 
disconnection to Mongo or with parallel download when the streams cross). As 
the batches are sorted before being written to disk, it is very cheap to 
de-duplicate entries (before writing an entry, compare with the previous one).
   This is a best-effort at avoiding duplicates, but as it is cheap and simple, 
I believe it is worth implementing it.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[PR] OAK-11238 - indexing-job: de-duplicate entries when writing to disk the intermediate sorted files with FFS contents. [jackrabbit-oak]

Reply via email to