Nuno Santos created OAK-11232:
---------------------------------

             Summary: indexing-job - Simplify download from Mongo logic by 
traversing only by _modified instead of (_modified, _id)
                 Key: OAK-11232
                 URL: https://issues.apache.org/jira/browse/OAK-11232
             Project: Jackrabbit Oak
          Issue Type: Improvement
          Components: indexing
            Reporter: Nuno Santos


The downloader from Mongo in the indexing job traverses the repository by order 
of the fields (_modified, _id). In case of disconnection from Mongo, this 
allows resuming the download from where it was interrupted without 
redownloading any document.

However, we can relax the requirement of not downloading duplicate documents, 
because the merge-sort stage of the pipelined strategy discards duplicates. 
Therefore, avoiding downloading duplicates is only a performance optimization, 
which applies only in the case of disconnections from Mongo, so only in the 
relatively rare case of failure.

We could alternatively download only ordering by _modified. This greatly 
simplifies the logic of the downloader, as it does not need to track two 
fields, and provides a slight performance boost for the normal case, as the 
download thread no longer needs to decode the _id field from the binary buffer 
representing the document, it only needs the _modified field.

In case of failure, we would have to download all documents with the last seen 
_modified value, which would likely include some duplicate documents. But this 
would likely take just a few seconds. Consider that _modified has a resolution 
of 5 seconds, so the number of documents with the same _modified value is 
limited by how much Oak can write to Mongo in a 5 minutes window. The 
downloader is streaming the results directly using a Mongo query, therefore it 
downloads much faster than what Oak can write. So likely, the downloader will 
take much less than 5 seconds to download all the documents with the same 
_modified value, which is an acceptable overhead in the rare case of 
disconnection from Mongo.





--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to