Nuno Santos created OAK-11232:
---------------------------------
Summary: indexing-job - Simplify download from Mongo logic by
traversing only by _modified instead of (_modified, _id)
Key: OAK-11232
URL: https://issues.apache.org/jira/browse/OAK-11232
Project: Jackrabbit Oak
Issue Type: Improvement
Components: indexing
Reporter: Nuno Santos
The downloader from Mongo in the indexing job traverses the repository by order
of the fields (_modified, _id). In case of disconnection from Mongo, this
allows resuming the download from where it was interrupted without
redownloading any document.
However, we can relax the requirement of not downloading duplicate documents,
because the merge-sort stage of the pipelined strategy discards duplicates.
Therefore, avoiding downloading duplicates is only a performance optimization,
which applies only in the case of disconnections from Mongo, so only in the
relatively rare case of failure.
We could alternatively download only ordering by _modified. This greatly
simplifies the logic of the downloader, as it does not need to track two
fields, and provides a slight performance boost for the normal case, as the
download thread no longer needs to decode the _id field from the binary buffer
representing the document, it only needs the _modified field.
In case of failure, we would have to download all documents with the last seen
_modified value, which would likely include some duplicate documents. But this
would likely take just a few seconds. Consider that _modified has a resolution
of 5 seconds, so the number of documents with the same _modified value is
limited by how much Oak can write to Mongo in a 5 minutes window. The
downloader is streaming the results directly using a Mongo query, therefore it
downloads much faster than what Oak can write. So likely, the downloader will
take much less than 5 seconds to download all the documents with the same
_modified value, which is an acceptable overhead in the rare case of
disconnection from Mongo.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)