Re: Supporting "resumable" operations on a large tree

Thomas Mueller Tue, 21 Feb 2017 03:11:18 -0800

Hi,

For re-indexing, there are two problems actually:


* Indexing can take multiple days, so resume would be nice
* For synchronous indexes, indexing create a large commit, which is
problematic (specially for MongoDB)

To solve both problems ("kill two birds with one stone"), we could instead
try to split indexing into multiple commits. For example use a "fromPath"
.. "toPath" range, and only re-index part of the repository at a time. See
also 
https://issues.apache.org/jira/browse/OAK-5324?focusedCommentId=15837941&pa
ge=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment
-15837941

Regards,
Thomas



On 20/02/17 13:13, "Chetan Mehrotra" <chetan.mehro...@gmail.com> wrote:

>Hi Team,
>
>In Oak many a times we perform operations which traverse the tree as
>part of some processing. For e.g. commit hooks, side grade, indexing
>etc. For small tree this works fine and in case of failure the
>processing can be done again from start.
>
>However for large operations like reindexing whole repository for some
>index this posses a problem. For example consider a Mongo setup having
>100M+ nodes and we need to provision a new index. This would trigger
>an IndexUpdate which would go through all the nodes in the repository
>(in some depth first manner) and then build up the index. This process
>can take long time say 1-2 days for a Mongo based setup.
>
>As with any remote setup such a flow may get interrupted due to some
>network issue or outage on Mongo/RDB side. In such a case the whole
>traversal is started again from start.
>
>Same would be the case for any sidegrade operation where we convert a
>big repository from one form to another.
>
>To improve the resiliency of such operations (OAK-2063) we need a way
>to "resume" traversal in a tree from some last known point. For
>operations performed on a sorted list such a "resume" is easy but
>doing that over a tree traversal looks tricky.
>
>Thoughts on what approach can be taken for enabling this?
>
>May be if we can expect a stable order in traversal at a given
>revision then we can keep track of paths t certain depth and then on
>retry skip processing of subtrees untill untill we get that path
>
>Chetan Mehrotra

Re: Supporting "resumable" operations on a large tree

Reply via email to