[jira] [Commented] (OAK-3436) Prevent missing checkpoint due to unstable topology from causing complete reindexing

Alex Parvulescu (JIRA) Wed, 09 Dec 2015 02:35:28 -0800

    [ 
https://issues.apache.org/jira/browse/OAK-3436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15048452#comment-15048452
 ]


Alex Parvulescu commented on OAK-3436:
--------------------------------------

I consider the issue fixed, the backport is still pending as I'd like still to 
have a few rounds of tests running.

> Prevent missing checkpoint due to unstable topology from causing complete 
> reindexing
> ------------------------------------------------------------------------------------
>
>                 Key: OAK-3436
>                 URL: https://issues.apache.org/jira/browse/OAK-3436
>             Project: Jackrabbit Oak
>          Issue Type: Improvement
>          Components: query
>            Reporter: Chetan Mehrotra
>            Assignee: Alex Parvulescu
>              Labels: candidate_oak_1_0, candidate_oak_1_2, resilience
>             Fix For: 1.3.13, 1.0.26, 1.2.10
>
>         Attachments: AsyncIndexUpdateClusterTest.java, OAK-3436-0.patch, 
> OAK-3436-part2-v2.patch, OAK-3436-part2.patch, OAK-3436-tests.patch, 
> OAK-3436-v2.patch
>
>
> Async indexing logic relies on embedding application to ensure that async 
> indexing job is run as a singleton in a cluster. For Sling based apps it 
> depends on Sling Discovery support. At times it is being seen that if 
> topology is not stable then different cluster nodes can consider them as 
> leader and execute the async indexing job concurrently.
> This can cause problem as both cluster node might not see same repository 
> state (due to write skew and eventual consistency) and might remove the 
> checkpoint which other cluster node is still relying upon. For e.g. consider 
> a 2 node cluster N1 and N2 where both are performing async indexing.
> # Base state - CP1 is the checkpoint for "async" job
> # N2 starts indexing and removes changes CP1 to CP2. For Mongo the 
> checkpoints are saved in {{settings}} collection
> # N1 also decides to execute indexing but has yet not seen the latest 
> repository state so still thinks that CP1 is the base checkpoint and tries to 
> read it. However CP1 is already removed from {{settings}} and this makes N1 
> think that checkpoint is missing and it decides to reindex everything!
> To avoid this topology must be stable but at Oak level we should still handle 
> such a case and avoid doing a full reindexing. So we would need to have a 
> {{MissingCheckpointStrategy}} similar to {{MissingIndexEditorStrategy}} as 
> done in OAK-2203 
> Possible approaches
> # A1 - Fail the indexing run if checkpoint is missing - Checkpoint being 
> missing can have valid reason and invalid reason. Need to see what are valid 
> scenarios where a checkpoint can go missing
> # A2 - When a checkpoint is created also store the creation time. When a 
> checkpoint is found to be missing and its a *recent* checkpoint then fail the 
> run. For e.g. we would fail the run till checkpoint found to be missing is 
> less than an hour old (for just started take startup time into account)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (OAK-3436) Prevent missing checkpoint due to unstable topology from causing complete reindexing

Reply via email to