[ https://issues.apache.org/jira/browse/OAK-3436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15048452#comment-15048452 ]
Alex Parvulescu commented on OAK-3436: -------------------------------------- I consider the issue fixed, the backport is still pending as I'd like still to have a few rounds of tests running. > Prevent missing checkpoint due to unstable topology from causing complete > reindexing > ------------------------------------------------------------------------------------ > > Key: OAK-3436 > URL: https://issues.apache.org/jira/browse/OAK-3436 > Project: Jackrabbit Oak > Issue Type: Improvement > Components: query > Reporter: Chetan Mehrotra > Assignee: Alex Parvulescu > Labels: candidate_oak_1_0, candidate_oak_1_2, resilience > Fix For: 1.3.13, 1.0.26, 1.2.10 > > Attachments: AsyncIndexUpdateClusterTest.java, OAK-3436-0.patch, > OAK-3436-part2-v2.patch, OAK-3436-part2.patch, OAK-3436-tests.patch, > OAK-3436-v2.patch > > > Async indexing logic relies on embedding application to ensure that async > indexing job is run as a singleton in a cluster. For Sling based apps it > depends on Sling Discovery support. At times it is being seen that if > topology is not stable then different cluster nodes can consider them as > leader and execute the async indexing job concurrently. > This can cause problem as both cluster node might not see same repository > state (due to write skew and eventual consistency) and might remove the > checkpoint which other cluster node is still relying upon. For e.g. consider > a 2 node cluster N1 and N2 where both are performing async indexing. > # Base state - CP1 is the checkpoint for "async" job > # N2 starts indexing and removes changes CP1 to CP2. For Mongo the > checkpoints are saved in {{settings}} collection > # N1 also decides to execute indexing but has yet not seen the latest > repository state so still thinks that CP1 is the base checkpoint and tries to > read it. However CP1 is already removed from {{settings}} and this makes N1 > think that checkpoint is missing and it decides to reindex everything! > To avoid this topology must be stable but at Oak level we should still handle > such a case and avoid doing a full reindexing. So we would need to have a > {{MissingCheckpointStrategy}} similar to {{MissingIndexEditorStrategy}} as > done in OAK-2203 > Possible approaches > # A1 - Fail the indexing run if checkpoint is missing - Checkpoint being > missing can have valid reason and invalid reason. Need to see what are valid > scenarios where a checkpoint can go missing > # A2 - When a checkpoint is created also store the creation time. When a > checkpoint is found to be missing and its a *recent* checkpoint then fail the > run. For e.g. we would fail the run till checkpoint found to be missing is > less than an hour old (for just started take startup time into account) -- This message was sent by Atlassian JIRA (v6.3.4#6332)