[ 
https://issues.apache.org/jira/browse/IGNITE-20425?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Lapin reassigned IGNITE-20425:
----------------------------------------

    Assignee: Alexander Lapin

> Corrupted Raft FSM state after restart
> --------------------------------------
>
>                 Key: IGNITE-20425
>                 URL: https://issues.apache.org/jira/browse/IGNITE-20425
>             Project: Ignite
>          Issue Type: Bug
>            Reporter: Ivan Bessonov
>            Assignee: Alexander Lapin
>            Priority: Major
>              Labels: ignite-3
>             Fix For: 3.0.0-beta2
>
>
> According to the protocol, there are several numeric indexes in the Log / FSM:
>  * {{lastLogIndex}} - index of the last logged log entry.
>  * {{committedIndex}} - index of last committed log entry. {{{}committedIndex 
> <= lastLogIndex{}}}.
>  * {{appliedIndex}} - index of last log entry, processed by the state 
> machine. {{appliedIndex <= }}{{{}committedIndex{}}}.
> If committed index is less then last index, RAFT can invoke the "truncate 
> suffix" procedure and delete uncommitted log's tail. This is a valid thing to 
> do.
> Now, imagine the following scenario:
>  * {{{}lastIndex == 12{}}}, {{committedIndex == 11}}
>  * Node is restarted
>  * Upon recovery, we replay the entire log. Now {{appliedIndex == 12}}
>  * After recovery, we join the group and receive "truncate suffix command" in 
> order to deleted uncommitted entries.
>  * We must delete entry 12, but it's already applied. Peer is broken.
> The reason is that we don't use default recovery procedure: 
> {{org.apache.ignite.raft.jraft.core.NodeImpl#init}}
> Canonical raft doesn't replay log before join is complete.
> Down to earth scenario, that shows this situation in practice:
>  * Start group with 3 nodes: A, B, and C.
>  * We assume that A is a leader.
>  * Shutdown A, leader re-election is triggered.
>  * We assume that B votes for C.
>  * C receives grant from B and proceeds writing new configuration into local 
> log.
>  * Shutdown B before it writes the same log entry (easily-reproducible race).
>  * Shutdown C.
>  * Restart cluster.
> Resulting states:
> A - [1: initial cfg]
> B - [1: initial cfg]
> C - [1: initial cfg, 2: re-election]
> h3. How to fix
> option a. Recover log after join. This is not optimal, it's like performing 
> local recovery after cluster activation in Ignite 2. We fixed that behavior 
> long time ago.
> option b. Somehow track committed index and perform partial recovery, that 
> guarantees safety. We could write committed index into log storage 
> periodically.
> "b" is better, but maybe there are other ways as well.
> h3. Upd #1
> Highly likely we just can remove all that await log replay code on raft node 
> start just because it’s no longer needed. Eventually it was introduced in 
> order to enable primary replica direct storage reads, which is now covered 
> properly within
> {code:java}
> /**
>  * Tries to read index from group leader and wait for this index to appear in 
> local storage. Can possible return failed future with
>  * timeout exception, and in this case, replica would not answer to placement 
> driver, because the response is useless. Placement driver
>  * should handle this.
>  *
>  * @param expirationTime Lease expiration time.
>  * @return Future that is completed when local storage catches up the index 
> that is actual for leader on the moment of request.
>  */
> private CompletableFuture<Void> waitForActualState(long expirationTime) {
>     LOG.info("Waiting for actual storage state, group=" + groupId());
>     long timeout = expirationTime - currentTimeMillis();
>     if (timeout <= 0) {
>         return failedFuture(new TimeoutException());
>     }
>     return retryOperationUntilSuccess(raftClient::readIndex, e -> 
> currentTimeMillis() > expirationTime, executor)
>             .orTimeout(timeout, TimeUnit.MILLISECONDS)
>             .thenCompose(storageIndexTracker::waitFor);
> }{code}
> similar is about RO access, we await the safeTime that has HB relations with 
> corresponding storage updates.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to