[ https://issues.apache.org/jira/browse/IGNITE-20425?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Alexander Lapin reassigned IGNITE-20425: ---------------------------------------- Assignee: Alexander Lapin > Corrupted Raft FSM state after restart > -------------------------------------- > > Key: IGNITE-20425 > URL: https://issues.apache.org/jira/browse/IGNITE-20425 > Project: Ignite > Issue Type: Bug > Reporter: Ivan Bessonov > Assignee: Alexander Lapin > Priority: Major > Labels: ignite-3 > Fix For: 3.0.0-beta2 > > > According to the protocol, there are several numeric indexes in the Log / FSM: > * {{lastLogIndex}} - index of the last logged log entry. > * {{committedIndex}} - index of last committed log entry. {{{}committedIndex > <= lastLogIndex{}}}. > * {{appliedIndex}} - index of last log entry, processed by the state > machine. {{appliedIndex <= }}{{{}committedIndex{}}}. > If committed index is less then last index, RAFT can invoke the "truncate > suffix" procedure and delete uncommitted log's tail. This is a valid thing to > do. > Now, imagine the following scenario: > * {{{}lastIndex == 12{}}}, {{committedIndex == 11}} > * Node is restarted > * Upon recovery, we replay the entire log. Now {{appliedIndex == 12}} > * After recovery, we join the group and receive "truncate suffix command" in > order to deleted uncommitted entries. > * We must delete entry 12, but it's already applied. Peer is broken. > The reason is that we don't use default recovery procedure: > {{org.apache.ignite.raft.jraft.core.NodeImpl#init}} > Canonical raft doesn't replay log before join is complete. > Down to earth scenario, that shows this situation in practice: > * Start group with 3 nodes: A, B, and C. > * We assume that A is a leader. > * Shutdown A, leader re-election is triggered. > * We assume that B votes for C. > * C receives grant from B and proceeds writing new configuration into local > log. > * Shutdown B before it writes the same log entry (easily-reproducible race). > * Shutdown C. > * Restart cluster. > Resulting states: > A - [1: initial cfg] > B - [1: initial cfg] > C - [1: initial cfg, 2: re-election] > h3. How to fix > option a. Recover log after join. This is not optimal, it's like performing > local recovery after cluster activation in Ignite 2. We fixed that behavior > long time ago. > option b. Somehow track committed index and perform partial recovery, that > guarantees safety. We could write committed index into log storage > periodically. > "b" is better, but maybe there are other ways as well. > h3. Upd #1 > Highly likely we just can remove all that await log replay code on raft node > start just because it’s no longer needed. Eventually it was introduced in > order to enable primary replica direct storage reads, which is now covered > properly within > {code:java} > /** > * Tries to read index from group leader and wait for this index to appear in > local storage. Can possible return failed future with > * timeout exception, and in this case, replica would not answer to placement > driver, because the response is useless. Placement driver > * should handle this. > * > * @param expirationTime Lease expiration time. > * @return Future that is completed when local storage catches up the index > that is actual for leader on the moment of request. > */ > private CompletableFuture<Void> waitForActualState(long expirationTime) { > LOG.info("Waiting for actual storage state, group=" + groupId()); > long timeout = expirationTime - currentTimeMillis(); > if (timeout <= 0) { > return failedFuture(new TimeoutException()); > } > return retryOperationUntilSuccess(raftClient::readIndex, e -> > currentTimeMillis() > expirationTime, executor) > .orTimeout(timeout, TimeUnit.MILLISECONDS) > .thenCompose(storageIndexTracker::waitFor); > }{code} > similar is about RO access, we await the safeTime that has HB relations with > corresponding storage updates. -- This message was sent by Atlassian Jira (v8.20.10#820010)