[
https://issues.apache.org/jira/browse/RATIS-2315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18000778#comment-18000778
]
Tsz-wo Sze commented on RATIS-2315:
-----------------------------------
[~Sammi] and Hao Guo, you are right that Ratis should ignore the exception from
StateMachine.
RATIS-2245 should has fixed it. Ozone in the master branch has updated to
Ratis 3.2.0. Would you be able to check if the problem still exists with Ratis
3.2.0 ?
> Ignore ExecutionException during take checkpoint check
> ------------------------------------------------------
>
> Key: RATIS-2315
> URL: https://issues.apache.org/jira/browse/RATIS-2315
> Project: Ratis
> Issue Type: Improvement
> Reporter: Sammi Chen
> Priority: Major
>
> When SCM raft reapply during restart, SCMStateMachine#applyTransaction could
> execute
> {code:java}
> "applyTransactionFuture.completeExceptionally(ex);"
> {code}
> for ContainerStateManagerImpl#addContainer operation, once it fails at
> {code:java}
> pipelineManager.addContainerToPipeline(pipelineID, containerID);
> {code}
> The failure message is likes "Cannot add container to
> pipeline=PipelineID=b2f717d8-3912-424c-b42a-e0b52c305c97 in closed state".
> This didn't crash the SCM if it happens after SCM has started and running. It
> also did't crash every peer of SCM in the raft group. The root cause is
> StateMachineUpdater#run -> StateMachineUpdater#checkAndTakeSnapshot
> {code:java}
> private void
> checkAndTakeSnapshot(MemoizedSupplier<List<CompletableFuture<Message>>>
> futures)
> throws ExecutionException, InterruptedException {
> // check if need to trigger a snapshot
> if (shouldTakeSnapshot()) {
> if (futures.isInitialized()) {
> JavaUtils.allOf(futures.get()).get();
> }
> takeSnapshot();
> }
> }
> {code}
> When shouldTakeSnapshot() is false, it doesn't care about the futures result.
> When shouldTakeSnapshot is true, if one of futures throws exception,
> checkAndTakeSnapshot will throws ExecutionException, which in turn shutdown
> the raft server in StateMachineUpdater#run.
> So the behavior when shouldTakeSnapshot false, and true are different. It's
> better have the aligned behavior. The proposal of this JIRA is to ignore the
> ExecutionException exception when shouldTakeSnapshot() is true.
> The above problem is reported and co-analyzed by "Hao Guo".
--
This message was sent by Atlassian Jira
(v8.20.10#820010)