[ 
https://issues.apache.org/jira/browse/RATIS-2315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18001368#comment-18001368
 ] 

Sammi Chen commented on RATIS-2315:
-----------------------------------

[~szetszwo],  If I understand RATIS-2245 correctly, it fixes the problem that 
statemachine doesn't wait for some raft log apply finish before taking the 
snapshot.  But looks like it doesn't ignore the ExecutionException exception, 
both waitForCommit and checkAndTakeSnapshot have the chance to throw 
ExecutionException, which leads to "server.close();" called. 

> Ignore ExecutionException during take checkpoint check
> ------------------------------------------------------
>
>                 Key: RATIS-2315
>                 URL: https://issues.apache.org/jira/browse/RATIS-2315
>             Project: Ratis
>          Issue Type: Improvement
>          Components: StateMachine
>            Reporter: Sammi Chen
>            Priority: Major
>
> When SCM raft reapply during restart, SCMStateMachine#applyTransaction could 
> execute 
> {code:java}
> "applyTransactionFuture.completeExceptionally(ex);"
> {code}
>  for ContainerStateManagerImpl#addContainer operation, once it fails at 
> {code:java}
> pipelineManager.addContainerToPipeline(pipelineID, containerID);
> {code}
> The failure message is likes "Cannot add container to 
> pipeline=PipelineID=b2f717d8-3912-424c-b42a-e0b52c305c97 in closed state". 
> This didn't crash the SCM if it happens after SCM has started and running. It 
> also did't crash every peer of SCM in the raft group.  The root cause is 
> StateMachineUpdater#run -> StateMachineUpdater#checkAndTakeSnapshot
> {code:java}
> private void 
> checkAndTakeSnapshot(MemoizedSupplier<List<CompletableFuture<Message>>> 
> futures)
>       throws ExecutionException, InterruptedException {
>     // check if need to trigger a snapshot
>     if (shouldTakeSnapshot()) {
>       if (futures.isInitialized()) {
>         JavaUtils.allOf(futures.get()).get();
>       }
>       takeSnapshot();
>     }
>   }
> {code}
> When shouldTakeSnapshot() is false, it doesn't care about the futures result. 
> When shouldTakeSnapshot is true, if one of futures throws exception, 
> checkAndTakeSnapshot will throws ExecutionException, which in turn shutdown 
> the raft server in  StateMachineUpdater#run. 
> So the behavior when shouldTakeSnapshot false, and true are different. It's 
> better have the aligned behavior. The proposal of this JIRA is to ignore the 
> ExecutionException exception when shouldTakeSnapshot() is true. 
> The above problem is reported by and co-analyzed with "Hao Guo". 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to