Sammi Chen created RATIS-2315:
---------------------------------
Summary: Ignore ExecutionException during take checkpoint check
Key: RATIS-2315
URL: https://issues.apache.org/jira/browse/RATIS-2315
Project: Ratis
Issue Type: Improvement
Reporter: Sammi Chen
When SCM raft reapply during restart, SCMStateMachine#applyTransaction could
execute
{code:java}
"applyTransactionFuture.completeExceptionally(ex);"
{code}
for ContainerStateManagerImpl#addContainer operation, once it fails at
{code:java}
pipelineManager.addContainerToPipeline(pipelineID, containerID);
{code}
The failure message is likes "Cannot add container to
pipeline=PipelineID=b2f717d8-3912-424c-b42a-e0b52c305c97 in closed state".
This didn't crash the SCM if it happens after SCM has started and running. It
also did't crash every peer of SCM in the raft group. The root cause is
StateMachineUpdater#run -> StateMachineUpdater#checkAndTakeSnapshot
{code:java}
private void
checkAndTakeSnapshot(MemoizedSupplier<List<CompletableFuture<Message>>> futures)
throws ExecutionException, InterruptedException {
// check if need to trigger a snapshot
if (shouldTakeSnapshot()) {
if (futures.isInitialized()) {
JavaUtils.allOf(futures.get()).get();
}
takeSnapshot();
}
}
{code}
When shouldTakeSnapshot() is false, it doesn't care about the futures result.
When shouldTakeSnapshot is true, if one of futures throws exception,
checkAndTakeSnapshot will throws ExecutionException, which in turn shutdown the
raft server in StateMachineUpdater#run.
So the behavior when shouldTakeSnapshot false, and true are different. It's
better have the aligned behavior. The proposal of this JIRA is to ignore the
ExecutionException exception when shouldTakeSnapshot() is true.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)