[
https://issues.apache.org/jira/browse/RATIS-2315?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Sammi Chen updated RATIS-2315:
------------------------------
Description:
When SCM raft reapply during restart, SCMStateMachine#applyTransaction could
execute
{code:java}
"applyTransactionFuture.completeExceptionally(ex);"
{code}
for ContainerStateManagerImpl#addContainer operation, once it fails at
{code:java}
pipelineManager.addContainerToPipeline(pipelineID, containerID);
{code}
The failure message is likes "Cannot add container to
pipeline=PipelineID=b2f717d8-3912-424c-b42a-e0b52c305c97 in closed state".
This didn't crash the SCM if it happens after SCM has started and running. It
also did't crash every peer of SCM in the raft group. The root cause is
StateMachineUpdater#run -> StateMachineUpdater#checkAndTakeSnapshot
{code:java}
private void
checkAndTakeSnapshot(MemoizedSupplier<List<CompletableFuture<Message>>> futures)
throws ExecutionException, InterruptedException {
// check if need to trigger a snapshot
if (shouldTakeSnapshot()) {
if (futures.isInitialized()) {
JavaUtils.allOf(futures.get()).get();
}
takeSnapshot();
}
}
{code}
When shouldTakeSnapshot() is false, it doesn't care about the futures result.
When shouldTakeSnapshot is true, if one of futures throws exception,
checkAndTakeSnapshot will throws ExecutionException, which in turn shutdown the
raft server in StateMachineUpdater#run.
So the behavior when shouldTakeSnapshot false, and true are different. It's
better have the aligned behavior. The proposal of this JIRA is to ignore the
ExecutionException exception when shouldTakeSnapshot() is true.
The above problem is reported by and co-analyzed with "Hao Guo".
was:
When SCM raft reapply during restart, SCMStateMachine#applyTransaction could
execute
{code:java}
"applyTransactionFuture.completeExceptionally(ex);"
{code}
for ContainerStateManagerImpl#addContainer operation, once it fails at
{code:java}
pipelineManager.addContainerToPipeline(pipelineID, containerID);
{code}
The failure message is likes "Cannot add container to
pipeline=PipelineID=b2f717d8-3912-424c-b42a-e0b52c305c97 in closed state".
This didn't crash the SCM if it happens after SCM has started and running. It
also did't crash every peer of SCM in the raft group. The root cause is
StateMachineUpdater#run -> StateMachineUpdater#checkAndTakeSnapshot
{code:java}
private void
checkAndTakeSnapshot(MemoizedSupplier<List<CompletableFuture<Message>>> futures)
throws ExecutionException, InterruptedException {
// check if need to trigger a snapshot
if (shouldTakeSnapshot()) {
if (futures.isInitialized()) {
JavaUtils.allOf(futures.get()).get();
}
takeSnapshot();
}
}
{code}
When shouldTakeSnapshot() is false, it doesn't care about the futures result.
When shouldTakeSnapshot is true, if one of futures throws exception,
checkAndTakeSnapshot will throws ExecutionException, which in turn shutdown the
raft server in StateMachineUpdater#run.
So the behavior when shouldTakeSnapshot false, and true are different. It's
better have the aligned behavior. The proposal of this JIRA is to ignore the
ExecutionException exception when shouldTakeSnapshot() is true.
The above problem is reported and co-analyzed by "Hao Guo".
> Ignore ExecutionException during take checkpoint check
> ------------------------------------------------------
>
> Key: RATIS-2315
> URL: https://issues.apache.org/jira/browse/RATIS-2315
> Project: Ratis
> Issue Type: Improvement
> Components: StateMachine
> Reporter: Sammi Chen
> Priority: Major
>
> When SCM raft reapply during restart, SCMStateMachine#applyTransaction could
> execute
> {code:java}
> "applyTransactionFuture.completeExceptionally(ex);"
> {code}
> for ContainerStateManagerImpl#addContainer operation, once it fails at
> {code:java}
> pipelineManager.addContainerToPipeline(pipelineID, containerID);
> {code}
> The failure message is likes "Cannot add container to
> pipeline=PipelineID=b2f717d8-3912-424c-b42a-e0b52c305c97 in closed state".
> This didn't crash the SCM if it happens after SCM has started and running. It
> also did't crash every peer of SCM in the raft group. The root cause is
> StateMachineUpdater#run -> StateMachineUpdater#checkAndTakeSnapshot
> {code:java}
> private void
> checkAndTakeSnapshot(MemoizedSupplier<List<CompletableFuture<Message>>>
> futures)
> throws ExecutionException, InterruptedException {
> // check if need to trigger a snapshot
> if (shouldTakeSnapshot()) {
> if (futures.isInitialized()) {
> JavaUtils.allOf(futures.get()).get();
> }
> takeSnapshot();
> }
> }
> {code}
> When shouldTakeSnapshot() is false, it doesn't care about the futures result.
> When shouldTakeSnapshot is true, if one of futures throws exception,
> checkAndTakeSnapshot will throws ExecutionException, which in turn shutdown
> the raft server in StateMachineUpdater#run.
> So the behavior when shouldTakeSnapshot false, and true are different. It's
> better have the aligned behavior. The proposal of this JIRA is to ignore the
> ExecutionException exception when shouldTakeSnapshot() is true.
> The above problem is reported by and co-analyzed with "Hao Guo".
--
This message was sent by Atlassian Jira
(v8.20.10#820010)