[
https://issues.apache.org/jira/browse/HDDS-12150?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Siyao Meng updated HDDS-12150:
------------------------------
Fix Version/s: 2.0.0
Target Version/s: (was: 2.1.0)
Resolution: Fixed
Status: Resolved (was: Patch Available)
> Abnormal container states should not crash the SCM ContainerReportHandler
> thread
> --------------------------------------------------------------------------------
>
> Key: HDDS-12150
> URL: https://issues.apache.org/jira/browse/HDDS-12150
> Project: Apache Ozone
> Issue Type: Bug
> Components: SCM
> Affects Versions: 1.4.1
> Reporter: Siyao Meng
> Assignee: Siyao Meng
> Priority: Critical
> Labels: pull-request-available
> Fix For: 2.0.0
>
>
> We observed a case where a full container report with one abnormal container
> state can crash SCM leader's ContainerReportHandler thread.
> The reason is that the Precondition check throws a RuntimeException
> (IllegalArgumentException) that isn't caught and handled properly:
> {code:java|title=https://github.com/apache/ozone/blob/69ba680c515a519a2e2fef611efe151aa033d7cd/hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/container/AbstractContainerReportHandler.java#L339-L340}
> case QUASI_CLOSED:
> /*
> * The container is in QUASI_CLOSED state, this means that at least
> * one of the replica was QUASI_CLOSED.
> *
> * Now replicas can be in any of the following state.
> *
> * 1. OPEN
> * 2. CLOSING
> * 3. QUASI_CLOSED
> * 4. CLOSED
> *
> * If at least one of the replica is in CLOSED state, mark the
> * container as CLOSED.
> *
> */
> if (replica.getState() == State.CLOSED) {
> logger.info("Moving container {} to CLOSED state, datanode {} " +
> "reported CLOSED replica.", containerId, datanode);
> Preconditions.checkArgument(replica.getBlockCommitSequenceId()
> == container.getSequenceId());
> containerManager.updateContainerState(containerId,
> LifeCycleEvent.FORCE_CLOSE);
> }
> break;
> {code}
> It causes the rest of the container report to be left unprocessed. That leads
> to a huge number of MISSING containers seen in {{ozone admin container
> report}} .
> But those containers are not actually missing. The container DB and blocks
> are still on the datanode volumes/disks. It's just that those container
> reports are not being processed, leading SCM to think they are missing.
> Repro (to be added as a test case):
> 1. SCM has container id 4071867 in QUASI_CLOSED state, bcsId = 208
> 2. A full container report from datanode 1 has a replica of container 4071867
> in {color:red}CLOSED state, bcsId = 0{color}
> 3. Without the patch, other container reports after the above would NOT be
> processed because of the ContainerReportHandler thread crashed due to
> unhandled exception
> 4. With the patch, a warning would be logged for the abnormal container
> replica, but the other container reports would still be processed.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]