Siyao Meng created HDDS-12150:
---------------------------------
Summary: Abnormal container states should not crash the SCM
ContainerReportHandler thread
Key: HDDS-12150
URL: https://issues.apache.org/jira/browse/HDDS-12150
Project: Apache Ozone
Issue Type: Bug
Components: SCM
Affects Versions: 1.4.1
Reporter: Siyao Meng
Assignee: Siyao Meng
We observed a case where a full container report with one abnormal container
state can crash SCM leader's ContainerReportHandler thread.
The reason is that the Precondition check throws a RuntimeException
(IllegalArgumentException) that isn't caught and handled properly:
{code:java|title=https://github.com/apache/ozone/blob/69ba680c515a519a2e2fef611efe151aa033d7cd/hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/container/AbstractContainerReportHandler.java#L339-L340}
case QUASI_CLOSED:
/*
* The container is in QUASI_CLOSED state, this means that at least
* one of the replica was QUASI_CLOSED.
*
* Now replicas can be in any of the following state.
*
* 1. OPEN
* 2. CLOSING
* 3. QUASI_CLOSED
* 4. CLOSED
*
* If at least one of the replica is in CLOSED state, mark the
* container as CLOSED.
*
*/
if (replica.getState() == State.CLOSED) {
logger.info("Moving container {} to CLOSED state, datanode {} " +
"reported CLOSED replica.", containerId, datanode);
Preconditions.checkArgument(replica.getBlockCommitSequenceId()
== container.getSequenceId());
containerManager.updateContainerState(containerId,
LifeCycleEvent.FORCE_CLOSE);
}
break;
{code}
It causes the rest of the container report to be left unprocessed. That leads
to a huge number of MISSING containers seen in {{ozone admin container report}}
.
But those containers are not actually missing. The container DB and blocks are
still on the datanode volumes/disks. It's just that those container reports are
not being processed, leading SCM to think they are missing.
Repro (to be added as a test case):
1. SCM has container id 4071867 in QUASI_CLOSED state, bcsId = 208
2. A full container report from datanode 1 has a replica of container 4071867
in {color:red}CLOSED state, bcsId = 0{color}
3. Without the patch, other container reports after the above would NOT be
processed because of the ContainerReportHandler thread crashed due to unhandled
exception
4. With the patch, a warning would be logged for the abnormal container
replica, but the other container reports would still be processed.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]