[ 
https://issues.apache.org/jira/browse/HDDS-12150?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Siyao Meng updated HDDS-12150:
------------------------------
       Fix Version/s: 2.0.0
    Target Version/s:   (was: 2.1.0)
          Resolution: Fixed
              Status: Resolved  (was: Patch Available)

> Abnormal container states should not crash the SCM ContainerReportHandler 
> thread
> --------------------------------------------------------------------------------
>
>                 Key: HDDS-12150
>                 URL: https://issues.apache.org/jira/browse/HDDS-12150
>             Project: Apache Ozone
>          Issue Type: Bug
>          Components: SCM
>    Affects Versions: 1.4.1
>            Reporter: Siyao Meng
>            Assignee: Siyao Meng
>            Priority: Critical
>              Labels: pull-request-available
>             Fix For: 2.0.0
>
>
> We observed a case where a full container report with one abnormal container 
> state can crash SCM leader's ContainerReportHandler thread.
> The reason is that the Precondition check throws a RuntimeException 
> (IllegalArgumentException) that isn't caught and handled properly:
> {code:java|title=https://github.com/apache/ozone/blob/69ba680c515a519a2e2fef611efe151aa033d7cd/hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/container/AbstractContainerReportHandler.java#L339-L340}
>     case QUASI_CLOSED:
>       /*
>        * The container is in QUASI_CLOSED state, this means that at least
>        * one of the replica was QUASI_CLOSED.
>        *
>        * Now replicas can be in any of the following state.
>        *
>        * 1. OPEN
>        * 2. CLOSING
>        * 3. QUASI_CLOSED
>        * 4. CLOSED
>        *
>        * If at least one of the replica is in CLOSED state, mark the
>        * container as CLOSED.
>        *
>        */
>       if (replica.getState() == State.CLOSED) {
>         logger.info("Moving container {} to CLOSED state, datanode {} " +
>             "reported CLOSED replica.", containerId, datanode);
>         Preconditions.checkArgument(replica.getBlockCommitSequenceId()
>             == container.getSequenceId());
>         containerManager.updateContainerState(containerId,
>             LifeCycleEvent.FORCE_CLOSE);
>       }
>       break;
> {code}
> It causes the rest of the container report to be left unprocessed. That leads 
> to a huge number of MISSING containers seen in {{ozone admin container 
> report}} .
> But those containers are not actually missing. The container DB and blocks 
> are still on the datanode volumes/disks. It's just that those container 
> reports are not being processed, leading SCM to think they are missing.
> Repro (to be added as a test case):
> 1. SCM has container id 4071867 in QUASI_CLOSED state, bcsId = 208
> 2. A full container report from datanode 1 has a replica of container 4071867 
> in {color:red}CLOSED state, bcsId = 0{color}
> 3. Without the patch, other container reports after the above would NOT be 
> processed because of the ContainerReportHandler thread crashed due to 
> unhandled exception
> 4. With the patch, a warning would be logged for the abnormal container 
> replica, but the other container reports would still be processed.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to