Siyao Meng created HDDS-12150:
---------------------------------

             Summary: Abnormal container states should not crash the SCM 
ContainerReportHandler thread
                 Key: HDDS-12150
                 URL: https://issues.apache.org/jira/browse/HDDS-12150
             Project: Apache Ozone
          Issue Type: Bug
          Components: SCM
    Affects Versions: 1.4.1
            Reporter: Siyao Meng
            Assignee: Siyao Meng


We observed a case where a full container report with one abnormal container 
state can crash SCM leader's ContainerReportHandler thread.

The reason is that the Precondition check throws a RuntimeException 
(IllegalArgumentException) that isn't caught and handled properly:

{code:java|title=https://github.com/apache/ozone/blob/69ba680c515a519a2e2fef611efe151aa033d7cd/hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/container/AbstractContainerReportHandler.java#L339-L340}
    case QUASI_CLOSED:
      /*
       * The container is in QUASI_CLOSED state, this means that at least
       * one of the replica was QUASI_CLOSED.
       *
       * Now replicas can be in any of the following state.
       *
       * 1. OPEN
       * 2. CLOSING
       * 3. QUASI_CLOSED
       * 4. CLOSED
       *
       * If at least one of the replica is in CLOSED state, mark the
       * container as CLOSED.
       *
       */
      if (replica.getState() == State.CLOSED) {
        logger.info("Moving container {} to CLOSED state, datanode {} " +
            "reported CLOSED replica.", containerId, datanode);
        Preconditions.checkArgument(replica.getBlockCommitSequenceId()
            == container.getSequenceId());
        containerManager.updateContainerState(containerId,
            LifeCycleEvent.FORCE_CLOSE);
      }
      break;
{code}

It causes the rest of the container report to be left unprocessed. That leads 
to a huge number of MISSING containers seen in {{ozone admin container report}} 
.

But those containers are not actually missing. The container DB and blocks are 
still on the datanode volumes/disks. It's just that those container reports are 
not being processed, leading SCM to think they are missing.


Repro (to be added as a test case):

1. SCM has container id 4071867 in QUASI_CLOSED state, bcsId = 208
2. A full container report from datanode 1 has a replica of container 4071867 
in {color:red}CLOSED state, bcsId = 0{color}
3. Without the patch, other container reports after the above would NOT be 
processed because of the ContainerReportHandler thread crashed due to unhandled 
exception
4. With the patch, a warning would be logged for the abnormal container 
replica, but the other container reports would still be processed.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to