[jira] [Commented] (HDDS-8902) Close open container immediately on ICR of unhealthy replica

Stephen O'Donnell (Jira) Wed, 23 Aug 2023 03:39:06 -0700


    [ 
https://issues.apache.org/jira/browse/HDDS-8902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17757954#comment-17757954
 ]


Stephen O'Donnell commented on HDDS-8902:
-----------------------------------------

I think this makes sense. Infact the code change is very small to do it. This 
only applies to containers in the OPEN state, and we really want them to 
transition to and get handled as normal. Inside AbstractContainerReportHandler, 
the current code looks like:

{code}
    switch (container.getState()) {
    case OPEN:
      /*
       * If the state of a container is OPEN, datanodes cannot report
       * any other state.
       */
      if (replica.getState() != State.OPEN) {
        logger.warn("Container {} is in OPEN state, but the datanode {} " +
            "reports an {} replica.", containerId,
            datanode, replica.getState());
        // Should we take some action?
      }
      break;
{code}

Notice that someone added a comment thinking about taking action, but never 
followed up.

I think we can simply add a call like:

{code}
        containerManager.updateContainerState(containerId,
            LifeCycleEvent.FINALIZE);
{code}

And then SCM will stop allocating to the container, and RM will clean things up 
by sending any close commands etc on its next run.

> Close open container immediately on ICR of unhealthy replica
> ------------------------------------------------------------
>
>                 Key: HDDS-8902
>                 URL: https://issues.apache.org/jira/browse/HDDS-8902
>             Project: Apache Ozone
>          Issue Type: Bug
>            Reporter: Ethan Rose
>            Assignee: Siddhant Sangwan
>            Priority: Major
>
> Currently when a datanode's replica moves from open to unhealthy, it will 
> send an ICR to SCM. This is processed on the next run of the SCM replication 
> manager, which could be up to 5 minutes by default. In the mean time, SCM 
> will continue to send writes to this open container, which will fail at the 
> datanode. The unhealthy replica could be handled quicker if the container is 
> closed when the unhealthy ICR is processed in 
> [AbstractContainerReportHandler#updateState|https://github.com/apache/ozone/blob/0fcfe212e18efc620260f733b14db43e63e2ea08/hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/container/AbstractContainerReportHandler.java#L262]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (HDDS-8902) Close open container immediately on ICR of unhealthy replica

Reply via email to