[ 
https://issues.apache.org/jira/browse/HDDS-1284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Elek, Marton updated HDDS-1284:
-------------------------------
    Status: Patch Available  (was: Open)

> Adjust default values of pipline recovery for more resilient service restart
> ----------------------------------------------------------------------------
>
>                 Key: HDDS-1284
>                 URL: https://issues.apache.org/jira/browse/HDDS-1284
>             Project: Hadoop Distributed Data Store
>          Issue Type: Bug
>            Reporter: Elek, Marton
>            Assignee: Elek, Marton
>            Priority: Critical
>
> As of now we have a following algorithm to handle node failures:
> 1. In case of a missing node the leader of the pipline or the scm can 
> detected the missing heartbeats.
> 2. SCM will start to close the pipeline (CLOSING state) and try to close the 
> containers with the remaining nodes in the pipeline
> 3. After 5 minutes the pipeline will be destroyed (CLOSED) and a new pipeline 
> can be created from the healthy nodes (one node can be part only one pipwline 
> in the same time).
> While this algorithm can work well with a big cluster it doesn't provide very 
> good usability on small clusters:
> Use case1:
> Given 3 nodes, in case of a service restart, if the restart takes more than 
> 90s, the pipline will be moved to the CLOSING state. For the next 5 minutes 
> (ozone.scm.pipeline.destroy.timeout) the container will remain in the CLOSING 
> state. As there are no more nodes and we can't assign the same node to two 
> different pipeline, the cluster will be unavailable for 5 minutes.
> Use case2:
> Given 90 nodes and 30 pipelines where all the pipelines are spread across 3 
> racks. Let's stop one rack. As all the pipelines are affected, all the 
> pipelines will be moved to the CLOSING state. We have no free nodes, 
> therefore we need to wait for 5 minutes to write any data to the cluster.
> These problems can be solved in multiple ways:
> 1.) Instead of waiting 5 minutes, destroy the pipeline when all the 
> containers are reported to be closed. (Most of the time it's enough, but some 
> container report can be missing)
> 2.) Support multi-raft and open a pipeline as soon as we have enough nodes 
> (even if the nodes already have a CLOSING pipelines).
> Both the options require more work on the pipeline management side. For 0.4.0 
> we can adjust the following parameters to get better user experience:
> {code}
>   <property>
>     <name>ozone.scm.pipeline.destroy.timeout</name>
>     <value>60s</value>
>     <tag>OZONE, SCM, PIPELINE</tag>
>     <description>
>       Once a pipeline is closed, SCM should wait for the above configured time
>       before destroying a pipeline.
>     </description>
>   <property>
>     <name>ozone.scm.stale.node.interval</name>
>     <value>90s</value>
>     <tag>OZONE, MANAGEMENT</tag>
>     <description>
>       The interval for stale node flagging. Please
>       see ozone.scm.heartbeat.thread.interval before changing this value.
>     </description>
>   </property>
>  {code}
> First of all, we can be more optimistic and mark node to stale only after 5 
> mins instead of 90s. 5 mins should be enough most of the time to recover the 
> nodes.
> Second: we can decrease the time of ozone.scm.pipeline.destroy.timeout. 
> Ideally the close command is sent by the scm to the datanode with a HB. 
> Between two HB we have enough time to close all the containers via ratis. 
> With the next HB, datanode can report the successful datanode. (If the 
> containers can be closed the scm can manage the QUASI_CLOSED containers)
> We need to wait 29 seconds (worst case) for the next HB, and 29+30 seconds 
> for the confirmation. --> 66 seconds seems to be a safe choice (assuming that 
> 6 seconds is enough to process the report about the successful closing)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

Reply via email to