[ https://issues.apache.org/jira/browse/HDDS-1284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Xiaoyu Yao updated HDDS-1284: ----------------------------- Target Version/s: 0.5.0 (was: 0.4.0) > Adjust default values of pipline recovery for more resilient service restart > ---------------------------------------------------------------------------- > > Key: HDDS-1284 > URL: https://issues.apache.org/jira/browse/HDDS-1284 > Project: Hadoop Distributed Data Store > Issue Type: Bug > Reporter: Elek, Marton > Assignee: Elek, Marton > Priority: Critical > Labels: pull-request-available > Fix For: 0.4.0 > > Time Spent: 0.5h > Remaining Estimate: 0h > > As of now we have a following algorithm to handle node failures: > 1. In case of a missing node the leader of the pipline or the scm can > detected the missing heartbeats. > 2. SCM will start to close the pipeline (CLOSING state) and try to close the > containers with the remaining nodes in the pipeline > 3. After 5 minutes the pipeline will be destroyed (CLOSED) and a new pipeline > can be created from the healthy nodes (one node can be part only one pipwline > in the same time). > While this algorithm can work well with a big cluster it doesn't provide very > good usability on small clusters: > Use case1: > Given 3 nodes, in case of a service restart, if the restart takes more than > 90s, the pipline will be moved to the CLOSING state. For the next 5 minutes > (ozone.scm.pipeline.destroy.timeout) the container will remain in the CLOSING > state. As there are no more nodes and we can't assign the same node to two > different pipeline, the cluster will be unavailable for 5 minutes. > Use case2: > Given 90 nodes and 30 pipelines where all the pipelines are spread across 3 > racks. Let's stop one rack. As all the pipelines are affected, all the > pipelines will be moved to the CLOSING state. We have no free nodes, > therefore we need to wait for 5 minutes to write any data to the cluster. > These problems can be solved in multiple ways: > 1.) Instead of waiting 5 minutes, destroy the pipeline when all the > containers are reported to be closed. (Most of the time it's enough, but some > container report can be missing) > 2.) Support multi-raft and open a pipeline as soon as we have enough nodes > (even if the nodes already have a CLOSING pipelines). > Both the options require more work on the pipeline management side. For 0.4.0 > we can adjust the following parameters to get better user experience: > {code} > <property> > <name>ozone.scm.pipeline.destroy.timeout</name> > <value>60s</value> > <tag>OZONE, SCM, PIPELINE</tag> > <description> > Once a pipeline is closed, SCM should wait for the above configured time > before destroying a pipeline. > </description> > <property> > <name>ozone.scm.stale.node.interval</name> > <value>90s</value> > <tag>OZONE, MANAGEMENT</tag> > <description> > The interval for stale node flagging. Please > see ozone.scm.heartbeat.thread.interval before changing this value. > </description> > </property> > {code} > First of all, we can be more optimistic and mark node to stale only after 5 > mins instead of 90s. 5 mins should be enough most of the time to recover the > nodes. > Second: we can decrease the time of ozone.scm.pipeline.destroy.timeout. > Ideally the close command is sent by the scm to the datanode with a HB. > Between two HB we have enough time to close all the containers via ratis. > With the next HB, datanode can report the successful datanode. (If the > containers can be closed the scm can manage the QUASI_CLOSED containers) > We need to wait 29 seconds (worst case) for the next HB, and 29+30 seconds > for the confirmation. --> 66 seconds seems to be a safe choice (assuming that > 6 seconds is enough to process the report about the successful closing) -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org