[ https://issues.apache.org/jira/browse/HDFS-7521?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Ming Ma updated HDFS-7521: -------------------------- Attachment: HDFS-7521.patch > Refactor DN state management > ---------------------------- > > Key: HDFS-7521 > URL: https://issues.apache.org/jira/browse/HDFS-7521 > Project: Hadoop HDFS > Issue Type: Improvement > Reporter: Ming Ma > Attachments: DNStateMachines.png, HDFS-7521.patch > > > There are two aspects w.r.t. DN state management in NN. > * State machine management within active NN > NN maintains states of each data node regarding whether it is running or > being decommissioned. But the state machine isn’t well defined. We have dealt > with some corner case bug in this area. It will be useful if we can refactor > the code to use clear state machine definition that define events, available > states and actions for state transitions. It has these benefits. > ** Make it easy to define correctness of DN state management. Currently some > of the state transitions aren't defined in the code. For example, if admins > remove a node from include host file while the node is being decommissioned, > it will be transitioned to DEAD and DECOMM_IN_PROGRESS. That might not be the > intention. If we have state machine definition, we can identify this case. > ** Make it easy to add new state for DN later. For example, people discussed > about new “maintenance” state for DN to support the scenario where admins > need to take the machine/rack down for 30 minutes for repair. > We can refactor DN with clear state machine definition based on YARN state > related components. > * State machine consistency between active and standby NN > Another dimension of state machine management is consistency across NN pairs. > We have dealt with bugs due to different live nodes between active NN and > standby NN. Current design is to have each NN manage its own state based on > the events it receives. For example, DNs will send heartbeat to both NNs; > admins will issue decommission commands to both NNs. Alternative design > approach could be to have ZK manage the state. > Thoughts? -- This message was sent by Atlassian JIRA (v6.3.4#6332)