[ 
https://issues.apache.org/jira/browse/HDFS-7521?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ming Ma updated HDFS-7521:
--------------------------
    Attachment: HDFS-7521.patch

> Refactor DN state management
> ----------------------------
>
>                 Key: HDFS-7521
>                 URL: https://issues.apache.org/jira/browse/HDFS-7521
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>            Reporter: Ming Ma
>         Attachments: DNStateMachines.png, HDFS-7521.patch
>
>
> There are two aspects w.r.t. DN state management in NN.
> * State machine management within active NN
> NN maintains states of each data node regarding whether it is running or 
> being decommissioned. But the state machine isn’t well defined. We have dealt 
> with some corner case bug in this area. It will be useful if we can refactor 
> the code to use clear state machine definition that define events, available 
> states and actions for state transitions. It has these benefits.
> ** Make it easy to define correctness of DN state management. Currently some 
> of the state transitions aren't defined in the code. For example, if admins 
> remove a node from include host file while the node is being decommissioned, 
> it will be transitioned to DEAD and DECOMM_IN_PROGRESS. That might not be the 
> intention. If we have state machine definition, we can identify this case.
> ** Make it easy to add new state for DN later. For example, people discussed 
> about new “maintenance” state for DN to support the scenario where admins 
> need to take the machine/rack down for 30 minutes for repair.
> We can refactor DN with clear state machine definition based on YARN state 
> related components.
> * State machine consistency between active and standby NN
> Another dimension of state machine management is consistency across NN pairs. 
> We have dealt with bugs due to different live nodes between active NN and 
> standby NN. Current design is to have each NN manage its own state based on 
> the events it receives. For example, DNs will send heartbeat to both NNs; 
> admins will issue decommission commands to both NNs. Alternative design 
> approach could be to have ZK manage the state.
> Thoughts?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to