Ming Ma created HDFS-7400:
-----------------------------

             Summary: More reliable namenode health check to detect OS/HW issues
                 Key: HDFS-7400
                 URL: https://issues.apache.org/jira/browse/HDFS-7400
             Project: Hadoop HDFS
          Issue Type: Improvement
            Reporter: Ming Ma


We had this scenario on an active NN machine.

* Disk array controller firmware has a bug. So disks stop working.
* ZKFC and NN still considered the node healthy; Communications between ZKFC 
and ZK as well as ZKFC and NN are good.
* The machine can be pinged.
* The machine can't be sshed.

So all clients and DNs can't use the NN. But ZKFC and NN still consider the 
node healthy.

The question is how we can have ZKFC and NN detect such OS/HW specific issues 
quickly? Some ideas we discussed briefly,

* Have other machines help to make the decision whether the NN is actually 
healthy. Then you have to figure out to make the decision accurate in the case 
of network issue, etc.
* Run OS/HW health check script external to ZKFC/NN on the same machine. If it 
detects disk or other issues, it can reboot the machine for example.
* Run OS/HW health check script inside ZKFC/NN. For example NN's 
HAServiceProtocol#monitorHealth can be modified to call such health check 
script.

Thoughts?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to