[ 
https://issues.apache.org/jira/browse/HDFS-7400?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Allen Wittenauer updated HDFS-7400:
-----------------------------------
    Status: Open  (was: Patch Available)

> More reliable namenode health check to detect OS/HW issues
> ----------------------------------------------------------
>
>                 Key: HDFS-7400
>                 URL: https://issues.apache.org/jira/browse/HDFS-7400
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>            Reporter: Ming Ma
>            Assignee: Ming Ma
>              Labels: BB2015-05-TBR
>         Attachments: HDFS-7400.patch
>
>
> We had this scenario on an active NN machine.
> * Disk array controller firmware has a bug. So disks stop working.
> * ZKFC and NN still considered the node healthy; Communications between ZKFC 
> and ZK as well as ZKFC and NN are good.
> * The machine can be pinged.
> * The machine can't be sshed.
> So all clients and DNs can't use the NN. But ZKFC and NN still consider the 
> node healthy.
> The question is how we can have ZKFC and NN detect such OS/HW specific issues 
> quickly? Some ideas we discussed briefly,
> * Have other machines help to make the decision whether the NN is actually 
> healthy. Then you have to figure out to make the decision accurate in the 
> case of network issue, etc.
> * Run OS/HW health check script external to ZKFC/NN on the same machine. If 
> it detects disk or other issues, it can reboot the machine for example.
> * Run OS/HW health check script inside ZKFC/NN. For example NN's 
> HAServiceProtocol#monitorHealth can be modified to call such health check 
> script.
> Thoughts?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to