[ https://issues.apache.org/jira/browse/HDFS-7400?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Allen Wittenauer updated HDFS-7400: ----------------------------------- Labels: BB2015-05-TBR (was: ) > More reliable namenode health check to detect OS/HW issues > ---------------------------------------------------------- > > Key: HDFS-7400 > URL: https://issues.apache.org/jira/browse/HDFS-7400 > Project: Hadoop HDFS > Issue Type: Improvement > Reporter: Ming Ma > Assignee: Ming Ma > Labels: BB2015-05-TBR > Attachments: HDFS-7400.patch > > > We had this scenario on an active NN machine. > * Disk array controller firmware has a bug. So disks stop working. > * ZKFC and NN still considered the node healthy; Communications between ZKFC > and ZK as well as ZKFC and NN are good. > * The machine can be pinged. > * The machine can't be sshed. > So all clients and DNs can't use the NN. But ZKFC and NN still consider the > node healthy. > The question is how we can have ZKFC and NN detect such OS/HW specific issues > quickly? Some ideas we discussed briefly, > * Have other machines help to make the decision whether the NN is actually > healthy. Then you have to figure out to make the decision accurate in the > case of network issue, etc. > * Run OS/HW health check script external to ZKFC/NN on the same machine. If > it detects disk or other issues, it can reboot the machine for example. > * Run OS/HW health check script inside ZKFC/NN. For example NN's > HAServiceProtocol#monitorHealth can be modified to call such health check > script. > Thoughts? -- This message was sent by Atlassian JIRA (v6.3.4#6332)