[ https://issues.apache.org/jira/browse/HDFS-2182?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Eli Collins updated HDFS-2182: ------------------------------ Attachment: hdfs-2182-1.patch Here's a patch that illustrates a potential fix. DX#run throws all exceptions by wrapping them in a RuntimeException, this way they are plumbed up to DXS#run who then insepcts the cause and just logs if it's an IOE and shutsdown otherwise. > Exceptions in DataXceiver#run can result in a zombie datanode > -------------------------------------------------------------- > > Key: HDFS-2182 > URL: https://issues.apache.org/jira/browse/HDFS-2182 > Project: Hadoop HDFS > Issue Type: Bug > Components: data-node > Reporter: Eli Collins > Fix For: 0.23.0 > > Attachments: hdfs-2182-1.patch > > > DataXceiver#run currently swallows all exceptions, it should instead plumb > them up to DataXceiverServer#run so it can decide whether the exception > should be tolerated or the daemon should exit. An IOE should be tolerated > (because it's likely just an issue with a particular thread, or an > intermittent failure), as it is today, but eg j.l.Error should not. > This came up in the following bug I'm seeing on a test cluster: if there's eg > a NoClassDefFoundError thrown in DataXceiver#run (because the host jars were > replaced out from underneath it, it ran out of descriptors, etc.) we'll end > up with a datanode that is alive but always fails because it can't create any > DataXceiver threads. In this case the datanode should shut itself down rather > than continue to run. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira