[ 
https://issues.apache.org/jira/browse/HDFS-2182?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13069220#comment-13069220
 ] 

Eli Collins commented on HDFS-2182:
-----------------------------------

Ah, yea, it's calling start(), not run(). DataXceiver has a reference do 
datanode, so it can just set shouldRun to false in the case of a non-IOE. Much 
simpler. 

> Exceptions in DataXceiver#run can result in a zombie datanode 
> --------------------------------------------------------------
>
>                 Key: HDFS-2182
>                 URL: https://issues.apache.org/jira/browse/HDFS-2182
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: data-node
>            Reporter: Eli Collins
>             Fix For: 0.23.0
>
>         Attachments: hdfs-2182-1.patch
>
>
> DataXceiver#run currently swallows all exceptions, it should instead plumb 
> them up to DataXceiverServer#run so it can decide whether the exception 
> should be tolerated or the daemon should exit. An IOE should be tolerated 
> (because it's likely just an issue with a particular thread, or an 
> intermittent failure), as it is today, but eg j.l.Error should not. 
> This came up in the following bug I'm seeing on a test cluster: if there's eg 
> a NoClassDefFoundError thrown in DataXceiver#run (because the host jars were 
> replaced out from underneath it, it ran out of descriptors, etc.) we'll end 
> up with a datanode that is alive but always fails because it can't create any 
> DataXceiver threads. In this case the datanode should shut itself down rather 
> than continue to run.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to