[ 
https://issues.apache.org/jira/browse/HDFS-8995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14723500#comment-14723500
 ] 

Kihwal Lee commented on HDFS-8995:
----------------------------------

[~daryn] did further analysis:

{panel}
It's a bug during re-registration. The DN is supposed to create a registration 
object which contains the 0.0.0.0 addr, pass it to the NN which updates the 
addr and returns it, then the DN saves the updated registration for future 
calls.

The problem is the DN saves off the initial registration with 0.0.0.0 before it 
receives the NN's response. When the DN encounters an exception contacting the 
NN, it is left with the invalid registration containing 0.0.0.0.

The fix is not saving the registration until the NN updates it. There's a 
couple places where the DN isn't updating all references to a new registration.
{panel}

> Flaw in registration bookeeping can make DN die on reconnect
> ------------------------------------------------------------
>
>                 Key: HDFS-8995
>                 URL: https://issues.apache.org/jira/browse/HDFS-8995
>             Project: Hadoop HDFS
>          Issue Type: Bug
>            Reporter: Kihwal Lee
>            Priority: Critical
>
> Normally data nodes re-register with the namenode when it was unreachable for 
> more than the heartbeat expiration and becomes reachable again. Datanodes 
> keep retrying the last rpc call such as incremental block report and 
> heartbeat and when it finally gets through the namenode tells it to 
> re-register.
> We have observed that some of datanodes stay dead in such scenarios. Further 
> investigation has revealed that those were told to shutdown by the namenode.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to