[ 
https://issues.apache.org/jira/browse/HDFS-9239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14961540#comment-14961540
 ] 

Jitendra Nath Pandey commented on HDFS-9239:
--------------------------------------------

bq. .. Well before node liveness is affected by inundation of IBRs and FBRs, 
the namenode performance will degrade to unacceptable level...

  Yes, indeed. But if datanodes are marked as dead in that situation, that 
completely destabilizes the system. At that point, even if we kill certain 
offending jobs, it takes a while before NN can come back to an acceptable 
service level. This proposal should help prevent the death after NN is past the 
overloading scenario.

  I think ZKFC healthcheck should also be separated into a different queue or 
port so that they are not blocked by other messages in NN's call queue. A 
failover because NN is busy is not very helpful. The other NN also gets busy 
and we end up seeing active-standby flip-flop between the namenodes.

> DataNode Lifeline Protocol: an alternative protocol for reporting DataNode 
> liveness
> -----------------------------------------------------------------------------------
>
>                 Key: HDFS-9239
>                 URL: https://issues.apache.org/jira/browse/HDFS-9239
>             Project: Hadoop HDFS
>          Issue Type: New Feature
>          Components: datanode, namenode
>            Reporter: Chris Nauroth
>            Assignee: Chris Nauroth
>         Attachments: DataNode-Lifeline-Protocol.pdf
>
>
> This issue proposes introduction of a new feature: the DataNode Lifeline 
> Protocol.  This is an RPC protocol that is responsible for reporting liveness 
> and basic health information about a DataNode to a NameNode.  Compared to the 
> existing heartbeat messages, it is lightweight and not prone to resource 
> contention problems that can harm accurate tracking of DataNode liveness 
> currently.  The attached design document contains more details.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to