[ 
https://issues.apache.org/jira/browse/HDFS-15067?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17156133#comment-17156133
 ] 

Uma Maheswara Rao G commented on HDFS-15067:
--------------------------------------------

HI [~surendrasingh], Interesting proposal.  I have few questions though:
- Let's say a DN does not have any work for some time and you started skipping 
heartbeats. When you are skipping, NN assigns some replication work to this 
node, they will just stay in NN side DatanodeDescriptor. Since there are no 
heartbeats received, that DN will not consume that work from NN right? So, 
assigned replication can be delayed? Am i missing something?
- We also report xceiver counts (and lot of other metrics) in heartbeats which 
will be used which choosing good nodes etc. I am wondering, whether we miss any 
approximation(far from original approximation)?
- I saw in your proposal that, at least one heartbeat in stale interval. I feel 
one hb may be risk as it can be delayed or failed due to nw fluctuations. So, 
it may be risk that you will declare that node as stale wrongly?
- Does this proved some benefit in your cluster? I mean in response time etc.


> Optimize heartbeat for large cluster
> ------------------------------------
>
>                 Key: HDFS-15067
>                 URL: https://issues.apache.org/jira/browse/HDFS-15067
>             Project: Hadoop HDFS
>          Issue Type: New Feature
>          Components: datanode
>    Affects Versions: 3.1.1
>            Reporter: Surendra Singh Lilhore
>            Assignee: Surendra Singh Lilhore
>            Priority: Major
>         Attachments: HDFS-15067.01.patch, HDFS-15067.02.patch, 
> HDFS-15067.03.patch, image-2020-01-09-18-00-49-556.png
>
>
> In a large cluster Namenode spend some time in processing heartbeats. For 
> example, in 10K node cluster namenode process 10K RPC's for heartbeat in each 
> 3sec. This will impact the client response time. This heart beat can be 
> optimized. DN can start skipping one heart beat if no 
> work(Write/replication/Delete) is allocated from long time. DN can start 
> sending heart beat in 6 sec. Once the DN stating getting work from NN , it 
> can start sending heart beat normally.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

Reply via email to