[ 
https://issues.apache.org/jira/browse/HDFS-15067?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17012687#comment-17012687
 ] 

Ayush Saxena edited comment on HDFS-15067 at 1/10/20 11:57 AM:
---------------------------------------------------------------

Thanx [~surendrasingh] for the design. Had a quick look.
* I guess the standby namenode will not be receiving any response from the 
namenode, so the heartbeat interval for the standby shall always be the max 
configured,  So, I think in case of failover, we should reset the counter to 
start, else the new ACTIVE shall be receiving the first slots of heartbeat with 
the max interval.
* In case of Connection Exception, or any connection issues with the namenode 
too, the counter should get reset, there is some re registration check and 
lease logic I think for the first heartbeat.
* For the default value the number has 3 in the defaults, in case of invalid 
that shoots to {{StaleInterval - 1 HeartBeat}} both seems at quite extremes, 
the first being at the lower and the later being at the higher, I think we can 
keep something is percent to stale interval, may be 40% or 50% to stale 
interval.
* Just a opinion, the standby and observer, will in anyway, reach to max skip 
interval, may be we can shoot them directly to the max value post first heart 
beat rather than going exponentially.
* nit : in case of change in value specified, there should be a warn log, 
stating specified value is more then stale interval, using default of..

Will try checking the code further, in couple of days!!!


was (Author: ayushtkn):
Thanx [~surendrasingh] for the design. Had a quick look.
I guess the standby namenode will not be receiving any response from the 
namenode, so the heartbeat interval for the standby shall always be the max 
configured,  So, I think in case of failover, we should reset the counter to 
start, else the new ACTIVE shall be receiving the first slots of heartbeat with 
the max interval.

> Optimize heartbeat for large cluster
> ------------------------------------
>
>                 Key: HDFS-15067
>                 URL: https://issues.apache.org/jira/browse/HDFS-15067
>             Project: Hadoop HDFS
>          Issue Type: New Feature
>          Components: datanode
>    Affects Versions: 3.1.1
>            Reporter: Surendra Singh Lilhore
>            Assignee: Surendra Singh Lilhore
>            Priority: Major
>         Attachments: HDFS-15067.01.patch, image-2020-01-09-18-00-49-556.png
>
>
> In a large cluster Namenode spend some time in processing heartbeats. For 
> example, in 10K node cluster namenode process 10K RPC's for heartbeat in each 
> 3sec. This will impact the client response time. This heart beat can be 
> optimized. DN can start skipping one heart beat if no 
> work(Write/replication/Delete) is allocated from long time. DN can start 
> sending heart beat in 6 sec. Once the DN stating getting work from NN , it 
> can start sending heart beat normally.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to