[
https://issues.apache.org/jira/browse/HDFS-15067?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17012687#comment-17012687
]
Ayush Saxena edited comment on HDFS-15067 at 1/10/20 11:57 AM:
---------------------------------------------------------------
Thanx [~surendrasingh] for the design. Had a quick look.
* I guess the standby namenode will not be receiving any response from the
namenode, so the heartbeat interval for the standby shall always be the max
configured, So, I think in case of failover, we should reset the counter to
start, else the new ACTIVE shall be receiving the first slots of heartbeat with
the max interval.
* In case of Connection Exception, or any connection issues with the namenode
too, the counter should get reset, there is some re registration check and
lease logic I think for the first heartbeat.
* For the default value the number has 3 in the defaults, in case of invalid
that shoots to {{StaleInterval - 1 HeartBeat}} both seems at quite extremes,
the first being at the lower and the later being at the higher, I think we can
keep something is percent to stale interval, may be 40% or 50% to stale
interval.
* Just a opinion, the standby and observer, will in anyway, reach to max skip
interval, may be we can shoot them directly to the max value post first heart
beat rather than going exponentially.
* nit : in case of change in value specified, there should be a warn log,
stating specified value is more then stale interval, using default of..
Will try checking the code further, in couple of days!!!
was (Author: ayushtkn):
Thanx [~surendrasingh] for the design. Had a quick look.
I guess the standby namenode will not be receiving any response from the
namenode, so the heartbeat interval for the standby shall always be the max
configured, So, I think in case of failover, we should reset the counter to
start, else the new ACTIVE shall be receiving the first slots of heartbeat with
the max interval.
> Optimize heartbeat for large cluster
> ------------------------------------
>
> Key: HDFS-15067
> URL: https://issues.apache.org/jira/browse/HDFS-15067
> Project: Hadoop HDFS
> Issue Type: New Feature
> Components: datanode
> Affects Versions: 3.1.1
> Reporter: Surendra Singh Lilhore
> Assignee: Surendra Singh Lilhore
> Priority: Major
> Attachments: HDFS-15067.01.patch, image-2020-01-09-18-00-49-556.png
>
>
> In a large cluster Namenode spend some time in processing heartbeats. For
> example, in 10K node cluster namenode process 10K RPC's for heartbeat in each
> 3sec. This will impact the client response time. This heart beat can be
> optimized. DN can start skipping one heart beat if no
> work(Write/replication/Delete) is allocated from long time. DN can start
> sending heart beat in 6 sec. Once the DN stating getting work from NN , it
> can start sending heart beat normally.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]