[ 
https://issues.apache.org/jira/browse/YARN-10475?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17316340#comment-17316340
 ] 

chaosju commented on YARN-10475:
--------------------------------

Why adaptive Heartbeat ?
 * Regular heartbeats can overload RM.
 * if RM is overloaded things get worse over time as events queue up.
 * Lower work efficiency as important events at NM/AM need to wait for next 
heartbeat to let RM know of their status.
 * Not every heartbeat from a node or AM may be important. If nodes are running 
full, heartbeats from such nodes would not be useful for application 
scheduling. 
 * RM should be able to control heartbeats sent to itself

How adaptive Heartbeat ?

1.Throttle Heartbeat: 
 * HB interval based on scheduler load (LIGHT, NORMAL, BUSY, HEAVY)
 * Statistics associated with various scheduler events (processing time vs wait 
time in queue) is collected. 
 * RM indicates the next HB interval to NM and AM to throttle the heartbeat.

2. Event based Heartbeat:
 * Send out of band heartbeat to send emergent request such as new resource 
requests, container completion etc. before the heartbeat interval indicated by 
RM. 
 * RM can notify AM when the containers have been allocated so that AM does not 
have to wait for the scheduled heartbeat to get resources.

 
Reference:[https://www.slideshare.net/vsaxenavarun/venturing-into-large-hadoop-clusters]

 

I think that the feature should  think about RM's load.

[~Jim_Brennan]

> Scale RM-NM heartbeat interval based on node utilization
> --------------------------------------------------------
>
>                 Key: YARN-10475
>                 URL: https://issues.apache.org/jira/browse/YARN-10475
>             Project: Hadoop YARN
>          Issue Type: Improvement
>          Components: yarn
>    Affects Versions: 2.10.1, 3.4.0
>            Reporter: Jim Brennan
>            Assignee: Jim Brennan
>            Priority: Minor
>             Fix For: 3.2.2, 3.4.0, 3.3.1, 3.1.5, 3.2.3
>
>         Attachments: YARN-10475-branch-3.2.003.patch, 
> YARN-10475-branch-3.3.003.patch, YARN-10475.001.patch, YARN-10475.002.patch, 
> YARN-10475.003.patch
>
>
> Add the ability to scale the RM-NM heartbeat interval based on node cpu 
> utilization compared to overall cluster cpu utilization.  If a node is 
> over-utilized compared to the rest of the cluster, it's heartbeat interval 
> slows down.  If it is under-utilized compared to the rest of the cluster, 
> it's heartbeat interval speeds up.
> This is a feature we have been running with internally in production for 
> several years.  It was developed by [~nroberts], based on the observation 
> that larger faster nodes on our cluster were under-utilized compared to 
> smaller slower nodes. 
> This feature is dependent on [YARN-10450], which added cluster-wide 
> utilization metrics.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

Reply via email to