[ https://issues.apache.org/jira/browse/YARN-10475?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17316340#comment-17316340 ]
chaosju commented on YARN-10475: -------------------------------- Why adaptive Heartbeat ? * Regular heartbeats can overload RM. * if RM is overloaded things get worse over time as events queue up. * Lower work efficiency as important events at NM/AM need to wait for next heartbeat to let RM know of their status. * Not every heartbeat from a node or AM may be important. If nodes are running full, heartbeats from such nodes would not be useful for application scheduling. * RM should be able to control heartbeats sent to itself How adaptive Heartbeat ? 1.Throttle Heartbeat: * HB interval based on scheduler load (LIGHT, NORMAL, BUSY, HEAVY) * Statistics associated with various scheduler events (processing time vs wait time in queue) is collected. * RM indicates the next HB interval to NM and AM to throttle the heartbeat. 2. Event based Heartbeat: * Send out of band heartbeat to send emergent request such as new resource requests, container completion etc. before the heartbeat interval indicated by RM. * RM can notify AM when the containers have been allocated so that AM does not have to wait for the scheduled heartbeat to get resources. Reference:[https://www.slideshare.net/vsaxenavarun/venturing-into-large-hadoop-clusters] I think that the feature should think about RM's load. [~Jim_Brennan] > Scale RM-NM heartbeat interval based on node utilization > -------------------------------------------------------- > > Key: YARN-10475 > URL: https://issues.apache.org/jira/browse/YARN-10475 > Project: Hadoop YARN > Issue Type: Improvement > Components: yarn > Affects Versions: 2.10.1, 3.4.0 > Reporter: Jim Brennan > Assignee: Jim Brennan > Priority: Minor > Fix For: 3.2.2, 3.4.0, 3.3.1, 3.1.5, 3.2.3 > > Attachments: YARN-10475-branch-3.2.003.patch, > YARN-10475-branch-3.3.003.patch, YARN-10475.001.patch, YARN-10475.002.patch, > YARN-10475.003.patch > > > Add the ability to scale the RM-NM heartbeat interval based on node cpu > utilization compared to overall cluster cpu utilization. If a node is > over-utilized compared to the rest of the cluster, it's heartbeat interval > slows down. If it is under-utilized compared to the rest of the cluster, > it's heartbeat interval speeds up. > This is a feature we have been running with internally in production for > several years. It was developed by [~nroberts], based on the observation > that larger faster nodes on our cluster were under-utilized compared to > smaller slower nodes. > This feature is dependent on [YARN-10450], which added cluster-wide > utilization metrics. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org