[ https://issues.apache.org/jira/browse/YARN-11411?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
ASF GitHub Bot updated YARN-11411: ---------------------------------- Labels: pull-request-available (was: ) > [Umbrella] Build Concurrent Yarn Scheduler > ------------------------------------------ > > Key: YARN-11411 > URL: https://issues.apache.org/jira/browse/YARN-11411 > Project: Hadoop YARN > Issue Type: New Feature > Reporter: Krishan Goyal > Assignee: Krishan Goyal > Priority: Major > Labels: pull-request-available > > We operate multiple yarn clusters with each cluster capped to ~ 10k nodes > which is its scalability limit. We expect multiple benefits with fewer > clusters & larger cluster sizes (better elasticity, operational simplicity, > larger queues). > Thus, we want to scale a single yarn cluster to as much as possible in terms > of number of nodes heartbeating to the cluster (& proportionally increase > container allocation rate) without degradation in overall quantiles (p50 / > p75 / p95) of container allocation delay > The scalability limit of a yarn cluster is primarily driven by RM’s > processing of node heartbeats & container allocation. The CPU usage of our RM > is < 10% & RM is primarily bottlenecked on global queue & user read/write > locks for container allocation > By removing these locks (through a very naive & incorrect implementation), we > were able to scale RM to 25k nodes (& proportional increase in container > allocs/sec) with avg RM CPU utilization of 20% (so there is still room for > improvement to use more CPU / scale up further). > This primarily requires > # Async scheduling to decouple scheduling from node heartbeats (existing > feature) > # Removing global write locks in scheduler path (primarily to maintain > queues and users) > # Multi threaded event queue dispatcher to process events parallelly > Additionally we need to probably scale RPC handling, DT management, > preemption flows, Timeline server, RM HA failover. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org