[ 
https://issues.apache.org/jira/browse/YARN-7327?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16205259#comment-16205259
 ] 

Wangda Tan commented on YARN-7327:
----------------------------------

[~CraigI], 

If you wanna try the latest async scheduling in Capacity Scheduler, you don't 
need to change application code. It is a global scheduler config in 
capacity-scheduler.xml:

{code}
  <property>
    <name>yarn.scheduler.capacity.schedule-asynchronously.enable</name>
    <value>true</value>
  </property>
{code} 

In addition to that, the 2.9.0/3.0.0 Yarn support specify multiple thread (by 
default is 1) to allocate containers.

{code}
  <property>
    <name>yarn.scheduler.capacity.schedule-asynchronously.maximum-threads</name>
    <value>4</value>
  </property>
{code} 

>From the test report: 
>https://issues.apache.org/jira/secure/attachment/12831662/YARN-5139-Concurrent-scheduling-performance-report.pdf
> the multiple thread + async approach can improve scheduler throughput (and 
>shorten allocation delays) significantly.

Please let me know how it goes in your side, I can help to answer questions if 
you have.

> CapacityScheduler: Allocate containers asynchronously by default
> ----------------------------------------------------------------
>
>                 Key: YARN-7327
>                 URL: https://issues.apache.org/jira/browse/YARN-7327
>             Project: Hadoop YARN
>          Issue Type: Improvement
>            Reporter: Craig Ingram
>            Priority: Trivial
>         Attachments: yarn-async-scheduling.png
>
>
> I was recently doing some research into Spark on YARN's startup time and 
> observed slow, synchronous allocation of containers/executors. I am testing 
> on a 4 node bare metal cluster w/48 cores and 128GB memory per node. YARN was 
> only allocating about 3 containers per second. Moreover when starting 3 Spark 
> applications at the same time with each requesting 44 containers, the first 
> application would get all 44 requested containers and then the next 
> application would start getting containers and so on.
>  
> From looking at the code, it appears this is by design. There is an 
> undocumented configuration variable that will enable asynchronous allocation 
> of containers. I'm sure I'm missing something, but why is this not the 
> default? Is there a bug or race condition in this code path? I've done some 
> testing with it and it's been working and is significantly faster.
>  
> Here's the config:
> `yarn.scheduler.capacity.schedule-asynchronously.enable`
>  
> Any help understanding this would be appreciated.
>  
> Thanks,
> Craig
>  
> If you're curious about the performance difference with this setting, here 
> are the results:
>  
> The following tool was used for the benchmarks:
> https://github.com/SparkTC/spark-bench
> h2. async scheduler research
> The goal of this test is to determine if running Spark on YARN with async 
> scheduling of containers reduces the amount of time required for an 
> application to receive all of its requested resources. This setting should 
> also reduce the overall runtime of short-lived applications/stages or 
> notebook paragraphs. This setting could prove crucial to achieving optimal 
> performance when sharing resources on a cluster with dynalloc enabled.
> h3. Test Setup
> Must update /etc/hadoop/conf/capacity-scheduler.xml (or through Ambari) 
> between runs.  
> `yarn.scheduler.capacity.schedule-asynchronously.enable=true|false`
> conf files request executors counts of:  
> * 2
> * 20
> * 50
> * 100
> The apps are being submitted to the default queue on each cluster which caps 
> at 48 cores on dynalloc and 72 cores on baremetal. The default queue was 
> expanded for the last two tests on baremetal so it could potentially take 
> advantage of all 144 cores.
> h3. Test Environments
> h4. dynalloc
> 4 VMs in Fyre (1 master, 3 workers)
> 8 CPUs/16 GB per node
> model name    : QEMU Virtual CPU version 2.5+  
> h4. baremetal
> 4 baremetal instances in Fyre (1 master, 3 workers)
> 48 CPUs/128GB per node
> model name    : Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz  
> h3. Using spark-bench with timedsleep workload sync
> h4. dynalloc
> || requested containers | avg | stdev||
> |2 | 23.814900 | 1.110725|
> |20 | 29.770250 | 0.830528|
> |50 | 44.486600 | 0.593516|
> |100 | 44.337700 | 0.490139|
> h4. baremetal - 2 queues splitting cluster 72 cores each
> || requested containers | avg | stdev||
> |2 | 14.827000 | 0.292290|
> |20 | 19.613150 | 0.155421|
> |50 | 30.768400 | 0.083400|
> |100 | 40.931850 | 0.092160|
> h4. baremetal - 1 queue to rule them all - 144 cores
> || requested containers | avg | stdev||
> |2 | 14.833050 | 0.334061|
> |20 | 19.575000 | 0.212836|
> |50 | 30.765350 | 0.111035|
> |100 | 41.763300 | 0.182700|
> h3. Using spark-bench with timedsleep workload async
> h4. dynalloc
> || requested containers | avg | stdev||
> |2 | 22.575150 | 0.574296|
> |20 | 26.904150 | 1.244602|
> |50 | 44.721800 | 0.655388|
> |100 | 44.570000 | 0.514540|
> h5. 2nd run  
> || requested containers | avg | stdev||
> |2 | 22.441200 | 0.715875|
> |20 | 26.683400 | 0.583762|
> |50 | 44.227250 | 0.512568|
> |100 | 44.238750 | 0.329712|
> h4. baremetal - 2 queues splitting cluster 72 cores each
> || requested containers | avg | stdev||
> |2 | 12.902350 | 0.125505|
> |20 | 13.830600 | 0.169598|
> |50 | 16.738050 | 0.265091|
> |100 | 40.654500 | 0.111417|
> h4. baremetal - 1 queue to rule them all - 144 cores
> || requested containers | avg | stdev||
> |2 | 12.987150 | 0.118169|
> |20 | 13.837150 | 0.145871|
> |50 | 16.816300 | 0.253437|
> |100 | 23.113450 | 0.320744|



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

Reply via email to