[ https://issues.apache.org/jira/browse/YARN-8821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Zhankun Tang updated YARN-8821: ------------------------------- Attachment: YARN-8821-trunk.007.patch > GPU hierarchy/topology scheduling support > ----------------------------------------- > > Key: YARN-8821 > URL: https://issues.apache.org/jira/browse/YARN-8821 > Project: Hadoop YARN > Issue Type: Sub-task > Reporter: Zhankun Tang > Assignee: Zhankun Tang > Priority: Major > Attachments: YARN-8821-trunk.001.patch, YARN-8821-trunk.002.patch, > YARN-8821-trunk.003.patch, YARN-8821-trunk.004.patch, > YARN-8821-trunk.005.patch, YARN-8821-trunk.006.patch, > YARN-8821-trunk.007.patch > > > h2. Background > GPU topology affects performance. There's been a discussion in YARN-7481. But > we'd like to move related discussions here. > And please note that YARN-8851 will provide a pluggable device framework > which can support plugin custom scheduler. Based on the framework, GPU plugin > could have own topology scheduler. > h2. Details of the proposed scheduling algorithm > The proposed patch has a topology algorithm implemented as below: > *Step 1*. When allocating devices, parse the output of "nvidia-smi topo -m" > to build a hash map whose key is all pairs of GPUs and the value is the > communication cost between the two. The map is like \{"0 - 1"=> 2, "0 - > 2"=>4, ...} which means the minimum cost of GPU 0 to 1 is 2. The cost is set > based on the connection type. > *Step 2*. And then it constructs a _+cost table+_ which caches all > combinations of GPUs and corresponding cost between them and cache it. The > cost table is a map whose structure is like > {code:java} > { 2=>{[0,1]=>2,..}, > 3=>{[0,1,2]=>10,..}, > 4=>{[0,1,2,3]=>18}}. > {code} > The key of the map is the count of GPUs, the value of it is a map whose key > is the combination of GPUs and the value is the calculated communication cost > of the numbers of GPUs. The cost calculation algorithm is to sum all > non-duplicate pairs of GPU's cost. For instance, the total cost of [0,1,2] > GPUs are the sum of cost "0 - 1", "0 - 2" and "1 - 2". And each cost can get > from the map built in step 1. > *Step 3*. After the cost table is built, when allocating GPUs based on > topology, we provide two policy which container can set through an > environment variable "NVIDIA_TOPO_POLICY". The value can be either "PACK" or > "SPREAD". The "PACK" means it prefers faster GPU-GPU communication. The > "SPREAD" means it prefers faster CPU-GPU communication( since GPUs are not > using the same bus to CPU). And the key difference of the two policy is the > sort order of the inner map in the cost table. For instance, let's assume 2 > GPUs is wanted. The costTable.get(2) would return a map containing all > combinations of two GPUs and their cost. If the policy is "PACK", we'll sort > the map by cost in ascending order. The first entry will be the GPUs has > minimum GPU-GPU cost. If the policy is "SPREAD", we sort it in descending > order and get the first one which is the highest GPU-GPU cost which means > lowest CPU-GPU costs. > h2. Estimation of the algorithm > Initial analysis of the topology scheduling algorithm(Using PACK policy) > based on the performance tests in an AWS EC2 with 8 GPU cards (P3) is done. > Some of the conclusions are: > 1. The topology between GPUs impacts the performance dramatically. The best > combination GPUs can get *5% to 185%* *performance gain* among the test cases > with various factors including CNN model, batch size, GPU subset, etc. > 2. The "inception3" and "resnet50" networks seem not topology sensitive. The > topology scheduling can only potentially get *about 10%* speedup. > 3. Our current version of topology scheduling algorithm can achieve *3% to > 140%* *performance gain. And the algorithm's allocations match the fastest > GPUs needed by "vgg16"*. > For "alexnet", although the fastest GPUs is not the algorithm's > allocation, the GPU subset ranks in the first 5 of the algorithm's candidates > and has the same cost with the one picked by the algorithm. We may improve > this by selecting a random combination in the first 5 candidates since they > have the same cost. > > In summary, the GPU topology scheduling algorithm is effective and can > potentially get 5% to 185% performance gain after more optimization. > *It means about maximum 3X comparing to a random GPU scheduling algorithm in > a specific scenario*. > > The spreadsheets are here for your reference. > > [https://docs.google.com/spreadsheets/d/1t1QgiSuyMY2u-9TtsTVpVhG3WYc46hoaqy3BuADPS14/edit?usp=sharing] -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org