Kent Yao created SPARK-29257:
--------------------------------

             Summary: All Task attempts scheduled to the same executor 
inevitably access the same bad disk
                 Key: SPARK-29257
                 URL: https://issues.apache.org/jira/browse/SPARK-29257
             Project: Spark
          Issue Type: Improvement
          Components: Shuffle
    Affects Versions: 2.4.4, 2.3.4
            Reporter: Kent Yao


We have an HDFS/YARN cluster with about 2k~3k nodes, each node has 8T * 12 
local disks for storage and shuffle. Sometimes, one or more disks get into bad 
status during computations. Sometimes it does cause job level failure, 
sometimes does.

The following picture shows one failure job caused by 4 task attempts were all 
delivered to the same node and failed with almost the same exception for 
writing the index temporary file to the same bad disk.

 

This is caused by two reasons:
 # As we can see in the figure the data and the node have the best data 
locality for reading from HDFS. As the default spark.locality.wait(3s) taking 
effect, there is a high probability that those attempts will be scheduled to 
this node.
 # The index file or data file name for a particular shuffle map task is fixed. 
It is formed by the shuffle id, the map id and the noop reduce id which is 
always 0. The root local dir is picked by the fixed file name's non-negative 
hash code % the disk number. Thus, this value is also fixed.  Even when we have 
12 disks in total and only one of them is broken, if the broken one is once 
picked, all the following attempts of this task will inevitably pick the broken 
one.

 

 

!image-2019-09-26-15-35-29-342.png!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to