[ 
https://issues.apache.org/jira/browse/SPARK-29257?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kent Yao updated SPARK-29257:
-----------------------------
    Description: 
We have an HDFS/YARN cluster with about 2k~3k nodes, each node has 8T * 12 
local disks for storage and shuffle. Sometimes, one or more disks get into bad 
status during computations. Sometimes it does cause job level failure, 
sometimes does.

The following picture shows one failure job caused by 4 task attempts were all 
delivered to the same node and failed with almost the same exception for 
writing the index temporary file to the same bad disk.

 

This is caused by two reasons:
 # As we can see in the figure the data and the node have the best data 
locality for reading from HDFS. As the default spark.locality.wait(3s) taking 
effect, there is a high probability that those attempts will be scheduled to 
this node.
 # The index file or data file name for a particular shuffle map task is fixed. 
It is formed by the shuffle id, the map id and the noop reduce id which is 
always 0. The root local dir is picked by the fixed file name's non-negative 
hash code % the disk number. Thus, this value is also fixed.  Even when we have 
12 disks in total and only one of them is broken, if the broken one is once 
picked, all the following attempts of this task will inevitably pick the broken 
one.

 

 

!image-2019-09-26-16-44-48-554.png!

  was:
We have an HDFS/YARN cluster with about 2k~3k nodes, each node has 8T * 12 
local disks for storage and shuffle. Sometimes, one or more disks get into bad 
status during computations. Sometimes it does cause job level failure, 
sometimes does.

The following picture shows one failure job caused by 4 task attempts were all 
delivered to the same node and failed with almost the same exception for 
writing the index temporary file to the same bad disk.

 

This is caused by two reasons:
 # As we can see in the figure the data and the node have the best data 
locality for reading from HDFS. As the default spark.locality.wait(3s) taking 
effect, there is a high probability that those attempts will be scheduled to 
this node.
 # The index file or data file name for a particular shuffle map task is fixed. 
It is formed by the shuffle id, the map id and the noop reduce id which is 
always 0. The root local dir is picked by the fixed file name's non-negative 
hash code % the disk number. Thus, this value is also fixed.  Even when we have 
12 disks in total and only one of them is broken, if the broken one is once 
picked, all the following attempts of this task will inevitably pick the broken 
one.

 

 

!image-2019-09-26-15-35-29-342.png!


> All Task attempts scheduled to the same executor inevitably access the same 
> bad disk
> ------------------------------------------------------------------------------------
>
>                 Key: SPARK-29257
>                 URL: https://issues.apache.org/jira/browse/SPARK-29257
>             Project: Spark
>          Issue Type: Improvement
>          Components: Shuffle
>    Affects Versions: 2.3.4, 2.4.4
>            Reporter: Kent Yao
>            Priority: Major
>         Attachments: image-2019-09-26-16-44-48-554.png
>
>
> We have an HDFS/YARN cluster with about 2k~3k nodes, each node has 8T * 12 
> local disks for storage and shuffle. Sometimes, one or more disks get into 
> bad status during computations. Sometimes it does cause job level failure, 
> sometimes does.
> The following picture shows one failure job caused by 4 task attempts were 
> all delivered to the same node and failed with almost the same exception for 
> writing the index temporary file to the same bad disk.
>  
> This is caused by two reasons:
>  # As we can see in the figure the data and the node have the best data 
> locality for reading from HDFS. As the default spark.locality.wait(3s) taking 
> effect, there is a high probability that those attempts will be scheduled to 
> this node.
>  # The index file or data file name for a particular shuffle map task is 
> fixed. It is formed by the shuffle id, the map id and the noop reduce id 
> which is always 0. The root local dir is picked by the fixed file name's 
> non-negative hash code % the disk number. Thus, this value is also fixed.  
> Even when we have 12 disks in total and only one of them is broken, if the 
> broken one is once picked, all the following attempts of this task will 
> inevitably pick the broken one.
>  
>  
> !image-2019-09-26-16-44-48-554.png!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to