[jira] [Updated] (SPARK-26712) Disk broken causing YarnShuffleSerivce not available

liupengcheng (JIRA) Wed, 23 Jan 2019 21:08:22 -0800


     [ 
https://issues.apache.org/jira/browse/SPARK-26712?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


liupengcheng updated SPARK-26712:
---------------------------------
    Description: 
Currently, `ExecutorShuffleInfo` can be recovered from file if NM recovery 
enabled, however, the recovery file is under a fixed directory, which may be 
unavailable if Disk broken. So if a NM restart happen(may be caused by kill or 
some reason), the shuffle service can not start even if there are executors on 
the node.

This may finally cause job failures(if node or executors on it not 
blacklisted), or at least, it will cause resource waste.(shuffle from this node 
always failed.)

For long running spark applications, this problem may be more serious.

So I think we should support multi directories(multi disk) for this recovery. 
and change to good directory when the disk of current directory is broken.

  was:
Currently, `ExecutorShuffleInfo` can be recovered from file if NM recovery 
enabled, however, the recovery file is under a fixed directory, which may be 
unavailable if Disk broken. So if a NM restart happen(may be caused by kill or 
some reason), the shuffle service can not start even if there are executors on 
the node.

This may finally cause job failures(if node or executors on it not 
blacklisted), or at least, it will cause resource waste.(shuffle from this node 
always failed.)

For long running spark applications, this problem may be more serious.

So I think we should support multi directories(multi disk) for this recovery. 
and change to good directory and when the disk of current directory is broken.


> Disk broken causing YarnShuffleSerivce not available
> ----------------------------------------------------
>
>                 Key: SPARK-26712
>                 URL: https://issues.apache.org/jira/browse/SPARK-26712
>             Project: Spark
>          Issue Type: Bug
>          Components: Shuffle
>    Affects Versions: 2.1.0, 2.4.0
>            Reporter: liupengcheng
>            Priority: Major
>
> Currently, `ExecutorShuffleInfo` can be recovered from file if NM recovery 
> enabled, however, the recovery file is under a fixed directory, which may be 
> unavailable if Disk broken. So if a NM restart happen(may be caused by kill 
> or some reason), the shuffle service can not start even if there are 
> executors on the node.
> This may finally cause job failures(if node or executors on it not 
> blacklisted), or at least, it will cause resource waste.(shuffle from this 
> node always failed.)
> For long running spark applications, this problem may be more serious.
> So I think we should support multi directories(multi disk) for this recovery. 
> and change to good directory when the disk of current directory is broken.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-26712) Disk broken causing YarnShuffleSerivce not available

Reply via email to