[ https://issues.apache.org/jira/browse/SPARK-26712?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
liupengcheng updated SPARK-26712: --------------------------------- Summary: Single disk broken causing YarnShuffleSerivce not available (was: Disk broken causing YarnShuffleSerivce not available) > Single disk broken causing YarnShuffleSerivce not available > ----------------------------------------------------------- > > Key: SPARK-26712 > URL: https://issues.apache.org/jira/browse/SPARK-26712 > Project: Spark > Issue Type: Bug > Components: Shuffle > Affects Versions: 2.1.0, 2.4.0 > Reporter: liupengcheng > Priority: Major > > Currently, `ExecutorShuffleInfo` can be recovered from file if NM recovery > enabled, however, the recovery file is under a single directory, which may be > unavailable if disk broken. So if a NM restart happen(may be caused by kill > or some reason), the shuffle service can not startĀ even if there are > executors on the node. > This may finally cause job failures(if node or executors on it not > blacklisted), or at least, it will cause resource waste.(shuffle from this > node always failed.) > For long running spark applications, this problem may be more serious. > So I think we should support multi directories(multi disk) for this recovery. > and change to good directory when the disk of current directory is broken. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org