[ https://issues.apache.org/jira/browse/SPARK-13669?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Thomas Graves reassigned SPARK-13669: ------------------------------------- Assignee: Saisai Shao > Job will always fail in the external shuffle service unavailable situation > -------------------------------------------------------------------------- > > Key: SPARK-13669 > URL: https://issues.apache.org/jira/browse/SPARK-13669 > Project: Spark > Issue Type: Bug > Components: Spark Core, YARN > Reporter: Saisai Shao > Assignee: Saisai Shao > Fix For: 2.3.0 > > > Currently we are running into an issue with Yarn work preserving enabled + > external shuffle service. > In the work preserving enabled scenario, the failure of NM will not lead to > the exit of executors, so executors can still accept and run the tasks. The > problem here is when NM is failed, external shuffle service is actually > inaccessible, so reduce tasks will always complain about the “Fetch failure”, > and the failure of reduce stage will make the parent stage (map stage) rerun. > The tricky thing here is Spark scheduler is not aware of the unavailability > of external shuffle service, and will reschedule the map tasks on the > executor where NM is failed, and again reduce stage will be failed with > “Fetch failure”, and after 4 retries, the job is failed. > So here the main problem is that we should avoid assigning tasks to those bad > executors (where shuffle service is unavailable). Current Spark's blacklist > mechanism could blacklist executors/nodes by failure tasks, but it doesn't > handle this specific fetch failure scenario. So here propose to improve the > current application blacklist mechanism to handle fetch failure issue > (especially with external shuffle service unavailable issue), to blacklist > the executors/nodes where shuffle fetch is unavailable. -- This message was sent by Atlassian JIRA (v6.4.14#64029) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org