[ 
https://issues.apache.org/jira/browse/SPARK-13669?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15889562#comment-15889562
 ] 

Apache Spark commented on SPARK-13669:
--------------------------------------

User 'jerryshao' has created a pull request for this issue:
https://github.com/apache/spark/pull/17113

> Job will always fail in the external shuffle service unavailable situation
> --------------------------------------------------------------------------
>
>                 Key: SPARK-13669
>                 URL: https://issues.apache.org/jira/browse/SPARK-13669
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core, YARN
>            Reporter: Saisai Shao
>
> Currently we are running into an issue with Yarn work preserving enabled + 
> external shuffle service. 
> In the work preserving enabled scenario, the failure of NM will not lead to 
> the exit of executors, so executors can still accept and run the tasks. The 
> problem here is when NM is failed, external shuffle service is actually 
> inaccessible, so reduce tasks will always complain about the “Fetch failure”, 
> and the failure of reduce stage will make the parent stage (map stage) rerun. 
> The tricky thing here is Spark scheduler is not aware of the unavailability 
> of external shuffle service, and will reschedule the map tasks on the 
> executor where NM is failed, and again reduce stage will be failed with 
> “Fetch failure”, and after 4 retries, the job is failed.
> So here the actual problem is Spark’s scheduler is not aware of the 
> unavailability of external shuffle service, and will still assign the tasks 
> on to that nodes. The fix is to avoid assigning tasks on to that nodes.
> Currently in the Spark, one related configuration is 
> “spark.scheduler.executorTaskBlacklistTime”, but I don’t think it will be 
> worked in this scenario. This configuration is used to avoid same reattempt 
> task to run on the same executor. Also ways like MapReduce’s blacklist 
> mechanism may not handle this scenario, since all the reduce tasks will be 
> failed, so counting the failure tasks will equally mark all the executors as 
> “bad” one.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to