[ 
https://issues.apache.org/jira/browse/SPARK-13669?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves resolved SPARK-13669.
-----------------------------------
       Resolution: Fixed
    Fix Version/s: 2.3.0

> Job will always fail in the external shuffle service unavailable situation
> --------------------------------------------------------------------------
>
>                 Key: SPARK-13669
>                 URL: https://issues.apache.org/jira/browse/SPARK-13669
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core, YARN
>            Reporter: Saisai Shao
>            Assignee: Saisai Shao
>             Fix For: 2.3.0
>
>
> Currently we are running into an issue with Yarn work preserving enabled + 
> external shuffle service. 
> In the work preserving enabled scenario, the failure of NM will not lead to 
> the exit of executors, so executors can still accept and run the tasks. The 
> problem here is when NM is failed, external shuffle service is actually 
> inaccessible, so reduce tasks will always complain about the “Fetch failure”, 
> and the failure of reduce stage will make the parent stage (map stage) rerun. 
> The tricky thing here is Spark scheduler is not aware of the unavailability 
> of external shuffle service, and will reschedule the map tasks on the 
> executor where NM is failed, and again reduce stage will be failed with 
> “Fetch failure”, and after 4 retries, the job is failed.
> So here the main problem is that we should avoid assigning tasks to those bad 
> executors (where shuffle service is unavailable). Current Spark's blacklist 
> mechanism could blacklist executors/nodes by failure tasks, but it doesn't 
> handle this specific fetch failure scenario. So here propose to improve the 
> current application blacklist mechanism to handle fetch failure issue 
> (especially with external shuffle service unavailable issue), to blacklist 
> the executors/nodes where shuffle fetch is unavailable.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to