[ 
https://issues.apache.org/jira/browse/SPARK-15865?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15865:
------------------------------------

    Assignee: Apache Spark  (was: Imran Rashid)

> Blacklist should not result in job hanging with less than 4 executors
> ---------------------------------------------------------------------
>
>                 Key: SPARK-15865
>                 URL: https://issues.apache.org/jira/browse/SPARK-15865
>             Project: Spark
>          Issue Type: Bug
>          Components: Scheduler
>    Affects Versions: 2.0.0
>            Reporter: Imran Rashid
>            Assignee: Apache Spark
>
> Currently when you turn on blacklisting with 
> {{spark.scheduler.executorTaskBlacklistTime}}, but you have fewer than 
> {{spark.task.maxFailures}} executors, you can end with a job "hung" after 
> some task failures.
> If some task fails regularly (say, due to error in user code), then the task 
> will be blacklisted from the given executor.  It will then try another 
> executor, and fail there as well.  However, after it has tried all available 
> executors, the scheduler will simply stop trying to schedule the task 
> anywhere.  The job doesn't fail, nor it does it succeed -- it simply waits.  
> Eventually, when the blacklist expires, the task will be scheduled again.  
> But that can be quite far in the future, and in the meantime the user just 
> observes a stuck job.
> Instead we should abort the stage (and fail any dependent jobs) as soon as we 
> detect tasks that cannot be scheduled.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to