Github user squito commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22288#discussion_r227067905
  
    --- Diff: 
core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala ---
    @@ -415,9 +420,55 @@ private[spark] class TaskSchedulerImpl(
                 launchedAnyTask |= launchedTaskAtCurrentMaxLocality
               } while (launchedTaskAtCurrentMaxLocality)
             }
    +
             if (!launchedAnyTask) {
    -          taskSet.abortIfCompletelyBlacklisted(hostToExecutors)
    +          taskSet.getCompletelyBlacklistedTaskIfAny(hostToExecutors) match 
{
    +            case Some(taskIndex) => // Returns the taskIndex which was 
unschedulable
    +
    +              // If the taskSet is unschedulable we try to find an 
existing idle blacklisted
    +              // executor. If we cannot find one, we abort immediately. 
Else we kill the idle
    --- End diff --
    
    I'm a little worried that the idle condition will be too strict in some 
scenarios, if there is a large backlog of tasks from another taskset, or 
whatever the error is, the tasks take a while to fail (eg., you've really got a 
bad executor, but its not apparent till after network timeouts or something).  
Eg. that could happen if you're doing a big join, and while preparing the input 
on the map-side, one side just has one straggler left but the other side still 
has a big backlog of tasks.  Or, in a jobserver style situation, and there are 
always other tasksets coming in.
    
    that said, I don't have any better ideas at the moment, and I still think 
this is an improvement.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to