[GitHub] spark pull request #18739: [WIP][SPARK-21539][CORE] Job should not be aborte...

squito Thu, 27 Jul 2017 07:02:59 -0700

Github user squito commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18739#discussion_r129849052
  
    --- Diff: 
core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala ---
    @@ -665,10 +667,15 @@ private[spark] class TaskSetManager(
                 }
               }
               if (blacklistedEverywhere) {
    -            val partition = tasks(indexInTaskSet).partitionId
    -            abort(s"Aborting $taskSet because task $indexInTaskSet 
(partition $partition) " +
    -              s"cannot run anywhere due to node and executor blacklist.  
Blacklisting behavior " +
    -              s"can be configured via spark.blacklist.*.")
    +            val dynamicAllocationEnabled = 
conf.getBoolean("spark.dynamicAllocation.enabled", false)
    +            val mayAllocateNewExecutor =
    +              conf.getInt("spark.executor.instances", -1) > 
currentExecutorNumber
    +            if (!dynamicAllocationEnabled && !mayAllocateNewExecutor) {
    --- End diff --
    
    the reason we do wait until the task set has finished is that before that, 
we have no idea whether the failure is the fault of the user-code (or bad input 
data etc.), or its actually a fault with the node / executor.  Our only piece 
of information on that is when the task that fails on one executor, and then 
succeeds elsewhere, then we assume that its the failure was the fault of the 
original executor (though this heuristic also has false-positives, from what 
I've seen so far it seems tolerable.)
    
    I have also thought of having this wait some amount of time rather than 
killing the taskset immediately, to see if another executor comes up.   
However, there are some complications with that as well.  I think this is all 
captured in the discussion on SPARK-15815, that actually discusses one of the 
trickiest cases -- just one task remaining with Dynamic Allocation, and all 
other executors have been killed b/c they were idle.  Take a look at that jira. 
 If it summarizes things, then we can close SPARK-21539 as a duplicate and 
continue discussion on SPARK-15815.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18739: [WIP][SPARK-21539][CORE] Job should not be aborte...

Reply via email to