Github user squito commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20640#discussion_r179012891
  
    --- Diff: 
resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosCoarseGrainedSchedulerBackend.scala
 ---
    @@ -571,7 +568,7 @@ private[spark] class MesosCoarseGrainedSchedulerBackend(
           cpus + totalCoresAcquired <= maxCores &&
           mem <= offerMem &&
           numExecutors < executorLimit &&
    -      slaves.get(slaveId).map(_.taskFailures).getOrElse(0) < 
MAX_SLAVE_FAILURES &&
    +      !scheduler.nodeBlacklist().contains(offerHostname) &&
    --- End diff --
    
    I just want to make really sure everybody understands the big change in 
behavior here -- `nodeBlacklist()` currently *only* gets updated based on 
failures in *spark* tasks.  If a mesos task fails to even start -- that is, if 
a spark executor fails to launch on a node -- `nodeBlacklist` does not get 
updated.  So you could have a node that is misconfigured somehow, and you might 
end up repeatedly trying to launch executors on it after this changed, with the 
executor even failing to start each time.  That is even if you have 
blacklisting on.
    
    This is SPARK-16630 for the non-mesos case.  That is being actively worked 
on now -- however the work there will probably have to be yarn-specific, so 
there will still be followup work to get the same thing for mesos after that is 
in.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to