Github user squito commented on the pull request:

    https://github.com/apache/spark/pull/8760#issuecomment-161100737
  
    @kayousterhout this is pretty important for users running clusters with a 
larger clusters, eg. a few hundred nodes.  We've seen cases where there is some 
weird misconfiguration on a node, which leads to all tasks failing on that 
node.  The job will eventually succeed if you use the existing blacklist 
behavior, but with even one bad node you get so many task failure messages its 
very hard to make any sense of what is going on -- apps with thousands of tasks 
per stage and tens of stages generate tons of task failures.  This leads to 
users complaining that things are broken, when its actually things are "ok" in 
the sense that they run successfully after completing, and also users being 
unable to find the "real" failures hidden in all the bogus msgs.
    
    Cluster managers don't help here -- the problem is when there is a gap 
between the manager's notion of a "healthy" node and the particular spark 
application (eg., one particular library fails to load, or bad memory or 
something).
    
    There is also an efficiency advantage to blacklisting the entire executor, 
rather than just letting tasks fail on your bad executors.  In general the 
tasks will fail quickly and just get rescheduled (I've only actually seen one 
case where the tasks did not fail almost immediately), so maybe this is more 
minor, but seems like a nice thing to do right.  But beyond just the initial 
task scheduling, this PR adds the ability for the cluster manager to avoid 
giving more executors on the bad node.  Plus this opens the ability for spark 
to actively request a new executor -- eg., if you request 5 executors and one 
is bad, rather than running with 80% of your requested resources, you should 
just throw away that executor and get another one (if the cluster has 
capacity).  though that is not directly included in this patch -- in earlier 
discussion @mwws thought it best to push off to follow up work.
    
    I also like the idea of minimizing user configuration, but also recognize 
that we are working from a position of some ignorance on the failure modes 
people encounter and how they want fixes.  Eg., I imagine that everyone would 
want to use the `ExecutorAndNode` strategy and not change any other 
configuration, but perhaps for some users they want the old behavior and more 
knobs.  That's also why I'm reluctant to change the default behavior (but not 
drawing a hard line either).  After providing some options we could try to 
collect some more feedback on what works.
    
    You've got a good point about only blacklisting executors across jobs if 
the tasks succeed elsewhere.  I was initially only thinking about a linear 
spark application -- in that case, if some tasks in a job fail everywhere, then 
the job will fail and the spark app is done in any case.  But of course apps 
may try to recover from failed jobs, or even have completely independent jobs, 
in a "job server" style deployment.
    
    Another point you made on another pr is that we have to ensure that we 
don't prevent any progress from being made if all executors get blacklisted and 
there is nowhere for any tasks to run.  We should make sure we have some 
sensible behavior there, eg., fail the job.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to