[ 
https://issues.apache.org/jira/browse/SPARK-20178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15950917#comment-15950917
 ] 

Thomas Graves commented on SPARK-20178:
---------------------------------------

Overall what I would like to accomplish is not throwing away work and making 
the failure case very performant. More and more people are running spark on 
larger clusters, this means failures are going to occur more.  We need those 
failures to be as fast as possible.  We need to be careful here and make sure 
we handle the node totally down case, the nodemanager totally down, and the 
nodemanager or node is just having intermittent issue.  Generally I see the 
last where the issue is just intermittent but some people recently have had 
more of the nodemanager totally down case in which case you want to fail all 
maps on that node quickly.  The decision on what to rerun is hard now because 
it could be very costly to rerun more, but at the same time it could be very 
costly to not rerun all immediately because you can fail all 4 stage attempts.  
This really depends on how long the maps and reduces run.  A lot of discussion 
on https://github.com/apache/spark/pull/17088 related to that. 

- We should not kill the Reduce tasks on fetch failure.  Leave the Reduce tasks 
running since it could have done useful work already like fetching X number of 
map outputs.  It can simply fail that map output which would cause the map to 
be rerun and only that specific map output would need to be refetched.  This 
does require checking to make sure there are enough resource to run the map and 
if not possibly killing a reducer or getting more resources if dynamic 
allocation.
- Improve logic around deciding which node is actually bad when you get a fetch 
failures.  Was it really the node the reduce was on or the node the map was on. 
 You can do something here like a % of reducers failed to fetch from map output 
node.
- We should only rerun the maps that failed (or have been logic around how to 
make this decision), other maps could have already been fetch (with bullet one) 
so no need to rerun if all reducers already fetched.  Since the reduce tasks 
keep running, other fetch failures can happen in parallel and that would just 
cause other maps to be rerun.  At some point based on bullet 2 above we can 
decide entire node is bad.
- Improve the blacklisting based on the above improvements
- make sure to think about how this plays into the stage attempt max failures 
(4, now settable)
- try to not waste resources.  ie right now we can have 2 of the same reduce 
tasks running which is using twice the resources and there are a bunch of 
different conditions that can occur as to whether this work is actually useful.

Question:
- should we consider having it fetch all map output from a host at once (rather 
then per executor).  This could improve fetching times (but would have to test) 
as well as fetch failure handling. This could cause it to fail more maps which 
is somewhat contradictory to bullet 3 above, need to think about this more.
- Do we need pluggable interface or how do we not destabilize current scheduler?

Bonus or future:
- Decision on when and how many maps to rerun is cost based estimate.  If maps 
only take a few seconds to run could rerun all maps on the host immediately
- option to prestart reduce tasks so that they can start fetching while last 
few maps are failing (if you have long tail maps)

> Improve Scheduler fetch failures
> --------------------------------
>
>                 Key: SPARK-20178
>                 URL: https://issues.apache.org/jira/browse/SPARK-20178
>             Project: Spark
>          Issue Type: Epic
>          Components: Scheduler
>    Affects Versions: 2.1.0
>            Reporter: Thomas Graves
>
> We have been having a lot of discussions around improving the handling of 
> fetch failures.  There are 4 jira currently related to this.  
> We should try to get a list of things we want to improve and come up with one 
> cohesive design.
> SPARK-20163,  SPARK-20091,  SPARK-14649 , and SPARK-19753
> I will put my initial thoughts in a follow on comment.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to