[ 
https://issues.apache.org/jira/browse/SPARK-20178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16019650#comment-16019650
 ] 

Thomas Graves commented on SPARK-20178:
---------------------------------------

| when the DAGScheduler is notified of a FetchFailure from a node then mark 
that shuffle's output locations on that node as unavailable (rather than all 
shuffles' outputs on that node)

Can you please clarify this?  Do you mean only the map stage output locations 
for that particular shuffle id that got the fetch failure?  

| the likelihood of the FetchFailure being a transient failure is relatively 
small

Unfortunately this is not what we have seen in the past with mapreduce and tez. 
 Both of those used to be very aggressive about invalidating outputs and such 
and we had to tune those back because we do see a lot of transient failures and 
caused a lot of extra work and delays.   But I do agree that this one is a lot 
of work and will take some time to implement and based on the way things work 
now we should do something shorter term as well.  I was actually just looking 
at this more last friday because I am seeing more of these fetch failures cause 
job failures so something needs to be done. Based on the way we handle fetch 
failures now invalidating all for that stage and causing it to take longer is 
better then job failure.

I was thinking we could invalidate all the map outputs on that host for that 
map stage after a certain number of failures, which could be across attempts. I 
think this is basically what you are proposing except waiting for a 
configurable amount of failures rather then doing it immediately.  Thoughts? 
 
I also think adding blacklisting after a certain number of fetch failures would 
be good for those cases where the YARN NM crashes but there could still be 
executors running on the node.  That one isn't as difficult, you just have to 
track it in the BlacklistTracker. Work started in SPARK-13669 it would just 
need tracking again of multiple failures.



> Improve Scheduler fetch failures
> --------------------------------
>
>                 Key: SPARK-20178
>                 URL: https://issues.apache.org/jira/browse/SPARK-20178
>             Project: Spark
>          Issue Type: Epic
>          Components: Scheduler
>    Affects Versions: 2.1.0
>            Reporter: Thomas Graves
>
> We have been having a lot of discussions around improving the handling of 
> fetch failures.  There are 4 jira currently related to this.  
> We should try to get a list of things we want to improve and come up with one 
> cohesive design.
> SPARK-20163,  SPARK-20091,  SPARK-14649 , and SPARK-19753
> I will put my initial thoughts in a follow on comment.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to