[ https://issues.apache.org/jira/browse/SPARK-20178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16019650#comment-16019650 ]
Thomas Graves commented on SPARK-20178: --------------------------------------- | when the DAGScheduler is notified of a FetchFailure from a node then mark that shuffle's output locations on that node as unavailable (rather than all shuffles' outputs on that node) Can you please clarify this? Do you mean only the map stage output locations for that particular shuffle id that got the fetch failure? | the likelihood of the FetchFailure being a transient failure is relatively small Unfortunately this is not what we have seen in the past with mapreduce and tez. Both of those used to be very aggressive about invalidating outputs and such and we had to tune those back because we do see a lot of transient failures and caused a lot of extra work and delays. But I do agree that this one is a lot of work and will take some time to implement and based on the way things work now we should do something shorter term as well. I was actually just looking at this more last friday because I am seeing more of these fetch failures cause job failures so something needs to be done. Based on the way we handle fetch failures now invalidating all for that stage and causing it to take longer is better then job failure. I was thinking we could invalidate all the map outputs on that host for that map stage after a certain number of failures, which could be across attempts. I think this is basically what you are proposing except waiting for a configurable amount of failures rather then doing it immediately. Thoughts? I also think adding blacklisting after a certain number of fetch failures would be good for those cases where the YARN NM crashes but there could still be executors running on the node. That one isn't as difficult, you just have to track it in the BlacklistTracker. Work started in SPARK-13669 it would just need tracking again of multiple failures. > Improve Scheduler fetch failures > -------------------------------- > > Key: SPARK-20178 > URL: https://issues.apache.org/jira/browse/SPARK-20178 > Project: Spark > Issue Type: Epic > Components: Scheduler > Affects Versions: 2.1.0 > Reporter: Thomas Graves > > We have been having a lot of discussions around improving the handling of > fetch failures. There are 4 jira currently related to this. > We should try to get a list of things we want to improve and come up with one > cohesive design. > SPARK-20163, SPARK-20091, SPARK-14649 , and SPARK-19753 > I will put my initial thoughts in a follow on comment. -- This message was sent by Atlassian JIRA (v6.3.15#6346) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org