Github user tgravescs commented on the issue: https://github.com/apache/spark/pull/17088 > (a) even the existing behavior will make you do unnecessary work for transient failures and (b) this just slightly increases the amount of work that has to be repeated for those transient failures. Yes a transient fetch failure always causes some more work because you are going to re-run some map tasks, but the question comes down to how much work it is doing. For your b) you can't really say it only "slightly increases" the work because its going to be highly dependent up on the timing of when things finish, how long maps take, and how long the reducers take. Please correct me if I've missed something in the spark scheduler related to this. - The ResultTasks (lets call them reducers) say in Stage 1.0 are running. One of them gets a fetchFailure. This restarts the ShuffleMapTasks for that executor in Stage 0.1. If during the time those maps are running other reducers fetch fail, those get taken into account and will be rerun in same stage 0.1. If a reducer happened to finish successfully while that map Stage 0.1 is running, it ends up failing with a commit denied (at least with the hadoop writer, others might not). If however that reduce was longer lived and the map stage 0.1 finished and then that reducer finished successfully that reduce would be a success. This timing dependent completion seems very unpredictable (and perhaps a bug in the commit logic but I would have to look more). If your maps take a very long time and your reducers also take a long time then you don't really want to re-run the maps. Of course once it starts stage 1.1 for the reducers that didn't originally fail, if the nodemanager really is down, the new reducer task could fail to fetch data that the old ones had already received, so it would need to be re-run anyway even if the original reducer was still running and would have succeeded. Sorry if that doesn't make sense, much to complicated. There is also the case where one reducer fails and you are using a static # of executors. So you have 1 slots to rerun the maps. If you now say fail 3 maps instead of 1, its going to potentially take 3 times as long to rerun those maps. My point here is that I think its very dependent upon the conditions, could be faster, could be slower. The question is which happens more often. Generally we see more intermittent issues with nodemanager rather then them going down fully. If they are going down is it due to them crashing or is the entire node going down? If entire node going down are we handling that wrong or not as well as we could? If its the nodemanager crashing I would say you need to look and see why as that shouldn't happen very often. I'm not sure which one could be better. Due to the way spark schedules I'm ok with invalidating all as well, but think it would be better for us to fix this right, which to me means not throwing away the work of the first stage running reducers.
--- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org