Github user tgravescs commented on the issue:

    https://github.com/apache/spark/pull/17088
  
    > (a) even the existing behavior will make you do unnecessary work for 
transient failures and (b) this just slightly increases the amount of work that 
has to be repeated for those transient failures.
    
    Yes a transient fetch failure always causes some more work because you are 
going to re-run some map tasks, but the question comes down to how much work it 
is doing.   For your b) you can't really say it only "slightly increases" the 
work because its going to be highly dependent up on the timing of when things 
finish, how long maps take, and how long the reducers take.
    
    Please correct me if I've missed something in the spark scheduler related 
to this.   
    - The ResultTasks (lets call them reducers)  say in Stage 1.0 are running.  
One of them gets a fetchFailure. This restarts the ShuffleMapTasks for that 
executor in Stage 0.1.  If during the time those maps are running other 
reducers fetch fail, those get taken into account and will be rerun in same 
stage 0.1.  If a reducer happened to finish successfully while that map Stage 
0.1 is running, it ends up failing with a commit denied (at least with the 
hadoop writer, others might not).  If however that reduce was longer lived and 
the map stage 0.1 finished and then that reducer finished successfully that 
reduce would be a success.  
    
    This timing dependent completion seems very unpredictable (and perhaps a 
bug in the commit logic but I would have to look more).  If your maps take a 
very long time and your reducers also take a long time then you don't really 
want to re-run the maps. Of course once it starts stage 1.1 for the reducers 
that didn't originally fail, if the nodemanager really is down, the new reducer 
task could fail to fetch data that the old ones had already received, so it 
would need to be re-run anyway even if the original reducer was still running 
and would have succeeded. 
    
    Sorry if that doesn't make sense, much to complicated.   There is also the 
case where one reducer fails and you are using a static # of executors. So you 
have 1 slots to rerun the maps.  If you now say fail 3 maps instead of 1, its 
going to potentially take 3 times as long to rerun those maps.  My point here 
is that I think its very dependent upon the conditions, could be faster, could 
be slower.  The question is which happens more often.  Generally we see more 
intermittent issues with nodemanager rather then them going down fully.  If 
they are going down is it due to them crashing or is the entire node going 
down?  If entire node going down are we handling that wrong or not as well as 
we could?  If its the nodemanager crashing I would say you need to look and see 
why as that shouldn't happen very often.
    
    I'm not sure which one could be better. Due to the way spark schedules I'm 
ok with invalidating all as well, but think it would be better for us to fix 
this right, which to me means not throwing away the work of the first stage 
running reducers.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to