agrawaldevesh commented on pull request #29422:
URL: https://github.com/apache/spark/pull/29422#issuecomment-674294907


   > Thank you for taking the time to resolve this and make such a clear 
writeup of the root cause. From an in-production not-in-test question: if the 
executor exits we also want to eagerly clean up everything and resubmit right?
   
   Yes for sure. That will happen on its own. I haven't really changed that 
behavior. I have only changed the way fetch failures are handled (stemming from 
a decommissioned host). And the way they lead to a rerun is that 
`org.apache.spark.scheduler.DAGScheduler#resubmitFailedStages` gets invoked on 
a fetch failure asynchronously. The driver will then figure out what stages are 
missing map outputs and rerun them in topological order. 
   
   When an executor exits, it will normally clean up just its shuffle data (it 
does not know that its peer executors on the same host will soon be dying as 
well). Its the incrementing of the `shuffleFileLostEpoch` as a part of this 
cleanup that prevents future cleanups when a fetch failure is observed. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to