agrawaldevesh commented on pull request #29422: URL: https://github.com/apache/spark/pull/29422#issuecomment-674294907
> Thank you for taking the time to resolve this and make such a clear writeup of the root cause. From an in-production not-in-test question: if the executor exits we also want to eagerly clean up everything and resubmit right? Yes for sure. That will happen on its own. I haven't really changed that behavior. I have only changed the way fetch failures are handled (stemming from a decommissioned host). And the way they lead to a rerun is that `org.apache.spark.scheduler.DAGScheduler#resubmitFailedStages` gets invoked on a fetch failure asynchronously. The driver will then figure out what stages are missing map outputs and rerun them in topological order. When an executor exits, it will normally clean up just its shuffle data (it does not know that its peer executors on the same host will soon be dying as well). Its the incrementing of the `shuffleFileLostEpoch` as a part of this cleanup that prevents future cleanups when a fetch failure is observed. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org