Hi, I opened https://issues.apache.org/jira/browse/SPARK-22339 some days ago, and I would like to get some feedback on that. The idea is pushing epoch updates to the executors after a fetch failure by piggybacking on the executor heartbeat response, in order to fail faster when an executor and their shuffle blocks are lost, instead of having to wait for all fetch retries to fail and a new task to be started on the reader executors. This can speed up job execution, particularly when executors are lost at the end of an stage in a Spark application with a single action at a time.There are more details and a draft patch for this in the JIRA.
Looking forward for your feedback on this. Greetings, Juan