Github user aarondav commented on the pull request: https://github.com/apache/spark/pull/1609#issuecomment-50305416 @mridulm Thanks for submitting this! I would like to dig a little deeper into understanding the specific issues you found, in order to understand the solutions you have provided (since the specific solutions seem interleaved with a lot of new asserts and code paths). You mention that there was an issue if shuffle writes co-occur with shuffle fetches, which is true, but should not typically occur due to the barrier before the reduce stage of a shuffle. In what situations does this happen (outside of failure conditions)? Did you observe a prior pattern of close/revert/close on the same block writer? How did task failures induce inconsistent state on the map side? Was it due to the same close/revert/close pattern?
--- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---