Github user JoshRosen commented on the issue: https://github.com/apache/spark/pull/14932 Context for other reviewers: As of #9610 (Spark 1.5.3+), map tasks write their output to temporary files and then atomically rename those files to put them at the final destination path. The `IndexShuffleBlockResolver.removeDataByMap()` call deletes the final map output files, not the temporary files. Therefore if a map task fails before these files are written then it does not need to call this cleanup method. Output files are renamed to their final locations via the `IndexShuffleBlockResolver.writeIndexFileAndCommit()` method which is called at the end of the shuffle write. If we assume that this method is atomic in case of failures (e.g. it doesn't leave any state around after failing) and also assume that no steps after `writeIndexFileAndCommit()` will fail then it should be fine to omit these `removeDataByMap` calls in the error-handling cases. Looking at `writeIndexFileAndCommit`, I think I spot a rare corner-case where it might behave non-atomically: ``` else { // This is the first successful attempt in writing the map outputs for this task, // so override any existing index and data files with the ones we wrote. if (indexFile.exists()) { indexFile.delete() } if (dataFile.exists()) { dataFile.delete() } if (!indexTmp.renameTo(indexFile)) { throw new IOException("fail to rename file " + indexTmp + " to " + indexFile) } if (dataTmp != null && dataTmp.exists() && !dataTmp.renameTo(dataFile)) { throw new IOException("fail to rename file " + dataTmp + " to " + dataFile) } } ``` I suppose it's possible that the index rename could succeed while the data file rename could fail. In this case we should probably handle cleanup of the data temp file. Similarly, the existing code won't necessarily clean up `indexTmp` or `dataTmp` in error-conditions. I don't think that either of these are huge issues in practice, but they might be worth fixing as long as we're removing an additional layer of error-handling defense as part of this PR.
--- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org