Github user squito commented on a diff in the pull request: https://github.com/apache/spark/pull/9610#discussion_r44717955 --- Diff: core/src/main/scala/org/apache/spark/shuffle/hash/HashShuffleWriter.scala --- @@ -106,6 +108,19 @@ private[spark] class HashShuffleWriter[K, V]( writer.commitAndClose() writer.fileSegment().length } + // rename all shuffle files to final paths + shuffle.writers.zip(sizes).foreach { case (writer: DiskBlockObjectWriter, size: Long) => + if (size > 0) { + val output = blockManager.diskBlockManager.getFile(writer.blockId) + if (output.exists()) { + writer.file.delete() + } else { + if (!writer.file.renameTo(output)) { + throw new IOException(s"fail to rename ${writer.file} to $output") --- End diff -- yeah I suppose it all depends on what the model is for non-deterministic data. The reduce tasks can read data from a mix attempts, but I guess that is OK (we can't completely prevent it in any case). There is also the problem of returning the right mapstatus here, but it doesn't matter as much in this case -- you will at least return some set of non-empty blocks that is consistent with the shuffle data on disk, even if the sizes can be arbitrarily wrong. Also I know its super-rare, but there _is_ a race between `output.exists` and `renameTo(output)`, might as well protect against that. I also find it a weird that this is neither first or last attempt wins -- the first attempt to get to each output file wins, but it can be a mix of attempts. again I'd include a comment explaining the logic
--- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org