Github user fangshil commented on a diff in the pull request: https://github.com/apache/spark/pull/20931#discussion_r179517200 --- Diff: core/src/main/scala/org/apache/spark/internal/io/HadoopMapReduceCommitProtocol.scala --- @@ -186,7 +186,9 @@ class HadoopMapReduceCommitProtocol( logDebug(s"Clean up default partition directories for overwriting: $partitionPaths") for (part <- partitionPaths) { val finalPartPath = new Path(path, part) - fs.delete(finalPartPath, true) + if (!fs.delete(finalPartPath, true) && !fs.exists(finalPartPath.getParent)) { --- End diff -- @cloud-fan this is to follow the behavior of HDFS rename spec: it requires the parent to be present. If we create finalPartPath directly, then it will cause another wired behavior in rename when the dst path already exists. From the HDFS spec I shared above: " If the destination exists and is a directory, the final destination of the rename becomes the destination + the filename of the source path". We have confirmed this in our production cluster, and resulted in the current solution to only create parent dir which follows the HDFS spec exactly,
--- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org