Github user fangshil commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20931#discussion_r179517200
  
    --- Diff: 
core/src/main/scala/org/apache/spark/internal/io/HadoopMapReduceCommitProtocol.scala
 ---
    @@ -186,7 +186,9 @@ class HadoopMapReduceCommitProtocol(
             logDebug(s"Clean up default partition directories for overwriting: 
$partitionPaths")
             for (part <- partitionPaths) {
               val finalPartPath = new Path(path, part)
    -          fs.delete(finalPartPath, true)
    +          if (!fs.delete(finalPartPath, true) && 
!fs.exists(finalPartPath.getParent)) {
    --- End diff --
    
    @cloud-fan this is to follow the behavior of HDFS rename spec: it requires 
the parent to be present. If we create finalPartPath directly, then it will 
cause another wired behavior in rename when the dst path already exists. From 
the HDFS spec I shared above: " If the destination exists and is a directory, 
the final destination of the rename becomes the destination + the filename of 
the source path".  We have confirmed this in our production cluster, and 
resulted in the current solution to only create parent dir which follows the 
HDFS spec exactly,


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to