[GitHub] [spark] Ngone51 commented on a change in pull request #29000: [SPARK-27194][SPARK-29302][SQL] Fix commit collision in dynamic partition overwrite mode

GitBox Sat, 01 Aug 2020 07:35:43 -0700


Ngone51 commented on a change in pull request #29000:
URL: https://github.com/apache/spark/pull/29000#discussion_r463965241




##########
File path: 
core/src/main/scala/org/apache/spark/internal/io/HadoopMapReduceCommitProtocol.scala
##########
@@ -41,13 +41,17 @@ import org.apache.spark.mapred.SparkHadoopMapRedUtil
  * @param jobId the job's or stage's id
  * @param path the job's output path, or null if committer acts as a noop
  * @param dynamicPartitionOverwrite If true, Spark will overwrite partition 
directories at runtime
- *                                  dynamically, i.e., we first write files 
under a staging
- *                                  directory with partition path, e.g.
- *                                  /path/to/staging/a=1/b=1/xxx.parquet. When 
committing the job,
- *                                  we first clean up the corresponding 
partition directories at
- *                                  destination path, e.g. 
/path/to/destination/a=1/b=1, and move
- *                                  files from staging directory to the 
corresponding partition
- *                                  directories under destination path.
+ *                                  dynamically, i.e., for speculative tasks, 
we first write files

Review comment:
       This is not only for the speculative task?

##########
File path: 
core/src/main/scala/org/apache/spark/internal/io/HadoopMapReduceCommitProtocol.scala
##########
@@ -106,13 +110,13 @@ class HadoopMapReduceCommitProtocol(
     val filename = getFilename(taskContext, ext)
 
     val stagingDir: Path = committer match {
-      case _ if dynamicPartitionOverwrite =>
-        assert(dir.isDefined,
-          "The dataset to be written must be partitioned when 
dynamicPartitionOverwrite is true.")
-        partitionPaths += dir.get
-        this.stagingDir
       // For FileOutputCommitter it has its own staging path called "work 
path".
       case f: FileOutputCommitter =>
+        if (dynamicPartitionOverwrite) {

Review comment:
       Could we make sure that we actually only support 
`dynamicPartitionOverwrite` with `FileOutputCommitter`?

##########
File path: 
core/src/main/scala/org/apache/spark/internal/io/HadoopMapReduceCommitProtocol.scala
##########
@@ -41,13 +41,17 @@ import org.apache.spark.mapred.SparkHadoopMapRedUtil
  * @param jobId the job's or stage's id
  * @param path the job's output path, or null if committer acts as a noop
  * @param dynamicPartitionOverwrite If true, Spark will overwrite partition 
directories at runtime
- *                                  dynamically, i.e., we first write files 
under a staging
- *                                  directory with partition path, e.g.
- *                                  /path/to/staging/a=1/b=1/xxx.parquet. When 
committing the job,
- *                                  we first clean up the corresponding 
partition directories at
- *                                  destination path, e.g. 
/path/to/destination/a=1/b=1, and move
- *                                  files from staging directory to the 
corresponding partition
- *                                  directories under destination path.
+ *                                  dynamically, i.e., for speculative tasks, 
we first write files
+ *                                  to task attempt paths under a staging 
directory, e.g.
+ *                                  
/path/to/staging/.spark-staging-{jobId}/_temporary/

Review comment:
       Shouldn't it be `/path/to/outputPath/.spark-staging-{jobId}/...` ?

##########
File path: 
core/src/main/scala/org/apache/spark/internal/io/HadoopMapReduceCommitProtocol.scala
##########
@@ -41,13 +41,17 @@ import org.apache.spark.mapred.SparkHadoopMapRedUtil
  * @param jobId the job's or stage's id
  * @param path the job's output path, or null if committer acts as a noop
  * @param dynamicPartitionOverwrite If true, Spark will overwrite partition 
directories at runtime
- *                                  dynamically, i.e., we first write files 
under a staging
- *                                  directory with partition path, e.g.
- *                                  /path/to/staging/a=1/b=1/xxx.parquet. When 
committing the job,
- *                                  we first clean up the corresponding 
partition directories at
- *                                  destination path, e.g. 
/path/to/destination/a=1/b=1, and move
- *                                  files from staging directory to the 
corresponding partition
- *                                  directories under destination path.
+ *                                  dynamically, i.e., for speculative tasks, 
we first write files
+ *                                  to task attempt paths under a staging 
directory, e.g.
+ *                                  
/path/to/staging/.spark-staging-{jobId}/_temporary/
+ *                                  
{appAttemptId}/_temporary/{taskAttemptId}/a=1/b=1/xxx.parquet.
+ *                                  When committing the job, we first move 
files from task attempt

Review comment:
       I'm wondering that the first moving actually happens during task 
committing?

##########
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/SQLHadoopMapReduceCommitProtocol.scala
##########
@@ -55,7 +55,8 @@ class SQLHadoopMapReduceCommitProtocol(
         // The specified output committer is a FileOutputCommitter.
         // So, we will use the FileOutputCommitter-specified constructor.
         val ctor = clazz.getDeclaredConstructor(classOf[Path], 
classOf[TaskAttemptContext])
-        committer = ctor.newInstance(new Path(path), context)
+        val committerOutputPath = if (dynamicPartitionOverwrite) stagingDir 
else new Path(path)
+        committer = ctor.newInstance(committerOutputPath, context)

Review comment:
       Is it the same if we pass the `committerOutputPath` in 
`InsertIntoHadoopFsRelationCommand` to `SQLHadoopMapReduceCommitProtocol` 
directly via `FileCommitProtocol.instantiate()`? 

##########
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InsertIntoHadoopFsRelationCommand.scala
##########
@@ -163,14 +164,21 @@ case class InsertIntoHadoopFsRelationCommand(
         }
       }
 
+      // For dynamic partition overwrite, FileOutputCommitter's output path is 
staging path, files
+      // will be renamed from staging path to final output path during commit 
job
+      val committerOutputPath = if (dynamicPartitionOverwrite) {
+        FileCommitProtocol.getStagingDir(outputPath.toString, jobId)
+          .makeQualified(fs.getUri, fs.getWorkingDirectory)
+      } else qualifiedOutputPath

Review comment:
       nit:
   ```suggestion
         } else {
           qualifiedOutputPath
         }
   ```




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] Ngone51 commented on a change in pull request #29000: [SPARK-27194][SPARK-29302][SQL] Fix commit collision in dynamic partition overwrite mode

Reply via email to