[GitHub] spark pull request #21257: [SPARK-24194] [SQL]HadoopFsRelation cannot overwr...

zheh12 Mon, 07 May 2018 20:25:37 -0700

Github user zheh12 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21257#discussion_r186608143
  
    --- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InsertIntoHadoopFsRelationCommand.scala
 ---
    @@ -207,9 +207,25 @@ case class InsertIntoHadoopFsRelationCommand(
         }
         // first clear the path determined by the static partition keys (e.g. 
/table/foo=1)
         val staticPrefixPath = 
qualifiedOutputPath.suffix(staticPartitionPrefix)
    -    if (fs.exists(staticPrefixPath) && !committer.deleteWithJob(fs, 
staticPrefixPath, true)) {
    -      throw new IOException(s"Unable to clear output " +
    -        s"directory $staticPrefixPath prior to writing to it")
    +
    +    // check if delete the dir or just sub files
    +    if (fs.exists(staticPrefixPath)) {
    +      // check if is he table root, and record the file to delete
    +      if (staticPartitionPrefix.isEmpty) {
    +        val files = fs.listFiles(staticPrefixPath, false)
    +        while (files.hasNext) {
    +          val file = files.next()
    +          if (!committer.deleteWithJob(fs, file.getPath, true)) {
    --- End diff --
    
    We choose to postpone deletion. Whether or not `output` is the same as 
`input`,
    now the `_temporary` directory is created in the `output` directory before 
deletion,
    so that it is not possible to delete the root directory directly.
    
    The original implementation was able to delete the root directory directly 
because it was deleted before the job was created, and then the root directory 
was rebuilt. Then the `_temporary` directory was created. Failure of any `task` 
in `job` in the original implementation will result in the loss of `output` 
data.
    
     I can't figure out how to separate the two situations. Do you have any 
good ideas?



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21257: [SPARK-24194] [SQL]HadoopFsRelation cannot overwr...

Reply via email to