[jira] [Commented] (SPARK-29037) [Core] Spark gives duplicate result when an application was killed and rerun

Xianjin YE (Jira) Thu, 12 Sep 2019 07:08:06 -0700


    [ 
https://issues.apache.org/jira/browse/SPARK-29037?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16928561#comment-16928561
 ]


Xianjin YE commented on SPARK-29037:
------------------------------------

[~hzfeiwang] by rerun the application, do you mean re-submit the same 
application again or yarn retries another app attempt?

 

I believe FileOutputCommitter has already covered the second one.  what you 
described should be the first case, spark doesn't trash the output path 
gracefully and commits duplicated result.

 

Two choices to fixes here:
 # user side, user should clean up the output path before submitting a new app. 
This is easy for rdd based applications. It may not be feasible for SQL/Hive 
tables.
 # Spark/Hadoop side:
 ## the proper fix should be FileOutputCommitter trashes files under the 
jobAttemptPath($output/_temporary/$app_attempt_id). This can be done in the 
hadoop side, which may involve a new release of hadoop. Or, we do it the 
Spark's FileCommitProtocol.
 ## another one could be we add output check with setupJob, just like the 
SparkHadoopWriter did 
[https://github.com/apache/spark/blob/b508eab9858b94f14b29e023812448e3d0c97712/core/src/main/scala/org/apache/spark/internal/io/SparkHadoopWriter.scala#L268]

 

> [Core] Spark gives duplicate result when an application was killed and rerun
> ----------------------------------------------------------------------------
>
>                 Key: SPARK-29037
>                 URL: https://issues.apache.org/jira/browse/SPARK-29037
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 2.1.0, 2.3.3
>            Reporter: feiwang
>            Priority: Major
>         Attachments: screenshot-1.png
>
>
> When we insert overwrite a partition of table.
> For a stage, whose tasks commit output, a task saves output to a staging dir 
> firstly,  when this task complete, it will save output to committedTaskPath, 
> when all tasks of this stage success, all task output under committedTaskPath 
> will be moved to destination dir.
> However, when we kill an application, which is committing tasks' output, 
> parts of tasks' results will be kept in committedTaskPath, which would not be 
> cleared gracefully.
> Then we rerun this application and the new application will reuse this 
> committedTaskPath dir.
> And when the task commit stage of new application success, all task output 
> under this committedTaskPath, which contains parts of old application's task 
> output , would be moved to destination dir and the result is duplicated.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29037) [Core] Spark gives duplicate result when an application was killed and rerun

Reply via email to