[ 
https://issues.apache.org/jira/browse/SPARK-29037?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16928650#comment-16928650
 ] 

Xianjin YE commented on SPARK-29037:
------------------------------------

> About output check, I think it is not appropriate, because when several 
>application(insert overwrite a partition of same table) running at same time, 
>they may use the same committedTaskPath.

 

Well, this is what https://issues.apache.org/jira/browse/SPARK-28945 trying to 
resolve. If you are inserting into partitioned table with 
dynamicPartitionOverwrite, you may not encountered this problem.  Meanwhile, 
Spark don't support concurrent writes to the same table currently.  Even with 
SPARk-28945, concurrent writes to the same partition of same table will mess up 
the output.

If we are going to resolve concurrent writes in dynamic partition 
overwrite(Spark's previous style), we can add output spec check for different 
static partitions as they will use different output path 
then($table_output/static_part_key1=value1/static_part_key2=value2)

> [Core] Spark gives duplicate result when an application was killed and rerun
> ----------------------------------------------------------------------------
>
>                 Key: SPARK-29037
>                 URL: https://issues.apache.org/jira/browse/SPARK-29037
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 2.1.0, 2.3.3
>            Reporter: feiwang
>            Priority: Major
>         Attachments: screenshot-1.png
>
>
> When we insert overwrite a partition of table.
> For a stage, whose tasks commit output, a task saves output to a staging dir 
> firstly,  when this task complete, it will save output to committedTaskPath, 
> when all tasks of this stage success, all task output under committedTaskPath 
> will be moved to destination dir.
> However, when we kill an application, which is committing tasks' output, 
> parts of tasks' results will be kept in committedTaskPath, which would not be 
> cleared gracefully.
> Then we rerun this application and the new application will reuse this 
> committedTaskPath dir.
> And when the task commit stage of new application success, all task output 
> under this committedTaskPath, which contains parts of old application's task 
> output , would be moved to destination dir and the result is duplicated.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to