[ 
https://issues.apache.org/jira/browse/HIVE-20912?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zihao Ye updated HIVE-20912:
----------------------------
    Priority: Critical  (was: Major)

> Output data might be duplicated while speculation is enabled
> ------------------------------------------------------------
>
>                 Key: HIVE-20912
>                 URL: https://issues.apache.org/jira/browse/HIVE-20912
>             Project: Hive
>          Issue Type: Bug
>          Components: Hive, Operators
>    Affects Versions: 1.2.1
>         Environment: Hive 1.2.1
> Hadoop 2.7.3
> Tez 0.7.0
>            Reporter: Zihao Ye
>            Priority: Critical
>         Attachments: image-2018-11-14-17-48-59-826.png, 
> image-2018-11-14-17-53-13-191.png, image-2018-11-14-17-53-50-171.png, 
> image-2018-11-14-19-28-18-924.png
>
>
> The file merge stage had two tasks, which should create two files, but there 
> was three files created.
> !image-2018-11-14-19-28-18-924.png!
> By tracing the log, we found that there were two task attempts(one of them 
> was a speculation) finished in one second by such a coincidence. Although the 
> later one received a kill signal from AM, the rename operation was already 
> done at that time, which cause the data duplication.
> The rename operation was done at _AbstractFileMergeOperator.closeOp()_, the 
> __ final path name was determined by the task attempt id rather than the task 
> id. In this case, the final path ended with '000000_0' and '000000_1' rather 
> than '000000'. IMHO, by making the final path name ended with task id without 
> task attempt id, one task can only generate at most one file, which could 
> solve this issue. But I don't know the side effects for changing the final 
> path name.
> This issue also affects other operators related to file renaming like 
> JoinOperator and FileSinkOperator.
> !image-2018-11-14-17-53-13-191.png!
> !image-2018-11-14-17-53-50-171.png!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to