Zihao Ye created HIVE-20912:
-------------------------------
Summary: Output data might be duplicated while speculation is
enabled
Key: HIVE-20912
URL: https://issues.apache.org/jira/browse/HIVE-20912
Project: Hive
Issue Type: Bug
Components: Hive, Operators
Affects Versions: 1.2.1
Environment: Hive 1.2.1
Hadoop 2.7.3
Tez 0.7.0
Reporter: Zihao Ye
Attachments: image-2018-11-14-17-48-59-826.png,
image-2018-11-14-17-53-13-191.png, image-2018-11-14-17-53-50-171.png,
image-2018-11-14-19-28-18-924.png
The file merge stage had two tasks, which should create two files, but there
was three files created.
!image-2018-11-14-19-28-18-924.png!
By tracing the log, we found that there were two task attempts(one of them was
a speculation) finished in one second by such a coincidence. Although the later
one received a kill signal from AM, the rename operation was already done at
that time, which cause the data duplication.
The rename operation was done at _AbstractFileMergeOperator.closeOp()_, the __
final path name was determined by the task attempt id rather than the task id.
In this case, the final path ended with '000000_0' and '000000_1' rather than
'000000'. IMHO, by making the final path name ended with task id without task
attempt id, one task can only generate at most one file, which could solve this
issue. But I don't know the side effects for changing the final path name.
This issue also affects other operators related to file renaming like
JoinOperator and FileSinkOperator.
!image-2018-11-14-17-53-13-191.png!
!image-2018-11-14-17-53-50-171.png!
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)