[ https://issues.apache.org/jira/browse/HIVE-20912?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Zihao Ye updated HIVE-20912: ---------------------------- Priority: Critical (was: Major) > Output data might be duplicated while speculation is enabled > ------------------------------------------------------------ > > Key: HIVE-20912 > URL: https://issues.apache.org/jira/browse/HIVE-20912 > Project: Hive > Issue Type: Bug > Components: Hive, Operators > Affects Versions: 1.2.1 > Environment: Hive 1.2.1 > Hadoop 2.7.3 > Tez 0.7.0 > Reporter: Zihao Ye > Priority: Critical > Attachments: image-2018-11-14-17-48-59-826.png, > image-2018-11-14-17-53-13-191.png, image-2018-11-14-17-53-50-171.png, > image-2018-11-14-19-28-18-924.png > > > The file merge stage had two tasks, which should create two files, but there > was three files created. > !image-2018-11-14-19-28-18-924.png! > By tracing the log, we found that there were two task attempts(one of them > was a speculation) finished in one second by such a coincidence. Although the > later one received a kill signal from AM, the rename operation was already > done at that time, which cause the data duplication. > The rename operation was done at _AbstractFileMergeOperator.closeOp()_, the > __ final path name was determined by the task attempt id rather than the task > id. In this case, the final path ended with '000000_0' and '000000_1' rather > than '000000'. IMHO, by making the final path name ended with task id without > task attempt id, one task can only generate at most one file, which could > solve this issue. But I don't know the side effects for changing the final > path name. > This issue also affects other operators related to file renaming like > JoinOperator and FileSinkOperator. > !image-2018-11-14-17-53-13-191.png! > !image-2018-11-14-17-53-50-171.png! -- This message was sent by Atlassian JIRA (v7.6.3#76005)