Sankar Hariappan created HIVE-17608:
---------------------------------------
Summary: REPL LOAD should overwrite the data files if exists
instead of duplicating it
Key: HIVE-17608
URL: https://issues.apache.org/jira/browse/HIVE-17608
Project: Hive
Issue Type: Sub-task
Components: HiveServer2, repl
Affects Versions: 3.0.0
Reporter: Sankar Hariappan
Assignee: Sankar Hariappan
Fix For: 3.0.0
This is to make insert event idempotent.
Currently, MoveTask would create a new file if the destination folder contains
a file of the same name. This is wrong if we have the same file in both
bootstrap dump and incremental dump (by design, duplicate file in incremental
dump will be ignored for idempotent reason), we will get duplicate files
eventually. Also it is wrong to just retain the filename in the staging folder.
Suppose we get the same insert event twice, the first time we get the file from
source table folder, the second time we get the file from cm, we still end up
with duplicate copy. The right solution is to keep the same file name as the
source table folder.
To do that, we can put the original filename in MoveWork, and in MoveTask, if
original filename is set, don't generate a new name, simply overwrite. We need
to do it in both bootstrap and incremental load.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)