[jira] [Updated] (HIVE-17608) REPL LOAD should overwrite the data files if exists instead of duplicating it
[ https://issues.apache.org/jira/browse/HIVE-17608?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sankar Hariappan updated HIVE-17608: Resolution: Fixed Status: Resolved (was: Patch Available) > REPL LOAD should overwrite the data files if exists instead of duplicating it > - > > Key: HIVE-17608 > URL: https://issues.apache.org/jira/browse/HIVE-17608 > Project: Hive > Issue Type: Sub-task > Components: HiveServer2, repl >Affects Versions: 3.0.0 >Reporter: Sankar Hariappan >Assignee: Sankar Hariappan > Labels: DR, pull-request-available, replication > Fix For: 3.0.0 > > Attachments: HIVE-17608.01.patch, HIVE-17608.02.patch > > > This is to make insert event idempotent. > Currently, MoveTask would create a new file if the destination folder > contains a file of the same name. This is wrong if we have the same file in > both bootstrap dump and incremental dump (by design, duplicate file in > incremental dump will be ignored for idempotent reason), we will get > duplicate files eventually. Also it is wrong to just retain the filename in > the staging folder. Suppose we get the same insert event twice, the first > time we get the file from source table folder, the second time we get the > file from cm, we still end up with duplicate copy. The right solution is to > keep the same file name as the source table folder. > To do that, we can put the original filename in MoveWork, and in MoveTask, if > original filename is set, don't generate a new name, simply overwrite. We > need to do it in both bootstrap and incremental load. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (HIVE-17608) REPL LOAD should overwrite the data files if exists instead of duplicating it
[ https://issues.apache.org/jira/browse/HIVE-17608?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sankar Hariappan updated HIVE-17608: Status: Patch Available (was: Open) > REPL LOAD should overwrite the data files if exists instead of duplicating it > - > > Key: HIVE-17608 > URL: https://issues.apache.org/jira/browse/HIVE-17608 > Project: Hive > Issue Type: Sub-task > Components: HiveServer2, repl >Affects Versions: 3.0.0 >Reporter: Sankar Hariappan >Assignee: Sankar Hariappan > Labels: DR, pull-request-available, replication > Fix For: 3.0.0 > > Attachments: HIVE-17608.01.patch, HIVE-17608.02.patch > > > This is to make insert event idempotent. > Currently, MoveTask would create a new file if the destination folder > contains a file of the same name. This is wrong if we have the same file in > both bootstrap dump and incremental dump (by design, duplicate file in > incremental dump will be ignored for idempotent reason), we will get > duplicate files eventually. Also it is wrong to just retain the filename in > the staging folder. Suppose we get the same insert event twice, the first > time we get the file from source table folder, the second time we get the > file from cm, we still end up with duplicate copy. The right solution is to > keep the same file name as the source table folder. > To do that, we can put the original filename in MoveWork, and in MoveTask, if > original filename is set, don't generate a new name, simply overwrite. We > need to do it in both bootstrap and incremental load. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (HIVE-17608) REPL LOAD should overwrite the data files if exists instead of duplicating it
[ https://issues.apache.org/jira/browse/HIVE-17608?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sankar Hariappan updated HIVE-17608: Attachment: HIVE-17608.02.patch > REPL LOAD should overwrite the data files if exists instead of duplicating it > - > > Key: HIVE-17608 > URL: https://issues.apache.org/jira/browse/HIVE-17608 > Project: Hive > Issue Type: Sub-task > Components: HiveServer2, repl >Affects Versions: 3.0.0 >Reporter: Sankar Hariappan >Assignee: Sankar Hariappan > Labels: DR, pull-request-available, replication > Fix For: 3.0.0 > > Attachments: HIVE-17608.01.patch, HIVE-17608.02.patch > > > This is to make insert event idempotent. > Currently, MoveTask would create a new file if the destination folder > contains a file of the same name. This is wrong if we have the same file in > both bootstrap dump and incremental dump (by design, duplicate file in > incremental dump will be ignored for idempotent reason), we will get > duplicate files eventually. Also it is wrong to just retain the filename in > the staging folder. Suppose we get the same insert event twice, the first > time we get the file from source table folder, the second time we get the > file from cm, we still end up with duplicate copy. The right solution is to > keep the same file name as the source table folder. > To do that, we can put the original filename in MoveWork, and in MoveTask, if > original filename is set, don't generate a new name, simply overwrite. We > need to do it in both bootstrap and incremental load. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (HIVE-17608) REPL LOAD should overwrite the data files if exists instead of duplicating it
[ https://issues.apache.org/jira/browse/HIVE-17608?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sankar Hariappan updated HIVE-17608: Attachment: (was: HIVE-17608.02.patch) > REPL LOAD should overwrite the data files if exists instead of duplicating it > - > > Key: HIVE-17608 > URL: https://issues.apache.org/jira/browse/HIVE-17608 > Project: Hive > Issue Type: Sub-task > Components: HiveServer2, repl >Affects Versions: 3.0.0 >Reporter: Sankar Hariappan >Assignee: Sankar Hariappan > Labels: DR, pull-request-available, replication > Fix For: 3.0.0 > > Attachments: HIVE-17608.01.patch > > > This is to make insert event idempotent. > Currently, MoveTask would create a new file if the destination folder > contains a file of the same name. This is wrong if we have the same file in > both bootstrap dump and incremental dump (by design, duplicate file in > incremental dump will be ignored for idempotent reason), we will get > duplicate files eventually. Also it is wrong to just retain the filename in > the staging folder. Suppose we get the same insert event twice, the first > time we get the file from source table folder, the second time we get the > file from cm, we still end up with duplicate copy. The right solution is to > keep the same file name as the source table folder. > To do that, we can put the original filename in MoveWork, and in MoveTask, if > original filename is set, don't generate a new name, simply overwrite. We > need to do it in both bootstrap and incremental load. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (HIVE-17608) REPL LOAD should overwrite the data files if exists instead of duplicating it
[ https://issues.apache.org/jira/browse/HIVE-17608?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sankar Hariappan updated HIVE-17608: Status: Open (was: Patch Available) > REPL LOAD should overwrite the data files if exists instead of duplicating it > - > > Key: HIVE-17608 > URL: https://issues.apache.org/jira/browse/HIVE-17608 > Project: Hive > Issue Type: Sub-task > Components: HiveServer2, repl >Affects Versions: 3.0.0 >Reporter: Sankar Hariappan >Assignee: Sankar Hariappan > Labels: DR, pull-request-available, replication > Fix For: 3.0.0 > > Attachments: HIVE-17608.01.patch > > > This is to make insert event idempotent. > Currently, MoveTask would create a new file if the destination folder > contains a file of the same name. This is wrong if we have the same file in > both bootstrap dump and incremental dump (by design, duplicate file in > incremental dump will be ignored for idempotent reason), we will get > duplicate files eventually. Also it is wrong to just retain the filename in > the staging folder. Suppose we get the same insert event twice, the first > time we get the file from source table folder, the second time we get the > file from cm, we still end up with duplicate copy. The right solution is to > keep the same file name as the source table folder. > To do that, we can put the original filename in MoveWork, and in MoveTask, if > original filename is set, don't generate a new name, simply overwrite. We > need to do it in both bootstrap and incremental load. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (HIVE-17608) REPL LOAD should overwrite the data files if exists instead of duplicating it
[ https://issues.apache.org/jira/browse/HIVE-17608?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sankar Hariappan updated HIVE-17608: Status: Patch Available (was: Open) > REPL LOAD should overwrite the data files if exists instead of duplicating it > - > > Key: HIVE-17608 > URL: https://issues.apache.org/jira/browse/HIVE-17608 > Project: Hive > Issue Type: Sub-task > Components: HiveServer2, repl >Affects Versions: 3.0.0 >Reporter: Sankar Hariappan >Assignee: Sankar Hariappan > Labels: DR, pull-request-available, replication > Fix For: 3.0.0 > > Attachments: HIVE-17608.01.patch, HIVE-17608.02.patch > > > This is to make insert event idempotent. > Currently, MoveTask would create a new file if the destination folder > contains a file of the same name. This is wrong if we have the same file in > both bootstrap dump and incremental dump (by design, duplicate file in > incremental dump will be ignored for idempotent reason), we will get > duplicate files eventually. Also it is wrong to just retain the filename in > the staging folder. Suppose we get the same insert event twice, the first > time we get the file from source table folder, the second time we get the > file from cm, we still end up with duplicate copy. The right solution is to > keep the same file name as the source table folder. > To do that, we can put the original filename in MoveWork, and in MoveTask, if > original filename is set, don't generate a new name, simply overwrite. We > need to do it in both bootstrap and incremental load. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (HIVE-17608) REPL LOAD should overwrite the data files if exists instead of duplicating it
[ https://issues.apache.org/jira/browse/HIVE-17608?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sankar Hariappan updated HIVE-17608: Attachment: HIVE-17608.02.patch Added 02.patch to fix test failures. > REPL LOAD should overwrite the data files if exists instead of duplicating it > - > > Key: HIVE-17608 > URL: https://issues.apache.org/jira/browse/HIVE-17608 > Project: Hive > Issue Type: Sub-task > Components: HiveServer2, repl >Affects Versions: 3.0.0 >Reporter: Sankar Hariappan >Assignee: Sankar Hariappan > Labels: DR, pull-request-available, replication > Fix For: 3.0.0 > > Attachments: HIVE-17608.01.patch, HIVE-17608.02.patch > > > This is to make insert event idempotent. > Currently, MoveTask would create a new file if the destination folder > contains a file of the same name. This is wrong if we have the same file in > both bootstrap dump and incremental dump (by design, duplicate file in > incremental dump will be ignored for idempotent reason), we will get > duplicate files eventually. Also it is wrong to just retain the filename in > the staging folder. Suppose we get the same insert event twice, the first > time we get the file from source table folder, the second time we get the > file from cm, we still end up with duplicate copy. The right solution is to > keep the same file name as the source table folder. > To do that, we can put the original filename in MoveWork, and in MoveTask, if > original filename is set, don't generate a new name, simply overwrite. We > need to do it in both bootstrap and incremental load. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (HIVE-17608) REPL LOAD should overwrite the data files if exists instead of duplicating it
[ https://issues.apache.org/jira/browse/HIVE-17608?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sankar Hariappan updated HIVE-17608: Status: Open (was: Patch Available) > REPL LOAD should overwrite the data files if exists instead of duplicating it > - > > Key: HIVE-17608 > URL: https://issues.apache.org/jira/browse/HIVE-17608 > Project: Hive > Issue Type: Sub-task > Components: HiveServer2, repl >Affects Versions: 3.0.0 >Reporter: Sankar Hariappan >Assignee: Sankar Hariappan > Labels: DR, pull-request-available, replication > Fix For: 3.0.0 > > Attachments: HIVE-17608.01.patch > > > This is to make insert event idempotent. > Currently, MoveTask would create a new file if the destination folder > contains a file of the same name. This is wrong if we have the same file in > both bootstrap dump and incremental dump (by design, duplicate file in > incremental dump will be ignored for idempotent reason), we will get > duplicate files eventually. Also it is wrong to just retain the filename in > the staging folder. Suppose we get the same insert event twice, the first > time we get the file from source table folder, the second time we get the > file from cm, we still end up with duplicate copy. The right solution is to > keep the same file name as the source table folder. > To do that, we can put the original filename in MoveWork, and in MoveTask, if > original filename is set, don't generate a new name, simply overwrite. We > need to do it in both bootstrap and incremental load. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (HIVE-17608) REPL LOAD should overwrite the data files if exists instead of duplicating it
[ https://issues.apache.org/jira/browse/HIVE-17608?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HIVE-17608: -- Labels: DR pull-request-available replication (was: DR replication) > REPL LOAD should overwrite the data files if exists instead of duplicating it > - > > Key: HIVE-17608 > URL: https://issues.apache.org/jira/browse/HIVE-17608 > Project: Hive > Issue Type: Sub-task > Components: HiveServer2, repl >Affects Versions: 3.0.0 >Reporter: Sankar Hariappan >Assignee: Sankar Hariappan > Labels: DR, pull-request-available, replication > Fix For: 3.0.0 > > Attachments: HIVE-17608.01.patch > > > This is to make insert event idempotent. > Currently, MoveTask would create a new file if the destination folder > contains a file of the same name. This is wrong if we have the same file in > both bootstrap dump and incremental dump (by design, duplicate file in > incremental dump will be ignored for idempotent reason), we will get > duplicate files eventually. Also it is wrong to just retain the filename in > the staging folder. Suppose we get the same insert event twice, the first > time we get the file from source table folder, the second time we get the > file from cm, we still end up with duplicate copy. The right solution is to > keep the same file name as the source table folder. > To do that, we can put the original filename in MoveWork, and in MoveTask, if > original filename is set, don't generate a new name, simply overwrite. We > need to do it in both bootstrap and incremental load. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (HIVE-17608) REPL LOAD should overwrite the data files if exists instead of duplicating it
[ https://issues.apache.org/jira/browse/HIVE-17608?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sankar Hariappan updated HIVE-17608: Status: Patch Available (was: Open) > REPL LOAD should overwrite the data files if exists instead of duplicating it > - > > Key: HIVE-17608 > URL: https://issues.apache.org/jira/browse/HIVE-17608 > Project: Hive > Issue Type: Sub-task > Components: HiveServer2, repl >Affects Versions: 3.0.0 >Reporter: Sankar Hariappan >Assignee: Sankar Hariappan > Labels: DR, replication > Fix For: 3.0.0 > > Attachments: HIVE-17608.01.patch > > > This is to make insert event idempotent. > Currently, MoveTask would create a new file if the destination folder > contains a file of the same name. This is wrong if we have the same file in > both bootstrap dump and incremental dump (by design, duplicate file in > incremental dump will be ignored for idempotent reason), we will get > duplicate files eventually. Also it is wrong to just retain the filename in > the staging folder. Suppose we get the same insert event twice, the first > time we get the file from source table folder, the second time we get the > file from cm, we still end up with duplicate copy. The right solution is to > keep the same file name as the source table folder. > To do that, we can put the original filename in MoveWork, and in MoveTask, if > original filename is set, don't generate a new name, simply overwrite. We > need to do it in both bootstrap and incremental load. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (HIVE-17608) REPL LOAD should overwrite the data files if exists instead of duplicating it
[ https://issues.apache.org/jira/browse/HIVE-17608?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sankar Hariappan updated HIVE-17608: Attachment: HIVE-17608.01.patch Added 01.patch with below updates. - Added load/copy type in LoadTableDesc instead of replace flag. - Type has 3 values, REPLACE_ALL, KEEP_EXISTING, OVERWRITE_EXISTING. - IMPORT, REPL LOAD flow uses OVERWRITE_EXISTING if replace flag is not set from source msg. - Other normal flows such as LOAD, Acid tables continue to use KEEP_EXISTING flow which is current behaviour. Request [~anishek] to please review the patch! cc [~thejas] > REPL LOAD should overwrite the data files if exists instead of duplicating it > - > > Key: HIVE-17608 > URL: https://issues.apache.org/jira/browse/HIVE-17608 > Project: Hive > Issue Type: Sub-task > Components: HiveServer2, repl >Affects Versions: 3.0.0 >Reporter: Sankar Hariappan >Assignee: Sankar Hariappan > Labels: DR, replication > Fix For: 3.0.0 > > Attachments: HIVE-17608.01.patch > > > This is to make insert event idempotent. > Currently, MoveTask would create a new file if the destination folder > contains a file of the same name. This is wrong if we have the same file in > both bootstrap dump and incremental dump (by design, duplicate file in > incremental dump will be ignored for idempotent reason), we will get > duplicate files eventually. Also it is wrong to just retain the filename in > the staging folder. Suppose we get the same insert event twice, the first > time we get the file from source table folder, the second time we get the > file from cm, we still end up with duplicate copy. The right solution is to > keep the same file name as the source table folder. > To do that, we can put the original filename in MoveWork, and in MoveTask, if > original filename is set, don't generate a new name, simply overwrite. We > need to do it in both bootstrap and incremental load. -- This message was sent by Atlassian JIRA (v6.4.14#64029)