[jira] [Work logged] (HIVE-26133) Insert overwrite on Iceberg tables can result in duplicate entries after partition evolution

ASF GitHub Bot (Jira) Tue, 12 Apr 2022 06:53:05 -0700


     [ 
https://issues.apache.org/jira/browse/HIVE-26133?focusedWorklogId=755763&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-755763
 ]


ASF GitHub Bot logged work on HIVE-26133:
-----------------------------------------

                Author: ASF GitHub Bot
            Created on: 12/Apr/22 13:52
            Start Date: 12/Apr/22 13:52
    Worklog Time Spent: 10m 
      Work Description: marton-bod commented on code in PR #3202:
URL: https://github.com/apache/hive/pull/3202#discussion_r848463614


##########
iceberg/iceberg-handler/src/main/java/org/apache/iceberg/mr/hive/HiveIcebergStorageHandler.java:
##########
@@ -460,6 +461,13 @@ public void validateSinkDesc(FileSinkDesc sinkDesc) throws 
SemanticException {
       if (IcebergTableUtil.isBucketed(table)) {
         throw new SemanticException("Cannot perform insert overwrite query on 
bucket partitioned Iceberg table.");
       }
+      if (table.currentSnapshot() != null) {
+        if 
(table.currentSnapshot().allManifests().parallelStream().map(ManifestFile::partitionSpecId)
+            .filter(id -> id < table.spec().specId()).findAny().isPresent()) {
+          throw new SemanticException(
+              "Cannot perform insert overwrite query on Iceberg table where 
partition evolution happened.");

Review Comment:
   I guess you're right. We can only resolve this using merge + compaction as 
far as I know (neither of which are available in Hive currently). So let's 
leave the message as it is, and then we can later extend it by adding some 
useful tips on how to do the rewrite





Issue Time Tracking
-------------------

    Worklog Id:     (was: 755763)
    Time Spent: 1.5h  (was: 1h 20m)

> Insert overwrite on Iceberg tables can result in duplicate entries after 
> partition evolution
> --------------------------------------------------------------------------------------------
>
>                 Key: HIVE-26133
>                 URL: https://issues.apache.org/jira/browse/HIVE-26133
>             Project: Hive
>          Issue Type: Improvement
>            Reporter: László Pintér
>            Assignee: László Pintér
>            Priority: Major
>              Labels: pull-request-available
>          Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> Insert overwrite commands in Hive only rewrite partitions affected by the 
> query.
> If we write out a record with specA (e.g. day(ts)), resulting in a datafile:
> "/tableRoot/data/ts_day="2020-10-24"/ffffgggg.orc
> If you then change to specB (e.g. day(ts), name), the same record would go to 
> a different partition:
> "/tableRoot/data/ts_day="2020-10-24"/name="Mike"/ffffgggg.orc
> If you then want to overwrite the table with itself, it will detect these two 
> records to belong to different partitions (as they do), and therefore does 
> not overwrite the original record with the new one, resulting in duplicate 
> entries.
> {code:java}
> create table testice1000 (a int, b string) stored by iceberg stored as orc 
> location 'file:/tmp/testice1000';
> insert into testice1000 values (11, 'ddd'), (22, 'ttt');
> alter table testice1000 set partition spec(truncate(2, b));
> insert into testice1000 values (33, 'rrfdfdf');
> insert overwrite table testice1000 select * from testice1000;
> ------------------------------+
> testice1000.a testice1000.b
> ------------------------------+
> 11 ddd   
> 11 ddd   
> 22 ttt   
> 22 ttt   
> 33 rrfdfdf
> ------------------------------+
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Work logged] (HIVE-26133) Insert overwrite on Iceberg tables can result in duplicate entries after partition evolution

Reply via email to