Shane-Yu opened a new issue, #5671:
URL: https://github.com/apache/iceberg/issues/5671
### Apache Iceberg version
0.13.1
### Query engine
Hive
### Please describe the bug 🐞
In Iceberg upsert mode, create v2 table like this:
> create table upsert_update_time_test(
> id bigint comment 'pk',
> data bigint comment 'data',
> update_time string comment 'update_time'
> )
> comment 'upsert_update_time_test'
> STORED BY 'org.apache.iceberg.mr.hive.HiveIcebergStorageHandler'
> TBLPROPERTIES (
> 'engine.hive.enabled'='true',
> 'write.metadata.delete-after-commit.enabled'='true',
> 'write.target-file-size-bytes'='134217728',
> 'write.metadata.previous-versions-max'='5',
> 'write.metadata.metrics.default'='full',
> 'format-version'='2'
> );
Write data to iceberg with Flink like the code below:
> FlinkSink.forRow(rowDataStream, tableSchema)
> .tableLoader(tableLoader)
> .tableSchema(tableSchema)
> .upsert(true)
> .writeParallelism(1)
> .equalityFieldColumns(ImmutableList.of("id"))
> .append();
And send data to like this:
> $ nc -lk 3287
> I,1,101,2022-08-26 15:44:50
> U,1,103,2022-08-26 15:45:23

Finally, using hive and spark both got the following query results:
> select * from upsert_update_time_test;
> OK
> upsert_update_time_test.id upsert_update_time_test.data
upsert_update_time_test.update_time
> 1 103 2022-08-26 15:45:23
> Time taken: 0.107 seconds, Fetched: 1 row(s)
> hive (iceberg_yx)> select * from upsert_update_time_test where update_time
<= '2022-08-26 15:45:00';
> OK
> upsert_update_time_test.id upsert_update_time_test.data
upsert_update_time_test.update_time
> 1 101 2022-08-26 15:44:50
> Time taken: 0.76 seconds, Fetched: 1 row(s)
> hive (iceberg_yx)> select * from upsert_update_time_test where update_time
<= '2022-08-26 15:46:00';
> OK
> upsert_update_time_test.id upsert_update_time_test.data
upsert_update_time_test.update_time
> 1 103 2022-08-26 15:45:23
> Time taken: 1.26 seconds, Fetched: 1 row(s)
> hive (iceberg_yx)>
> > select * from upsert_update_time_test where data <= 102;
> OK
> upsert_update_time_test.id upsert_update_time_test.data
upsert_update_time_test.update_time
> 1 101 2022-08-26 15:44:50
> Time taken: 0.119 seconds, Fetched: 1 row(s)
> hive (iceberg_yx)>
> > select * from upsert_update_time_test where data <= 103;
> OK
> upsert_update_time_test.id upsert_update_time_test.data
upsert_update_time_test.update_time
> 1 103 2022-08-26 15:45:23
> Time taken: 0.114 seconds, Fetched: 1 row(s)
> hive (iceberg_yx)>
> > select * from upsert_update_time_test where id = 1;
> OK
> upsert_update_time_test.id upsert_update_time_test.data
upsert_update_time_test.update_time
> 1 103 2022-08-26 15:45:23
> Time taken: 0.134 seconds, Fetched: 1 row(s)


The above query results show that the v2 table can **_query the historical
version of the data when it meets the historical data conditions_**. Is this a
bug or is there something wrong with my operation? Anybody else met this?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]