rbalamohan opened a new issue, #5927:
URL: https://github.com/apache/iceberg/issues/5927
### Apache Iceberg version
0.14.0
### Query engine
Spark
### Please describe the bug 🐞
When reading data from an "updated" table with "merge-on-read" option
enabled. This creates positional delete files during update queries. Subsequent
select queries return wrong results. It doesn't happen in single split, but
consistently happens with >= 2 splits.
Repro steps:
==========
1. Create store_sales table in TPCDS (3 TB). Just "ss_sold_date_sk =
2452612" partition is good enough.
2. Create iceberg table (store_sales_parq_test_merge_on_read) with
merge-on-read table properties. Eg.
(TBLPROPERTIES('write.delete.mode'='merge-on-read',
'write.update.mode'='merge-on-read', 'write.merge.mode'='merge-on-read'))
3. sql("insert overwrite store_sales_parq_test_merge_on_read select * from
tpcds_sf3000_withdecimal_withdate_withnulls.store_sales where ss_sold_date_sk =
2452612 limit 3000000")
4. sql("select count(*) from
store_sales_parq_test_merge_on_read_del").show(false);
+--------+
|count(1)|
+--------+
|3000000 |
+--------+
5. sql("alter table store_sales_parq_test_merge_on_read_del set
tblproperties('format-version'='2')");
6. sql("update store_sales_parq_test_merge_on_read_del set
ss_ext_discount_amt=0.0 where ss_ext_discount_amt is null and ss_sold_date_sk =
2452612")
_This should have updated all rows in the table, where "ss_ext_discount_amt
is null". Following query fails by providing additional records._
7. sql("select count(*) from store_sales_parq_test_merge_on_read_del where
ss_ext_discount_amt is null and ss_sold_date_sk = 2452612").show(false);
+--------+
|count(1)|
+--------+
|3670 |
+--------+
This is the issue, as In the previous update, we have updated all data with
"ss_ext_discount_amt is null". So ideally, this should be 0 instead of 3670.
Temp workaround as of now, is to explicitly disable vectorization. i.e by
adding "'read.parquet.vectorization.enabled'='false'" in the table properties.
In this case, it returns "0" records correctly.
Unfortunately I couldn't attach the sample data here, as it is 400+ MB.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]