[GitHub] [iceberg] rbalamohan opened a new issue, #5927: Vectorized reading of parquet in an updated table with 'merge-on-read' returns wrong results

GitBox Thu, 06 Oct 2022 01:43:17 -0700


rbalamohan opened a new issue, #5927:
URL: https://github.com/apache/iceberg/issues/5927


   ### Apache Iceberg version
   
   0.14.0
   
   ### Query engine
   
   Spark
   
   ### Please describe the bug 🐞
   
   When reading data from an "updated" table with "merge-on-read" option 
enabled. This creates positional delete files during update queries. Subsequent 
select queries return wrong results. It doesn't happen in single split, but 
consistently happens with >= 2 splits.
   
   Repro steps:
   ==========
   1. Create store_sales table in TPCDS (3 TB). Just "ss_sold_date_sk = 
2452612"  partition is good enough.
   
   2. Create iceberg table (store_sales_parq_test_merge_on_read) with 
merge-on-read table properties. Eg. 
(TBLPROPERTIES('write.delete.mode'='merge-on-read', 
'write.update.mode'='merge-on-read', 'write.merge.mode'='merge-on-read'))
   
   3. sql("insert overwrite store_sales_parq_test_merge_on_read select * from 
tpcds_sf3000_withdecimal_withdate_withnulls.store_sales where ss_sold_date_sk = 
2452612 limit 3000000")
   
   4. sql("select count(*) from 
store_sales_parq_test_merge_on_read_del").show(false);
   +--------+
   |count(1)|
   +--------+
   |3000000 |
   +--------+
   
   5. sql("alter table store_sales_parq_test_merge_on_read_del set 
tblproperties('format-version'='2')");
   
   6. sql("update store_sales_parq_test_merge_on_read_del set 
ss_ext_discount_amt=0.0 where ss_ext_discount_amt is null and ss_sold_date_sk = 
2452612")
   
   _This should have updated all rows in the table, where "ss_ext_discount_amt 
is null". Following query fails by providing additional records._
   
   7. sql("select count(*) from store_sales_parq_test_merge_on_read_del where  
ss_ext_discount_amt is null and ss_sold_date_sk = 2452612").show(false);
   
   +--------+
   |count(1)|
   +--------+
   |3670    | 
   +--------+
   This is the issue, as In the previous update, we have updated all data with 
"ss_ext_discount_amt is null". So ideally, this should be 0 instead of 3670.
   
   Temp workaround as of now, is to explicitly disable vectorization. i.e by 
adding "'read.parquet.vectorization.enabled'='false'" in the table properties. 
In this case, it returns "0" records correctly.
   
   Unfortunately I couldn't attach the sample data here, as it is 400+ MB. 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] rbalamohan opened a new issue, #5927: Vectorized reading of parquet in an updated table with 'merge-on-read' returns wrong results

Reply via email to