[I] `system.add_files` utility does not support updated Partition Spec [iceberg]

via GitHub Wed, 20 Mar 2024 11:53:50 -0700


sfc-gh-asudhakar opened a new issue, #10008:
URL: https://github.com/apache/iceberg/issues/10008


   **Apache Iceberg version**
   v2
   
   **Query engine**
   Spark
   
   **Please describe the bug 🐞**
   
   **The issue**
   
   Files ingested to an Iceberg table using the `system.add_files` utility 
don't reflect the latest partition spec, but instead reflects the original 
partition spec, even if the source Parquet table matches the latest partition 
spec.
   
   **Repro**
   
   1) Create an Iceberg table with Partition Spec A:
   `create table testIcebergTable (data string, p1 int, p2 int) USING ICEBERG 
partitioned by (p1);`
   
   2) Modify the partition spec by adding another partition field to make 
Partition Spec B:
   `alter table testIcebergTable add partition field p2;`
   
   3) Create a Parquet table whose partitioning matches the new Partition Spec 
B:
   `create table testParquetTable (data string, p1 int, p2 int) USING PARQUET 
partitioned by (p1, p2)`
   
   4) Insert data into Parquet table
   `insert into testParquetTable values ("hello", 10, 20)`
   
   5) Call the `system.add_files` utility with the source as the Parquet table 
and the destination as the Iceberg table
   `CALL system.add_files(table => 'testIcebergTable', source_table => 
'testParquetTable')`
   
   6) Run a select query on the Iceberg table
   `select * from testIcebergTable`
   
   The select query returns `NULL` for column p2. 
   
   Looking at the manifest file that gets created after the `add_files` call, 
it only contains the value for partition column `p1` and does not contain the 
value for `p2`. 
   
   Looking at the `partition-spec` in the Avro file's key-value metadata also 
shows a partition spec with only 1 partition column (corresponding to Partition 
Spec A from step 1 above). As a result, the value for partition column p2 is 
lost and cannot be retrieved.
   
   Note - if the Iceberg table is originally created with 2 partition columns, 
then the select query returns both values. But it would face a similar issue if 
a 3rd partition field were to be added after. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[I] `system.add_files` utility does not support updated Partition Spec [iceberg]

Reply via email to