sfc-gh-asudhakar opened a new issue, #10008:
URL: https://github.com/apache/iceberg/issues/10008
**Apache Iceberg version**
v2
**Query engine**
Spark
**Please describe the bug 🐞**
**The issue**
Files ingested to an Iceberg table using the `system.add_files` utility
don't reflect the latest partition spec, but instead reflects the original
partition spec, even if the source Parquet table matches the latest partition
spec.
**Repro**
1) Create an Iceberg table with Partition Spec A:
`create table testIcebergTable (data string, p1 int, p2 int) USING ICEBERG
partitioned by (p1);`
2) Modify the partition spec by adding another partition field to make
Partition Spec B:
`alter table testIcebergTable add partition field p2;`
3) Create a Parquet table whose partitioning matches the new Partition Spec
B:
`create table testParquetTable (data string, p1 int, p2 int) USING PARQUET
partitioned by (p1, p2)`
4) Insert data into Parquet table
`insert into testParquetTable values ("hello", 10, 20)`
5) Call the `system.add_files` utility with the source as the Parquet table
and the destination as the Iceberg table
`CALL system.add_files(table => 'testIcebergTable', source_table =>
'testParquetTable')`
6) Run a select query on the Iceberg table
`select * from testIcebergTable`
The select query returns `NULL` for column p2.
Looking at the manifest file that gets created after the `add_files` call,
it only contains the value for partition column `p1` and does not contain the
value for `p2`.
Looking at the `partition-spec` in the Avro file's key-value metadata also
shows a partition spec with only 1 partition column (corresponding to Partition
Spec A from step 1 above). As a result, the value for partition column p2 is
lost and cannot be retrieved.
Note - if the Iceberg table is originally created with 2 partition columns,
then the select query returns both values. But it would face a similar issue if
a 3rd partition field were to be added after.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]