[GitHub] [iceberg] szehon-ho opened a new issue, #5543: Imported parquet tables may have wrong metrics

GitBox Mon, 15 Aug 2022 23:02:37 -0700


szehon-ho opened a new issue, #5543:
URL: https://github.com/apache/iceberg/issues/5543

### Apache Iceberg version

main (development)

### Query engine

Spark

### Please describe the bug 🐞

I found this problem while doing
https://github.com/apache/iceberg/pull/5376#discussion_r934960703, which now
attempts to convert metrics to readable ones and encountered an exception. So
just reporting the problem.

See the test:
TestIcebergSourceTablesBase::testFilesTableWithSnapshotIdInheritance
https://github.com/apache/iceberg/blob/5f5c9235c10ed4a711a64de880491b3ae4f348ec/spark/v3.3/spark/src/test/java/org/apache/iceberg/spark/source/TestIcebergSourceTablesBase.java#L466

Setup:
The parquet table is a partitioned one, so when we insert data to that
table, there is only one column in the file (data). Column "Id" is partitioned
so does not exist in the file.

Code Flow:
The import code seems to do the following steps (via
TableMigraitonUtil::listPartition -> TableMigrationUtil::getParquetMetrics ->
ParquetUtil::footerMetrics())
1. Assign Field Ids
2. Calculate metrics

The first step, it sees the parquet file schema does not have ids (expected)
and assigns the ids using ParquetSchemaUtil::addFallbackIds, which starts as 1,
so now data column has field 1.

The second step calculates metrics for 'data' column and puts them in the
map with id=1.

However, in the Iceberg destination table schema, id=1, data=2. So now,
when we try to read the metrics, they are not correct.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] szehon-ho opened a new issue, #5543: Imported parquet tables may have wrong metrics

Reply via email to