odunHub opened a new issue, #5007: URL: https://github.com/apache/paimon/issues/5007
### Search before asking - [x] I searched in the [issues](https://github.com/apache/paimon/issues) and found nothing similar. ### Paimon version 1.1-SNAPSHOT ### Compute Engine - Using Flink to write data to Paimon with Iceberg compatibility. - AWS Athena, Spark, Flink SQL to query the Iceberg table. ### Minimal reproduce step Create a Paimon table with Iceberg compatibility enabled and partitioned by a `string` field. Try to query the Iceberg table with a predicate based on the partition field will not match any data. The Paimon table itself can be queried by the partition field but not the Iceberg table. - Create a Paimon table with Iceberg compatibility enabled with string partitioning, such as `event_date`. - Write data into the table, ensuring values like `2024-12-30` are stored in the partition. - Query the Iceberg table using equality predicates, such as: `SELECT * FROM iceberg_table WHERE event_date = '2024-12-30';` - Observe that no results are returned due to the null character suffix in the manifest files. - Rerun the query but apply the trim function or wildcard filtering e.g. - `SELECT * FROM iceberg_table WHERE TRIM(event_date) = '2024-12-30'; - `SELECT * FROM iceberg_table WHERE event_date LIKE '2024-12-30%'; ### What doesn't meet your expectations? The expectation is that the Iceberg table should accurately reflect the partitions defined in the underlying Paimon tables without any changes or alterations to the values during the serialization process. The presence of a null character suffix in the manifest files prevents successful querying by various client applications (Spark, Flink, Athena). ### Anything else? Concretely, when we inspect the avro manifest files we see that the column stats and partitions summary values have `\u0000` suffix e.g. Snapshot metadata file: ``` { "manifest_path": "s3a://some-bucket/some-prefix/warehouse/some_db/some_table/metadata/dc8e3d96-4144-4853-8ad6-1959ccac318e-m1.avro", "manifest_length": 10207, "partition_spec_id": 0, "content": 0, "sequence_number": 1, "min_sequence_number": 1, "added_snapshot_id": 1, "added_data_files_count": 26, "existing_data_files_count": 0, "deleted_data_files_count": 0, "added_rows_count": 3011776, "existing_rows_count": 0, "deleted_rows_count": 0, "partitions": "[{\"contains_null\": false, \"contains_nan\": false, \"lower_bound\": \"2024-12-16\\u0000\", \"upper_bound\": \"2025-01-20\\u0000\"}]" } ``` Manifest file column stats: ``` { "key": 13, "value": "2025-01-04\u0000" } ``` As a result, when a client (e.g. Spark/Athena) performs a scan of the Iceberg table, it'll skip all the data files after failing to find any manifests that match the predicate given in the query. ### Are you willing to submit a PR? - [x] I'm willing to submit a PR! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
