[I] [Bug] Null Character Suffix in Iceberg Manifest Files Due to toByteBuffer Invocation [paimon]

via GitHub Mon, 03 Feb 2025 04:05:14 -0800


odunHub opened a new issue, #5007:
URL: https://github.com/apache/paimon/issues/5007


   ### Search before asking
   
   - [x] I searched in the [issues](https://github.com/apache/paimon/issues) 
and found nothing similar.
   
   
   ### Paimon version
   
   1.1-SNAPSHOT
   
   ### Compute Engine
   
   - Using Flink to write data to Paimon with Iceberg compatibility.
   - AWS Athena, Spark, Flink SQL to query the Iceberg table.
   
   ### Minimal reproduce step
   
   Create a Paimon table with Iceberg compatibility enabled and partitioned by 
a `string` field. Try to query the Iceberg table with a predicate based on the 
partition field will not match any data. The Paimon table itself can be queried 
by the partition field but not the Iceberg table.
   
   - Create a Paimon table with Iceberg compatibility enabled with string 
partitioning, such as `event_date`.
   - Write data into the table, ensuring values like `2024-12-30` are stored in 
the partition.
   - Query the Iceberg table using equality predicates, such as: `SELECT * FROM 
iceberg_table WHERE event_date = '2024-12-30';`
   - Observe that no results are returned due to the null character suffix in 
the manifest files. 
   - Rerun the query but apply the trim function or wildcard filtering e.g.
      -  `SELECT * FROM iceberg_table WHERE TRIM(event_date) = '2024-12-30'; 
      - `SELECT * FROM iceberg_table WHERE event_date LIKE '2024-12-30%'; 
   
   ### What doesn't meet your expectations?
   
   The expectation is that the Iceberg table should accurately reflect the 
partitions defined in the underlying Paimon tables without any changes or 
alterations to the values during the serialization process. The presence of a 
null character suffix in the manifest files prevents successful querying by 
various client applications (Spark, Flink, Athena). 
   
   
   
   ### Anything else?
   
   Concretely, when we inspect the avro manifest files we see that the column 
stats and partitions summary values have `\u0000` suffix e.g.
   
   Snapshot metadata file:
   ```
   {
       "manifest_path": 
"s3a://some-bucket/some-prefix/warehouse/some_db/some_table/metadata/dc8e3d96-4144-4853-8ad6-1959ccac318e-m1.avro",
       "manifest_length": 10207,
       "partition_spec_id": 0,
       "content": 0,
       "sequence_number": 1,
       "min_sequence_number": 1,
       "added_snapshot_id": 1,
       "added_data_files_count": 26,
       "existing_data_files_count": 0,
       "deleted_data_files_count": 0,
       "added_rows_count": 3011776,
       "existing_rows_count": 0,
       "deleted_rows_count": 0,
       "partitions": "[{\"contains_null\": false, \"contains_nan\": false, 
\"lower_bound\": \"2024-12-16\\u0000\", \"upper_bound\": 
\"2025-01-20\\u0000\"}]"
     }
   ```
   
   Manifest file column stats:
   ```
   {
     "key": 13,
     "value": "2025-01-04\u0000"
   }
   ```
   
   As a result, when a client (e.g. Spark/Athena) performs a scan of the 
Iceberg table, it'll skip all the data files after failing to find any 
manifests that match the predicate given in the query.
   
   ### Are you willing to submit a PR?
   
   - [x] I'm willing to submit a PR!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[I] [Bug] Null Character Suffix in Iceberg Manifest Files Due to toByteBuffer Invocation [paimon]

Reply via email to