paul-bormans-pcgw opened a new issue, #11687:
URL: https://github.com/apache/iceberg/issues/11687
### Query engine
1. PyIceberg
2. Trino
### Question
I'm running a test (on docker-compose) where new data is appended
(FastAppend) every +/- 1 second while on the other end Trino runs a query to
DELETE data older than 2 hrs.
The latter throws an exception like so:
```
trino:ts> DELETE FROM pack WHERE epoch_timestamp_tz <= timestamp '2024-12-02
12:30 UTC' AND timestampns < 1.7331426374286223e+18;
Query 20241202_145700_00052_2bt64, FAILED, 1 node
Splits: 557 total, 556 done (99.82%)
29.45 [21.6M rows, 206MiB] [734K rows/s, 7.01MiB/s]
Query 20241202_145700_00052_2bt64 failed: Failed to commit the transaction
during write: Found conflicting files that can contain records matching true: [
s3://demobucket/ts.db/pack/data/source_id=s00000/epoch_hours=2024-12-02-14/00000-0-076ac96f-51b3-48d3-9a68-c7f971278ada.parquet,
s3://demobucket/ts.db/pack/data/source_id=s00000/epoch_hours=2024-12-02-14/00000-0-92c59e44-c8db-4f99-a9bb-f0a2d4fbc164.parquet]
Caused by: org.apache.iceberg.exceptions.ValidationException: Found
conflicting files that can contain records matching true:
[s3://demobucket/ts.db/pack/data/source_id=s00000/epoch_hours=2024-12-02-14/00000-0-62caa502-27ad-4f0c-aabf-41d1bb3198fa.parquet]
at
org.apache.iceberg.MergingSnapshotProducer.validateAddedDataFiles(MergingSnapshotProducer.java:347)
at org.apache.iceberg.BaseRowDelta.validate(BaseRowDelta.java:130)
```
Now as can be seen I'm using a PartitionSpec:
```
with table.update_spec() as update:
update.add_field(
source_column_name="epoch_timestamp_tz",
transform=HourTransform(),
partition_field_name="epoch_hours",
)
```
Since the ingestion only appends new data, no new datafiles are added to the
partition (epoch_hours=2024-12-02-12) where DELETE is running. Why do i still
get this exception?
Do i need to push any additional configuration to configure "data conflict
filters"?
Some guidance and/or best practices is appreciated.
Paul
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]