paul-bormans-pcgw opened a new issue, #11695:
URL: https://github.com/apache/iceberg/issues/11695
### Apache Iceberg version
1.6.1
### Query engine
Trino
### Please describe the bug 🐞
I'm running iceberg on a compose setup and have 2 concurrent writers:
1) doing appends using pyIceberg
2) doing a DELETE query using Trino + followed by a expire-snapshots also
using Trino.
I'm using the following properties when creating the table:
```
table = self.catalog.create_table(
identifier=...,
schema=...,
properties={
"gc.enabled": True,
"commit.retry.num-retries": 4,
"write.delete.isolation-level": "snapshot",
"write.update.isolation-level": "snapshot",
"write.merge.isolation-level": "snapshot",
},
```
Also I'm using just a JDBC catalog, for instance the Trino connector config:
```
connector.name=iceberg
iceberg.catalog.type=jdbc
iceberg.jdbc-catalog.catalog-name=sql
iceberg.jdbc-catalog.driver-class=org.postgresql.Driver
iceberg.jdbc-catalog.connection-url=jdbc:postgresql://postgres:5432/catalog
iceberg.jdbc-catalog.connection-user=postgres
iceberg.jdbc-catalog.connection-password=postgres
iceberg.jdbc-catalog.default-warehouse-dir=s3://demobucket
fs.native-s3.enabled=true
s3.endpoint=http\://minio\:9000/
s3.path-style-access=true
s3.region=us-east-1
s3.aws-access-key=minioadmin
s3.aws-secret-key=minioadmin
iceberg.expire-snapshots.min-retention=2h
iceberg.remove-orphan-files.min-retention=1h
```
pyIceberg is committing new data (FastAppend) every few second; for instance:
```
{
"snapshot-id": 2401014885715513300,
"parent-snapshot-id": 5514772428877076000,
"sequence-number": 3783,
"timestamp-ms": 1733304538706,
"manifest-list":
"s3://demobucket/ts.db/pack/metadata/snap-2401014885715513233-0-2609e33f-b31d-4425-bcc6-bd074de3012f.avro",
"summary": {
"operation": "append",
"added-files-size": "18618432",
"added-data-files": "1",
"added-records": "107768",
"changed-partition-count": "1",
"total-data-files": "3761",
"total-delete-files": "3399",
"total-records": "438076882",
"total-files-size": "72715662134",
"total-position-deletes": "393434960",
"total-equality-deletes": "0"
},
"schema-id": 0
},
{
"snapshot-id": 5605442414867702000,
"parent-snapshot-id": 2401014885715513300,
"sequence-number": 3784,
"timestamp-ms": 1733304550871,
"manifest-list":
"s3://demobucket/ts.db/pack/metadata/snap-5605442414867701673-0-c0eef6ae-41b7-45ee-a777-299279b632ea.avro",
"summary": {
"operation": "append",
```
To cleanup older data we run following Query:
```
DELETE FROM pack WHERE epoch_timestamp_tz <= timestamp '2024-12-?? ??:??'
AND timestampns < n
```
This correctly creates a delete commit; for instance:
```
{
"snapshot-id": 3134131428617513500,
"parent-snapshot-id": 2181664209442251000,
"sequence-number": 3631,
"timestamp-ms": 1733302382646,
"manifest-list":
"s3://demobucket/ts.db/pack/metadata/snap-3134131428617513346-2-43520ef4-077e-4b06-a8db-85c8f2c12e43.avro",
"summary": {
"operation": "delete",
"trino_query_id": "20241204_084910_00399_5fm9y",
"added-position-delete-files": "177",
"added-delete-files": "177",
"added-files-size": "27066930",
"added-position-deletes": "20433177",
"changed-partition-count": "1",
"total-records": "420753387",
"total-files-size": "69857546227",
"total-data-files": "3611",
"total-delete-files": "3219",
"total-position-deletes": "372374521",
"total-equality-deletes": "0",
"iceberg-version": "Apache Iceberg 1.6.1 (commit
8e9d59d299be42b0bca9461457cd1e95dbaad086)"
},
"schema-id": 0
},
```
After the delete query we run expire-snapshots to cleanup old snapshots AND
old datafiles that were removed by delete-operations earlier; for instance:
```
ALTER TABLE pack EXECUTE expire_snapshots(retention_threshold => '3h')
```
From the Trino logging I can see snapshots get expired and also
delete-operation (snapshots) are expired / removed BUT none of the actual data
files are removed? What are we missing here?
```
org.apache.iceberg.RemoveSnapshots Expiring snapshots older than:
2024-12-04T03:59:01.303+00:00 (1733284741303)
org.apache.iceberg.RemoveSnapshots Committed snapshot changes
org.apache.iceberg.RemoveSnapshots Cleaning up expired files (local,
incremental)
org.apache.iceberg.IncrementalFileCleanup Expired snapshot:
BaseSnapshot{id=4579799571894291545, timestamp_ms=1733284090666,
operation=append, summary={added-files-size=23432710, added-data-files=1,
added-records=131666, changed-partition-count=1, total-data-files=2134,
total-delete-files=1608, total-records=248302696, total-files-size=41244442373,
total-position-deletes=185726747, total-equality-deletes=0},
manifest-list=s3://demobucket/ts.db/pack/metadata/snap-4579799571894291545-0-87990d26-3c22-4df0-a590-76c2805d95f1.avro,
schema-id=0}
org.apache.iceberg.IncrementalFileCleanup Expired snapshot:
BaseSnapshot{id=4640762571463645975, timestamp_ms=1733284113116,
operation=append, summary={added-files-size=17668611, added-data-files=1,
added-records=116928, changed-partition-count=1, total-data-files=2135,
total-delete-files=1608, total-records=248419624, total-files-size=41262110984,
total-position-deletes=185726747, total-equality-deletes=0},
manifest-list=s3://demobucket/ts.db/pack/metadata/snap-4640762571463645975-0-4b118704-b931-42e7-b7b3-3453766c746e.avro,
schema-id=0}
<...>
org.apache.iceberg.IncrementalFileCleanup Expired snapshot:
BaseSnapshot{id=9027409204264082391, timestamp_ms=1733285284267,
operation=delete, summary={trino_query_id=20241204_040721_00185_5fm9y,
added-position-delete-files=183, added-delete-files=183,
added-files-size=27888755, added-position-deletes=21059718,
changed-partition-count=2, total-records=260746407,
total-files-size=43279826772, total-data-files=2242, total-delete-files=1791,
total-position-deletes=206786465, total-equality-deletes=0,
iceberg-version=Apache Iceberg 1.6.1 (commit
8e9d59d299be42b0bca9461457cd1e95dbaad086)},
manifest-list=s3://demobucket/ts.db/pack/metadata/snap-9027409204264082391-2-ad111312-2980-4071-9f7d-b819dfc1ed21.avro,
schema-id=0}
```
I can only assume the old data files are still referenced by manifests? But
how can we investigate this? What are we missing?
The relevant source code seems to be:
https://github.com/apache/iceberg/blob/main/core/src/main/java/org/apache/iceberg/IncrementalFileCleanup.java#L261C17-L261C30
Paul
### Willingness to contribute
- [ ] I can contribute a fix for this bug independently
- [X] I would be willing to contribute a fix for this bug with guidance from
the Iceberg community
- [ ] I cannot contribute a fix for this bug at this time
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]