szehon-ho commented on PR #6581:
URL: https://github.com/apache/iceberg/pull/6581#issuecomment-1387630786
Chatting with @aokolnychyi , @RussellSpitzer , a guide to when this can be
used.
There will be two types of operations that can remove delete files:
| Operation | Cost | File Type | Description |
| --- | --- | --- | --- |
| RemoveDanglingDeletes | Metadata-Only, cost will be like querying
files/partition table | Both | Removes position deletes with sequence number
less than that of the min sequence number of all data files in each partition |
| RewritePositionDeletes | Data-operation, need to read/write all
concerned delete files | Position only (Equality Deletes will need to be
converted to PositionDeletes) | Read all position delete files satisfying given
filter, write them back out , filtering out position delete entries that refer
to data files that no longer exist |
Use-case, RemoveDanglingDeleteFiles is cheaper, and is the only one to work
across both types of files. However, to get it to exactly work, we need the
following conditions: RewriteDataFiles being run with:
* Filter that includes entire partition(s)
* All data files in the partition with delete files gets rewritten, ie any
of these:
* rewrite-all=true
* delete-file-threshold=1
* All data files happen to meet the criteria of rewrite without these
flags.
* 'use-starting-sequence-number' needs to be false. This is to properly
identify old delete files as invalid using sequence number rule. This is only
needed for position-deletes, as equality-deletes are not applied to equivalent
sequence number.
Note RemoveDanglingDeleteFiles can still remove some delete files if these
conditions are not met, but just it may not do so for all delete files, because
an old data file (one with a low sequence number) not rewritten will prevent
delete files from getting removed.
So Im open to whether there is a good use-case of this. One idea is to
bundle this with RewriteDataFiles, and if trigger optimistically if these
conditions are met, or trigger in any case in hopes it will remove delete files
as, as its relatively cheap.
Otherwise, the complete solution (all to be developed) would be:
For position deletes, run RewritePositionDeletes across all partitions
For equality deletes, run ConvertToPosDeletes, then RewritePositionDeletes
across all partitions.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]