aokolnychyi commented on PR #14264: URL: https://github.com/apache/iceberg/pull/14264#issuecomment-3821104935
What I like about this PR is the attempt to have per snapshot state like `DeleteFileIndex` and the incremental approach for building that state. It does seem like there are some correctness issues in the implementation, however. I found some old notes on this topic and spent some time thinking about it. Sharing them below. ## CDC tasks - `AddedRowsScanTask` (data file + deletes assigned in the same snapshot) - `DeletedDataFileScanTask` (data file + existing deletes) - `DeletedRowsScanTask` (data file + added deletes + existing deletes) ## Potential Algorithm Iterate over changelog snapshots (ignoring logical rewrites). - Find new data and delete files added across the changelog range (look for new manifests and ADDED manifest entry statuses). - Find data and delete files that have been removed across the changelog range (look for new manifests and DELETED manifest entry statuses). - Build the affected partition set and a set of affected data files locations for pruning of base data and delete manifests if there are removed data files or new deletes. This combined predicate may be used for pruning along the user-provided filter. It would essentially tell us what partitions have been affected. ### AddedRowsScanTask generation The only delete lookup that needs to happen for new data files is DVs associated in the same snapshot. Like in this PR, we can simply build `DeleteFileIndex` using the new delete files from that snapshot. This information will be sufficient to emit this type of tasks. We don’t need to analyze existing deletes here. ### DeletedDataFileScanTask generation For each removed data file, we need to find deletes that were associated with it. We have to scan the effective delete file list for this changelog snapshot (base + incremental changes up to this snapshot) while doing the lookup. We don’t have manifest affinity in V3 but we can prune the base delete set using the user provided predicate and by collecting partitions and file names of all affected data files in the changelog range. If the table only contains DVs, the set of removed delete files for that snapshot will contain the old DV. While dropping data files engines must drop the DV too (double check this statement). This means we don’t have to scan through base deletes if we know that we only deal with DVs. It is likely OK to ignore delete compactions that happened during the changelog range and use old deletes as their content should be the same. ### DeletedRowsScanTask generation For each new delete file, we have to find its data file and the list of previous deletes. If the added delete is a DV, the old DV (if existed) must be in the set of removed DVs in that snapshot. We won’t have to scan historic deletes. We can use the same partition set and file name filter to find data files. We also have to scan the effective set of data files. The applicable data file will have to be either in the base file set or in the ANY snapshot after that including REWRITEs that we ignore while selecting changelog snapshots. Therefore, we will have to check compaction outputs, newly added files in the changelog range and potentially the base set of data files (filtered by affected files and partitions). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
