aokolnychyi commented on PR #14264:
URL: https://github.com/apache/iceberg/pull/14264#issuecomment-3821104935

   What I like about this PR is the attempt to have per snapshot state like 
`DeleteFileIndex` and the incremental approach for building that state. It does 
seem like there are some correctness issues in the implementation, however.
   
   I found some old notes on this topic and spent some time thinking about it. 
Sharing them below.
   
   ## CDC tasks
   
   - `AddedRowsScanTask` (data file + deletes assigned in the same snapshot)
   - `DeletedDataFileScanTask` (data file + existing deletes)
   - `DeletedRowsScanTask` (data file + added deletes + existing deletes)
   
   ## Potential Algorithm
   
   Iterate over changelog snapshots (ignoring logical rewrites).
   
   - Find new data and delete files added across the changelog range (look for 
new manifests and ADDED manifest entry statuses).
   - Find data and delete files that have been removed across the changelog 
range (look for new manifests and DELETED manifest entry statuses).
   - Build the affected partition set and a set of affected data files 
locations for pruning of base data and delete manifests if there are removed 
data files or new deletes. This combined predicate may be used for pruning 
along the user-provided filter. It would essentially tell us what partitions 
have been affected.
   
   ### AddedRowsScanTask generation
   
   The only delete lookup that needs to happen for new data files is DVs 
associated in the same snapshot. Like in this PR, we can simply build 
`DeleteFileIndex` using the new delete files from that snapshot. This 
information will be sufficient to emit this type of tasks. We don’t need to 
analyze existing deletes here.
   
   ### DeletedDataFileScanTask generation
   
   For each removed data file, we need to find deletes that were associated 
with it. We have to scan the effective delete file list for this changelog 
snapshot (base + incremental changes up to this snapshot) while doing the 
lookup. We don’t have manifest affinity in V3 but we can prune the base delete 
set using the user provided predicate and by collecting partitions and file 
names of all affected data files in the changelog range. 
   
   If the table only contains DVs, the set of removed delete files for that 
snapshot will contain the old DV. While dropping data files engines must drop 
the DV too (double check this statement). This means we don’t have to scan 
through base deletes if we know that we only deal with DVs.
   
   It is likely OK to ignore delete compactions that happened during the 
changelog range and use old deletes as their content should be the same.
   
   ### DeletedRowsScanTask generation
   
   For each new delete file, we have to find its data file and the list of 
previous deletes.
   
   If the added delete is a DV, the old DV (if existed) must be in the set of 
removed DVs in that snapshot. We won’t have to scan historic deletes.
   
   We can use the same partition set and file name filter to find data files. 
We also have to scan the effective set of data files. The applicable data file 
will have to be either in the base file set or in the ANY snapshot after that 
including REWRITEs that we ignore while selecting changelog snapshots. 
Therefore, we will have to check compaction outputs, newly added files in the 
changelog range and potentially the base set of data files (filtered by 
affected files and partitions).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to