wirybeaver commented on PR #2203:
URL: https://github.com/apache/iceberg-rust/pull/2203#issuecomment-4356747375

   Thank you @mbutrovich — this is the most thorough review the PR has received 
and it caught real bugs. Here is a point-by-point response, followed by what 
was changed.
   
   ---
   
   ## Blocking Issues
   
   ### 1 & 2: Manifest rewriting + sequence numbers
   
   Both are now fixed together, as you predicted.
   
   `existing_manifest()` now rewrites any manifest that contains a deleted 
file, matching Java's `ManifestFilterManager.filterManifestWithDeletedFiles`:
   - Removed files → `DELETED` entry (snapshot_id updated to current commit, 
sequence numbers copied verbatim from the loaded entry)
   - Surviving files → `EXISTING` entry (all original fields, including 
sequence numbers, preserved)
   - Manifests with no deleted files → carried forward unchanged
   
   Since the sequence numbers are read directly off the loaded `ManifestEntry` 
(which has them set via inheritance during `load_manifest`), the TODO for 
sequence numbers is gone — fixing the manifest rewrite gave us correct sequence 
numbers for free, exactly as you described.
   
   To enable this, `SnapshotProduceOperation::existing_manifest` was changed to 
take `&mut SnapshotProducer<'_>` so implementations can call 
`snapshot_produce.new_manifest_writer()` to produce rewritten manifests.
   
   ### 3: Snapshot summary missing deleted-file metrics
   
   Fixed. Added `removed_data_files()` as a default method on 
`SnapshotProduceOperation` (returns `&[]` by default, overridden in 
`RowDeltaOperation`). `SnapshotProducer::summary()` now calls 
`summary_collector.remove_file()` for each removed file, populating 
`deleted-data-files`, `deleted-records`, and `removed-files-size`.
   
   ---
   
   ## Significant Issues
   
   ### 4: Tests don't exercise critical code paths
   
   Added `test_row_delta_cow_manifest_rewrite`: uses `FastAppendAction` to 
write file-A + file-B into a real snapshot S1 (backed by in-memory FileIO), 
then `RowDeltaAction` to remove file-A and add file-C. The test loads S2's 
manifest list and asserts:
   - file-A → `ManifestStatus::Deleted` with non-null sequence numbers
   - file-B → `ManifestStatus::Existing` with non-null sequence numbers  
   - file-C → `ManifestStatus::Added`
   - `deleted-data-files = 1` in the snapshot summary
   
   ### 5: `test_row_delta_remove_only` removes a phantom file
   
   You are correct that the concurrency concern (files removed by a concurrent 
commit) is separate from partition validation. That validation 
(`validateDataFilesExist`) is out of scope for this PR and belongs alongside 
the broader `validate_from_snapshot` / ancestry-check work. I've updated the 
reply to jdockerty's original comment to clarify this distinction.
   
   ---
   
   ## Design Concerns
   
   ### 6: `add_delete_files` silently drops files
   
   Fixed. `commit()` now returns `ErrorKind::FeatureUnsupported` immediately if 
`added_delete_files` is non-empty. The API is preserved so MoR support can wire 
it up without a breaking change.
   
   ### 7: `write_delete_manifest` naming confusion
   
   Renamed to `write_manifest_with_deleted_entries` with a comment 
distinguishing it from Iceberg delete manifests (content=Deletes).
   
   ---
   
   ## Nits
   
   ### 8: Verbose doc comments
   
   Trimmed. `RowDeltaAction` now has a 2-line struct comment; method comments 
match the single-sentence style used in the rest of the transaction module.
   
   ### 9 & 10: Noted as follow-ups
   
   - The "fast append" string in the shared validation error message would 
require a change in `snapshot.rs` that affects all callers — deferred.
   - The O(manifests × entries) scan in `existing_manifest` has a TODO comment; 
optimizing with manifest-level statistics (partition bounds, file counts) is 
tracked separately.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to