krisnaru commented on code in PR #15975: URL: https://github.com/apache/iceberg/pull/15975#discussion_r3151704244
##########
core/src/main/java/org/apache/iceberg/RewriteTablePathUtil.java:
##########
@@ -431,8 +434,8 @@ private static RewriteResult<DataFile> writeDataFileEntry(
appendEntryWithFile(entry, writer, newDataFile);
// keep the following entries in metadata but exclude them from copyPlan
// 1) deleted data files
- // 2) entries not changed by snapshotIds
- if (entry.isLive() && snapshotIds.contains(entry.snapshotId())) {
+ // 2) entries not changed by snapshotIds (incremental copy only)
+ if (entry.isLive() && (snapshotIds == null ||
snapshotIds.contains(entry.snapshotId()))) {
Review Comment:
The fix for full copies (passing `null` snapshotIds) looks correct, but I
believe the same class of bug persists for **incremental copies** when
`RewriteManifests` runs between replications.
**Scenario:**
1. Incremental replication at T1 captures `startVersion`. Source has `{S1,
S2}`.
2. Between T1 and T2: append → `S3`(file_C), `RewriteManifests` → `S4`,
expire `S3`.
3. Incremental at T2:
- `deltaSnapshotIds = {S4}` (S3 expired, not in
`endMetadata.snapshots()`)
- S4's rewritten manifests are selected for processing ✓
- But `entry.snapshotId() = S3` for `file_C`, and `{S4}.contains(S3) ==
false`
- `file_C` is excluded from `copyPlan` — **live data silently lost**
The root cause: `deltaSnapshotIds` is derived from
`endMetadata.snapshots()` in `RewriteTablePathSparkAction`), which excludes
expired snapshots. When
`RewriteManifests` reorganizes entries from expired snapshots into new
manifests, the
manifest *is*
processed but the entry-level filter drops files whose adding
snapshot was expired.
Should the incremental case also account for expired snapshot IDs — perhaps
by
collecting all snapshot IDs that existed between start and end versions
from the metadata
log, or by using a
different strategy (e.g., comparing against already-replicated file
sets)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
