krisnaru commented on code in PR #15975:
URL: https://github.com/apache/iceberg/pull/15975#discussion_r3151704244


##########
core/src/main/java/org/apache/iceberg/RewriteTablePathUtil.java:
##########
@@ -431,8 +434,8 @@ private static RewriteResult<DataFile> writeDataFileEntry(
     appendEntryWithFile(entry, writer, newDataFile);
     // keep the following entries in metadata but exclude them from copyPlan
     // 1) deleted data files
-    // 2) entries not changed by snapshotIds
-    if (entry.isLive() && snapshotIds.contains(entry.snapshotId())) {
+    // 2) entries not changed by snapshotIds (incremental copy only)
+    if (entry.isLive() && (snapshotIds == null || 
snapshotIds.contains(entry.snapshotId()))) {

Review Comment:
   The fix for full copies (passing `null` snapshotIds) looks correct, but I 
believe the same class of bug persists for **incremental copies** when 
`RewriteManifests` runs  between replications.                                  
                                                                                
                                                                                
                                                                                
                       
                                                                                
                                                                                
                                                    
     **Scenario:**                                                              
                                                                                
                                                    
     1. Incremental replication at T1 captures `startVersion`. Source has `{S1, 
S2}`.                                                                           
                                                  
     2. Between T1 and T2: append → `S3`(file_C), `RewriteManifests` → `S4`, 
expire `S3`.                                                                    
                                                       
     3. Incremental at T2:                                                      
                                                                                
                                                    
        - `deltaSnapshotIds = {S4}` (S3 expired, not in 
`endMetadata.snapshots()`)                                                      
                                                                            
        - S4's rewritten manifests are selected for processing ✓                
                                                                                
                                                    
        - But `entry.snapshotId() = S3` for `file_C`, and `{S4}.contains(S3) == 
false`                                                                          
                                                    
        - `file_C` is excluded from `copyPlan` — **live data silently lost**    
                                                                                
                                                    
                                                                                
                                                                                
                                                    
     The root cause: `deltaSnapshotIds` is derived from 
`endMetadata.snapshots()` in `RewriteTablePathSparkAction`), which excludes 
expired snapshots. When                                                         
                                                                     
     `RewriteManifests` reorganizes entries from expired snapshots into new 
manifests, the                                                                  
                                                         manifest *is* 
processed but the entry-level filter drops files whose adding                   
                                                                                
                    snapshot was expired.                                       
                                                                                
                                                                   
                                                                                
                                                                                
                                                    
   Should the incremental case also account for expired snapshot IDs — perhaps 
by                                                                              
                                                   
     collecting all snapshot IDs that existed between start and end versions 
from the metadata                                                               
                                                          log, or by using a 
different strategy (e.g., comparing against already-replicated file             
                                                                                
                            
     sets)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to