openinx commented on a change in pull request #2892:
URL: https://github.com/apache/iceberg/pull/2892#discussion_r679657531



##########
File path: core/src/test/java/org/apache/iceberg/TestRewriteFiles.java
##########
@@ -641,4 +641,115 @@ public void testNewDeleteFile() {
         .rewriteFiles(Sets.newSet(FILE_A), Sets.newSet(FILE_A2))
         .apply();
   }
+
+  @Test
+  public void testRewriteReferencedDataFile() {
+    Assume.assumeTrue("Delete files are only supported in v2", formatVersion > 
1);
+
+    table.newAppend()
+        .appendFile(FILE_A)
+        .commit();
+
+    table.newRowDelta()
+        .addDeletes(FILE_A_DELETES)
+        .commit();
+
+    long snapshotBeforeDeleteRewrite = table.currentSnapshot().snapshotId();
+
+    // simulate rewrite deletes in FILE_A_DELETES to FILE_B_DELETES
+    table.newRewrite()
+        .validateFromSnapshot(snapshotBeforeDeleteRewrite)
+        .validateDataFilesNotRewritten(Sets.newSet(FILE_A.path()))
+        .rewriteFiles(Sets.newSet(), Sets.newSet(FILE_A_DELETES), 
Sets.newSet(), Sets.newSet(FILE_B_DELETES))
+        .commit();
+
+    long snapshotBeforeRewriteFileA = table.currentSnapshot().snapshotId();
+
+    // rewrite FILE_A as FILE_A2
+    table.newRewrite()
+        .validateFromSnapshot(table.currentSnapshot().snapshotId())
+        .rewriteFiles(Sets.newSet(FILE_A), Sets.newSet(FILE_A2))
+        .commit();
+
+    AssertHelpers.assertThrows("Should fail because a referenced file was 
rewritten",
+        ValidationException.class, "Cannot commit, missing data files",
+        () -> table.newRewrite()
+            .validateFromSnapshot(snapshotBeforeRewriteFileA)
+            .validateDataFilesNotRewritten(Sets.newSet(FILE_A.path()))
+            .rewriteFiles(Sets.newSet(), Sets.newSet(FILE_B_DELETES), 
Sets.newSet(), Sets.newSet(FILE_A_DELETES))
+            .apply());
+  }
+
+  @Test
+  public void testOverwriteReferencedDataFile() {
+    Assume.assumeTrue("Delete files are only supported in v2", formatVersion > 
1);
+
+    table.newAppend()
+        .appendFile(FILE_A)
+        .commit();
+
+    table.newRowDelta()
+        .addDeletes(FILE_A_DELETES)
+        .commit();
+
+    long snapshotBeforeDeleteRewrite = table.currentSnapshot().snapshotId();
+
+    // simulate rewrite deletes in FILE_A_DELETES to FILE_B_DELETES
+    table.newRewrite()
+        .validateFromSnapshot(snapshotBeforeDeleteRewrite)
+        .validateDataFilesNotRewritten(Sets.newSet(FILE_A.path()))
+        .rewriteFiles(Sets.newSet(), Sets.newSet(FILE_A_DELETES), 
Sets.newSet(), Sets.newSet(FILE_B_DELETES))
+        .commit();
+
+    long snapshotBeforeOverwriteFileA = table.currentSnapshot().snapshotId();
+
+    // overwrite FILE_A with FILE_A2
+    table.newOverwrite()
+        .deleteFile(FILE_A)
+        .addFile(FILE_A2)
+        .commit();
+
+    // the rewrite succeeds because the overwrite is required to read FILE_A 
correctly
+    table.newRewrite()
+        .validateFromSnapshot(snapshotBeforeOverwriteFileA)
+        .validateDataFilesNotRewritten(Sets.newSet(FILE_A.path()))
+        .rewriteFiles(Sets.newSet(), Sets.newSet(FILE_B_DELETES), 
Sets.newSet(), Sets.newSet(FILE_A_DELETES))

Review comment:
       Reconsidered the 
[comment](https://github.com/apache/iceberg/pull/2865#discussion_r677076862),  
I agree that conflicts between REPLACE and OVERWRITE/INSERT (say conflict-1) is 
another different story compared to conflicts between REPLACE and REPLACE (say 
conflict-2).
   
   For this case `The REPLACE operation remove the data files that was relied 
by other committed APPEND/OVERWRITE/DELTE operations`,  both conflict-1 and 
conflict-2 should be avoided because both of them will lead to incorrect data 
set.
   
   For the other case `The APPEND/OVERWRITE/DELETE operations removed the data 
files that was relied by other committing REPLACE operation`,   conflict-1 
won't lead to incorrect data set although there will be some remaining dangling 
positional deletes (as you said in this 
[comment](https://github.com/apache/iceberg/pull/2865#discussion_r679563193)).  
but it's possible to lead to incorrect data set when considering conflict-2: 
   
   Step.1  :  Table has FILE_A and EQ_DELETE_FILE_A;
   Step.2  :  RewriteAction_1  rewrite the FILE_A to another FILE_B - not 
commit; 
   Step.3  :  RewriteAction_2 rewrite the EQ_DELETE_FILE_A to POS_DELETE_FILE_C 
which reference to FILE_A  - not commit. 
   Step.4. :   Committed RewriteAction_1 ; 
   Step.5  :   Committed RewriteAction_2.
   
   In the end, the POS_DELETE_FILE_C won't be able to apply to the newly 
rewritten FILE_B, which create the incorrect data set.  Using older sequence 
number for RewriteAction also cannot fix this bug.
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to