nastra commented on code in PR #14859:
URL: https://github.com/apache/iceberg/pull/14859#discussion_r2787471362


##########
spark/v4.0/spark/src/test/java/org/apache/iceberg/spark/actions/TestRewriteTablePathsAction.java:
##########
@@ -565,6 +565,89 @@ public void testPositionDeletesAcrossFiles() throws 
Exception {
         .isEmpty();
   }
 
+  /**
+   * Test for https://github.com/apache/iceberg/issues/14814
+   *
+   * <p>This test verifies that rewrite_table_path correctly deduplicates 
delete files when the same
+   * delete file appears in multiple manifests. Without the DeleteFileSet fix, 
this test would fail
+   * with AlreadyExistsException because DeleteFile objects don't override 
equals() and the same
+   * file would be processed multiple times.
+   *
+   * <p>The test creates a scenario where the same delete file is added to 
multiple snapshots,
+   * causing it to appear in multiple manifest entries. When these manifests 
are processed, the same
+   * delete file is returned as different object instances which need proper 
deduplication.
+   */
+  @TestTemplate
+  public void testPositionDeletesDeduplication() throws Exception {
+    // Format versions 3 and 4 use Deletion Vectors stored in Puffin files, 
which have different
+    // validation rules that prevent adding the same delete file multiple times
+    assumeThat(formatVersion)
+        .as("Format versions 3+ use DVs with different validation rules")
+        .isEqualTo(2);
+
+    Table tableWithPosDeletes =
+        createTableWithSnapshots(
+            
tableDir.toFile().toURI().toString().concat("tableWithDuplicateDeletes"),
+            2,
+            Map.of(TableProperties.DELETE_DEFAULT_FILE_FORMAT, "parquet"));
+
+    // Get a data file to create position deletes for
+    DataFile dataFile =
+        tableWithPosDeletes
+            .currentSnapshot()
+            .addedDataFiles(tableWithPosDeletes.io())
+            .iterator()
+            .next();
+
+    // Create a position delete file
+    List<Pair<CharSequence, Long>> deletes = 
Lists.newArrayList(Pair.of(dataFile.location(), 0L));
+    File deleteFile =
+        new File(
+            removePrefix(tableWithPosDeletes.location() + 
"/data/deeply/nested/deletes.parquet"));
+    DeleteFile positionDeletes =
+        FileHelpers.writeDeleteFile(
+                tableWithPosDeletes,
+                
tableWithPosDeletes.io().newOutputFile(deleteFile.toURI().toString()),
+                deletes,
+                formatVersion)
+            .first();
+
+    // Add the SAME delete file in the first snapshot
+    tableWithPosDeletes.newRowDelta().addDeletes(positionDeletes).commit();
+    long snapshot1 = tableWithPosDeletes.currentSnapshot().snapshotId();
+
+    // Add the SAME delete file AGAIN in a second snapshot - this creates a 
duplicate entry
+    // in a new manifest, which will cause duplicate DeleteFile objects when 
processing
+    tableWithPosDeletes.newRowDelta().addDeletes(positionDeletes).commit();
+    long snapshot2 = tableWithPosDeletes.currentSnapshot().snapshotId();
+
+    // Create tags to ensure both snapshots are processed
+    tableWithPosDeletes.manageSnapshots().createTag("tag1", 
snapshot1).commit();
+    tableWithPosDeletes.manageSnapshots().createTag("tag2", 
snapshot2).commit();
+
+    // This should NOT throw AlreadyExistsException - the fix uses 
DeleteFileSet to deduplicate
+    // Without the fix (using Collectors.toSet()), this would fail because:
+    // 1. Both manifests contain entries for the same delete file
+    // 2. Processing returns two different DeleteFile objects for the same file
+    // 3. HashSet doesn't deduplicate them (DeleteFile doesn't override 
equals())
+    // 4. rewritePositionDeletes tries to write the same file twice -> 
AlreadyExistsException
+    RewriteTablePath.Result result =
+        actions()
+            .rewriteTablePath(tableWithPosDeletes)
+            .stagingLocation(stagingLocation())
+            .rewriteLocationPrefix(tableWithPosDeletes.location(), 
targetTableLocation())
+            .execute();
+
+    // Verify the rewrite completed successfully - should have rewritten 
exactly 1 delete file
+    // (the duplicate should be deduplicated by DeleteFileSet)
+    assertThat(result.rewrittenDeleteFilePathsCount())
+        .as("Should have rewritten exactly 1 delete file after deduplication")
+        .isEqualTo(1);
+
+    // Copy the metadata files and data files
+    copyTableFiles(result);

Review Comment:
   ```suggestion
       tableWithPosDeletes.newRowDelta().addDeletes(positionDeletes).commit();
   
       // Add the SAME delete file AGAIN in a second snapshot - this creates a 
duplicate entry
       // in a new manifest, which will cause duplicate DeleteFile objects when 
processing
       tableWithPosDeletes.newRowDelta().addDeletes(positionDeletes).commit();
   
       // This should NOT throw AlreadyExistsException - the fix uses 
DeleteFileSet to deduplicate
       // Without the fix (using Collectors.toSet()), this would fail because:
       // 1. Both manifests contain entries for the same delete file
       // 2. Processing returns two different DeleteFile objects for the same 
file
       // 3. HashSet doesn't deduplicate them (DeleteFile doesn't override 
equals())
       // 4. rewritePositionDeletes tries to write the same file twice -> 
AlreadyExistsException
       RewriteTablePath.Result result =
           actions()
               .rewriteTablePath(tableWithPosDeletes)
               .stagingLocation(stagingLocation())
               .rewriteLocationPrefix(tableWithPosDeletes.location(), 
targetTableLocation())
               .execute();
   
       // Verify the rewrite completed successfully - should have rewritten 
exactly 1 delete file
       // (the duplicate should be deduplicated by DeleteFileSet)
       assertThat(result.rewrittenDeleteFilePathsCount())
           .as("Should have rewritten exactly 1 delete file after 
deduplication")
           .isEqualTo(1);
   ```
   
   this removes any unnecessary changes that aren't really required to 
reproduce the original issue



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to