amogh-jahagirdar commented on code in PR #15006:
URL: https://github.com/apache/iceberg/pull/15006#discussion_r2847744008


##########
core/src/main/java/org/apache/iceberg/MergingSnapshotProducer.java:
##########
@@ -1065,9 +1062,34 @@ private List<ManifestFile> newDeleteFilesAsManifests() {
     }
 
     if (cachedNewDeleteManifests.isEmpty()) {
+      for (Map.Entry<String, List<DeleteFile>> entry : 
dvsByReferencedFile.entrySet()) {
+        if (entry.getValue().size() > 1) {
+          LOG.warn(
+              "Attempted to commit {} duplicate DVs for data file {} in table 
{}. "
+                  + "Merging duplicates, and original DVs will be orphaned.",
+              entry.getValue().size(),
+              entry.getKey(),
+              tableName);
+        }
+      }
+
+      List<DeleteFile> finalDVs =
+          DVUtil.mergeAndWriteDvsIfRequired(
+              dvsByReferencedFile,
+              ThreadPools.getDeleteWorkerPool(),
+              ops().locationProvider(),
+              ops().encryption(),
+              ops().io(),
+              ops().current().specsById());
+      // Prevent commiting duplicate V2 deletes by deduping them
+      Map<Integer, List<DeleteFile>> newDeleteFilesBySpec =
+          Streams.stream(Iterables.concat(finalDVs, 
DeleteFileSet.of(positionAndEqualityDeletes)))

Review Comment:
   It would be a behavior change if we don't use a DeleteFileSet because 
something like
   
`table.newRowDelta().addDelete(POS_DELETE_A).addDelete(POS_DELETE_A).commit()` 
would end up adding POS_DELETE_A twice to metadata.
   
    I think the original rationale for the DeleteFileSet was that it's similar 
in principle to preventing duplicate data file entries in the same commit; the 
implementation can be robust to handling that. I do think this should be 
revisited in a 2.0 release? The churn around semantics of delete file equality 
does make this difficult to maintain over time.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to