Re: [PR] Core: Track duplicate DVs for data file and merge them before committing [iceberg]

via GitHub Thu, 15 Jan 2026 11:03:47 -0800


amogh-jahagirdar commented on code in PR #15006:
URL: https://github.com/apache/iceberg/pull/15006#discussion_r2695596281



##########
core/src/main/java/org/apache/iceberg/MergingSnapshotProducer.java:
##########
@@ -1073,6 +1088,139 @@ private List<ManifestFile> newDeleteFilesAsManifests() {
     return cachedNewDeleteManifests;
   }
 
+  // Merge duplicates, internally takes care of updating newDeleteFilesBySpec 
to remove
+  // duplicates and add the newly merged DV
+  private void mergeDVsAndWrite() {
+    Map<Integer, List<MergedDVContent>> mergedIndicesBySpec = 
Maps.newConcurrentMap();
+
+    Tasks.foreach(dvsByReferencedFile.entrySet())
+        .executeWith(ThreadPools.getDeleteWorkerPool())
+        .stopOnFailure()
+        .throwFailureWhenFinished()
+        .run(
+            entry -> {
+              String referencedLocation = entry.getKey();
+              DeleteFileSet dvsToMerge = entry.getValue();
+              // Nothing to merge
+              if (dvsToMerge.size() < 2) {
+                return;
+              }
+
+              MergedDVContent merged = mergePositions(referencedLocation, 
dvsToMerge);
+
+              mergedIndicesBySpec
+                  .computeIfAbsent(
+                      merged.specId, spec -> 
Collections.synchronizedList(Lists.newArrayList()))
+                  .add(merged);
+            });
+
+    // Update newDeleteFilesBySpec to remove all the duplicates
+    mergedIndicesBySpec.forEach(
+        (specId, mergedDVContent) -> {
+          mergedDVContent.stream()
+              .map(content -> content.mergedDVs)
+              .forEach(duplicateDVs -> 
newDeleteFilesBySpec.get(specId).removeAll(duplicateDVs));
+        });
+
+    writeMergedDVs(mergedIndicesBySpec);
+  }
+
+  // Produces a Puffin per partition spec containing the merged DVs for that 
spec
+  private void writeMergedDVs(Map<Integer, List<MergedDVContent>> 
mergedDVContentBySpec) {
+    Map<Integer, DeleteFileSet> mergedDVsBySpec = Maps.newHashMap();
+
+    mergedDVContentBySpec.forEach(
+        (specId, mergedDVsForSpec) -> {
+          try (DVFileWriter dvFileWriter =
+              new BaseDVFileWriter(
+                  OutputFileFactory.builderFor(ops(), spec(specId), 
FileFormat.PUFFIN, 1, 1)
+                      .build(),
+                  path -> null)) {
+
+            for (MergedDVContent mergedDV : mergedDVsForSpec) {
+              LOG.warn(
+                  "Merged {} duplicate deletion vectors for data file {} in 
table {}. The merged DVs are orphaned, and writers should merge DVs per file 
before committing",

Review Comment:
   Actually more than I/O, I'm a little unsure that we can even delete the 
duplicates from storage without doing a mini reachability analysis of the 
puffin.
   
   e.g. the writer may have produced a puffin , puffinFile1 with DV1, DV2, DV3 
and another puffin puffinFile2 DV4, DV2-Duplicate, DV5. Even after we merge, we 
can't just delete puffin file 1 and 2 in it's entirety. We'd have to rewrite 
the whole puffin with DV2 removed, and rewrite the whole puffinFile2 with 
DV2-Duplicate removed.
   
   Not to mention, while this doesn't happen in practice, there could be other 
blob types in the puffin that need to be preserved?
   
   Again, it's something we could do? It's a lot more complexity on the commit 
path, and it feels better to defer work here. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Core: Track duplicate DVs for data file and merge them before committing [iceberg]

Reply via email to