Re: [PR] Core: Track duplicate DVs for data file and merge them before committing [iceberg]

via GitHub Tue, 13 Jan 2026 09:29:34 -0800


amogh-jahagirdar commented on code in PR #15006:
URL: https://github.com/apache/iceberg/pull/15006#discussion_r2687370868



##########
core/src/main/java/org/apache/iceberg/MergingSnapshotProducer.java:
##########
@@ -87,6 +96,7 @@ abstract class MergingSnapshotProducer<ThisT> extends 
SnapshotProducer<ThisT> {
   private final Map<Integer, DataFileSet> newDataFilesBySpec = 
Maps.newHashMap();
   private Long newDataFilesDataSequenceNumber;
   private final Map<Integer, DeleteFileSet> newDeleteFilesBySpec = 
Maps.newHashMap();
+  private final Map<String, DeleteFileSet> duplicateDVsForDataFile = 
Maps.newHashMap();

Review Comment:
   It certainly is simpler that way but I think it's more than just the data 
file name because if we have newDVRefs be a map tracking all DVs for a data 
file then we'd need to track a DeleteFileSet per data file (which we expect to 
have 1 in most cases), and we also already track the deletes just as part of 
regular tracking. So then the average case memory for delete files goes from 
O(delete files) to O(2 * delete files). Unless we further split the delete file 
tracking? i.e. separate the tracking for DVs and tracking for all other delete 
files. But that complicates things in another way. 
   
   Since delete files are already pretty small I think you're right, it's not 
so bad, but I guess I'm a bit hesitant to incur that cost for the average case 
where we don't really expect duplicates to begin with



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Core: Track duplicate DVs for data file and merge them before committing [iceberg]

Reply via email to