amogh-jahagirdar commented on code in PR #15006:
URL: https://github.com/apache/iceberg/pull/15006#discussion_r2687370868


##########
core/src/main/java/org/apache/iceberg/MergingSnapshotProducer.java:
##########
@@ -87,6 +96,7 @@ abstract class MergingSnapshotProducer<ThisT> extends 
SnapshotProducer<ThisT> {
   private final Map<Integer, DataFileSet> newDataFilesBySpec = 
Maps.newHashMap();
   private Long newDataFilesDataSequenceNumber;
   private final Map<Integer, DeleteFileSet> newDeleteFilesBySpec = 
Maps.newHashMap();
+  private final Map<String, DeleteFileSet> duplicateDVsForDataFile = 
Maps.newHashMap();

Review Comment:
   It certainly is simpler that way but I think it's more than just the data 
file name because if we have newDVRefs be a map tracking all DVs for a data 
file then we'd need to track a DeleteFileSet per data file, and we also already 
track the deletes just as part of regular tracking. So then the average case 
memory for delete files goes from O(delete files) to O(2 * delete files). 
Unless we further split the delete file tracking? But that complicates things 
in another way. 
   
   Since delete files are already pretty small I think you're right, it's not 
so bad, but I guess I'm a bit hesitant to incur that cost for the average case 
where we don't really expect duplicates to begin with



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to