amogh-jahagirdar commented on code in PR #15006:
URL: https://github.com/apache/iceberg/pull/15006#discussion_r2687370868
##########
core/src/main/java/org/apache/iceberg/MergingSnapshotProducer.java:
##########
@@ -87,6 +96,7 @@ abstract class MergingSnapshotProducer<ThisT> extends
SnapshotProducer<ThisT> {
private final Map<Integer, DataFileSet> newDataFilesBySpec =
Maps.newHashMap();
private Long newDataFilesDataSequenceNumber;
private final Map<Integer, DeleteFileSet> newDeleteFilesBySpec =
Maps.newHashMap();
+ private final Map<String, DeleteFileSet> duplicateDVsForDataFile =
Maps.newHashMap();
Review Comment:
It certainly is simpler that way but I think it's more than just the data
file name because if we have newDVRefs be a map tracking all DVs for a data
file then we'd need to track a DeleteFileSet per data file, and we also already
track the deletes just as part of regular tracking. So then the average case
memory for delete files goes from O(delete files) to O(2 * delete files).
Unless we further split the delete file tracking? But that complicates things
in another way.
Since delete files are already pretty small I think you're right, it's not
so bad, but I guess I'm a bit hesitant to incur that cost for the average case
where we don't really expect duplicates to begin with
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]