amogh-jahagirdar commented on code in PR #15006:
URL: https://github.com/apache/iceberg/pull/15006#discussion_r2687370868
##########
core/src/main/java/org/apache/iceberg/MergingSnapshotProducer.java:
##########
@@ -87,6 +96,7 @@ abstract class MergingSnapshotProducer<ThisT> extends
SnapshotProducer<ThisT> {
private final Map<Integer, DataFileSet> newDataFilesBySpec =
Maps.newHashMap();
private Long newDataFilesDataSequenceNumber;
private final Map<Integer, DeleteFileSet> newDeleteFilesBySpec =
Maps.newHashMap();
+ private final Map<String, DeleteFileSet> duplicateDVsForDataFile =
Maps.newHashMap();
Review Comment:
It certainly is simpler that way but I think it's more than just the data
file name because if we have newDVRefs be a map tracking all DVs for a data
file then we'd need to track a DeleteFileSet per data file (which we expect to
have 1 in most cases), and we also already track the deletes just as part of
regular tracking. So then the average case memory for delete files goes from
O(delete files) to O(2 * delete files). Unless we further split the delete file
tracking? i.e. separate the tracking for DVs and tracking for all other delete
files. But that complicates things in another way.
Since delete files are already pretty small I think you're right, it's not
so bad, but I guess I'm a bit hesitant to incur that cost for the average case
where we don't really expect duplicates to begin with
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]