rdblue commented on code in PR #15006:
URL: https://github.com/apache/iceberg/pull/15006#discussion_r2688017581


##########
core/src/main/java/org/apache/iceberg/MergingSnapshotProducer.java:
##########
@@ -87,6 +96,7 @@ abstract class MergingSnapshotProducer<ThisT> extends 
SnapshotProducer<ThisT> {
   private final Map<Integer, DataFileSet> newDataFilesBySpec = 
Maps.newHashMap();
   private Long newDataFilesDataSequenceNumber;
   private final Map<Integer, DeleteFileSet> newDeleteFilesBySpec = 
Maps.newHashMap();
+  private final Map<String, DeleteFileSet> duplicateDVsForDataFile = 
Maps.newHashMap();

Review Comment:
   I agree with Russell here. I think it would be better to keep track of the 
DVs in a different way.
   
   Now that there may be duplicate DVs that get rewritten, we can no longer 
trust `newDeleteFilesBySpec` because it could contain DVs that will be replaced 
with compacted ones. I think what I would do is a bit simpler: I would keep a 
multimap of `DeleteFile` by referenced data file (location). Then in 
`newDeleteFilesAsManifests`, you would loop through each referenced data file 
and process the `DeleteFile` entries. If there is more than one then you'd 
create a new DV and write its metadata to a new file and then write the new DV 
metadata to a manifest writer (by spec ID). If it's important to dedup before 
compacting, then you could keep a single `DeleteFileSet` and clear it before 
processing each referenced file's deletes.
   
   I think that's slightly more reliable in a couple ways. For instance, it 
would catch DVs with different (but possibly equivalent) specs and it would 
also catch cases where you have a v2 delete file and a DV, which should also be 
compacted. It's also simpler to keep state in just one place.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to