rdblue commented on code in PR #15006:
URL: https://github.com/apache/iceberg/pull/15006#discussion_r2719071791
##########
core/src/main/java/org/apache/iceberg/MergingSnapshotProducer.java:
##########
@@ -1060,9 +1072,12 @@ private List<ManifestFile> newDeleteFilesAsManifests() {
}
if (cachedNewDeleteManifests.isEmpty()) {
+ mergeDVsAndWrite();
+
newDeleteFilesBySpec.forEach(
(specId, deleteFiles) -> {
PartitionSpec spec = ops().current().spec(specId);
+ deleteFiles.forEach(file -> addedFilesSummary.addedFile(spec,
file));
Review Comment:
I don't like that the `DeleteFile` instances are still kept in two places,
and that both places will deduplicate using sets but different logic once you
account for the map keys. I know that because we are not deduplicating v2
deletes, we need a place for them to be tracked, but that doesn't mean we need
to store the `DeleteFile` instances for DVs twice.
The reason why I don't like the double storage is that it doesn't handle
some strange cases. For instance, what if a `DeleteFile` is added for the same
DV but with two different (possibly equivalent, unpartitioned) specs? Then the
same file would be handled twice here, but would be deduplicated by the
`DeleteFileSet` for its referenced data file causing a metrics bug. While the
merge code checks that duplicates have the same spec ID, the `DeleteFileSet`
does not.
I think it would be cleaner to keep a list of v2 deletes and the multimap of
DVs and maintain them separately. This method should produce a new list of
merged DVs, then both lists (v2 deletes and merged DVs) should be written to
delete manifests by spec. It's easy enough to produce a filtered iterator, so I
don't think we are buying much by grouping by spec ID as files are added.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]