amogh-jahagirdar commented on code in PR #15006:
URL: https://github.com/apache/iceberg/pull/15006#discussion_r2695596281
##########
core/src/main/java/org/apache/iceberg/MergingSnapshotProducer.java:
##########
@@ -1073,6 +1088,139 @@ private List<ManifestFile> newDeleteFilesAsManifests() {
return cachedNewDeleteManifests;
}
+ // Merge duplicates, internally takes care of updating newDeleteFilesBySpec
to remove
+ // duplicates and add the newly merged DV
+ private void mergeDVsAndWrite() {
+ Map<Integer, List<MergedDVContent>> mergedIndicesBySpec =
Maps.newConcurrentMap();
+
+ Tasks.foreach(dvsByReferencedFile.entrySet())
+ .executeWith(ThreadPools.getDeleteWorkerPool())
+ .stopOnFailure()
+ .throwFailureWhenFinished()
+ .run(
+ entry -> {
+ String referencedLocation = entry.getKey();
+ DeleteFileSet dvsToMerge = entry.getValue();
+ // Nothing to merge
+ if (dvsToMerge.size() < 2) {
+ return;
+ }
+
+ MergedDVContent merged = mergePositions(referencedLocation,
dvsToMerge);
+
+ mergedIndicesBySpec
+ .computeIfAbsent(
+ merged.specId, spec ->
Collections.synchronizedList(Lists.newArrayList()))
+ .add(merged);
+ });
+
+ // Update newDeleteFilesBySpec to remove all the duplicates
+ mergedIndicesBySpec.forEach(
+ (specId, mergedDVContent) -> {
+ mergedDVContent.stream()
+ .map(content -> content.mergedDVs)
+ .forEach(duplicateDVs ->
newDeleteFilesBySpec.get(specId).removeAll(duplicateDVs));
+ });
+
+ writeMergedDVs(mergedIndicesBySpec);
+ }
+
+ // Produces a Puffin per partition spec containing the merged DVs for that
spec
+ private void writeMergedDVs(Map<Integer, List<MergedDVContent>>
mergedDVContentBySpec) {
+ Map<Integer, DeleteFileSet> mergedDVsBySpec = Maps.newHashMap();
+
+ mergedDVContentBySpec.forEach(
+ (specId, mergedDVsForSpec) -> {
+ try (DVFileWriter dvFileWriter =
+ new BaseDVFileWriter(
+ OutputFileFactory.builderFor(ops(), spec(specId),
FileFormat.PUFFIN, 1, 1)
+ .build(),
+ path -> null)) {
+
+ for (MergedDVContent mergedDV : mergedDVsForSpec) {
+ LOG.warn(
+ "Merged {} duplicate deletion vectors for data file {} in
table {}. The merged DVs are orphaned, and writers should merge DVs per file
before committing",
Review Comment:
Actually more than I/O, I'm a little unsure that we can even delete the
duplicates from storage without doing a mini reachability analysis of the
puffin.
e.g. the writer may have produced a puffin , puffinFile1 with DV1, DV2, DV3
and another puffin puffinFile2 DV4, DV2-Duplicate, DV5. Even after we merge, we
can't just delete puffin file 1 and 2 in it's entirety. We'd have to rewrite
the whole puffin with DV2 removed, and rewrite the whole puffinFile2 with
DV2-Duplicate removed.
Not to mention, while this doesn't happen in practice, there could be other
blob types in the puffin that need to be preserved?
Again, it's something we could do? It's a lot more complexity on the commit
path, and it feels better to defer work here.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]