amogh-jahagirdar commented on code in PR #5981:
URL: https://github.com/apache/iceberg/pull/5981#discussion_r996389820
##########
core/src/main/java/org/apache/iceberg/ReachableFileCleanup.java:
##########
@@ -84,19 +76,65 @@ public void cleanFiles(TableMetadata beforeExpiration,
TableMetadata afterExpira
deleteFiles(manifestListsToDelete, "manifest list");
}
- private Set<ManifestFile> readManifests(Set<Snapshot> snapshots) {
- Set<ManifestFile> manifestFiles = Sets.newHashSet();
- for (Snapshot snapshot : snapshots) {
- try (CloseableIterable<ManifestFile> manifestFilesForSnapshot =
readManifestFiles(snapshot)) {
- for (ManifestFile manifestFile : manifestFilesForSnapshot) {
- manifestFiles.add(manifestFile.copy());
- }
- } catch (IOException e) {
- throw new RuntimeIOException(
- e, "Failed to close manifest list: %s",
snapshot.manifestListLocation());
- }
+ private Set<ManifestFile> currentManifests(
+ Set<Snapshot> snapshots, Set<ManifestFile> manifestsToDelete) {
+ Set<ManifestFile> currentManifests = ConcurrentHashMap.newKeySet();
+ if (manifestsToDelete.isEmpty()) {
+ return currentManifests;
Review Comment:
I think this is ultimately a naming problem, I struggled to find a concise
name since it's really doing multiple things:
1.) The function is reading the current manifests and pruning out the
candidate manifests to delete at the same time. The reason it is done this way
is ultimately for this procedure we are optimizing for memory usage, and the
current manifests are expected to be a larger set on average compared to the
manifests to delete.
2.) Following 1. what this means is the procedure should aim to minimize the
amount of the current table state to read. The procedure can stop reading
manifests in the current table state if all the candidate manifests have been
pruned.
>Generally speaking, that's not true to say that if there are no manifests
to delete, there are no current manifests.
3.) You're definitely correct here but from a "deletion correctness"
standpoint this state is still good because if there are no manifests to delete
this means that there are no data files we can actually delete and subsequent
findFilesToDelete operations are guaranteed to be no-ops.
Let me know if there's still any concerns on correctness, right now the
TestRemoveSnapshots goes through both strategies but I'll think if there's any
cases missing . Also definitely open to better names here, and I'll also look
into structuring/commenting the code so it's more clear. I think inline
comments can probably go a long way here.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]