kbendick commented on code in PR #5981:
URL: https://github.com/apache/iceberg/pull/5981#discussion_r996382670
##########
core/src/main/java/org/apache/iceberg/ReachableFileCleanup.java:
##########
@@ -84,19 +76,65 @@ public void cleanFiles(TableMetadata beforeExpiration,
TableMetadata afterExpira
deleteFiles(manifestListsToDelete, "manifest list");
}
- private Set<ManifestFile> readManifests(Set<Snapshot> snapshots) {
- Set<ManifestFile> manifestFiles = Sets.newHashSet();
- for (Snapshot snapshot : snapshots) {
- try (CloseableIterable<ManifestFile> manifestFilesForSnapshot =
readManifestFiles(snapshot)) {
- for (ManifestFile manifestFile : manifestFilesForSnapshot) {
- manifestFiles.add(manifestFile.copy());
- }
- } catch (IOException e) {
- throw new RuntimeIOException(
- e, "Failed to close manifest list: %s",
snapshot.manifestListLocation());
- }
+ private Set<ManifestFile> currentManifests(
+ Set<Snapshot> snapshots, Set<ManifestFile> manifestsToDelete) {
+ Set<ManifestFile> currentManifests = ConcurrentHashMap.newKeySet();
+ if (manifestsToDelete.isEmpty()) {
+ return currentManifests;
}
+ Tasks.foreach(snapshots)
+ .retry(3)
+ .stopOnFailure()
+ .throwFailureWhenFinished()
+ .executeWith(planExecutorService)
+ .onFailure(
+ (snapshot, exc) ->
+ LOG.warn(
+ "Failed to determine manifests for snapshot {}",
snapshot.snapshotId(), exc))
+ .run(
+ snapshot -> {
+ try (CloseableIterable<ManifestFile> manifestFiles =
readManifests(snapshot)) {
+ for (ManifestFile manifestFile : manifestFiles) {
+ manifestsToDelete.remove(manifestFile);
+ if (manifestsToDelete.isEmpty()) {
+ return;
+ }
+
+ currentManifests.add(manifestFile.copy());
Review Comment:
This seems potentially incorrect to me. It seems like we could get different
results in `currentManifests` based on whether or not there are more manifests
left to delete.
Specifically, if some manifest file M is the last remaining manifest to
delete as the others were considered in previous snapshots, then it _will not_
be included in `currentManifests`.
However, if the snapshots are evaluated in an ordering such that M is
considered while there are still manigests left to delete, then M _will be_
included in `currentManifests`.
Do I have that correct or am I misunderstanding what this code is doing?
Any behavior that depends on ordering is somewhat suspicious to me, though I
was admittedly not involved in the previous PR review.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]