wypoon commented on code in PR #10935:
URL: https://github.com/apache/iceberg/pull/10935#discussion_r1727409290
##########
core/src/main/java/org/apache/iceberg/BaseIncrementalChangelogScan.java:
##########
@@ -63,33 +60,43 @@ protected CloseableIterable<ChangelogScanTask> doPlanFiles(
return CloseableIterable.empty();
}
- Set<Long> changelogSnapshotIds = toSnapshotIds(changelogSnapshots);
+ Map<Long, Integer> snapshotOrdinals =
computeSnapshotOrdinals(changelogSnapshots);
- Set<ManifestFile> newDataManifests =
- FluentIterable.from(changelogSnapshots)
- .transformAndConcat(snapshot ->
snapshot.dataManifests(table().io()))
- .filter(manifest ->
changelogSnapshotIds.contains(manifest.snapshotId()))
- .toSet();
-
- ManifestGroup manifestGroup =
- new ManifestGroup(table().io(), newDataManifests, ImmutableList.of())
- .specsById(table().specs())
- .caseSensitive(isCaseSensitive())
- .select(scanColumns())
- .filterData(filter())
- .filterManifestEntries(entry ->
changelogSnapshotIds.contains(entry.snapshotId()))
- .ignoreExisting()
- .columnsToKeepStats(columnsToKeepStats());
-
- if (shouldIgnoreResiduals()) {
- manifestGroup = manifestGroup.ignoreResiduals();
- }
-
- if (newDataManifests.size() > 1 && shouldPlanWithExecutor()) {
- manifestGroup = manifestGroup.planWith(planExecutor());
- }
+ // map of delete file to the snapshot where the delete file is added
+ // the delete file is keyed by its path, and the snapshot is represented
by the snapshot ordinal
+ Map<String, Integer> deleteFileToSnapshotOrdinal =
+ computeDeleteFileToSnapshotOrdinal(changelogSnapshots,
snapshotOrdinals);
- return manifestGroup.plan(new
CreateDataFileChangeTasks(changelogSnapshots));
+ Iterable<CloseableIterable<ChangelogScanTask>> plans =
+ FluentIterable.from(changelogSnapshots)
+ .transform(
+ snapshot -> {
+ List<ManifestFile> dataManifests =
snapshot.dataManifests(table().io());
+ List<ManifestFile> deleteManifests =
snapshot.deleteManifests(table().io());
+
+ ManifestGroup manifestGroup =
+ new ManifestGroup(table().io(), dataManifests,
deleteManifests)
+ .specsById(table().specs())
+ .caseSensitive(isCaseSensitive())
+ .select(scanColumns())
+ .filterData(filter())
+ .columnsToKeepStats(columnsToKeepStats());
+
+ if (shouldIgnoreResiduals()) {
+ manifestGroup = manifestGroup.ignoreResiduals();
+ }
+
+ if (dataManifests.size() > 1 && shouldPlanWithExecutor()) {
+ manifestGroup = manifestGroup.planWith(planExecutor());
+ }
+
+ long snapshotId = snapshot.snapshotId();
+ return manifestGroup.plan(
+ new CreateDataFileChangeTasks(
+ snapshotId, snapshotOrdinals,
deleteFileToSnapshotOrdinal));
Review Comment:
Ah, in this case, (b) is the correct behavior. The changelog scan is an
incremental scan over multiple snapshots in a range, and should emit the
changes for each snapshot. This is the current behavior for the supported case,
which is copy-on-write. What you are seeking are the net changes, which is
functionality that is also supported by Spark, and built on top of the
changelog scan. This uses `ChangelogIterator.removeNetCarryovers`. This
functionality is exposed in the Spark procedure, `create_changelog_view`. (Of
course, one can also use it programmatically.)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]