laskoviymishka commented on code in PR #873:
URL: https://github.com/apache/iceberg-go/pull/873#discussion_r3074357530
##########
table/metadata.go:
##########
@@ -505,6 +509,15 @@ func (b *MetadataBuilder) RemoveSnapshots(snapshotIds
[]int64, postCommit bool)
}
b.refs = newRefs
+ // Prune statistics entries whose snapshot was removed so that table
+ // metadata does not retain stale statistics references.
+ b.statisticsList = slices.DeleteFunc(b.statisticsList, func(e
StatisticsFile) bool {
+ return slices.Contains(snapshotIds, e.SnapshotID)
+ })
+ b.partitionStatsList = slices.DeleteFunc(b.partitionStatsList, func(e
PartitionStatisticsFile) bool {
+ return slices.Contains(snapshotIds, e.SnapshotID)
+ })
Review Comment:
Consistent with the existing snapshotList/snapshotLog pruning pattern above.
Same O(n*m) complexity. Fine for typical use.
##########
table/metadata.go:
##########
@@ -871,6 +884,8 @@ func (b *MetadataBuilder) buildCommonMetadata()
(*commonMetadata, error) {
SortOrderList: b.sortOrderList,
DefaultSortOrderID: b.defaultSortOrderID,
SnapshotRefs: b.refs,
+ StatisticsList: b.statisticsList,
+ PartitionStatsList: b.partitionStatsList,
Review Comment:
This is actually the most important line in the PR.
Before this change, every Build() call silently dropped all statistics from
the metadata, not just during RemoveSnapshots, but during ANY builder operation
(add snapshot, update schema, etc.). The PR description frames this as a
RemoveSnapshots bug, but it's a general metadata builder bug. Worth calling out
in the description.
##########
table/metadata.go:
##########
@@ -261,6 +263,8 @@ func MetadataBuilderFromBase(metadata Metadata,
currentFileLocation string) (*Me
b.refs = maps.Collect(metadata.Refs())
b.snapshotLog = slices.Collect(metadata.SnapshotLogs())
b.metadataLog = slices.Collect(metadata.PreviousFiles())
+ b.statisticsList = slices.Collect(metadata.Statistics())
+ b.partitionStatsList = slices.Collect(metadata.PartitionStatistics())
Review Comment:
Correct — Statistics() and PartitionStatistics() return iterators over the
base metadata's lists. slices.Collect materializes them into owned slices for
the builder. Consistent with how snapshotLog and metadataLog are loaded right
above.
##########
table/metadata_builder_internal_test.go:
##########
@@ -622,6 +624,76 @@ func TestRemoveSnapshotRemovesBranch(t *testing.T) {
}
}
+// TestRemoveSnapshotsPrunesStatistics verifies that RemoveSnapshots also
+// prunes StatisticsList and PartitionStatsList entries whose SnapshotID
+// matches a removed snapshot. See iceberg-go#836.
+func TestRemoveSnapshotsPrunesStatistics(t *testing.T) {
Review Comment:
The test is good — covers load, prune, and build round-trip.
One thing missing: a test that only does MetadataBuilderFromBase → Build()
(no removals) and verifies statistics survive.
That pins the buildCommonMetadata fix independently of RemoveSnapshots.
Right now, if someone reverts just the buildCommonMetadata lines but keeps
the RemoveSnapshots pruning, this test would still pass (the builder's internal
state is correct, but Build() would drop stats) — actually no, the test does
call Build() and checks the result.
So it does cover both. But a separate "no-op round-trip preserves stats"
test is clearer about what it'sasserting.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]