This is an automated email from the ASF dual-hosted git repository.
zeroshade pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/iceberg-go.git
The following commit(s) were added to refs/heads/main by this push:
new 350ae727 fix(table): restore Summary/ColumnSizes test expectations to
arrow-go v18.6.0 (#1114)
350ae727 is described below
commit 350ae7270d7c44578638ef55e7e0392227745e8a
Author: Matt Topol <[email protected]>
AuthorDate: Thu May 21 16:21:23 2026 -0400
fix(table): restore Summary/ColumnSizes test expectations to arrow-go
v18.6.0 (#1114)
Restores main CI to green by reverting only the test-expectation byte
counts that #1102 ("fix(manifest): stop writing deprecated
distinct_counts") inadvertently changed.
## What broke
After #1102 merged to main, both the `Go` and `Audit and Verify`
workflows started failing on every push, across all 6 matrix entries
(ubuntu/windows/macos × Go 1.25.5/1.26.1):
```
--- FAIL: TestTableWriting (10.16s)
--- FAIL: TestTableWriting/TestAddFilesPartitionedTable
--- FAIL: TestTableWriting/TestAddFilesUnpartitioned
--- FAIL: TestTableWriting/TestReplaceDataFiles
--- FAIL: TestTableWriting/TestAddFilesPartitionedTable#01
--- FAIL: TestTableWriting/TestAddFilesUnpartitioned#01
--- FAIL: TestTableWriting/TestReplaceDataFiles#01
--- FAIL: TestPositionDeletePartitionedFanoutWriterProcessBatch (0.00s)
--- FAIL: TestPositionDeletePartitionedFanoutWriterProcessBatch/success
--- FAIL:
TestPositionDeletePartitionedFanoutWriterProcessBatch/batch_with_records_having_different_file_paths
```
with diffs like:
```
- "added-files-size": "3590" (actual on CI / arrow-go v18.6.0)
+ "added-files-size": "3070" (test source, post-#1102)
- ColumnSizes: 2147483545:88 (actual on CI / arrow-go v18.6.0)
+ ColumnSizes: 2147483545:86 (test source, post-#1102)
```
## Root cause
These two test files assert on **exact byte counts** of the on-disk
parquet files written by `arrow-go`. Those byte counts depend on the
parquet writer's metadata encoding, which differs across arrow-go
versions.
#1102 lowered the expected counts (3590→3070, 1066→963, 1816→2132+→1816,
4264→3687, 88→86, 174→172, 96→94, 187→185). Those new values only
reproduce against an unreleased arrow-go build (presumably wired in via
a local `go.work`). Against the pinned `github.com/apache/arrow-go/v18
v18.6.0` from `go.sum` — which every CI runner resolves — the writer
still emits the original larger sizes, so the assertions fail every
time.
The PR's own CI runs were `CANCELLED` before completion, so this didn't
surface at merge time.
The functional change in #1102 (dropping `distinct_counts` field-id 111
from the manifest entry Avro schema) only alters **manifest**
serialization, not the **data** parquet files whose sizes these
`Summary`/`ColumnSizes` fields tally — so the test-value updates were
always unrelated to the production fix and can be safely reverted on
their own.
## Fix
Revert only the eight byte-count lines back to the pre-#1102 values
verified via `git show 51b3140^`:
| File | Test | Field | Before | #1102 | This PR |
|---|---|---|---|---|---|
| `table/table_test.go:549,554` | `TestAddFilesUnpartitioned` |
`added-/total-files-size` | 3590 | 3070 | **3590** |
| `table/table_test.go:770,776` | `TestAddFilesPartitionedTable` |
`added-/total-files-size` | 3590 | 3070 | **3590** |
| `table/table_test.go:1136,1140,1144` | `TestReplaceDataFiles` |
`added-/removed-/total-files-size` | 1066/2132/4264 | 963/1816/3687 |
**1066/2132/4264** |
| `table/pos_delete_partitioned_fanout_writer_test.go:77` | `success` |
`ColumnSizes` | 88,174 | 86,172 | **88,174** |
| `table/pos_delete_partitioned_fanout_writer_test.go:87` |
`batch_with_records_having_different_file_paths` | `ColumnSizes` |
96,187 | 94,185 | **96,187** |
No production code changes — #1102's manifest fix is preserved intact.
## Verification
Reproduced CI failure locally with `GOWORK=off go test ./...` (forces
use of pinned v18.6.0):
- **Before this PR**: same 8 sub-test failures with identical
expected/actual diffs as CI.
- **After this PR**: `ok` across every package — `iceberg-go`,
`catalog/{glue,hadoop,hive,rest,sql}`, `cmd/iceberg`, `codec`, `config`,
`internal`, `io`, `io/gocloud`, `puffin`, `table`,
`table/{compaction,dv,internal,substrait}`, `view`, `view/internal`.
- `go vet ./...` clean, `go build ./...` clean, LSP diagnostics clean on
both touched files.
## Followup (out of scope)
These tests are inherently brittle — they'll keep breaking on every
arrow-go bump that nudges parquet metadata encoding. A future cleanup
could replace exact byte assertions with bounds (`> 0`, monotonic
relationships between added/removed/total) or assert on row counts only.
Not addressing here to keep the diff minimal and unblock main.
---
table/pos_delete_partitioned_fanout_writer_test.go | 4 ++--
table/table_test.go | 14 +++++++-------
2 files changed, 9 insertions(+), 9 deletions(-)
diff --git a/table/pos_delete_partitioned_fanout_writer_test.go
b/table/pos_delete_partitioned_fanout_writer_test.go
index 7b6e5284..83b6fc42 100644
--- a/table/pos_delete_partitioned_fanout_writer_test.go
+++ b/table/pos_delete_partitioned_fanout_writer_test.go
@@ -74,7 +74,7 @@ func TestPositionDeletePartitionedFanoutWriterProcessBatch(t
*testing.T) {
name: "success",
pathToPartitionContext:
map[string]partitionContext{"file://namespace/age_bucket=1/test.parquet":
{partitionData: map[int]any{iceberg.PartitionDataIDStart: 1}, specID: 0}},
input:
mustLoadRecordBatchFromJSON(PositionalDeleteArrowSchema, `[{"file_path":
"file://namespace/age_bucket=1/test.parquet", "pos": 100}]`),
- expectedDataFile: &mockDataFile{columnSizes:
map[int]int64{2147483545: 86, 2147483546: 172}, format: iceberg.ParquetFile,
partition: map[int]any{iceberg.PartitionDataIDStart: 1}, count: 1, specid: 0,
contentType: iceberg.EntryContentPosDeletes, sortOrderID: ptr(1)},
+ expectedDataFile: &mockDataFile{columnSizes:
map[int]int64{2147483545: 88, 2147483546: 174}, format: iceberg.ParquetFile,
partition: map[int]any{iceberg.PartitionDataIDStart: 1}, count: 1, specid: 0,
contentType: iceberg.EntryContentPosDeletes, sortOrderID: ptr(1)},
},
// This test case illustrates how the
positionDeletePartitionedFanoutWriter does not validate that all records
// in a batch have the same file path. Doing so would be
prohibitive in the current implementation and
@@ -84,7 +84,7 @@ func TestPositionDeletePartitionedFanoutWriterProcessBatch(t
*testing.T) {
name: "batch with records having
different file paths",
pathToPartitionContext:
map[string]partitionContext{"file://namespace/age_bucket=1/test.parquet":
{partitionData: map[int]any{iceberg.PartitionDataIDStart: 1}, specID: 0}},
input:
mustLoadRecordBatchFromJSON(PositionalDeleteArrowSchema, `[{"file_path":
"file://namespace/age_bucket=1/test.parquet", "pos": 100}, {"file_path":
"file://namespace/age_bucket=0/test.parquet", "pos": 10}]`),
- expectedDataFile: &mockDataFile{columnSizes:
map[int]int64{2147483545: 94, 2147483546: 185}, format: iceberg.ParquetFile,
partition: map[int]any{iceberg.PartitionDataIDStart: 1}, count: 2, specid: 0,
contentType: iceberg.EntryContentPosDeletes, sortOrderID: ptr(1)},
+ expectedDataFile: &mockDataFile{columnSizes:
map[int]int64{2147483545: 96, 2147483546: 187}, format: iceberg.ParquetFile,
partition: map[int]any{iceberg.PartitionDataIDStart: 1}, count: 2, specid: 0,
contentType: iceberg.EntryContentPosDeletes, sortOrderID: ptr(1)},
},
}
diff --git a/table/table_test.go b/table/table_test.go
index adf8c052..17c5bd16 100644
--- a/table/table_test.go
+++ b/table/table_test.go
@@ -546,12 +546,12 @@ func (t *TableWritingTestSuite)
TestAddFilesUnpartitioned() {
Operation: table.OpAppend,
Properties: iceberg.Properties{
"added-data-files": "5",
- "added-files-size": "3070",
+ "added-files-size": "3590",
"added-records": "5",
"total-data-files": "5",
"total-delete-files": "0",
"total-equality-deletes": "0",
- "total-files-size": "3070",
+ "total-files-size": "3590",
"total-position-deletes": "0",
"total-records": "5",
},
@@ -767,13 +767,13 @@ func (t *TableWritingTestSuite)
TestAddFilesPartitionedTable() {
Operation: table.OpAppend,
Properties: iceberg.Properties{
"added-data-files": "5",
- "added-files-size": "3070",
+ "added-files-size": "3590",
"added-records": "5",
"changed-partition-count": "1",
"total-data-files": "5",
"total-delete-files": "0",
"total-equality-deletes": "0",
- "total-files-size": "3070",
+ "total-files-size": "3590",
"total-position-deletes": "0",
"total-records": "5",
},
@@ -1133,15 +1133,15 @@ func (t *TableWritingTestSuite) TestReplaceDataFiles() {
Operation: table.OpOverwrite,
Properties: iceberg.Properties{
"added-data-files": "1",
- "added-files-size": "963",
+ "added-files-size": "1066",
"added-records": "4",
"deleted-data-files": "2",
"deleted-records": "4",
- "removed-files-size": "1816",
+ "removed-files-size": "2132",
"total-data-files": "4",
"total-delete-files": "0",
"total-equality-deletes": "0",
- "total-files-size": "3687",
+ "total-files-size": "4264",
"total-position-deletes": "0",
"total-records": "10",
},