This is an automated email from the ASF dual-hosted git repository.

zeroshade pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/iceberg-go.git


The following commit(s) were added to refs/heads/main by this push:
     new 350ae727 fix(table): restore Summary/ColumnSizes test expectations to 
arrow-go v18.6.0 (#1114)
350ae727 is described below

commit 350ae7270d7c44578638ef55e7e0392227745e8a
Author: Matt Topol <[email protected]>
AuthorDate: Thu May 21 16:21:23 2026 -0400

    fix(table): restore Summary/ColumnSizes test expectations to arrow-go 
v18.6.0 (#1114)
    
    Restores main CI to green by reverting only the test-expectation byte
    counts that #1102 ("fix(manifest): stop writing deprecated
    distinct_counts") inadvertently changed.
    
    ## What broke
    
    After #1102 merged to main, both the `Go` and `Audit and Verify`
    workflows started failing on every push, across all 6 matrix entries
    (ubuntu/windows/macos × Go 1.25.5/1.26.1):
    
    ```
    --- FAIL: TestTableWriting (10.16s)
        --- FAIL: TestTableWriting/TestAddFilesPartitionedTable
        --- FAIL: TestTableWriting/TestAddFilesUnpartitioned
        --- FAIL: TestTableWriting/TestReplaceDataFiles
        --- FAIL: TestTableWriting/TestAddFilesPartitionedTable#01
        --- FAIL: TestTableWriting/TestAddFilesUnpartitioned#01
        --- FAIL: TestTableWriting/TestReplaceDataFiles#01
    --- FAIL: TestPositionDeletePartitionedFanoutWriterProcessBatch (0.00s)
        --- FAIL: TestPositionDeletePartitionedFanoutWriterProcessBatch/success
        --- FAIL: 
TestPositionDeletePartitionedFanoutWriterProcessBatch/batch_with_records_having_different_file_paths
    ```
    
    with diffs like:
    
    ```
    - "added-files-size": "3590"   (actual on CI / arrow-go v18.6.0)
    + "added-files-size": "3070"   (test source, post-#1102)
    
    - ColumnSizes: 2147483545:88   (actual on CI / arrow-go v18.6.0)
    + ColumnSizes: 2147483545:86   (test source, post-#1102)
    ```
    
    ## Root cause
    
    These two test files assert on **exact byte counts** of the on-disk
    parquet files written by `arrow-go`. Those byte counts depend on the
    parquet writer's metadata encoding, which differs across arrow-go
    versions.
    
    #1102 lowered the expected counts (3590→3070, 1066→963, 1816→2132+→1816,
    4264→3687, 88→86, 174→172, 96→94, 187→185). Those new values only
    reproduce against an unreleased arrow-go build (presumably wired in via
    a local `go.work`). Against the pinned `github.com/apache/arrow-go/v18
    v18.6.0` from `go.sum` — which every CI runner resolves — the writer
    still emits the original larger sizes, so the assertions fail every
    time.
    
    The PR's own CI runs were `CANCELLED` before completion, so this didn't
    surface at merge time.
    
    The functional change in #1102 (dropping `distinct_counts` field-id 111
    from the manifest entry Avro schema) only alters **manifest**
    serialization, not the **data** parquet files whose sizes these
    `Summary`/`ColumnSizes` fields tally — so the test-value updates were
    always unrelated to the production fix and can be safely reverted on
    their own.
    
    ## Fix
    
    Revert only the eight byte-count lines back to the pre-#1102 values
    verified via `git show 51b3140^`:
    
    | File | Test | Field | Before | #1102 | This PR |
    |---|---|---|---|---|---|
    | `table/table_test.go:549,554` | `TestAddFilesUnpartitioned` |
    `added-/total-files-size` | 3590 | 3070 | **3590** |
    | `table/table_test.go:770,776` | `TestAddFilesPartitionedTable` |
    `added-/total-files-size` | 3590 | 3070 | **3590** |
    | `table/table_test.go:1136,1140,1144` | `TestReplaceDataFiles` |
    `added-/removed-/total-files-size` | 1066/2132/4264 | 963/1816/3687 |
    **1066/2132/4264** |
    | `table/pos_delete_partitioned_fanout_writer_test.go:77` | `success` |
    `ColumnSizes` | 88,174 | 86,172 | **88,174** |
    | `table/pos_delete_partitioned_fanout_writer_test.go:87` |
    `batch_with_records_having_different_file_paths` | `ColumnSizes` |
    96,187 | 94,185 | **96,187** |
    
    No production code changes — #1102's manifest fix is preserved intact.
    
    ## Verification
    
    Reproduced CI failure locally with `GOWORK=off go test ./...` (forces
    use of pinned v18.6.0):
    
    - **Before this PR**: same 8 sub-test failures with identical
    expected/actual diffs as CI.
    - **After this PR**: `ok` across every package — `iceberg-go`,
    `catalog/{glue,hadoop,hive,rest,sql}`, `cmd/iceberg`, `codec`, `config`,
    `internal`, `io`, `io/gocloud`, `puffin`, `table`,
    `table/{compaction,dv,internal,substrait}`, `view`, `view/internal`.
    - `go vet ./...` clean, `go build ./...` clean, LSP diagnostics clean on
    both touched files.
    
    ## Followup (out of scope)
    
    These tests are inherently brittle — they'll keep breaking on every
    arrow-go bump that nudges parquet metadata encoding. A future cleanup
    could replace exact byte assertions with bounds (`> 0`, monotonic
    relationships between added/removed/total) or assert on row counts only.
    Not addressing here to keep the diff minimal and unblock main.
---
 table/pos_delete_partitioned_fanout_writer_test.go |  4 ++--
 table/table_test.go                                | 14 +++++++-------
 2 files changed, 9 insertions(+), 9 deletions(-)

diff --git a/table/pos_delete_partitioned_fanout_writer_test.go 
b/table/pos_delete_partitioned_fanout_writer_test.go
index 7b6e5284..83b6fc42 100644
--- a/table/pos_delete_partitioned_fanout_writer_test.go
+++ b/table/pos_delete_partitioned_fanout_writer_test.go
@@ -74,7 +74,7 @@ func TestPositionDeletePartitionedFanoutWriterProcessBatch(t 
*testing.T) {
                        name:                   "success",
                        pathToPartitionContext: 
map[string]partitionContext{"file://namespace/age_bucket=1/test.parquet": 
{partitionData: map[int]any{iceberg.PartitionDataIDStart: 1}, specID: 0}},
                        input:                  
mustLoadRecordBatchFromJSON(PositionalDeleteArrowSchema, `[{"file_path": 
"file://namespace/age_bucket=1/test.parquet", "pos": 100}]`),
-                       expectedDataFile:       &mockDataFile{columnSizes: 
map[int]int64{2147483545: 86, 2147483546: 172}, format: iceberg.ParquetFile, 
partition: map[int]any{iceberg.PartitionDataIDStart: 1}, count: 1, specid: 0, 
contentType: iceberg.EntryContentPosDeletes, sortOrderID: ptr(1)},
+                       expectedDataFile:       &mockDataFile{columnSizes: 
map[int]int64{2147483545: 88, 2147483546: 174}, format: iceberg.ParquetFile, 
partition: map[int]any{iceberg.PartitionDataIDStart: 1}, count: 1, specid: 0, 
contentType: iceberg.EntryContentPosDeletes, sortOrderID: ptr(1)},
                },
                // This test case illustrates how the 
positionDeletePartitionedFanoutWriter does not validate that all records
                // in a batch have the same file path. Doing so would be 
prohibitive in the current implementation and
@@ -84,7 +84,7 @@ func TestPositionDeletePartitionedFanoutWriterProcessBatch(t 
*testing.T) {
                        name:                   "batch with records having 
different file paths",
                        pathToPartitionContext: 
map[string]partitionContext{"file://namespace/age_bucket=1/test.parquet": 
{partitionData: map[int]any{iceberg.PartitionDataIDStart: 1}, specID: 0}},
                        input:                  
mustLoadRecordBatchFromJSON(PositionalDeleteArrowSchema, `[{"file_path": 
"file://namespace/age_bucket=1/test.parquet", "pos": 100}, {"file_path": 
"file://namespace/age_bucket=0/test.parquet", "pos": 10}]`),
-                       expectedDataFile:       &mockDataFile{columnSizes: 
map[int]int64{2147483545: 94, 2147483546: 185}, format: iceberg.ParquetFile, 
partition: map[int]any{iceberg.PartitionDataIDStart: 1}, count: 2, specid: 0, 
contentType: iceberg.EntryContentPosDeletes, sortOrderID: ptr(1)},
+                       expectedDataFile:       &mockDataFile{columnSizes: 
map[int]int64{2147483545: 96, 2147483546: 187}, format: iceberg.ParquetFile, 
partition: map[int]any{iceberg.PartitionDataIDStart: 1}, count: 2, specid: 0, 
contentType: iceberg.EntryContentPosDeletes, sortOrderID: ptr(1)},
                },
        }
 
diff --git a/table/table_test.go b/table/table_test.go
index adf8c052..17c5bd16 100644
--- a/table/table_test.go
+++ b/table/table_test.go
@@ -546,12 +546,12 @@ func (t *TableWritingTestSuite) 
TestAddFilesUnpartitioned() {
                        Operation: table.OpAppend,
                        Properties: iceberg.Properties{
                                "added-data-files":       "5",
-                               "added-files-size":       "3070",
+                               "added-files-size":       "3590",
                                "added-records":          "5",
                                "total-data-files":       "5",
                                "total-delete-files":     "0",
                                "total-equality-deletes": "0",
-                               "total-files-size":       "3070",
+                               "total-files-size":       "3590",
                                "total-position-deletes": "0",
                                "total-records":          "5",
                        },
@@ -767,13 +767,13 @@ func (t *TableWritingTestSuite) 
TestAddFilesPartitionedTable() {
                        Operation: table.OpAppend,
                        Properties: iceberg.Properties{
                                "added-data-files":        "5",
-                               "added-files-size":        "3070",
+                               "added-files-size":        "3590",
                                "added-records":           "5",
                                "changed-partition-count": "1",
                                "total-data-files":        "5",
                                "total-delete-files":      "0",
                                "total-equality-deletes":  "0",
-                               "total-files-size":        "3070",
+                               "total-files-size":        "3590",
                                "total-position-deletes":  "0",
                                "total-records":           "5",
                        },
@@ -1133,15 +1133,15 @@ func (t *TableWritingTestSuite) TestReplaceDataFiles() {
                Operation: table.OpOverwrite,
                Properties: iceberg.Properties{
                        "added-data-files":       "1",
-                       "added-files-size":       "963",
+                       "added-files-size":       "1066",
                        "added-records":          "4",
                        "deleted-data-files":     "2",
                        "deleted-records":        "4",
-                       "removed-files-size":     "1816",
+                       "removed-files-size":     "2132",
                        "total-data-files":       "4",
                        "total-delete-files":     "0",
                        "total-equality-deletes": "0",
-                       "total-files-size":       "3687",
+                       "total-files-size":       "4264",
                        "total-position-deletes": "0",
                        "total-records":          "10",
                },

Reply via email to