This is an automated email from the ASF dual-hosted git repository.
zeroshade pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/iceberg-go.git
The following commit(s) were added to refs/heads/main by this push:
new 261edba7 test(puffin): golden round-trip a deletion-vector-v1 blob
(#1041)
261edba7 is described below
commit 261edba7b4afe680473995cb6db12e85a69048c7
Author: Andrei Tserakhau <[email protected]>
AuthorDate: Tue May 12 18:35:09 2026 +0200
test(puffin): golden round-trip a deletion-vector-v1 blob (#1041)
Pins the puffin envelope shape around a deletion-vector-v1 blob without
depending on the in-flight roaring decoder (#866). The fixture is two
files:
- puffin/testdata/deletion-vector-v1-payload.bin — a Java-produced
64-bit roaring DV payload lifted directly from apache/iceberg test
resources (small-alternating-values-position-index.bin, 50 bytes; bitmap
encodes positions 1, 3, 5, 7, 9).
- puffin/testdata/deletion-vector-v1.puffin — the same payload wrapped
by puffin.Writer with the spec-canonical metadata (snapshot-id=-1,
sequence-number=-1, deletion-vector-v1 blob type, referenced-data-file
and cardinality properties).
The test asserts both layers. Reader side: blob count, type,
spec-mandated invariants, properties, and that ReadBlob round-trips the
inner payload byte-for-byte equal to the standalone Java fixture. Writer
side: regenerates the envelope in-memory and asserts byte equality
against the on-disk fixture. Without the writer-side check the test was
reader-only round-trip and writer drift would silently calcify into the
next regeneration.
Honest framing: this is a Go-writer wire-format pin with a Java-
equivalent inner payload, not a strong Java cross-impl pin. The basic
envelope shape is cross-checked by TestWriterBitIdenticalWithJava, but
that test does not exercise empty Fields arrays or multi-key blob
Properties — both of which this fixture relies on — and JSON key
ordering of blob Properties is encoder-defined.
A regen test gated on REGEN_FIXTURES=1 reproduces the .puffin from the
inner payload and self-validates by reading the blob back before
overwriting the on-disk file.
Closes #1008.
---
puffin/dv_golden_test.go | 167 +++++++++++++++++++++++++
puffin/gen_dv_fixture.go | 110 ++++++++++++++++
puffin/testdata/README.md | 48 ++++++-
puffin/testdata/deletion-vector-v1-payload.bin | Bin 0 -> 50 bytes
puffin/testdata/deletion-vector-v1.puffin | Bin 0 -> 314 bytes
5 files changed, 324 insertions(+), 1 deletion(-)
diff --git a/puffin/dv_golden_test.go b/puffin/dv_golden_test.go
new file mode 100644
index 00000000..2464b582
--- /dev/null
+++ b/puffin/dv_golden_test.go
@@ -0,0 +1,167 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements. See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership. The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License. You may obtain a copy of the License at
+//
+// http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied. See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+package puffin_test
+
+//go:generate go run gen_dv_fixture.go
+
+import (
+ "bytes"
+ "os"
+ "path/filepath"
+ "testing"
+
+ "github.com/apache/iceberg-go/puffin"
+ "github.com/stretchr/testify/assert"
+ "github.com/stretchr/testify/require"
+)
+
+// dvFixturePayloadName is the standalone Java-produced 64-bit roaring DV
+// payload lifted from apache/iceberg's test resources. The inner shape per
+// Iceberg spec: 4-byte BE length, 4-byte 0xD1D33964 magic, serialized
+// roaring bitmap, 4-byte BE CRC32. Source:
+//
https://github.com/apache/iceberg/blob/main/core/src/test/resources/org/apache/iceberg/deletes/small-alternating-values-position-index.bin
+const dvFixturePayloadName = "deletion-vector-v1-payload.bin"
+
+// dvFixturePuffinName is the complete puffin file wrapping the Java-produced
+// DV payload. The envelope is what puffin.Writer emits today; this is a
+// Go-writer wire-format pin, not a strong Java cross-impl pin. (The basic
+// puffin envelope shape is cross-checked by TestWriterBitIdenticalWithJava,
+// but that test does not exercise empty Fields arrays or multi-key blob
+// Properties, both of which this fixture relies on. JSON key ordering of
+// blob Properties is also encoder-defined.) Regenerate via
+// `go generate ./puffin/...` (driven by gen_dv_fixture.go).
+const dvFixturePuffinName = "deletion-vector-v1.puffin"
+
+// dvFixtureReferencedDataFile is the placeholder data-file path stored in
+// the blob's properties. Spec: every DV blob carries `referenced-data-file`
+// pointing at the parquet file it deletes from. The exact string is a fixture
+// choice; matching against any specific Java-emitted file would require a
+// matching string, which is not currently checked in upstream.
+const dvFixtureReferencedDataFile = "data/test.parquet"
+
+// dvFixtureCardinality is the cardinality property — the count of deleted
+// row positions encoded inside the roaring bitmap. String form because puffin
+// blob properties are stringly-typed (map[string]string). The bitmap encodes
+// 5 positions: 1, 3, 5, 7, 9.
+const dvFixtureCardinality = "5"
+
+// buildDVFixture returns a puffin envelope wrapping the given Java-produced
+// DV payload, with the canonical metadata for a deletion-vector-v1 blob.
+// The same builder is used by both the regen path and the wire-format pin,
+// so any drift in puffin.Writer surfaces as a byte-mismatch against the
+// checked-in fixture rather than calcifying into a regenerated golden.
+func buildDVFixture(t *testing.T, payload []byte) []byte {
+ t.Helper()
+ buf := &bytes.Buffer{}
+ w, err := puffin.NewWriter(buf)
+ require.NoError(t, err)
+ require.NoError(t, w.SetCreatedBy("iceberg-go test fixture"))
+
+ _, err = w.AddBlob(puffin.BlobMetadataInput{
+ Type: puffin.BlobTypeDeletionVector,
+ SnapshotID: -1,
+ SequenceNumber: -1,
+ Fields: []int32{},
+ Properties: map[string]string{
+ "referenced-data-file": dvFixtureReferencedDataFile,
+ "cardinality": dvFixtureCardinality,
+ },
+ }, payload)
+ require.NoError(t, err)
+ require.NoError(t, w.Finish())
+
+ return buf.Bytes()
+}
+
+// TestDeletionVectorPuffinWireFormat is a cross-implementation wire-format
+// pin for puffin envelopes wrapping deletion-vector-v1 blobs. Two layers of
+// guarantee:
+//
+// - The inner roaring payload is byte-equal to a Java-produced fixture
+// lifted directly from apache/iceberg test resources. If the puffin
+// reader ever mangles blob bytes on the way out, this fails.
+//
+// - The on-disk envelope bytes are byte-equal to what puffin.Writer
+// re-emits today for the same input. Any drift in the writer (footer
+// JSON shape, key ordering, properties handling, magic placement)
+// surfaces here instead of calcifying silently into the next regen.
+//
+// Independent of #866 (the roaring decoder PR) — uses raw bytes only.
+func TestDeletionVectorPuffinWireFormat(t *testing.T) {
+ puffinBytes, err := os.ReadFile(filepath.Join("testdata",
dvFixturePuffinName))
+ require.NoError(t, err, "fixture missing — regenerate with `go generate
./puffin/...`")
+
+ expectedPayload, err := os.ReadFile(filepath.Join("testdata",
dvFixturePayloadName))
+ require.NoError(t, err)
+
+ // Writer-side pin: the checked-in envelope must equal what
puffin.Writer
+ // produces today for the same input. Without this, writer regressions
+ // slip through the read-side assertions because both sides drift
+ // together.
+ freshBytes := buildDVFixture(t, expectedPayload)
+ if !bytes.Equal(freshBytes, puffinBytes) {
+ diffAt := -1
+ for i := 0; i < len(freshBytes) && i < len(puffinBytes); i++ {
+ if freshBytes[i] != puffinBytes[i] {
+ diffAt = i
+
+ break
+ }
+ }
+ t.Fatalf("checked-in envelope no longer matches puffin.Writer
output. "+
+ "First diff at byte %d (fixture=%d bytes, fresh=%d
bytes). "+
+ "Either a deliberate format change (regenerate with "+
+ "`go generate ./puffin/...` and review the diff) or a
writer regression.",
+ diffAt, len(puffinBytes), len(freshBytes))
+ }
+
+ // Magic bytes at file head and tail.
+ r, err := puffin.NewReader(bytes.NewReader(puffinBytes))
+ require.NoError(t, err)
+
+ // Read-side pin: blob count, type, spec-mandated invariants.
+ blobs := r.Blobs()
+ require.Len(t, blobs, 1, "fixture should contain exactly one DV blob")
+
+ blob := blobs[0]
+ assert.Equal(t, puffin.BlobTypeDeletionVector, blob.Type)
+ assert.Equal(t, int64(-1), blob.SnapshotID,
+ "deletion-vector-v1 spec requires snapshot-id=-1")
+ assert.Equal(t, int64(-1), blob.SequenceNumber,
+ "deletion-vector-v1 spec requires sequence-number=-1")
+ // Strict empty (not nil): Java's parser rejects "fields": null.
Asserting
+ // the concrete []int32{} value catches a future regression that would
+ // emit null instead of [].
+ assert.Equal(t, []int32{}, blob.Fields,
+ "DV blob fields must be an explicit empty array per spec")
+ assert.Nil(t, blob.CompressionCodec,
+ "this fixture is uncompressed (per fixture choice, not spec)")
+ assert.Len(t, blob.Properties, 2,
+ "DV blob should carry exactly the two spec-canonical
properties; "+
+ "a writer regression that emits extra keys must fail
here too")
+ assert.Equal(t, dvFixtureReferencedDataFile,
blob.Properties["referenced-data-file"])
+ assert.Equal(t, dvFixtureCardinality, blob.Properties["cardinality"])
+ assert.Equal(t, int64(len(expectedPayload)), blob.Length,
+ "blob length should equal the Java-produced payload")
+
+ // Reader returns the inner payload byte-for-byte equal to the Java
fixture.
+ got, err := r.ReadBlob(0)
+ require.NoError(t, err)
+ assert.Equal(t, expectedPayload, got.Data,
+ "reader should round-trip the Java-produced payload bytes
unmodified")
+}
diff --git a/puffin/gen_dv_fixture.go b/puffin/gen_dv_fixture.go
new file mode 100644
index 00000000..c69e2e1f
--- /dev/null
+++ b/puffin/gen_dv_fixture.go
@@ -0,0 +1,110 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements. See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership. The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License. You may obtain a copy of the License at
+//
+// http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied. See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+//go:build ignore
+
+// Regenerates puffin/testdata/deletion-vector-v1.puffin from the Java-
+// produced inner payload (deletion-vector-v1-payload.bin). Invoked via
+//
+// go generate ./puffin/...
+//
+// After regen, diff the file before committing to confirm only the
+// intended bytes changed:
+//
+// git diff -- puffin/testdata/deletion-vector-v1.puffin
+package main
+
+import (
+ "bytes"
+ "fmt"
+ "log"
+ "os"
+ "path/filepath"
+
+ "github.com/apache/iceberg-go/puffin"
+)
+
+const (
+ payloadName = "deletion-vector-v1-payload.bin"
+ puffinName = "deletion-vector-v1.puffin"
+ referencedDataFile = "data/test.parquet"
+ cardinality = "5"
+)
+
+func main() {
+ // go generate runs in the directory of the file carrying the
+ // //go:generate directive (puffin/); testdata sits alongside.
+ payloadPath := filepath.Join("testdata", payloadName)
+ outPath := filepath.Join("testdata", puffinName)
+
+ payload, err := os.ReadFile(payloadPath)
+ if err != nil {
+ log.Fatalf("read payload: %v", err)
+ }
+
+ buf := &bytes.Buffer{}
+ w, err := puffin.NewWriter(buf)
+ if err != nil {
+ log.Fatalf("new writer: %v", err)
+ }
+ if err := w.SetCreatedBy("iceberg-go test fixture"); err != nil {
+ log.Fatalf("set created-by: %v", err)
+ }
+ if _, err := w.AddBlob(puffin.BlobMetadataInput{
+ Type: puffin.BlobTypeDeletionVector,
+ SnapshotID: -1,
+ SequenceNumber: -1,
+ Fields: []int32{},
+ Properties: map[string]string{
+ "referenced-data-file": referencedDataFile,
+ "cardinality": cardinality,
+ },
+ }, payload); err != nil {
+ log.Fatalf("add blob: %v", err)
+ }
+ if err := w.Finish(); err != nil {
+ log.Fatalf("finish: %v", err)
+ }
+
+ // Self-validate before overwriting: parse what we just produced, read
+ // the blob back, and confirm both the envelope-level invariants and
+ // the inner-payload bytes survive the round-trip. Without the ReadBlob
+ // step a writer bug producing valid-but-mismatched blob offsets/lengths
+ // would still pass parse-only validation and calcify into the fixture.
+ r, err := puffin.NewReader(bytes.NewReader(buf.Bytes()))
+ if err != nil {
+ log.Fatalf("regen produced an unreadable puffin file: %v", err)
+ }
+ if n := len(r.Blobs()); n != 1 {
+ log.Fatalf("regen produced %d blobs, want 1", n)
+ }
+ if got, want := r.Blobs()[0].Type, puffin.BlobTypeDeletionVector; got
!= want {
+ log.Fatalf("regen produced blob type %q, want %q", got, want)
+ }
+ got, err := r.ReadBlob(0)
+ if err != nil {
+ log.Fatalf("regen produced an unreadable blob: %v", err)
+ }
+ if !bytes.Equal(payload, got.Data) {
+ log.Fatal("regen round-trip mangled the inner payload")
+ }
+
+ if err := os.WriteFile(outPath, buf.Bytes(), 0o644); err != nil {
+ log.Fatalf("write fixture: %v", err)
+ }
+ fmt.Printf("wrote %s (%d bytes)\n", outPath, buf.Len())
+}
diff --git a/puffin/testdata/README.md b/puffin/testdata/README.md
index 084df93c..b39e5d62 100644
--- a/puffin/testdata/README.md
+++ b/puffin/testdata/README.md
@@ -17,5 +17,51 @@ specific language governing permissions and limitations
under the License.
-->
-These test fixture files are canonical Puffin files from the Apache Iceberg
Java implementation:
+## Canonical fixtures from apache/iceberg
+
+`empty-puffin-uncompressed.bin`, `sample-metric-data-uncompressed.bin`, and
+`sample-metric-data-compressed-zstd.bin` are canonical Puffin files from the
+Apache Iceberg Java implementation:
https://github.com/apache/iceberg/tree/main/core/src/test/resources/org/apache/iceberg/puffin/v1
+
+## Deletion-vector cross-impl fixtures
+
+`deletion-vector-v1-payload.bin` is a Java-produced 64-bit Roaring deletion
+vector payload lifted directly from apache/iceberg's test resources. 50 bytes
+total: 4-byte BE length, 4-byte 0xD1D33964 magic, serialized roaring bitmap
+(38 bytes), 4-byte BE CRC32. The bitmap encodes 5 deleted positions
+(1, 3, 5, 7, 9). Source:
+https://github.com/apache/iceberg/blob/main/core/src/test/resources/org/apache/iceberg/deletes/small-alternating-values-position-index.bin
+
+`deletion-vector-v1.puffin` wraps that payload in a complete Puffin envelope:
+blob type `deletion-vector-v1`, snapshot-id and sequence-number set to -1
+per spec, with `referenced-data-file` and `cardinality` properties. The
+envelope is what `puffin.Writer` emits today; this is a Go-writer wire-
+format pin, not a strong Java cross-impl pin. The basic envelope shape is
+cross-checked by `TestWriterBitIdenticalWithJava`, but that test does not
+exercise empty `Fields` arrays or multi-key blob `Properties` — both of
+which this fixture relies on — and JSON key ordering of blob `Properties`
+is encoder-defined. The property values
+(`referenced-data-file=data/test.parquet`, `cardinality=5`,
+`created-by="iceberg-go test fixture"`) are fixture choices, not bytes
+inherited from any specific Java-emitted file.
+
+To regenerate after a deliberate puffin-format change:
+
+```
+go generate ./puffin/...
+```
+
+The generator lives in `puffin/gen_dv_fixture.go` (built with the
+`//go:build ignore` tag and run via the `//go:generate` directive in
+`puffin/dv_golden_test.go`). It self-validates by reading the freshly-
+written envelope back before overwriting the on-disk file, so a writer
+bug producing a valid-but-unreadable file fails the regen rather than
+calcifying into the fixture.
+
+After regen, diff the file before committing to verify only intended bytes
+changed:
+
+```
+git diff -- puffin/testdata/deletion-vector-v1.puffin
+```
diff --git a/puffin/testdata/deletion-vector-v1-payload.bin
b/puffin/testdata/deletion-vector-v1-payload.bin
new file mode 100644
index 00000000..80829fae
Binary files /dev/null and b/puffin/testdata/deletion-vector-v1-payload.bin
differ
diff --git a/puffin/testdata/deletion-vector-v1.puffin
b/puffin/testdata/deletion-vector-v1.puffin
new file mode 100644
index 00000000..84c6baf7
Binary files /dev/null and b/puffin/testdata/deletion-vector-v1.puffin differ