laskoviymishka commented on code in PR #876:
URL: https://github.com/apache/iceberg-go/pull/876#discussion_r3074148825
##########
manifest.go:
##########
@@ -407,7 +407,7 @@ func (m *manifestFile) FetchEntries(fs iceio.IO,
discardDeleted bool) ([]Manifes
return fetchManifestEntries(m, fs, discardDeleted)
}
-func getFieldIDMap(sc avro.Schema) (map[string]int, map[int]avro.LogicalType,
map[int]int) {
+func getFieldIDMap(sc avro.Schema) (map[string]int, map[int]avro.LogicalType,
map[int]int, map[int]int) {
Review Comment:
Four bare map return values where two of them are map[int]int with
completely different semantics (decimal scale vs fixed byte length). This is
going to be a source of bugs — callers can swap the last two and the compiler
won't catch it.
Worth refactor into a struct, something like:
```
type partitionFieldInfo struct {
nameToID map[string]int
logicalTypes map[int]avro.LogicalType
decimalScales map[int]int
fixedByteSizes map[int]int
}
```
I know the 3-return pattern was already there, but adding a 4th identical
type is where it crosses the line.
##########
manifest.go:
##########
@@ -1650,6 +1667,48 @@ func avroPartitionData(input map[int]any, logicalTypes
map[int]avro.LogicalType)
return out
}
+// convertFixedValue wraps a FixedType partition value for hamba/avro's
+// union encoder. The encoder cannot dispatch a bare [N]byte array inside
+// a union (see https://github.com/hamba/avro/issues/571), but it does
+// accept the explicit {"<fixed-name>": [N]byte} form used elsewhere for
+// UUID and decimal fixed branches. Nil (the other union arm) is passed
+// through untouched.
+func convertFixedValue(v any, size int) any {
+ if v == nil {
+ return map[string]any{"null": nil}
+ }
+
+ switch b := v.(type) {
+ case []byte:
+ return map[string]any{fixedSchemaName(size):
convertToFixedArray(padOrTruncateBytes(b, size), size)}
+ case FixedLiteral:
+ return map[string]any{fixedSchemaName(size):
convertToFixedArray(padOrTruncateBytes([]byte(b), size), size)}
Review Comment:
Is this branch actually reachable? Partition values in the map[int]any that
flows through avroPartitionData come from DataFile.Partition().
When does a FixedLiteral (as opposed to []byte) end up there?
If it's reachable — add a test. If it's speculative — remove it.
Dead code in serialization paths is how bugs hide.
##########
manifest.go:
##########
@@ -1650,6 +1667,48 @@ func avroPartitionData(input map[int]any, logicalTypes
map[int]avro.LogicalType)
return out
}
+// convertFixedValue wraps a FixedType partition value for hamba/avro's
+// union encoder. The encoder cannot dispatch a bare [N]byte array inside
+// a union (see https://github.com/hamba/avro/issues/571), but it does
+// accept the explicit {"<fixed-name>": [N]byte} form used elsewhere for
+// UUID and decimal fixed branches. Nil (the other union arm) is passed
+// through untouched.
+func convertFixedValue(v any, size int) any {
+ if v == nil {
+ return map[string]any{"null": nil}
+ }
+
+ switch b := v.(type) {
+ case []byte:
+ return map[string]any{fixedSchemaName(size):
convertToFixedArray(padOrTruncateBytes(b, size), size)}
+ case FixedLiteral:
+ return map[string]any{fixedSchemaName(size):
convertToFixedArray(padOrTruncateBytes([]byte(b), size), size)}
+ }
+
+ return v
+}
+
+// unwrapFixedValue reverses [convertFixedValue]: hamba/avro surfaces a
+// nullable fixed branch as map[string]any{"<fixed-name>": [N]byte}. We
+// flatten that back into the []byte slice shape Iceberg partition data
+// uses everywhere else.
+func unwrapFixedValue(v any, size int) any {
+ if unionMap, ok := v.(map[string]any); ok {
+ if inner, ok := unionMap[fixedSchemaName(size)]; ok {
+ v = inner
+ }
+ }
+ rv := reflect.ValueOf(v)
+ if rv.Kind() == reflect.Array && rv.Type().Elem().Kind() ==
reflect.Uint8 {
Review Comment:
This works but it's worth a comment that hamba/avro decodes fixed unions as
[N]byte (Go array, not slice), which is why we need reflect here. Without that
context the next reader will wonder why we're using reflect for a byte copy.
##########
manifest.go:
##########
@@ -1650,6 +1667,48 @@ func avroPartitionData(input map[int]any, logicalTypes
map[int]avro.LogicalType)
return out
}
+// convertFixedValue wraps a FixedType partition value for hamba/avro's
+// union encoder. The encoder cannot dispatch a bare [N]byte array inside
+// a union (see https://github.com/hamba/avro/issues/571), but it does
+// accept the explicit {"<fixed-name>": [N]byte} form used elsewhere for
+// UUID and decimal fixed branches. Nil (the other union arm) is passed
+// through untouched.
+func convertFixedValue(v any, size int) any {
+ if v == nil {
+ return map[string]any{"null": nil}
+ }
+
+ switch b := v.(type) {
+ case []byte:
+ return map[string]any{fixedSchemaName(size):
convertToFixedArray(padOrTruncateBytes(b, size), size)}
+ case FixedLiteral:
+ return map[string]any{fixedSchemaName(size):
convertToFixedArray(padOrTruncateBytes([]byte(b), size), size)}
+ }
+
+ return v
+}
+
+// unwrapFixedValue reverses [convertFixedValue]: hamba/avro surfaces a
+// nullable fixed branch as map[string]any{"<fixed-name>": [N]byte}. We
+// flatten that back into the []byte slice shape Iceberg partition data
+// uses everywhere else.
+func unwrapFixedValue(v any, size int) any {
Review Comment:
Is this function on data-critical path?
##########
manifest.go:
##########
@@ -1650,6 +1667,48 @@ func avroPartitionData(input map[int]any, logicalTypes
map[int]avro.LogicalType)
return out
}
+// convertFixedValue wraps a FixedType partition value for hamba/avro's
+// union encoder. The encoder cannot dispatch a bare [N]byte array inside
+// a union (see https://github.com/hamba/avro/issues/571), but it does
+// accept the explicit {"<fixed-name>": [N]byte} form used elsewhere for
+// UUID and decimal fixed branches. Nil (the other union arm) is passed
+// through untouched.
+func convertFixedValue(v any, size int) any {
+ if v == nil {
+ return map[string]any{"null": nil}
+ }
+
+ switch b := v.(type) {
+ case []byte:
+ return map[string]any{fixedSchemaName(size):
convertToFixedArray(padOrTruncateBytes(b, size), size)}
Review Comment:
padOrTruncateBytes was written for decimal two's complement encoding where
padding/truncation is semantically meaningful. For FixedType partition values
the byte slice must be exactly size bytes, if it's not, something is broken
upstream and we should fail loudly, not silently truncate partition data.
A wrong-length slice that gets silently truncated means files land in the
wrong partition and queries return wrong results.
Replace with a strict check:
```
if len(b) != size {
panic(fmt.Sprintf("FixedType partition value has length %d, expected
%d", len(b), size))
}
```
Or return an error if you prefer, but don't silently corrupt.
##########
schema_conversions_test.go:
##########
@@ -147,3 +151,52 @@ func TestPartitionTypeToAvroSchemaNullableAndNonNullable(t
*testing.T) {
assert.Empty(t, encoded)
})
}
+
+// TestFixedPartitionColumnAvroSchema exercises the full encode/decode path
+// for a FixedType partition column. Before #845 was fixed the partition
+// schema emitted Avro "bytes" (a spec violation); the round-trip here
+// pins the behaviour to a true Avro "fixed" branch.
+func TestFixedPartitionColumnAvroSchema(t *testing.T) {
Review Comment:
This test exercises raw avro.Marshal/avro.Unmarshal — it proves hamba/avro
can round-trip the value through the schema. But it doesn't touch the actual
manifest code paths at all:
- getFieldIDMap parsing a fixed schema into fixedPartSizes — not tested
- avroPartitionData dispatching to convertFixedValue — not tested
- ManifestWriter.addEntry threading fixedPartSizes — not tested
- convertAvroValueToIcebergType calling unwrapFixedValue — not tested
Need a manifest-level round-trip test: create a table with a FixedType
partition column, write a manifest entry via ManifestWriter, read it back via
ManifestReader, assert the partition value survives byte-for-byte. That's the
actual code path that was broken.
Also missing: backward compat test — reading a manifest written by old code
(bytes schema for FixedType) to make sure those still decode correctly.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]