Re: [PR] feat(parquet): add variant encoder/decoder [arrow-go]

via GitHub Tue, 27 May 2025 11:42:02 -0700


sfc-gh-mbojanczyk commented on code in PR #344:
URL: https://github.com/apache/arrow-go/pull/344#discussion_r2109823794



##########
parquet/variant/variant.go:
##########
@@ -0,0 +1,722 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+// http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+package variant
+
+import (
+       "bytes"
+       "encoding/binary"
+       "encoding/json"
+       "errors"
+       "fmt"
+       "iter"
+       "maps"
+       "slices"
+       "strings"
+       "time"
+       "unsafe"
+
+       "github.com/apache/arrow-go/v18/arrow"
+       "github.com/apache/arrow-go/v18/arrow/decimal"
+       "github.com/apache/arrow-go/v18/arrow/decimal128"
+       "github.com/apache/arrow-go/v18/parquet/internal/debug"
+       "github.com/google/uuid"
+)
+
+//go:generate go tool stringer -type=BasicType -linecomment 
-output=basic_type_string.go
+//go:generate go tool stringer -type=PrimitiveType -linecomment 
-output=primitive_type_string.go
+
+// BasicType represents the fundamental type category of a variant value.
+type BasicType int
+
+const (
+       BasicUndefined   BasicType = iota - 1 // Unknown
+       BasicPrimitive                        // Primitive
+       BasicShortString                      // ShortString
+       BasicObject                           // Object
+       BasicArray                            // Array
+)
+
+func basicTypeFromHeader(hdr byte) BasicType {
+       // because we're doing hdr & 0x3, it is impossible for the result
+       // to be outside of the range of BasicType. Therefore, we don't
+       // need to perform any checks. The value will always be [0,3]
+       return BasicType(hdr & basicTypeMask)
+}
+
+// PrimitiveType represents specific primitive data types within the variant 
format.
+type PrimitiveType int
+
+const (
+       PrimitiveInvalid            PrimitiveType = iota - 1 // Unknown
+       PrimitiveNull                                        // Null
+       PrimitiveBoolTrue                                    // BoolTrue
+       PrimitiveBoolFalse                                   // BoolFalse
+       PrimitiveInt8                                        // Int8
+       PrimitiveInt16                                       // Int16
+       PrimitiveInt32                                       // Int32
+       PrimitiveInt64                                       // Int64
+       PrimitiveDouble                                      // Double
+       PrimitiveDecimal4                                    // Decimal32
+       PrimitiveDecimal8                                    // Decimal64
+       PrimitiveDecimal16                                   // Decimal128
+       PrimitiveDate                                        // Date
+       PrimitiveTimestampMicros                             // 
Timestamp(micros)
+       PrimitiveTimestampMicrosNTZ                          // 
TimestampNTZ(micros)
+       PrimitiveFloat                                       // Float
+       PrimitiveBinary                                      // Binary
+       PrimitiveString                                      // String
+       PrimitiveTimeMicrosNTZ                               // TimeNTZ(micros)
+       PrimitiveTimestampNanos                              // Timestamp(nanos)
+       PrimitiveTimestampNanosNTZ                           // 
TimestampNTZ(nanos)
+       PrimitiveUUID                                        // UUID
+)
+
+func primitiveTypeFromHeader(hdr byte) PrimitiveType {
+       return PrimitiveType((hdr >> basicTypeBits) & typeInfoMask)
+}
+
+// Type represents the high-level variant data type.
+// This is what applications typically use to identify the type of a variant 
value.
+type Type int
+
+const (
+       Object Type = iota
+       Array
+       Null
+       Bool
+       Int8
+       Int16
+       Int32
+       Int64
+       String
+       Double
+       Decimal4
+       Decimal8
+       Decimal16
+       Date
+       TimestampMicros
+       TimestampMicrosNTZ
+       Float
+       Binary
+       Time
+       TimestampNanos
+       TimestampNanosNTZ
+       UUID
+)
+
+const (
+       versionMask        uint8 = 0x0F
+       sortedStrMask      uint8 = 0b10000
+       basicTypeMask      uint8 = 0x3
+       basicTypeBits      uint8 = 2
+       typeInfoMask       uint8 = 0x3F
+       hdrSizeBytes             = 1
+       minOffsetSizeBytes       = 1
+       maxOffsetSizeBytes       = 4
+
+       // mask is applied after shift
+       offsetSizeMask     uint8 = 0b11
+       offsetSizeBitShift uint8 = 6
+       supportedVersion         = 1
+       maxShortStringSize       = 0x3F
+       maxSizeLimit             = 128 * 1024 * 1024 // 128MB

Review Comment:
   Nit: Call this `metadataMaxSize` or something similar (looks like it only 
applies to the metadata section)
   
   Also, is there _actually_ a max size for the metadata? You can have ~4.29B 
entries max per the spec- not saying that you _should_, but still, it feels 
conceivable that you can have a valid metadata that's beyond 128MB.



##########
parquet/variant/basic_type_string.go:
##########


Review Comment:
   Nit: Could you rename this to `basic_type_stringer.go` or something similar? 
My first reaction to this was that it contained utilities to parse Variant 
string types (same with `primitive_type_string.go`) :)



##########
parquet/variant/doc.go:
##########
@@ -0,0 +1,142 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+// http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+// Package variant provides an implementation of the Apache Parquet Variant 
data type.
+//
+// The Variant type is a flexible binary format designed to represent complex 
nested
+// data structures with minimal overhead. It supports a wide range of 
primitive types
+// as well as nested arrays and objects (similar to JSON). The format uses a 
memory-efficient
+// binary representation with a separate metadata section for dictionary 
encoding of keys.
+//
+// # Key Components
+//
+// - [Value]: The primary type representing a variant value
+// - [Metadata]: Contains information about the dictionary of keys
+// - [Builder]: Used to construct variant values
+//
+// # Format Overview
+//
+// The variant format consists of two parts:
+//
+//  1. Metadata: A dictionary of keys used in objects
+//  2. Value: The actual data payload
+//
+// Values can be one of the following types:
+//
+//   - Primitive values (null, bool, int8/16/32/64, float32/64, etc.)
+//   - Short strings (less than 64 bytes)
+//   - Long strings and binary data
+//   - Date, time and timestamp values
+//   - Decimal values (4, 8, or 16 bytes)
+//   - Arrays of any variant value
+//   - Objects with key-value pairs
+//
+// # Working with Variants
+//
+// To create a variant value, use the Builder:
+//
+//     var b variant.Builder
+//     b.Append(map[string]any{
+//         "id": 123,
+//         "name": "example",
+//         "data": []any{1, 2, 3},
+//     })
+//     value, err := b.Build()
+//
+// To parse an existing variant value:
+//
+//     v, err := variant.New(metadataBytes, valueBytes)
+//
+// You can access the data using the [Value.Value] method which returns the 
appropriate Go type:
+//
+//     switch v.Type() {
+//     case variant.Object:
+//         obj := v.Value().(variant.ObjectValue)
+//         field, err := obj.ValueByKey("name")
+//     case variant.Array:
+//         arr := v.Value().(variant.ArrayValue)
+//         elem, err := arr.Value(0)
+//     case variant.String:
+//         s := v.Value().(string)
+//     case variant.Int64:
+//         i := v.Value().(int64)
+//     }
+//
+// You can also switch on the type of the result value from the [Value.Value] 
method:
+//
+//     switch val := v.Value().(type) {
+//     case nil:
+//       // ...
+//     case int32:
+//       // ...
+//     case string:
+//       // ...
+//     case variant.ArrayValue:
+//       for i, item := range val.Values() {
+//         // item is a variant.Value
+//       }
+//     case variant.ObjectValue:
+//       for k, item := range val.Values() {
+//         // k is the field key
+//         // item is a variant.Value for that field
+//       }
+//     }
+//
+// Values can also be converted to JSON:
+//
+//     jsonBytes, err := json.Marshal(v)
+//
+// # Low-level Construction
+//
+// For direct construction of complex nested structures, you can use the 
low-level
+// methods:
+//
+//     var b variant.Builder

Review Comment:
   This interface feels a bit clunky, IMO. Keeping track of the fields and 
offsets feels like it's leaking implementation details of the Variant spec to 
the user here, plus it opens the door for the user to do something unexpected 
like deleting a field from the `fields` slice before calling `FinishObject()`. 
Tracing through the code, I don't think that's catastrophic (you'll just have 
extra data in the data section that isn't referenced by the field and key 
dict), but still doesn't feel right.
   
   While I generally prefer returning a builder for any nested object (and 
forcing the building of each nested thing when calling `Finish*()`, I'll also 
defer to you as this may be a pattern that's more common throughout this 
codebase.



##########
parquet/variant/variant.go:
##########
@@ -0,0 +1,722 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+// http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+package variant
+
+import (
+       "bytes"
+       "encoding/binary"
+       "encoding/json"
+       "errors"
+       "fmt"
+       "iter"
+       "maps"
+       "slices"
+       "strings"
+       "time"
+       "unsafe"
+
+       "github.com/apache/arrow-go/v18/arrow"
+       "github.com/apache/arrow-go/v18/arrow/decimal"
+       "github.com/apache/arrow-go/v18/arrow/decimal128"
+       "github.com/apache/arrow-go/v18/parquet/internal/debug"
+       "github.com/google/uuid"
+)
+
+//go:generate go tool stringer -type=BasicType -linecomment 
-output=basic_type_string.go
+//go:generate go tool stringer -type=PrimitiveType -linecomment 
-output=primitive_type_string.go
+
+// BasicType represents the fundamental type category of a variant value.
+type BasicType int
+
+const (
+       BasicUndefined   BasicType = iota - 1 // Unknown
+       BasicPrimitive                        // Primitive
+       BasicShortString                      // ShortString
+       BasicObject                           // Object
+       BasicArray                            // Array
+)
+
+func basicTypeFromHeader(hdr byte) BasicType {
+       // because we're doing hdr & 0x3, it is impossible for the result
+       // to be outside of the range of BasicType. Therefore, we don't
+       // need to perform any checks. The value will always be [0,3]
+       return BasicType(hdr & basicTypeMask)
+}
+
+// PrimitiveType represents specific primitive data types within the variant 
format.
+type PrimitiveType int
+
+const (
+       PrimitiveInvalid            PrimitiveType = iota - 1 // Unknown

Review Comment:
   I generally make the zero value invalid, that way you don't accidentally 
allow for an uninitialized value to do something wonky.



##########
parquet/variant/builder.go:
##########
@@ -0,0 +1,852 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+// http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+package variant
+
+import (
+       "bytes"
+       "cmp"
+       "encoding/binary"
+       "errors"
+       "fmt"
+       "io"
+       "math"
+       "reflect"
+       "slices"
+       "strings"
+       "time"
+       "unsafe"
+
+       "github.com/apache/arrow-go/v18/arrow"
+       "github.com/apache/arrow-go/v18/arrow/decimal"
+       "github.com/google/uuid"
+)
+
+// Builder is used to construct Variant values by appending data of various 
types.
+// It manages an internal buffer for the value data and a dictionary for field 
keys.
+type Builder struct {
+       buf             bytes.Buffer
+       dict            map[string]uint32
+       dictKeys        [][]byte
+       allowDuplicates bool
+}
+
+// SetAllowDuplicates controls whether duplicate keys are allowed in objects.
+// When true, the last value for a key is used. When false, an error is 
returned
+// if a duplicate key is detected.
+func (b *Builder) SetAllowDuplicates(allow bool) {
+       b.allowDuplicates = allow
+}
+
+// AddKey adds a key to the builder's dictionary and returns its ID.
+// If the key already exists in the dictionary, its existing ID is returned.
+func (b *Builder) AddKey(key string) (id uint32) {
+       if b.dict == nil {
+               b.dict = make(map[string]uint32)
+               b.dictKeys = make([][]byte, 0, 16)
+       }
+
+       var ok bool
+       if id, ok = b.dict[key]; ok {
+               return id
+       }
+
+       id = uint32(len(b.dictKeys))
+       b.dict[key] = id
+       b.dictKeys = append(b.dictKeys, unsafe.Slice(unsafe.StringData(key), 
len(key)))
+
+       return id
+}
+
+// AppendOpt represents options for appending time-related values. These are 
only
+// used when using the generic Append method that takes an interface{}.
+type AppendOpt int16
+
+const (
+       // OptTimestampNano specifies that timestamps should use nanosecond 
precision,
+       // otherwise microsecond precision is used.
+       OptTimestampNano AppendOpt = 1 << iota
+       // OptTimestampUTC specifies that timestamps should be in UTC timezone, 
otherwise
+       // no time zone (NTZ) is used.
+       OptTimestampUTC
+       // OptTimeAsDate specifies that time.Time values should be encoded as 
dates
+       OptTimeAsDate
+       // OptTimeAsTime specifies that time.Time values should be encoded as a 
time value
+       OptTimeAsTime
+)
+
+func extractFieldInfo(f reflect.StructField) (name string, o AppendOpt) {
+       tag := f.Tag.Get("variant")
+       if tag == "" {
+               return f.Name, 0
+       }
+
+       parts := strings.Split(tag, ",")
+       if len(parts) == 1 {
+               return parts[0], 0
+       }
+
+       name = parts[0]
+       if name == "" {
+               name = f.Name
+       }
+
+       for _, opt := range parts[1:] {
+               switch strings.ToLower(opt) {
+               case "nanos":
+                       o |= OptTimestampNano
+               case "utc":
+                       o |= OptTimestampUTC
+               case "date":
+                       o |= OptTimeAsDate
+               case "time":
+                       o |= OptTimeAsTime
+               }
+       }
+
+       return name, o
+}
+
+// Append adds a value of any supported type to the builder.
+//
+// Any basic primitive type is supported, the AppendOpt options are used to 
control how
+// timestamps are appended (e.g., as microseconds or nanoseconds and 
timezone). The options
+// also control how a [time.Time] value is appended (e.g., as a date, 
timestamp, or time).
+//
+// Appending a value with type `[]any` will construct an array appropriately, 
appending
+// each element. Calling with a map[string]any will construct an object, 
recursively calling
+// Append for each value, propagating the options.
+//
+// For other types (arbitrary slices, arrays, maps and structs), reflection is 
used to determine
+// the type and whether we can append it. A nil pointer will append a null, 
while a non-nil
+// pointer will append the value that it points to.
+//
+// For structs, field tags can be used to control the field names and options. 
Only exported
+// fields are considered, with the field name being used as the key. A struct 
tag of `variant`
+// can be used with the following format and options:
+//
+//                     type MyStruct struct {
+//                             Field1    string    `variant:"key"`           
// Use "key" instead of "Field1" as the field name
+//                             Field2    time.Time `variant:"day,date"`      
// Use "day" instead of "Field2" as the field name
+//                                                                   // append 
this value as a "date" value
+//              Time      time.Time `variant:",time"`         // Use "Time" as 
the field name, append the value as
+//                                                                   // a 
"time" value
+//              Field3    int       `variant:"-"`             // Ignore this 
field
+//              Timestamp time.Time `variant:"ts"`            // Use "ts" as 
the field name, append value as a
+//                                                                   // 
timestamp(UTC=false,MICROS)
+//                         Ts2       time.Time `variant:"ts2,nanos,utc"` // 
Use "ts2" as the field name, append value as a
+//                                                                             
                                          // timestamp(UTC=true,NANOS)
+//                     }
+//
+// There is only one case where options can conflict currently: If both 
[OptTimeAsDate] and
+// [OptTimeAsTime] are set, then [OptTimeAsDate] will take precedence.
+//
+// Options specified in the struct tags will be OR'd with any options passed 
to the original call
+// to Append. As a result, if a Struct field tag sets [OptTimeAsTime], but the 
call to Append
+// passes [OptTimeAsDate], then the value will be appended as a date since 
that option takes
+// precedence.
+func (b *Builder) Append(v any, opts ...AppendOpt) error {
+       var o AppendOpt
+       for _, opt := range opts {
+               o |= opt
+       }
+
+       return b.append(v, o)
+}
+
+func (b *Builder) append(v any, o AppendOpt) error {
+       switch v := v.(type) {
+       case nil:
+               return b.AppendNull()
+       case bool:
+               return b.AppendBool(v)
+       case int8:
+               return b.AppendInt(int64(v))
+       case uint8:
+               return b.AppendInt(int64(v))
+       case int16:
+               return b.AppendInt(int64(v))
+       case uint16:
+               return b.AppendInt(int64(v))
+       case int32:
+               return b.AppendInt(int64(v))
+       case uint32:
+               return b.AppendInt(int64(v))
+       case int64:
+               return b.AppendInt(v)
+       case int:
+               return b.AppendInt(int64(v))
+       case uint:
+               return b.AppendInt(int64(v))
+       case float32:
+               return b.AppendFloat32(v)
+       case float64:
+               return b.AppendFloat64(v)
+       case arrow.Date32:
+               return b.AppendDate(v)
+       case arrow.Time64:
+               return b.AppendTimeMicro(v)
+       case arrow.Timestamp:
+               return b.AppendTimestamp(v, o&OptTimestampNano == 0, 
o&OptTimestampUTC != 0)
+       case []byte:
+               return b.AppendBinary(v)
+       case string:
+               return b.AppendString(v)
+       case uuid.UUID:
+               return b.AppendUUID(v)
+       case time.Time:
+               switch {
+               case o&OptTimeAsDate != 0:
+                       return b.AppendDate(arrow.Date32FromTime(v))
+               case o&OptTimeAsTime != 0:
+                       t := v.Sub(v.Truncate(24 * time.Hour))
+                       return b.AppendTimeMicro(arrow.Time64(t.Microseconds()))
+               default:
+                       unit := arrow.Microsecond
+                       if o&OptTimestampNano != 0 {
+                               unit = arrow.Nanosecond
+                       }
+
+                       if o&OptTimestampUTC != 0 {
+                               v = v.UTC()
+                       }
+
+                       t, err := arrow.TimestampFromTime(v, unit)
+                       if err != nil {
+                               return err
+                       }
+
+                       return b.AppendTimestamp(t, o&OptTimestampNano == 0, 
o&OptTimestampUTC != 0)
+               }
+       case DecimalValue[decimal.Decimal32]:
+               return b.AppendDecimal4(v.Scale, v.Value.(decimal.Decimal32))
+       case DecimalValue[decimal.Decimal64]:
+               return b.AppendDecimal8(v.Scale, v.Value.(decimal.Decimal64))
+       case DecimalValue[decimal.Decimal128]:
+               return b.AppendDecimal16(v.Scale, v.Value.(decimal.Decimal128))
+       case []any:
+               start, offsets := b.Offset(), make([]int, 0, len(v))
+               for _, item := range v {
+                       offsets = append(offsets, b.NextElement(start))
+                       if err := b.append(item, o); err != nil {
+                               return err
+                       }
+               }
+               return b.FinishArray(start, offsets)
+       case map[string]any:
+               start, fields := b.Offset(), make([]FieldEntry, 0, len(v))
+               for key, item := range v {
+                       fields = append(fields, b.NextField(start, key))
+                       if err := b.append(item, o); err != nil {
+                               return err
+                       }
+               }
+               return b.FinishObject(start, fields)
+       default:
+               // attempt to use reflection before we give up!
+               val := reflect.ValueOf(v)
+               switch val.Kind() {
+               case reflect.Pointer, reflect.Interface:
+                       if val.IsNil() {
+                               return b.AppendNull()
+                       }
+                       return b.append(val.Elem().Interface(), o)
+               case reflect.Array, reflect.Slice:
+                       start, offsets := b.Offset(), make([]int, 0, val.Len())
+                       for _, item := range val.Seq2() {
+                               offsets = append(offsets, b.NextElement(start))
+                               if err := b.append(item.Interface(), o); err != 
nil {
+                                       return err
+                               }
+                       }
+                       return b.FinishArray(start, offsets)
+               case reflect.Map:
+                       if val.Type().Key().Kind() != reflect.String {
+                               return fmt.Errorf("unsupported map key type: 
%s", val.Type().Key())
+                       }
+
+                       start, fields := b.Offset(), make([]FieldEntry, 0, 
val.Len())
+                       for k, v := range val.Seq2() {
+                               fields = append(fields, b.NextField(start, 
k.String()))
+                               if err := b.append(v.Interface(), o); err != 
nil {
+                                       return err
+                               }
+                       }
+                       return b.FinishObject(start, fields)
+               case reflect.Struct:
+                       start, fields := b.Offset(), make([]FieldEntry, 0, 
val.NumField())
+
+                       typ := val.Type()
+                       for i := range typ.NumField() {
+                               f := typ.Field(i)
+                               if !f.IsExported() {
+                                       continue
+                               }
+
+                               name, opt := extractFieldInfo(f)
+                               if name == "-" {
+                                       continue
+                               }
+
+                               fields = append(fields, b.NextField(start, 
name))
+                               if err := b.append(val.Field(i).Interface(), 
o|opt); err != nil {
+                                       return err
+                               }
+                       }
+                       return b.FinishObject(start, fields)
+               }
+       }
+       return fmt.Errorf("cannot append unsupported type to variant: %T", v)
+}
+
+// AppendNull appends a null value to the builder.
+func (b *Builder) AppendNull() error {
+       return b.buf.WriteByte(primitiveHeader(PrimitiveNull))
+}
+
+// AppendBool appends a boolean value to the builder.
+func (b *Builder) AppendBool(v bool) error {
+       var t PrimitiveType
+       if v {
+               t = PrimitiveBoolTrue
+       } else {
+               t = PrimitiveBoolFalse
+       }
+
+       return b.buf.WriteByte(primitiveHeader(t))
+}
+
+type primitiveNumeric interface {
+       int8 | int16 | int32 | int64 | float32 | float64 |
+               arrow.Date32 | arrow.Time64
+}
+
+type buffer interface {
+       io.Writer
+       io.ByteWriter
+}
+
+func writeBinary[T string | []byte](w buffer, v T) error {
+       var t PrimitiveType
+       switch any(v).(type) {
+       case string:
+               t = PrimitiveString
+       case []byte:
+               t = PrimitiveBinary
+       }
+
+       if err := w.WriteByte(primitiveHeader(t)); err != nil {
+               return err
+       }
+
+       if err := binary.Write(w, binary.LittleEndian, uint32(len(v))); err != 
nil {
+               return err
+       }
+
+       _, err := w.Write([]byte(v))
+       return err
+}
+
+func writeNumeric[T primitiveNumeric](w buffer, v T) error {
+       var t PrimitiveType
+       switch any(v).(type) {
+       case int8:
+               t = PrimitiveInt8
+       case int16:
+               t = PrimitiveInt16
+       case int32:
+               t = PrimitiveInt32
+       case int64:
+               t = PrimitiveInt64
+       case float32:
+               t = PrimitiveFloat
+       case float64:
+               t = PrimitiveDouble
+       case arrow.Date32:
+               t = PrimitiveDate
+       case arrow.Time64:
+               t = PrimitiveTimeMicrosNTZ
+       }
+
+       if err := w.WriteByte(primitiveHeader(t)); err != nil {
+               return err
+       }
+
+       return binary.Write(w, binary.LittleEndian, v)
+}
+
+// AppendInt appends an integer value to the builder, using the smallest
+// possible integer representation based on the value's range.
+func (b *Builder) AppendInt(v int64) error {
+       b.buf.Grow(9)
+       switch {
+       case v >= math.MinInt8 && v <= math.MaxInt8:
+               return writeNumeric(&b.buf, int8(v))
+       case v >= math.MinInt16 && v <= math.MaxInt16:
+               return writeNumeric(&b.buf, int16(v))
+       case v >= math.MinInt32 && v <= math.MaxInt32:
+               return writeNumeric(&b.buf, int32(v))
+       default:
+               return writeNumeric(&b.buf, v)
+       }
+}
+
+// AppendFloat32 appends a 32-bit floating point value to the builder.
+func (b *Builder) AppendFloat32(v float32) error {
+       b.buf.Grow(5)
+       return writeNumeric(&b.buf, v)
+}
+
+// AppendFloat64 appends a 64-bit floating point value to the builder.
+func (b *Builder) AppendFloat64(v float64) error {
+       b.buf.Grow(9)
+       return writeNumeric(&b.buf, v)
+}
+
+// AppendDate appends a date value to the builder.
+func (b *Builder) AppendDate(v arrow.Date32) error {
+       b.buf.Grow(5)
+       return writeNumeric(&b.buf, v)
+}
+
+// AppendTimeMicro appends a time value with microsecond precision to the 
builder.
+func (b *Builder) AppendTimeMicro(v arrow.Time64) error {
+       b.buf.Grow(9)
+       return writeNumeric(&b.buf, v)
+}
+
+// AppendTimestamp appends a timestamp value to the builder.
+// The useMicros parameter controls whether microsecond or nanosecond 
precision is used.
+// The useUTC parameter controls whether the timestamp is in UTC timezone or 
has no time zone (NTZ).
+func (b *Builder) AppendTimestamp(v arrow.Timestamp, useMicros, useUTC bool) 
error {
+       b.buf.Grow(9)
+       var t PrimitiveType
+       if useMicros {
+               t = PrimitiveTimestampMicrosNTZ
+       } else {
+               t = PrimitiveTimestampNanosNTZ
+       }
+
+       if useUTC {
+               t--
+       }
+
+       if err := b.buf.WriteByte(primitiveHeader(t)); err != nil {
+               return err
+       }
+
+       return binary.Write(&b.buf, binary.LittleEndian, v)
+}
+
+// AppendBinary appends a binary value to the builder.
+func (b *Builder) AppendBinary(v []byte) error {
+       b.buf.Grow(5 + len(v))
+       return writeBinary(&b.buf, v)
+}
+
+// AppendString appends a string value to the builder.
+// Small strings are encoded using the short string representation if small 
enough.
+func (b *Builder) AppendString(v string) error {
+       if len(v) > maxShortStringSize {
+               b.buf.Grow(5 + len(v))
+               return writeBinary(&b.buf, v)
+       }
+
+       b.buf.Grow(1 + len(v))
+       if err := b.buf.WriteByte(shortStrHeader(len(v))); err != nil {
+               return err
+       }
+
+       _, err := b.buf.WriteString(v)
+       return err
+}
+
+// AppendUUID appends a UUID value to the builder.
+func (b *Builder) AppendUUID(v uuid.UUID) error {
+       b.buf.Grow(17)
+       if err := b.buf.WriteByte(primitiveHeader(PrimitiveUUID)); err != nil {
+               return err
+       }
+
+       m, _ := v.MarshalBinary()
+       _, err := b.buf.Write(m)
+       return err
+}
+
+// AppendDecimal4 appends a 4-byte decimal value with the specified scale to 
the builder.
+func (b *Builder) AppendDecimal4(scale uint8, v decimal.Decimal32) error {
+       b.buf.Grow(6)
+       if err := b.buf.WriteByte(primitiveHeader(PrimitiveDecimal4)); err != 
nil {
+               return err
+       }
+
+       if err := b.buf.WriteByte(scale); err != nil {
+               return err
+       }
+
+       return binary.Write(&b.buf, binary.LittleEndian, int32(v))
+}
+
+// AppendDecimal8 appends a 8-byte decimal value with the specified scale to 
the builder.
+func (b *Builder) AppendDecimal8(scale uint8, v decimal.Decimal64) error {
+       b.buf.Grow(10)
+       return errors.Join(
+               b.buf.WriteByte(primitiveHeader(PrimitiveDecimal8)),
+               b.buf.WriteByte(scale),
+               binary.Write(&b.buf, binary.LittleEndian, int64(v)),
+       )
+}
+
+// AppendDecimal16 appends a 16-byte decimal value with the specified scale to 
the builder.
+func (b *Builder) AppendDecimal16(scale uint8, v decimal.Decimal128) error {
+       b.buf.Grow(18)
+       return errors.Join(
+               b.buf.WriteByte(primitiveHeader(PrimitiveDecimal16)),
+               b.buf.WriteByte(scale),
+               binary.Write(&b.buf, binary.LittleEndian, v.LowBits()),
+               binary.Write(&b.buf, binary.LittleEndian, v.HighBits()),
+       )
+}
+
+// Offset returns the current offset in the builder's buffer. Generally used 
for
+// grabbing a starting point for building an array or object.
+func (b *Builder) Offset() int {
+       return b.buf.Len()
+}
+
+// NextElement returns the offset of the next element relative to the start 
position.
+// Use when building arrays to track element positions. The following creates 
a variant
+// equivalent to `[5, 10]`.
+//
+//     var b variant.Builder
+//     start, offsets := b.Offset(), make([]int, 0)
+//     offsets = append(offsets, b.NextElement(start))
+//     b.Append(5)
+//     offsets = append(offsets, b.NextElement(start))
+//     b.Append(10)
+//     b.FinishArray(start, offsets)
+//
+// The value returned by this is equivalent to `b.Offset() - start`, as 
offsets are all
+// relative to the start position. This allows for creating nested arrays, the 
following
+// creates a variant equivalent to `[5, [10, 20], 30]`.
+//
+//     var b variant.Builder
+//     start, offsets := b.Offset(), make([]int, 0)
+//     offsets = append(offsets, b.NextElement(start))
+//     b.Append(5)
+//     offsets = append(offsets, b.NextElement(start))
+//
+//     nestedStart, nestedOffsets := b.Offset(), make([]int, 0)
+//     nestedOffsets = append(nestedOffsets, b.NextElement(nestedStart))
+//     b.Append(10)
+//     nestedOffsets = append(nestedOffsets, b.NextElement(nestedStart))
+//     b.Append(20)
+//     b.FinishArray(nestedStart, nestedOffsets)
+//
+//     offsets = append(offsets, b.NextElement(start))
+//     b.Append(30)
+//     b.FinishArray(start, offsets)
+func (b *Builder) NextElement(start int) int {
+       return b.Offset() - start
+}
+
+// FinishArray finalizes an array value in the builder.
+// The start parameter is the offset where the array begins.
+// The offsets parameter contains the offsets of each element in the array. 
See [Builder.NextElement]
+// for examples of how to use this.
+func (b *Builder) FinishArray(start int, offsets []int) error {
+       var (
+               dataSize, sz = b.buf.Len() - start, len(offsets)
+               isLarge      = sz > math.MaxUint8
+               sizeBytes    = 1
+       )
+
+       if isLarge {
+               sizeBytes = 4
+       }
+
+       if dataSize < 0 {
+               return errors.New("invalid array size")
+       }
+
+       offsetSize := intSize(dataSize)
+       headerSize := 1 + sizeBytes + (sz+1)*int(offsetSize)
+
+       // shift the just written data to make room for the header section
+       b.buf.Grow(headerSize)
+       av := b.buf.AvailableBuffer()
+       if _, err := b.buf.Write(av[:headerSize]); err != nil {
+               return err
+       }
+
+       bs := b.buf.Bytes()
+       copy(bs[start+headerSize:], bs[start:start+dataSize])
+
+       // populate the header
+       bs[start] = arrayHeader(isLarge, offsetSize)
+       writeOffset(bs[start+1:], sz, uint8(sizeBytes))
+
+       offsetsStart := start + 1 + sizeBytes
+       for i, off := range offsets {
+               writeOffset(bs[offsetsStart+i*int(offsetSize):], off, 
offsetSize)
+       }
+       writeOffset(bs[offsetsStart+sz*int(offsetSize):], dataSize, offsetSize)
+
+       return nil
+}
+
+// FieldEntry represents a field in an object, with its key, ID, and offset.
+// Usually constructed by using [Builder.NextField] and then passed to 
[Builder.FinishObject].
+type FieldEntry struct {
+       Key    string
+       ID     uint32
+       Offset int
+}
+
+// NextField creates a new field entry for an object with the given key.
+// The start parameter is the offset where the object begins. The following 
example would
+// construct a variant equivalent to `{"key1": 5, "key2": 10}`.
+//
+//     var b variant.Builder
+//     start, fields := b.Offset(), make([]variant.FieldEntry, 0)
+//     fields = append(fields, b.NextField(start, "key1"))
+//     b.Append(5)
+//     fields = append(fields, b.NextField(start, "key2"))
+//     b.Append(10)
+//     b.FinishObject(start, fields)
+//
+// This allows for creating nested objects, the following example would create 
a variant
+// equivalent to `{"key1": 5, "key2": {"key3": 10, "key4": 20}, "key5": 30}`.
+//
+//     var b variant.Builder
+//     start, fields := b.Offset(), make([]variant.FieldEntry, 0)
+//     fields = append(fields, b.NextField(start, "key1"))
+//     b.Append(5)
+//     fields = append(fields, b.NextField(start, "key2"))
+//     nestedStart, nestedFields := b.Offset(), make([]variant.FieldEntry, 0)
+//     nestedFields = append(nestedFields, b.NextField(nestedStart, "key3"))
+//     b.Append(10)
+//     nestedFields = append(nestedFields, b.NextField(nestedStart, "key4"))
+//     b.Append(20)
+//     b.FinishObject(nestedStart, nestedFields)
+//     fields = append(fields, b.NextField(start, "key5"))
+//     b.Append(30)
+//     b.FinishObject(start, fields)
+//
+// The offset value returned by this is equivalent to `b.Offset() - start`, as 
offsets are all
+// relative to the start position. The key provided will be passed to the 
[Builder.AddKey] method
+// to ensure that the key is added to the dictionary and an ID is assigned. It 
will re-use existing
+// IDs if the key already exists in the dictionary.
+func (b *Builder) NextField(start int, key string) FieldEntry {
+       id := b.AddKey(key)
+       return FieldEntry{
+               Key:    key,
+               ID:     id,
+               Offset: b.Offset() - start,
+       }
+}
+
+// FinishObject finalizes an object value in the builder.
+// The start parameter is the offset where the object begins.
+// The fields parameter contains the entries for each field in the object. See 
[Builder.NextField]
+// for examples of how to use this.
+//
+// The fields are sorted by key before finalizing the object. If duplicate 
keys are found,
+// the last value for a key is kept if [Builder.SetAllowDuplicates] is set to 
true. If false,
+// an error is returned.
+func (b *Builder) FinishObject(start int, fields []FieldEntry) error {
+       slices.SortFunc(fields, func(a, b FieldEntry) int {
+               return cmp.Compare(a.Key, b.Key)
+       })
+
+       sz := len(fields)
+       var maxID uint32
+       if sz > 0 {
+               maxID = fields[0].ID
+       }
+
+       // if a duplicate key is found, one of two things happens:
+       // - if allowDuplicates is true, then the field with the greatest
+       //    offset value (the last appended field) is kept.
+       // - if allowDuplicates is false, then an error is returned
+       if b.allowDuplicates {
+               distinctPos := 0
+               // maintain a list of distinct keys in-place
+               for i := 1; i < sz; i++ {
+                       maxID = max(maxID, fields[i].ID)
+                       if fields[i].ID == fields[i-1].ID {
+                               // found a duplicate key. keep the
+                               // field with a greater offset
+                               if fields[distinctPos].Offset < 
fields[i].Offset {
+                                       fields[distinctPos].Offset = 
fields[i].Offset
+                               }
+                       } else {
+                               // found distinct key, add field to the list
+                               distinctPos++
+                               fields[distinctPos] = fields[i]
+                       }
+               }
+
+               if distinctPos+1 < len(fields) {
+                       sz = distinctPos + 1
+                       // resize fields to size
+                       fields = fields[:sz]
+                       // sort the fields by offsets so that we can move the 
value
+                       // data of each field to the new offset without 
overwriting the
+                       // fields after it.
+                       slices.SortFunc(fields, func(a, b FieldEntry) int {
+                               return cmp.Compare(a.Offset, b.Offset)
+                       })
+
+                       buf := b.buf.Bytes()
+                       curOffset := 0
+                       for i := range sz {
+                               oldOffset := fields[i].Offset
+                               fieldSize := valueSize(buf[start+oldOffset:])
+                               copy(buf[start+curOffset:], 
buf[start+oldOffset:start+oldOffset+fieldSize])
+                               fields[i].Offset = curOffset
+                               curOffset += fieldSize
+                       }
+                       b.buf.Truncate(start + curOffset)
+                       // change back to sort order by field keys to meet 
variant spec
+                       slices.SortFunc(fields, func(a, b FieldEntry) int {
+                               return cmp.Compare(a.Key, b.Key)
+                       })
+               }
+       } else {
+               for i := 1; i < sz; i++ {
+                       maxID = max(maxID, fields[i].ID)
+                       if fields[i].Key == fields[i-1].Key {
+                               return fmt.Errorf("disallowed duplicate key 
found: %s", fields[i].Key)
+                       }
+               }
+       }
+
+       var (
+               dataSize  = b.buf.Len() - start
+               isLarge   = sz > math.MaxUint8
+               sizeBytes = 1
+       )
+
+       if isLarge {
+               sizeBytes = 4
+       }
+
+       if dataSize < 0 {
+               return errors.New("invalid object size")
+       }
+
+       idSize, offsetSize := intSize(int(maxID)), intSize(dataSize)
+       headerSize := 1 + sizeBytes + sz*int(idSize) + (sz+1)*int(offsetSize)
+       // shift the just written data to make room for the header section
+       b.buf.Grow(headerSize)
+       av := b.buf.AvailableBuffer()
+       if _, err := b.buf.Write(av[:headerSize]); err != nil {
+               return err
+       }
+
+       bs := b.buf.Bytes()
+       copy(bs[start+headerSize:], bs[start:start+dataSize])
+
+       // populate the header
+       bs[start] = objectHeader(isLarge, idSize, offsetSize)
+       writeOffset(bs[start+1:], sz, uint8(sizeBytes))
+
+       idStart := start + 1 + sizeBytes
+       offsetStart := idStart + sz*int(idSize)
+       for i, field := range fields {
+               writeOffset(bs[idStart+i*int(idSize):], int(field.ID), idSize)
+               writeOffset(bs[offsetStart+i*int(offsetSize):], field.Offset, 
offsetSize)
+       }
+       writeOffset(bs[offsetStart+sz*int(offsetSize):], dataSize, offsetSize)
+       return nil
+}
+
+// Reset truncates the builder's buffer and clears the dictionary while 
re-using the
+// underlying storage where possible. This allows for reusing the builder 
while keeping
+// the total memory usage low. The caveat to this is that any variant value 
returned
+// by calling [Builder.Build] must be cloned with [Value.Clone] before calling 
this
+// method. Otherwise, the byte slice used by the value will be invalidated 
upon calling
+// this method.
+//
+// For trivial cases where the builder is not reused, this method never needs 
to be called,
+// and the variant built by the builder gets to avoid having to copy the 
buffer, just referring
+// to it directly.
+func (b *Builder) Reset() {
+       b.buf.Reset()
+       b.dict = make(map[string]uint32)
+       for i := range b.dictKeys {
+               b.dictKeys[i] = nil
+       }
+       b.dictKeys = b.dictKeys[:0]
+}
+
+// Build creates a Variant Value from the builder's current state.
+// The returned Value includes both the value data and the metadata 
(dictionary).
+//
+// Importantly, the value data is the returned variant value is not copied 
here. This will
+// return the raw buffer data owned by the builder's buffer. If you wish to 
reuse a builder,
+// then the [Value.Clone] method must be called on the returned value to copy 
the data before
+// calling [Builder.Reset]. This enables trivial cases that don't reuse the 
builder to avoid
+// performing this copy.
+func (b *Builder) Build() (Value, error) {
+       nkeys := len(b.dictKeys)
+       totalDictSize := 0
+       for _, k := range b.dictKeys {

Review Comment:
   Is there a reason you calculate the size here instead of keeping a running 
total in `AddKey()`? Not that an extra walk through the keys is going to make 
or break performance, but for large numbers of keys, this could save you a 
couple of cycles.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] feat(parquet): add variant encoder/decoder [arrow-go]

Reply via email to