Re: [PR] feat(parquet): add variant encoder/decoder [arrow-go]

via GitHub Tue, 27 May 2025 15:45:07 -0700


zeroshade commented on code in PR #344:
URL: https://github.com/apache/arrow-go/pull/344#discussion_r2110433238



##########
parquet/variant/doc.go:
##########
@@ -0,0 +1,142 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+// http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+// Package variant provides an implementation of the Apache Parquet Variant 
data type.
+//
+// The Variant type is a flexible binary format designed to represent complex 
nested
+// data structures with minimal overhead. It supports a wide range of 
primitive types
+// as well as nested arrays and objects (similar to JSON). The format uses a 
memory-efficient
+// binary representation with a separate metadata section for dictionary 
encoding of keys.
+//
+// # Key Components
+//
+// - [Value]: The primary type representing a variant value
+// - [Metadata]: Contains information about the dictionary of keys
+// - [Builder]: Used to construct variant values
+//
+// # Format Overview
+//
+// The variant format consists of two parts:
+//
+//  1. Metadata: A dictionary of keys used in objects
+//  2. Value: The actual data payload
+//
+// Values can be one of the following types:
+//
+//   - Primitive values (null, bool, int8/16/32/64, float32/64, etc.)
+//   - Short strings (less than 64 bytes)
+//   - Long strings and binary data
+//   - Date, time and timestamp values
+//   - Decimal values (4, 8, or 16 bytes)
+//   - Arrays of any variant value
+//   - Objects with key-value pairs
+//
+// # Working with Variants
+//
+// To create a variant value, use the Builder:
+//
+//     var b variant.Builder
+//     b.Append(map[string]any{
+//         "id": 123,
+//         "name": "example",
+//         "data": []any{1, 2, 3},
+//     })
+//     value, err := b.Build()
+//
+// To parse an existing variant value:
+//
+//     v, err := variant.New(metadataBytes, valueBytes)
+//
+// You can access the data using the [Value.Value] method which returns the 
appropriate Go type:
+//
+//     switch v.Type() {
+//     case variant.Object:
+//         obj := v.Value().(variant.ObjectValue)
+//         field, err := obj.ValueByKey("name")
+//     case variant.Array:
+//         arr := v.Value().(variant.ArrayValue)
+//         elem, err := arr.Value(0)
+//     case variant.String:
+//         s := v.Value().(string)
+//     case variant.Int64:
+//         i := v.Value().(int64)
+//     }
+//
+// You can also switch on the type of the result value from the [Value.Value] 
method:
+//
+//     switch val := v.Value().(type) {
+//     case nil:
+//       // ...
+//     case int32:
+//       // ...
+//     case string:
+//       // ...
+//     case variant.ArrayValue:
+//       for i, item := range val.Values() {
+//         // item is a variant.Value
+//       }
+//     case variant.ObjectValue:
+//       for k, item := range val.Values() {
+//         // k is the field key
+//         // item is a variant.Value for that field
+//       }
+//     }
+//
+// Values can also be converted to JSON:
+//
+//     jsonBytes, err := json.Marshal(v)
+//
+// # Low-level Construction
+//
+// For direct construction of complex nested structures, you can use the 
low-level
+// methods:
+//
+//     var b variant.Builder

Review Comment:
   >  Things also get tricky with sorting of metadata keys and building (as you 
can't know the index of the dictionary key until build time, which then changes 
an Object's index, which can then change the size of offsets, etc...).
   
   Well that's an entirely different ball of worms. Even if you use the nested 
object approach, things are still tricky if you want to sort the metadata keys 
as you end up having to patch, modify, or entirely rewrite any already written 
objects when you change the field index by sorting the metadata keys. (See the 
[rust 
impl](https://github.com/apache/arrow-rs/pull/7452/files#diff-83c4572d83951bc1234fd12d3cc6537b1e025188f06c39b378b0f7b9c32aa4f3)).
 So using the nested approach with multiple buffers doesn't help us make 
sorting metadata keys more efficient.
   
   > I don't think it can be done
   
   yea, that's why the API looks the way it does right now, so that we can 
build everything using a single buffer instead of needing multiple buffers and 
copying a bunch of stuff around. 
   
   I've managed to sketch something out that enables using parent/child 
builders to manage the nested building while still requiring that the object be 
built in order. Kind of a compromise between the approaches, I'll try to clean 
it up a bit and see if it's worth pushing it as an option.
   
   > Once again, up to you. At least with child builders the parent builder can 
keep track of all nested things, and either error out if its children haven't 
called Finish(), or the parent can call each child's Finish()
   
   Yea, there's definitely benefits to having the parent builders to keep track 
of the nested children in terms of usability. Just trying to figure out if the 
code complexity is worth it.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] feat(parquet): add variant encoder/decoder [arrow-go]

Reply via email to