zeroshade commented on code in PR #344: URL: https://github.com/apache/arrow-go/pull/344#discussion_r2110433238
########## parquet/variant/doc.go: ########## @@ -0,0 +1,142 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +// Package variant provides an implementation of the Apache Parquet Variant data type. +// +// The Variant type is a flexible binary format designed to represent complex nested +// data structures with minimal overhead. It supports a wide range of primitive types +// as well as nested arrays and objects (similar to JSON). The format uses a memory-efficient +// binary representation with a separate metadata section for dictionary encoding of keys. +// +// # Key Components +// +// - [Value]: The primary type representing a variant value +// - [Metadata]: Contains information about the dictionary of keys +// - [Builder]: Used to construct variant values +// +// # Format Overview +// +// The variant format consists of two parts: +// +// 1. Metadata: A dictionary of keys used in objects +// 2. Value: The actual data payload +// +// Values can be one of the following types: +// +// - Primitive values (null, bool, int8/16/32/64, float32/64, etc.) +// - Short strings (less than 64 bytes) +// - Long strings and binary data +// - Date, time and timestamp values +// - Decimal values (4, 8, or 16 bytes) +// - Arrays of any variant value +// - Objects with key-value pairs +// +// # Working with Variants +// +// To create a variant value, use the Builder: +// +// var b variant.Builder +// b.Append(map[string]any{ +// "id": 123, +// "name": "example", +// "data": []any{1, 2, 3}, +// }) +// value, err := b.Build() +// +// To parse an existing variant value: +// +// v, err := variant.New(metadataBytes, valueBytes) +// +// You can access the data using the [Value.Value] method which returns the appropriate Go type: +// +// switch v.Type() { +// case variant.Object: +// obj := v.Value().(variant.ObjectValue) +// field, err := obj.ValueByKey("name") +// case variant.Array: +// arr := v.Value().(variant.ArrayValue) +// elem, err := arr.Value(0) +// case variant.String: +// s := v.Value().(string) +// case variant.Int64: +// i := v.Value().(int64) +// } +// +// You can also switch on the type of the result value from the [Value.Value] method: +// +// switch val := v.Value().(type) { +// case nil: +// // ... +// case int32: +// // ... +// case string: +// // ... +// case variant.ArrayValue: +// for i, item := range val.Values() { +// // item is a variant.Value +// } +// case variant.ObjectValue: +// for k, item := range val.Values() { +// // k is the field key +// // item is a variant.Value for that field +// } +// } +// +// Values can also be converted to JSON: +// +// jsonBytes, err := json.Marshal(v) +// +// # Low-level Construction +// +// For direct construction of complex nested structures, you can use the low-level +// methods: +// +// var b variant.Builder Review Comment: > Things also get tricky with sorting of metadata keys and building (as you can't know the index of the dictionary key until build time, which then changes an Object's index, which can then change the size of offsets, etc...). Well that's an entirely different ball of worms. Even if you use the nested object approach, things are still tricky if you want to sort the metadata keys as you end up having to patch, modify, or entirely rewrite any already written objects when you change the field index by sorting the metadata keys. (See the [rust impl](https://github.com/apache/arrow-rs/pull/7452/files#diff-83c4572d83951bc1234fd12d3cc6537b1e025188f06c39b378b0f7b9c32aa4f3)). So using the nested approach with multiple buffers doesn't help us make sorting metadata keys more efficient. > I don't think it can be done yea, that's why the API looks the way it does right now, so that we can build everything using a single buffer instead of needing multiple buffers and copying a bunch of stuff around. I've managed to sketch something out that enables using parent/child builders to manage the nested building while still requiring that the object be built in order. Kind of a compromise between the approaches, I'll try to clean it up a bit and see if it's worth pushing it as an option. > Once again, up to you. At least with child builders the parent builder can keep track of all nested things, and either error out if its children haven't called Finish(), or the parent can call each child's Finish() Yea, there's definitely benefits to having the parent builders to keep track of the nested children in terms of usability. Just trying to figure out if the code complexity is worth it. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org