Re: [PR] feat(parquet): add variant encoder/decoder [arrow-go]

2025-05-28 Thread via GitHub


zeroshade merged PR #344:
URL: https://github.com/apache/arrow-go/pull/344


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] feat(parquet): add variant encoder/decoder [arrow-go]

2025-05-28 Thread via GitHub


zeroshade commented on code in PR #344:
URL: https://github.com/apache/arrow-go/pull/344#discussion_r2112493931


##
parquet/variant/doc.go:
##
@@ -0,0 +1,142 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+// http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+// Package variant provides an implementation of the Apache Parquet Variant 
data type.
+//
+// The Variant type is a flexible binary format designed to represent complex 
nested
+// data structures with minimal overhead. It supports a wide range of 
primitive types
+// as well as nested arrays and objects (similar to JSON). The format uses a 
memory-efficient
+// binary representation with a separate metadata section for dictionary 
encoding of keys.
+//
+// # Key Components
+//
+// - [Value]: The primary type representing a variant value
+// - [Metadata]: Contains information about the dictionary of keys
+// - [Builder]: Used to construct variant values
+//
+// # Format Overview
+//
+// The variant format consists of two parts:
+//
+//  1. Metadata: A dictionary of keys used in objects
+//  2. Value: The actual data payload
+//
+// Values can be one of the following types:
+//
+//   - Primitive values (null, bool, int8/16/32/64, float32/64, etc.)
+//   - Short strings (less than 64 bytes)
+//   - Long strings and binary data
+//   - Date, time and timestamp values
+//   - Decimal values (4, 8, or 16 bytes)
+//   - Arrays of any variant value
+//   - Objects with key-value pairs
+//
+// # Working with Variants
+//
+// To create a variant value, use the Builder:
+//
+// var b variant.Builder
+// b.Append(map[string]any{
+// "id": 123,
+// "name": "example",
+// "data": []any{1, 2, 3},
+// })
+// value, err := b.Build()
+//
+// To parse an existing variant value:
+//
+// v, err := variant.New(metadataBytes, valueBytes)
+//
+// You can access the data using the [Value.Value] method which returns the 
appropriate Go type:
+//
+// switch v.Type() {
+// case variant.Object:
+// obj := v.Value().(variant.ObjectValue)
+// field, err := obj.ValueByKey("name")
+// case variant.Array:
+// arr := v.Value().(variant.ArrayValue)
+// elem, err := arr.Value(0)
+// case variant.String:
+// s := v.Value().(string)
+// case variant.Int64:
+// i := v.Value().(int64)
+// }
+//
+// You can also switch on the type of the result value from the [Value.Value] 
method:
+//
+// switch val := v.Value().(type) {
+// case nil:
+//   // ...
+// case int32:
+//   // ...
+// case string:
+//   // ...
+// case variant.ArrayValue:
+//   for i, item := range val.Values() {
+// // item is a variant.Value
+//   }
+// case variant.ObjectValue:
+//   for k, item := range val.Values() {
+// // k is the field key
+// // item is a variant.Value for that field
+//   }
+// }
+//
+// Values can also be converted to JSON:
+//
+// jsonBytes, err := json.Marshal(v)
+//
+// # Low-level Construction
+//
+// For direct construction of complex nested structures, you can use the 
low-level
+// methods:
+//
+// var b variant.Builder

Review Comment:
   > At the end of the day, this is your codebase and I'm just tossing in my 2c.
   
   Yup, but I still value the feedback. Especially as I am often *way* too deep 
in the weeds and will miss things that can affect usability by consumers. So 
the feedback is important.
   
   >  `Append()` does the overwhelming majority of the heavy lifting for users 
and is probably going to be the most used entry point, so this PR is fully 
featured enough (ie. it's a good hole for folk to drop into).
   
   Cool. I'll push this as is for now and then we can look at creating the 
higher-level "nicer" API for simplifying object/array construction later on. 
Another reason is that I want to get the basic variant support out and my 
"compromise" version has a lot of duplicated code in there that I wasn't able 
to get to my liking yet.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.o

Re: [PR] feat(parquet): add variant encoder/decoder [arrow-go]

2025-05-28 Thread via GitHub


sfc-gh-mbojanczyk commented on PR #344:
URL: https://github.com/apache/arrow-go/pull/344#issuecomment-2917181016

   Can't actually toss in my approval since I'm the original author, but this 
PR looks good to me.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] feat(parquet): add variant encoder/decoder [arrow-go]

2025-05-28 Thread via GitHub


sfc-gh-mbojanczyk commented on code in PR #344:
URL: https://github.com/apache/arrow-go/pull/344#discussion_r2112471286


##
parquet/variant/doc.go:
##
@@ -0,0 +1,142 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+// http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+// Package variant provides an implementation of the Apache Parquet Variant 
data type.
+//
+// The Variant type is a flexible binary format designed to represent complex 
nested
+// data structures with minimal overhead. It supports a wide range of 
primitive types
+// as well as nested arrays and objects (similar to JSON). The format uses a 
memory-efficient
+// binary representation with a separate metadata section for dictionary 
encoding of keys.
+//
+// # Key Components
+//
+// - [Value]: The primary type representing a variant value
+// - [Metadata]: Contains information about the dictionary of keys
+// - [Builder]: Used to construct variant values
+//
+// # Format Overview
+//
+// The variant format consists of two parts:
+//
+//  1. Metadata: A dictionary of keys used in objects
+//  2. Value: The actual data payload
+//
+// Values can be one of the following types:
+//
+//   - Primitive values (null, bool, int8/16/32/64, float32/64, etc.)
+//   - Short strings (less than 64 bytes)
+//   - Long strings and binary data
+//   - Date, time and timestamp values
+//   - Decimal values (4, 8, or 16 bytes)
+//   - Arrays of any variant value
+//   - Objects with key-value pairs
+//
+// # Working with Variants
+//
+// To create a variant value, use the Builder:
+//
+// var b variant.Builder
+// b.Append(map[string]any{
+// "id": 123,
+// "name": "example",
+// "data": []any{1, 2, 3},
+// })
+// value, err := b.Build()
+//
+// To parse an existing variant value:
+//
+// v, err := variant.New(metadataBytes, valueBytes)
+//
+// You can access the data using the [Value.Value] method which returns the 
appropriate Go type:
+//
+// switch v.Type() {
+// case variant.Object:
+// obj := v.Value().(variant.ObjectValue)
+// field, err := obj.ValueByKey("name")
+// case variant.Array:
+// arr := v.Value().(variant.ArrayValue)
+// elem, err := arr.Value(0)
+// case variant.String:
+// s := v.Value().(string)
+// case variant.Int64:
+// i := v.Value().(int64)
+// }
+//
+// You can also switch on the type of the result value from the [Value.Value] 
method:
+//
+// switch val := v.Value().(type) {
+// case nil:
+//   // ...
+// case int32:
+//   // ...
+// case string:
+//   // ...
+// case variant.ArrayValue:
+//   for i, item := range val.Values() {
+// // item is a variant.Value
+//   }
+// case variant.ObjectValue:
+//   for k, item := range val.Values() {
+// // k is the field key
+// // item is a variant.Value for that field
+//   }
+// }
+//
+// Values can also be converted to JSON:
+//
+// jsonBytes, err := json.Marshal(v)
+//
+// # Low-level Construction
+//
+// For direct construction of complex nested structures, you can use the 
low-level
+// methods:
+//
+// var b variant.Builder

Review Comment:
   > Yea, Finish() errors if the children haven't had Finish() called first.
   
   Perfect!
   
   > I think @lidavidm has a good point here. The "compromise" example I put 
together is just built on top of the existing Builder API I've put together by 
creating the "higher-level" API on top of the "lower-level" one that I've 
already built. Any objections to pushing through what we've already got here 
and then following up with the higher-level builder that handles the offsets 
and fields for you as a separate PR afterwards @sfc-gh-mbojanczyk?
   
   At the end of the day, this is your codebase and I'm just tossing in my 2c. 
In my experience, users are going to fall into holes- it's the library author's 
job to lead them to the correct one :) `Append()` does the overwhelming 
majority of the heavy lifting for users and is probably going to be the most 
used entry point, so this PR is fully featured enough (ie. it's a good hole for 
folk to drop into).



-- 
This is an automated message from the Apache Git Service.
To respond to 

Re: [PR] feat(parquet): add variant encoder/decoder [arrow-go]

2025-05-28 Thread via GitHub


zeroshade commented on code in PR #344:
URL: https://github.com/apache/arrow-go/pull/344#discussion_r2112139261


##
parquet/variant/doc.go:
##
@@ -0,0 +1,142 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+// http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+// Package variant provides an implementation of the Apache Parquet Variant 
data type.
+//
+// The Variant type is a flexible binary format designed to represent complex 
nested
+// data structures with minimal overhead. It supports a wide range of 
primitive types
+// as well as nested arrays and objects (similar to JSON). The format uses a 
memory-efficient
+// binary representation with a separate metadata section for dictionary 
encoding of keys.
+//
+// # Key Components
+//
+// - [Value]: The primary type representing a variant value
+// - [Metadata]: Contains information about the dictionary of keys
+// - [Builder]: Used to construct variant values
+//
+// # Format Overview
+//
+// The variant format consists of two parts:
+//
+//  1. Metadata: A dictionary of keys used in objects
+//  2. Value: The actual data payload
+//
+// Values can be one of the following types:
+//
+//   - Primitive values (null, bool, int8/16/32/64, float32/64, etc.)
+//   - Short strings (less than 64 bytes)
+//   - Long strings and binary data
+//   - Date, time and timestamp values
+//   - Decimal values (4, 8, or 16 bytes)
+//   - Arrays of any variant value
+//   - Objects with key-value pairs
+//
+// # Working with Variants
+//
+// To create a variant value, use the Builder:
+//
+// var b variant.Builder
+// b.Append(map[string]any{
+// "id": 123,
+// "name": "example",
+// "data": []any{1, 2, 3},
+// })
+// value, err := b.Build()
+//
+// To parse an existing variant value:
+//
+// v, err := variant.New(metadataBytes, valueBytes)
+//
+// You can access the data using the [Value.Value] method which returns the 
appropriate Go type:
+//
+// switch v.Type() {
+// case variant.Object:
+// obj := v.Value().(variant.ObjectValue)
+// field, err := obj.ValueByKey("name")
+// case variant.Array:
+// arr := v.Value().(variant.ArrayValue)
+// elem, err := arr.Value(0)
+// case variant.String:
+// s := v.Value().(string)
+// case variant.Int64:
+// i := v.Value().(int64)
+// }
+//
+// You can also switch on the type of the result value from the [Value.Value] 
method:
+//
+// switch val := v.Value().(type) {
+// case nil:
+//   // ...
+// case int32:
+//   // ...
+// case string:
+//   // ...
+// case variant.ArrayValue:
+//   for i, item := range val.Values() {
+// // item is a variant.Value
+//   }
+// case variant.ObjectValue:
+//   for k, item := range val.Values() {
+// // k is the field key
+// // item is a variant.Value for that field
+//   }
+// }
+//
+// Values can also be converted to JSON:
+//
+// jsonBytes, err := json.Marshal(v)
+//
+// # Low-level Construction
+//
+// For direct construction of complex nested structures, you can use the 
low-level
+// methods:
+//
+// var b variant.Builder

Review Comment:
   > I like it. I assume you'll do some handling if the parent has Finish() 
called without the children's Finish() (either panic, or just call their 
Finish() methods, right?
   
   Yea, `Finish()` errors if the children haven't had `Finish()` called first.
   
   > You can also still just directly append a map or slice instead of keeping 
track manually - would that solve your use case? (Also, presumably we can add 
the higher level builders separately/on top of this API? Does it have to be 
built into the builder for this first pass?)
   
   I think @lidavidm has a good point here. The "compromise" example I put 
together is just built on top of the existing Builder API I've put together by 
creating the "higher-level" API on top of the "lower-level" one that I've 
already built. Any objections to pushing through what we've already got here 
and then following up with the higher-level builder that handles the offsets 
and fields for you as a separate PR afterwards @sfc-gh-mbojanczyk?



-- 
This is an automated message from the Apache Git

Re: [PR] feat(parquet): add variant encoder/decoder [arrow-go]

2025-05-27 Thread via GitHub


lidavidm commented on code in PR #344:
URL: https://github.com/apache/arrow-go/pull/344#discussion_r2110573898


##
parquet/variant/doc.go:
##
@@ -0,0 +1,142 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+// http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+// Package variant provides an implementation of the Apache Parquet Variant 
data type.
+//
+// The Variant type is a flexible binary format designed to represent complex 
nested
+// data structures with minimal overhead. It supports a wide range of 
primitive types
+// as well as nested arrays and objects (similar to JSON). The format uses a 
memory-efficient
+// binary representation with a separate metadata section for dictionary 
encoding of keys.
+//
+// # Key Components
+//
+// - [Value]: The primary type representing a variant value
+// - [Metadata]: Contains information about the dictionary of keys
+// - [Builder]: Used to construct variant values
+//
+// # Format Overview
+//
+// The variant format consists of two parts:
+//
+//  1. Metadata: A dictionary of keys used in objects
+//  2. Value: The actual data payload
+//
+// Values can be one of the following types:
+//
+//   - Primitive values (null, bool, int8/16/32/64, float32/64, etc.)
+//   - Short strings (less than 64 bytes)
+//   - Long strings and binary data
+//   - Date, time and timestamp values
+//   - Decimal values (4, 8, or 16 bytes)
+//   - Arrays of any variant value
+//   - Objects with key-value pairs
+//
+// # Working with Variants
+//
+// To create a variant value, use the Builder:
+//
+// var b variant.Builder
+// b.Append(map[string]any{
+// "id": 123,
+// "name": "example",
+// "data": []any{1, 2, 3},
+// })
+// value, err := b.Build()
+//
+// To parse an existing variant value:
+//
+// v, err := variant.New(metadataBytes, valueBytes)
+//
+// You can access the data using the [Value.Value] method which returns the 
appropriate Go type:
+//
+// switch v.Type() {
+// case variant.Object:
+// obj := v.Value().(variant.ObjectValue)
+// field, err := obj.ValueByKey("name")
+// case variant.Array:
+// arr := v.Value().(variant.ArrayValue)
+// elem, err := arr.Value(0)
+// case variant.String:
+// s := v.Value().(string)
+// case variant.Int64:
+// i := v.Value().(int64)
+// }
+//
+// You can also switch on the type of the result value from the [Value.Value] 
method:
+//
+// switch val := v.Value().(type) {
+// case nil:
+//   // ...
+// case int32:
+//   // ...
+// case string:
+//   // ...
+// case variant.ArrayValue:
+//   for i, item := range val.Values() {
+// // item is a variant.Value
+//   }
+// case variant.ObjectValue:
+//   for k, item := range val.Values() {
+// // k is the field key
+// // item is a variant.Value for that field
+//   }
+// }
+//
+// Values can also be converted to JSON:
+//
+// jsonBytes, err := json.Marshal(v)
+//
+// # Low-level Construction
+//
+// For direct construction of complex nested structures, you can use the 
low-level
+// methods:
+//
+// var b variant.Builder

Review Comment:
   You can also still just directly append a map or slice instead of keeping 
track manually - would that solve your use case? 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] feat(parquet): add variant encoder/decoder [arrow-go]

2025-05-27 Thread via GitHub


lidavidm commented on code in PR #344:
URL: https://github.com/apache/arrow-go/pull/344#discussion_r2110574589


##
parquet/variant/doc.go:
##
@@ -0,0 +1,142 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+// http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+// Package variant provides an implementation of the Apache Parquet Variant 
data type.
+//
+// The Variant type is a flexible binary format designed to represent complex 
nested
+// data structures with minimal overhead. It supports a wide range of 
primitive types
+// as well as nested arrays and objects (similar to JSON). The format uses a 
memory-efficient
+// binary representation with a separate metadata section for dictionary 
encoding of keys.
+//
+// # Key Components
+//
+// - [Value]: The primary type representing a variant value
+// - [Metadata]: Contains information about the dictionary of keys
+// - [Builder]: Used to construct variant values
+//
+// # Format Overview
+//
+// The variant format consists of two parts:
+//
+//  1. Metadata: A dictionary of keys used in objects
+//  2. Value: The actual data payload
+//
+// Values can be one of the following types:
+//
+//   - Primitive values (null, bool, int8/16/32/64, float32/64, etc.)
+//   - Short strings (less than 64 bytes)
+//   - Long strings and binary data
+//   - Date, time and timestamp values
+//   - Decimal values (4, 8, or 16 bytes)
+//   - Arrays of any variant value
+//   - Objects with key-value pairs
+//
+// # Working with Variants
+//
+// To create a variant value, use the Builder:
+//
+// var b variant.Builder
+// b.Append(map[string]any{
+// "id": 123,
+// "name": "example",
+// "data": []any{1, 2, 3},
+// })
+// value, err := b.Build()
+//
+// To parse an existing variant value:
+//
+// v, err := variant.New(metadataBytes, valueBytes)
+//
+// You can access the data using the [Value.Value] method which returns the 
appropriate Go type:
+//
+// switch v.Type() {
+// case variant.Object:
+// obj := v.Value().(variant.ObjectValue)
+// field, err := obj.ValueByKey("name")
+// case variant.Array:
+// arr := v.Value().(variant.ArrayValue)
+// elem, err := arr.Value(0)
+// case variant.String:
+// s := v.Value().(string)
+// case variant.Int64:
+// i := v.Value().(int64)
+// }
+//
+// You can also switch on the type of the result value from the [Value.Value] 
method:
+//
+// switch val := v.Value().(type) {
+// case nil:
+//   // ...
+// case int32:
+//   // ...
+// case string:
+//   // ...
+// case variant.ArrayValue:
+//   for i, item := range val.Values() {
+// // item is a variant.Value
+//   }
+// case variant.ObjectValue:
+//   for k, item := range val.Values() {
+// // k is the field key
+// // item is a variant.Value for that field
+//   }
+// }
+//
+// Values can also be converted to JSON:
+//
+// jsonBytes, err := json.Marshal(v)
+//
+// # Low-level Construction
+//
+// For direct construction of complex nested structures, you can use the 
low-level
+// methods:
+//
+// var b variant.Builder

Review Comment:
   (Also, presumably we can add the higher level builders separately/on top of 
this API? Does it have to be built into the builder for this first pass?)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] feat(parquet): add variant encoder/decoder [arrow-go]

2025-05-27 Thread via GitHub


sfc-gh-mbojanczyk commented on code in PR #344:
URL: https://github.com/apache/arrow-go/pull/344#discussion_r2110554283


##
parquet/variant/doc.go:
##
@@ -0,0 +1,142 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+// http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+// Package variant provides an implementation of the Apache Parquet Variant 
data type.
+//
+// The Variant type is a flexible binary format designed to represent complex 
nested
+// data structures with minimal overhead. It supports a wide range of 
primitive types
+// as well as nested arrays and objects (similar to JSON). The format uses a 
memory-efficient
+// binary representation with a separate metadata section for dictionary 
encoding of keys.
+//
+// # Key Components
+//
+// - [Value]: The primary type representing a variant value
+// - [Metadata]: Contains information about the dictionary of keys
+// - [Builder]: Used to construct variant values
+//
+// # Format Overview
+//
+// The variant format consists of two parts:
+//
+//  1. Metadata: A dictionary of keys used in objects
+//  2. Value: The actual data payload
+//
+// Values can be one of the following types:
+//
+//   - Primitive values (null, bool, int8/16/32/64, float32/64, etc.)
+//   - Short strings (less than 64 bytes)
+//   - Long strings and binary data
+//   - Date, time and timestamp values
+//   - Decimal values (4, 8, or 16 bytes)
+//   - Arrays of any variant value
+//   - Objects with key-value pairs
+//
+// # Working with Variants
+//
+// To create a variant value, use the Builder:
+//
+// var b variant.Builder
+// b.Append(map[string]any{
+// "id": 123,
+// "name": "example",
+// "data": []any{1, 2, 3},
+// })
+// value, err := b.Build()
+//
+// To parse an existing variant value:
+//
+// v, err := variant.New(metadataBytes, valueBytes)
+//
+// You can access the data using the [Value.Value] method which returns the 
appropriate Go type:
+//
+// switch v.Type() {
+// case variant.Object:
+// obj := v.Value().(variant.ObjectValue)
+// field, err := obj.ValueByKey("name")
+// case variant.Array:
+// arr := v.Value().(variant.ArrayValue)
+// elem, err := arr.Value(0)
+// case variant.String:
+// s := v.Value().(string)
+// case variant.Int64:
+// i := v.Value().(int64)
+// }
+//
+// You can also switch on the type of the result value from the [Value.Value] 
method:
+//
+// switch val := v.Value().(type) {
+// case nil:
+//   // ...
+// case int32:
+//   // ...
+// case string:
+//   // ...
+// case variant.ArrayValue:
+//   for i, item := range val.Values() {
+// // item is a variant.Value
+//   }
+// case variant.ObjectValue:
+//   for k, item := range val.Values() {
+// // k is the field key
+// // item is a variant.Value for that field
+//   }
+// }
+//
+// Values can also be converted to JSON:
+//
+// jsonBytes, err := json.Marshal(v)
+//
+// # Low-level Construction
+//
+// For direct construction of complex nested structures, you can use the 
low-level
+// methods:
+//
+// var b variant.Builder

Review Comment:
   I like it. I assume you'll do some handling if the parent has `Finish()` 
called without the children's `Finish()` (either panic, or just call their 
`Finish()` methods, right?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] feat(parquet): add variant encoder/decoder [arrow-go]

2025-05-27 Thread via GitHub


zeroshade commented on code in PR #344:
URL: https://github.com/apache/arrow-go/pull/344#discussion_r2110486245


##
parquet/variant/doc.go:
##
@@ -0,0 +1,142 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+// http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+// Package variant provides an implementation of the Apache Parquet Variant 
data type.
+//
+// The Variant type is a flexible binary format designed to represent complex 
nested
+// data structures with minimal overhead. It supports a wide range of 
primitive types
+// as well as nested arrays and objects (similar to JSON). The format uses a 
memory-efficient
+// binary representation with a separate metadata section for dictionary 
encoding of keys.
+//
+// # Key Components
+//
+// - [Value]: The primary type representing a variant value
+// - [Metadata]: Contains information about the dictionary of keys
+// - [Builder]: Used to construct variant values
+//
+// # Format Overview
+//
+// The variant format consists of two parts:
+//
+//  1. Metadata: A dictionary of keys used in objects
+//  2. Value: The actual data payload
+//
+// Values can be one of the following types:
+//
+//   - Primitive values (null, bool, int8/16/32/64, float32/64, etc.)
+//   - Short strings (less than 64 bytes)
+//   - Long strings and binary data
+//   - Date, time and timestamp values
+//   - Decimal values (4, 8, or 16 bytes)
+//   - Arrays of any variant value
+//   - Objects with key-value pairs
+//
+// # Working with Variants
+//
+// To create a variant value, use the Builder:
+//
+// var b variant.Builder
+// b.Append(map[string]any{
+// "id": 123,
+// "name": "example",
+// "data": []any{1, 2, 3},
+// })
+// value, err := b.Build()
+//
+// To parse an existing variant value:
+//
+// v, err := variant.New(metadataBytes, valueBytes)
+//
+// You can access the data using the [Value.Value] method which returns the 
appropriate Go type:
+//
+// switch v.Type() {
+// case variant.Object:
+// obj := v.Value().(variant.ObjectValue)
+// field, err := obj.ValueByKey("name")
+// case variant.Array:
+// arr := v.Value().(variant.ArrayValue)
+// elem, err := arr.Value(0)
+// case variant.String:
+// s := v.Value().(string)
+// case variant.Int64:
+// i := v.Value().(int64)
+// }
+//
+// You can also switch on the type of the result value from the [Value.Value] 
method:
+//
+// switch val := v.Value().(type) {
+// case nil:
+//   // ...
+// case int32:
+//   // ...
+// case string:
+//   // ...
+// case variant.ArrayValue:
+//   for i, item := range val.Values() {
+// // item is a variant.Value
+//   }
+// case variant.ObjectValue:
+//   for k, item := range val.Values() {
+// // k is the field key
+// // item is a variant.Value for that field
+//   }
+// }
+//
+// Values can also be converted to JSON:
+//
+// jsonBytes, err := json.Marshal(v)
+//
+// # Low-level Construction
+//
+// For direct construction of complex nested structures, you can use the 
low-level
+// methods:
+//
+// var b variant.Builder

Review Comment:
   Okay, what do you think of something like this:
   
   ```go
var b variant.Builder
   
obj, err := b.NewObject()
   // handle err
   
obj.AppendInt("id", 1)
   
observe, err := obj.AppendObject("observation")
   // handle err
   
observe.AppendString("location", "In the Volcano")
observe.AppendString("time", "12:34:56")
   
value, err := observe.AppendObject("value")
// handle err
   
value.AppendInt("humidity", 456)
value.AppendInt("temperature", 123)
value.Finish()
observe.Finish()
   
species, err := obj.AppendObject("species")
// handle err
species.AppendString("name", "lava monster")
species.AppendInt("population", 6789)
species.Finish()
   
obj.Finish()
   v, err := b.Build()
// handle err
out, err := json.Marshal(v) 
/* result is this `{
"id": 1,
"observation": {
"location": "In the V

Re: [PR] feat(parquet): add variant encoder/decoder [arrow-go]

2025-05-27 Thread via GitHub


zeroshade commented on code in PR #344:
URL: https://github.com/apache/arrow-go/pull/344#discussion_r2110433238


##
parquet/variant/doc.go:
##
@@ -0,0 +1,142 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+// http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+// Package variant provides an implementation of the Apache Parquet Variant 
data type.
+//
+// The Variant type is a flexible binary format designed to represent complex 
nested
+// data structures with minimal overhead. It supports a wide range of 
primitive types
+// as well as nested arrays and objects (similar to JSON). The format uses a 
memory-efficient
+// binary representation with a separate metadata section for dictionary 
encoding of keys.
+//
+// # Key Components
+//
+// - [Value]: The primary type representing a variant value
+// - [Metadata]: Contains information about the dictionary of keys
+// - [Builder]: Used to construct variant values
+//
+// # Format Overview
+//
+// The variant format consists of two parts:
+//
+//  1. Metadata: A dictionary of keys used in objects
+//  2. Value: The actual data payload
+//
+// Values can be one of the following types:
+//
+//   - Primitive values (null, bool, int8/16/32/64, float32/64, etc.)
+//   - Short strings (less than 64 bytes)
+//   - Long strings and binary data
+//   - Date, time and timestamp values
+//   - Decimal values (4, 8, or 16 bytes)
+//   - Arrays of any variant value
+//   - Objects with key-value pairs
+//
+// # Working with Variants
+//
+// To create a variant value, use the Builder:
+//
+// var b variant.Builder
+// b.Append(map[string]any{
+// "id": 123,
+// "name": "example",
+// "data": []any{1, 2, 3},
+// })
+// value, err := b.Build()
+//
+// To parse an existing variant value:
+//
+// v, err := variant.New(metadataBytes, valueBytes)
+//
+// You can access the data using the [Value.Value] method which returns the 
appropriate Go type:
+//
+// switch v.Type() {
+// case variant.Object:
+// obj := v.Value().(variant.ObjectValue)
+// field, err := obj.ValueByKey("name")
+// case variant.Array:
+// arr := v.Value().(variant.ArrayValue)
+// elem, err := arr.Value(0)
+// case variant.String:
+// s := v.Value().(string)
+// case variant.Int64:
+// i := v.Value().(int64)
+// }
+//
+// You can also switch on the type of the result value from the [Value.Value] 
method:
+//
+// switch val := v.Value().(type) {
+// case nil:
+//   // ...
+// case int32:
+//   // ...
+// case string:
+//   // ...
+// case variant.ArrayValue:
+//   for i, item := range val.Values() {
+// // item is a variant.Value
+//   }
+// case variant.ObjectValue:
+//   for k, item := range val.Values() {
+// // k is the field key
+// // item is a variant.Value for that field
+//   }
+// }
+//
+// Values can also be converted to JSON:
+//
+// jsonBytes, err := json.Marshal(v)
+//
+// # Low-level Construction
+//
+// For direct construction of complex nested structures, you can use the 
low-level
+// methods:
+//
+// var b variant.Builder

Review Comment:
   >  Things also get tricky with sorting of metadata keys and building (as you 
can't know the index of the dictionary key until build time, which then changes 
an Object's index, which can then change the size of offsets, etc...).
   
   Well that's an entirely different ball of worms. Even if you use the nested 
object approach, things are still tricky if you want to sort the metadata keys 
as you end up having to patch, modify, or entirely rewrite any already written 
objects when you change the field index by sorting the metadata keys. (See the 
[rust 
impl](https://github.com/apache/arrow-rs/pull/7452/files#diff-83c4572d83951bc1234fd12d3cc6537b1e025188f06c39b378b0f7b9c32aa4f3)).
 So using the nested approach with multiple buffers doesn't help us make 
sorting metadata keys more efficient.
   
   > I don't think it can be done
   
   yea, that's why the API looks the way it does right now, so that we can 
build everything using a single buffer instead of needing multiple buffers and 
copying a bunch of stuff around. 
   
   I've managed

Re: [PR] feat(parquet): add variant encoder/decoder [arrow-go]

2025-05-27 Thread via GitHub


sfc-gh-mbojanczyk commented on code in PR #344:
URL: https://github.com/apache/arrow-go/pull/344#discussion_r2110236886


##
parquet/variant/variant.go:
##
@@ -0,0 +1,722 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+// http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+package variant
+
+import (
+   "bytes"
+   "encoding/binary"
+   "encoding/json"
+   "errors"
+   "fmt"
+   "iter"
+   "maps"
+   "slices"
+   "strings"
+   "time"
+   "unsafe"
+
+   "github.com/apache/arrow-go/v18/arrow"
+   "github.com/apache/arrow-go/v18/arrow/decimal"
+   "github.com/apache/arrow-go/v18/arrow/decimal128"
+   "github.com/apache/arrow-go/v18/parquet/internal/debug"
+   "github.com/google/uuid"
+)
+
+//go:generate go tool stringer -type=BasicType -linecomment 
-output=basic_type_string.go
+//go:generate go tool stringer -type=PrimitiveType -linecomment 
-output=primitive_type_string.go
+
+// BasicType represents the fundamental type category of a variant value.
+type BasicType int
+
+const (
+   BasicUndefined   BasicType = iota - 1 // Unknown
+   BasicPrimitive// Primitive
+   BasicShortString  // ShortString
+   BasicObject   // Object
+   BasicArray// Array
+)
+
+func basicTypeFromHeader(hdr byte) BasicType {
+   // because we're doing hdr & 0x3, it is impossible for the result
+   // to be outside of the range of BasicType. Therefore, we don't
+   // need to perform any checks. The value will always be [0,3]
+   return BasicType(hdr & basicTypeMask)
+}
+
+// PrimitiveType represents specific primitive data types within the variant 
format.
+type PrimitiveType int
+
+const (
+   PrimitiveInvalidPrimitiveType = iota - 1 // Unknown

Review Comment:
   Oh right! I think I ended up explicitly defining each primitive value (ie. 
`primitiveNull = 0`, `primitiveFalse = 1`, etc...) because I've been bitten by 
relying on `iota` in the past (a well meaning coworker reordered my enum to be 
in alphabetical order and a bunch of tests blew up). IMO, if it's defined in 
spec, it's safer to be explicit instead of using an enum.



##
parquet/variant/doc.go:
##
@@ -0,0 +1,142 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+// http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+// Package variant provides an implementation of the Apache Parquet Variant 
data type.
+//
+// The Variant type is a flexible binary format designed to represent complex 
nested
+// data structures with minimal overhead. It supports a wide range of 
primitive types
+// as well as nested arrays and objects (similar to JSON). The format uses a 
memory-efficient
+// binary representation with a separate metadata section for dictionary 
encoding of keys.
+//
+// # Key Components
+//
+// - [Value]: The primary type representing a variant value
+// - [Metadata]: Contains information about the dictionary of keys
+// - [Builder]: Used to construct variant values
+//
+// # Format Overview
+//
+// The variant format consists of two parts:
+//
+//  1. Metadata: A dictionary of keys used in objects
+//  2. Value: The actual data payload
+//
+// Values can be one of the following types:
+//
+//   - Primitive values (null, bool, int8/16/32/64, float32/64, etc.)
+//   - Short strings (less than 64 bytes)
+//   - Long strings and binary data
+//   - Date, time and timestamp values
+//   - Decimal values (4, 8, or 16 bytes)
+//   - Ar

Re: [PR] feat(parquet): add variant encoder/decoder [arrow-go]

2025-05-27 Thread via GitHub


zeroshade commented on code in PR #344:
URL: https://github.com/apache/arrow-go/pull/344#discussion_r2110077872


##
parquet/variant/variant.go:
##
@@ -0,0 +1,722 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+// http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+package variant
+
+import (
+   "bytes"
+   "encoding/binary"
+   "encoding/json"
+   "errors"
+   "fmt"
+   "iter"
+   "maps"
+   "slices"
+   "strings"
+   "time"
+   "unsafe"
+
+   "github.com/apache/arrow-go/v18/arrow"
+   "github.com/apache/arrow-go/v18/arrow/decimal"
+   "github.com/apache/arrow-go/v18/arrow/decimal128"
+   "github.com/apache/arrow-go/v18/parquet/internal/debug"
+   "github.com/google/uuid"
+)
+
+//go:generate go tool stringer -type=BasicType -linecomment 
-output=basic_type_string.go
+//go:generate go tool stringer -type=PrimitiveType -linecomment 
-output=primitive_type_string.go
+
+// BasicType represents the fundamental type category of a variant value.
+type BasicType int
+
+const (
+   BasicUndefined   BasicType = iota - 1 // Unknown
+   BasicPrimitive// Primitive
+   BasicShortString  // ShortString
+   BasicObject   // Object
+   BasicArray// Array
+)
+
+func basicTypeFromHeader(hdr byte) BasicType {
+   // because we're doing hdr & 0x3, it is impossible for the result
+   // to be outside of the range of BasicType. Therefore, we don't
+   // need to perform any checks. The value will always be [0,3]
+   return BasicType(hdr & basicTypeMask)
+}
+
+// PrimitiveType represents specific primitive data types within the variant 
format.
+type PrimitiveType int
+
+const (
+   PrimitiveInvalidPrimitiveType = iota - 1 // Unknown

Review Comment:
   These constants have specific values defined by the spec, so the zero value 
needs to be Null as the constant for "Null" type is 0. I don't really have 
wiggle room to change that. Though, technically if all variant building is done 
through the builder, we could probably get away without exporting this enum and 
these constants.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] feat(parquet): add variant encoder/decoder [arrow-go]

2025-05-27 Thread via GitHub


zeroshade commented on code in PR #344:
URL: https://github.com/apache/arrow-go/pull/344#discussion_r2110087873


##
parquet/variant/doc.go:
##
@@ -0,0 +1,142 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+// http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+// Package variant provides an implementation of the Apache Parquet Variant 
data type.
+//
+// The Variant type is a flexible binary format designed to represent complex 
nested
+// data structures with minimal overhead. It supports a wide range of 
primitive types
+// as well as nested arrays and objects (similar to JSON). The format uses a 
memory-efficient
+// binary representation with a separate metadata section for dictionary 
encoding of keys.
+//
+// # Key Components
+//
+// - [Value]: The primary type representing a variant value
+// - [Metadata]: Contains information about the dictionary of keys
+// - [Builder]: Used to construct variant values
+//
+// # Format Overview
+//
+// The variant format consists of two parts:
+//
+//  1. Metadata: A dictionary of keys used in objects
+//  2. Value: The actual data payload
+//
+// Values can be one of the following types:
+//
+//   - Primitive values (null, bool, int8/16/32/64, float32/64, etc.)
+//   - Short strings (less than 64 bytes)
+//   - Long strings and binary data
+//   - Date, time and timestamp values
+//   - Decimal values (4, 8, or 16 bytes)
+//   - Arrays of any variant value
+//   - Objects with key-value pairs
+//
+// # Working with Variants
+//
+// To create a variant value, use the Builder:
+//
+// var b variant.Builder
+// b.Append(map[string]any{
+// "id": 123,
+// "name": "example",
+// "data": []any{1, 2, 3},
+// })
+// value, err := b.Build()
+//
+// To parse an existing variant value:
+//
+// v, err := variant.New(metadataBytes, valueBytes)
+//
+// You can access the data using the [Value.Value] method which returns the 
appropriate Go type:
+//
+// switch v.Type() {
+// case variant.Object:
+// obj := v.Value().(variant.ObjectValue)
+// field, err := obj.ValueByKey("name")
+// case variant.Array:
+// arr := v.Value().(variant.ArrayValue)
+// elem, err := arr.Value(0)
+// case variant.String:
+// s := v.Value().(string)
+// case variant.Int64:
+// i := v.Value().(int64)
+// }
+//
+// You can also switch on the type of the result value from the [Value.Value] 
method:
+//
+// switch val := v.Value().(type) {
+// case nil:
+//   // ...
+// case int32:
+//   // ...
+// case string:
+//   // ...
+// case variant.ArrayValue:
+//   for i, item := range val.Values() {
+// // item is a variant.Value
+//   }
+// case variant.ObjectValue:
+//   for k, item := range val.Values() {
+// // k is the field key
+// // item is a variant.Value for that field
+//   }
+// }
+//
+// Values can also be converted to JSON:
+//
+// jsonBytes, err := json.Marshal(v)
+//
+// # Low-level Construction
+//
+// For direct construction of complex nested structures, you can use the 
low-level
+// methods:
+//
+// var b variant.Builder

Review Comment:
   If we instead use a `StartObject`/`StartArray` pattern to construct child 
builders, then the user ends up having to keep track of the child builders 
instead of keeping track of the offests, so I'm not sure the added complexity 
of having to check and enforce that all the nested builders are completed 
before calling `Finish`/`Build` is worth it or saves that much. 
   
   I also couldn't think of a clean way to keep the ability to build the whole 
object with a single buffer (instead of multiple buffers) while handling the 
nested builders and enforcing all the checks without a ton of repetitive code. 
I'll think about this a bit more and see if I can come up with something I like.
   
   @lidavid @mapleFU  what do you think?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.

Re: [PR] feat(parquet): add variant encoder/decoder [arrow-go]

2025-05-27 Thread via GitHub


zeroshade commented on code in PR #344:
URL: https://github.com/apache/arrow-go/pull/344#discussion_r2110087873


##
parquet/variant/doc.go:
##
@@ -0,0 +1,142 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+// http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+// Package variant provides an implementation of the Apache Parquet Variant 
data type.
+//
+// The Variant type is a flexible binary format designed to represent complex 
nested
+// data structures with minimal overhead. It supports a wide range of 
primitive types
+// as well as nested arrays and objects (similar to JSON). The format uses a 
memory-efficient
+// binary representation with a separate metadata section for dictionary 
encoding of keys.
+//
+// # Key Components
+//
+// - [Value]: The primary type representing a variant value
+// - [Metadata]: Contains information about the dictionary of keys
+// - [Builder]: Used to construct variant values
+//
+// # Format Overview
+//
+// The variant format consists of two parts:
+//
+//  1. Metadata: A dictionary of keys used in objects
+//  2. Value: The actual data payload
+//
+// Values can be one of the following types:
+//
+//   - Primitive values (null, bool, int8/16/32/64, float32/64, etc.)
+//   - Short strings (less than 64 bytes)
+//   - Long strings and binary data
+//   - Date, time and timestamp values
+//   - Decimal values (4, 8, or 16 bytes)
+//   - Arrays of any variant value
+//   - Objects with key-value pairs
+//
+// # Working with Variants
+//
+// To create a variant value, use the Builder:
+//
+// var b variant.Builder
+// b.Append(map[string]any{
+// "id": 123,
+// "name": "example",
+// "data": []any{1, 2, 3},
+// })
+// value, err := b.Build()
+//
+// To parse an existing variant value:
+//
+// v, err := variant.New(metadataBytes, valueBytes)
+//
+// You can access the data using the [Value.Value] method which returns the 
appropriate Go type:
+//
+// switch v.Type() {
+// case variant.Object:
+// obj := v.Value().(variant.ObjectValue)
+// field, err := obj.ValueByKey("name")
+// case variant.Array:
+// arr := v.Value().(variant.ArrayValue)
+// elem, err := arr.Value(0)
+// case variant.String:
+// s := v.Value().(string)
+// case variant.Int64:
+// i := v.Value().(int64)
+// }
+//
+// You can also switch on the type of the result value from the [Value.Value] 
method:
+//
+// switch val := v.Value().(type) {
+// case nil:
+//   // ...
+// case int32:
+//   // ...
+// case string:
+//   // ...
+// case variant.ArrayValue:
+//   for i, item := range val.Values() {
+// // item is a variant.Value
+//   }
+// case variant.ObjectValue:
+//   for k, item := range val.Values() {
+// // k is the field key
+// // item is a variant.Value for that field
+//   }
+// }
+//
+// Values can also be converted to JSON:
+//
+// jsonBytes, err := json.Marshal(v)
+//
+// # Low-level Construction
+//
+// For direct construction of complex nested structures, you can use the 
low-level
+// methods:
+//
+// var b variant.Builder

Review Comment:
   If we instead use a `StartObject`/`StartArray` pattern to construct child 
builders, then the user ends up having to keep track of the child builders 
instead of keeping track of the offests, so I'm not sure the added complexity 
of having to check and enforce that all the nested builders are completed 
before calling `Finish`/`Build` is worth it or saves that much. 
   
   I also couldn't think of a clean way to keep the ability to build the whole 
object with a single buffer (instead of multiple buffers) while handling the 
nested builders and enforcing all the checks without a ton of repetitive code. 
I'll think about this a bit more and see if I can come up with something I like.
   
   @lidavid @wgtmac what do you think?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.or

Re: [PR] feat(parquet): add variant encoder/decoder [arrow-go]

2025-05-27 Thread via GitHub


zeroshade commented on code in PR #344:
URL: https://github.com/apache/arrow-go/pull/344#discussion_r2110080419


##
parquet/variant/basic_type_string.go:
##


Review Comment:
   sure, i'll update the go-generate lines and have it rename the files.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] feat(parquet): add variant encoder/decoder [arrow-go]

2025-05-27 Thread via GitHub


zeroshade commented on code in PR #344:
URL: https://github.com/apache/arrow-go/pull/344#discussion_r2110079492


##
parquet/variant/builder.go:
##
@@ -0,0 +1,852 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+// http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+package variant
+
+import (
+   "bytes"
+   "cmp"
+   "encoding/binary"
+   "errors"
+   "fmt"
+   "io"
+   "math"
+   "reflect"
+   "slices"
+   "strings"
+   "time"
+   "unsafe"
+
+   "github.com/apache/arrow-go/v18/arrow"
+   "github.com/apache/arrow-go/v18/arrow/decimal"
+   "github.com/google/uuid"
+)
+
+// Builder is used to construct Variant values by appending data of various 
types.
+// It manages an internal buffer for the value data and a dictionary for field 
keys.
+type Builder struct {
+   buf bytes.Buffer
+   dictmap[string]uint32
+   dictKeys[][]byte
+   allowDuplicates bool
+}
+
+// SetAllowDuplicates controls whether duplicate keys are allowed in objects.
+// When true, the last value for a key is used. When false, an error is 
returned
+// if a duplicate key is detected.
+func (b *Builder) SetAllowDuplicates(allow bool) {
+   b.allowDuplicates = allow
+}
+
+// AddKey adds a key to the builder's dictionary and returns its ID.
+// If the key already exists in the dictionary, its existing ID is returned.
+func (b *Builder) AddKey(key string) (id uint32) {
+   if b.dict == nil {
+   b.dict = make(map[string]uint32)
+   b.dictKeys = make([][]byte, 0, 16)
+   }
+
+   var ok bool
+   if id, ok = b.dict[key]; ok {
+   return id
+   }
+
+   id = uint32(len(b.dictKeys))
+   b.dict[key] = id
+   b.dictKeys = append(b.dictKeys, unsafe.Slice(unsafe.StringData(key), 
len(key)))
+
+   return id
+}
+
+// AppendOpt represents options for appending time-related values. These are 
only
+// used when using the generic Append method that takes an interface{}.
+type AppendOpt int16
+
+const (
+   // OptTimestampNano specifies that timestamps should use nanosecond 
precision,
+   // otherwise microsecond precision is used.
+   OptTimestampNano AppendOpt = 1 << iota
+   // OptTimestampUTC specifies that timestamps should be in UTC timezone, 
otherwise
+   // no time zone (NTZ) is used.
+   OptTimestampUTC
+   // OptTimeAsDate specifies that time.Time values should be encoded as 
dates
+   OptTimeAsDate
+   // OptTimeAsTime specifies that time.Time values should be encoded as a 
time value
+   OptTimeAsTime
+)
+
+func extractFieldInfo(f reflect.StructField) (name string, o AppendOpt) {
+   tag := f.Tag.Get("variant")
+   if tag == "" {
+   return f.Name, 0
+   }
+
+   parts := strings.Split(tag, ",")
+   if len(parts) == 1 {
+   return parts[0], 0
+   }
+
+   name = parts[0]
+   if name == "" {
+   name = f.Name
+   }
+
+   for _, opt := range parts[1:] {
+   switch strings.ToLower(opt) {
+   case "nanos":
+   o |= OptTimestampNano
+   case "utc":
+   o |= OptTimestampUTC
+   case "date":
+   o |= OptTimeAsDate
+   case "time":
+   o |= OptTimeAsTime
+   }
+   }
+
+   return name, o
+}
+
+// Append adds a value of any supported type to the builder.
+//
+// Any basic primitive type is supported, the AppendOpt options are used to 
control how
+// timestamps are appended (e.g., as microseconds or nanoseconds and 
timezone). The options
+// also control how a [time.Time] value is appended (e.g., as a date, 
timestamp, or time).
+//
+// Appending a value with type `[]any` will construct an array appropriately, 
appending
+// each element. Calling with a map[string]any will construct an object, 
recursively calling
+// Append for each value, propagating the options.
+//
+// For other types (arbitrary slices, arrays, maps and structs), reflection is 
used to determine
+// the type and whether we can append it. A nil pointer will append a null, 
while a non-nil
+// pointer will append the va

Re: [PR] feat(parquet): add variant encoder/decoder [arrow-go]

2025-05-27 Thread via GitHub


sfc-gh-mbojanczyk commented on code in PR #344:
URL: https://github.com/apache/arrow-go/pull/344#discussion_r2109823794


##
parquet/variant/variant.go:
##
@@ -0,0 +1,722 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+// http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+package variant
+
+import (
+   "bytes"
+   "encoding/binary"
+   "encoding/json"
+   "errors"
+   "fmt"
+   "iter"
+   "maps"
+   "slices"
+   "strings"
+   "time"
+   "unsafe"
+
+   "github.com/apache/arrow-go/v18/arrow"
+   "github.com/apache/arrow-go/v18/arrow/decimal"
+   "github.com/apache/arrow-go/v18/arrow/decimal128"
+   "github.com/apache/arrow-go/v18/parquet/internal/debug"
+   "github.com/google/uuid"
+)
+
+//go:generate go tool stringer -type=BasicType -linecomment 
-output=basic_type_string.go
+//go:generate go tool stringer -type=PrimitiveType -linecomment 
-output=primitive_type_string.go
+
+// BasicType represents the fundamental type category of a variant value.
+type BasicType int
+
+const (
+   BasicUndefined   BasicType = iota - 1 // Unknown
+   BasicPrimitive// Primitive
+   BasicShortString  // ShortString
+   BasicObject   // Object
+   BasicArray// Array
+)
+
+func basicTypeFromHeader(hdr byte) BasicType {
+   // because we're doing hdr & 0x3, it is impossible for the result
+   // to be outside of the range of BasicType. Therefore, we don't
+   // need to perform any checks. The value will always be [0,3]
+   return BasicType(hdr & basicTypeMask)
+}
+
+// PrimitiveType represents specific primitive data types within the variant 
format.
+type PrimitiveType int
+
+const (
+   PrimitiveInvalidPrimitiveType = iota - 1 // Unknown
+   PrimitiveNull// Null
+   PrimitiveBoolTrue// BoolTrue
+   PrimitiveBoolFalse   // BoolFalse
+   PrimitiveInt8// Int8
+   PrimitiveInt16   // Int16
+   PrimitiveInt32   // Int32
+   PrimitiveInt64   // Int64
+   PrimitiveDouble  // Double
+   PrimitiveDecimal4// Decimal32
+   PrimitiveDecimal8// Decimal64
+   PrimitiveDecimal16   // Decimal128
+   PrimitiveDate// Date
+   PrimitiveTimestampMicros // 
Timestamp(micros)
+   PrimitiveTimestampMicrosNTZ  // 
TimestampNTZ(micros)
+   PrimitiveFloat   // Float
+   PrimitiveBinary  // Binary
+   PrimitiveString  // String
+   PrimitiveTimeMicrosNTZ   // TimeNTZ(micros)
+   PrimitiveTimestampNanos  // Timestamp(nanos)
+   PrimitiveTimestampNanosNTZ   // 
TimestampNTZ(nanos)
+   PrimitiveUUID// UUID
+)
+
+func primitiveTypeFromHeader(hdr byte) PrimitiveType {
+   return PrimitiveType((hdr >> basicTypeBits) & typeInfoMask)
+}
+
+// Type represents the high-level variant data type.
+// This is what applications typically use to identify the type of a variant 
value.
+type Type int
+
+const (
+   Object Type = iota
+   Array
+   Null
+   Bool
+   Int8
+   Int16
+   Int32
+   Int64
+   String
+   Double
+   Decimal4
+   Decimal8
+   Decimal16
+   Date
+   TimestampMicros
+   TimestampMicrosNTZ
+   Float
+   Binary
+   Time
+   TimestampNanos
+   TimestampNanosNTZ
+   UUID
+)
+
+const (
+   versionMaskuint8 = 0x0F
+   sortedStrMask  uint8 = 0b1
+   basicTypeMask  uint8 = 0x3
+   basicTypeBits  uint8 = 2
+   t

Re: [PR] feat(parquet): add variant encoder/decoder [arrow-go]

2025-05-27 Thread via GitHub


zeroshade commented on PR #344:
URL: https://github.com/apache/arrow-go/pull/344#issuecomment-2913095614

   I'll give @sfc-gh-mbojanczyk a chance to comment and respond here before I 
merge this just to make sure I get all the feedback I can.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] feat(parquet): add variant encoder/decoder [arrow-go]

2025-05-24 Thread via GitHub


lidavidm commented on code in PR #344:
URL: https://github.com/apache/arrow-go/pull/344#discussion_r2106054776


##
parquet/variant/variant_test.go:
##


Review Comment:
   Thanks. I wonder if fuzzing the variants specifically would be a more 
manageable case to start with.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] feat(parquet): add variant encoder/decoder [arrow-go]

2025-05-24 Thread via GitHub


zeroshade commented on code in PR #344:
URL: https://github.com/apache/arrow-go/pull/344#discussion_r2105906717


##
parquet/variant/variant.go:
##
@@ -0,0 +1,725 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+// http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+package variant
+
+import (
+   "bytes"
+   "encoding/binary"
+   "encoding/json"
+   "errors"
+   "fmt"
+   "iter"
+   "maps"
+   "slices"
+   "strings"
+   "time"
+   "unsafe"
+
+   "github.com/apache/arrow-go/v18/arrow"
+   "github.com/apache/arrow-go/v18/arrow/decimal"
+   "github.com/apache/arrow-go/v18/arrow/decimal128"
+   "github.com/apache/arrow-go/v18/parquet/internal/debug"
+   "github.com/google/uuid"
+)
+
+//go:generate go tool stringer -type=BasicType -linecomment 
-output=basic_type_string.go
+//go:generate go tool stringer -type=PrimitiveType -linecomment 
-output=primitive_type_string.go
+
+// BasicType represents the fundamental type category of a variant value.
+type BasicType int
+
+const (
+   BasicUndefined   BasicType = iota - 1 // Unknown
+   BasicPrimitive// Primitive
+   BasicShortString  // ShortString
+   BasicObject   // Object
+   BasicArray// Array
+)
+
+func basicTypeFromHeader(hdr byte) BasicType {
+   return BasicType(hdr & basicTypeMask)

Review Comment:
   turns out there's no need for a `debug.Assert` here, because we're doing `& 
basicTypeMask` we're guaranteed the result is going to be in the correct range 
for the `BasicType` enum: `[0,3]`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] feat(parquet): add variant encoder/decoder [arrow-go]

2025-05-24 Thread via GitHub


zeroshade commented on code in PR #344:
URL: https://github.com/apache/arrow-go/pull/344#discussion_r2105891795


##
parquet/variant/variant_test.go:
##


Review Comment:
   Currently the Go implementation doesn't have any fuzzing set up. Go has a 
whole infrastructure for setting up fuzz testing 
(https://go.dev/doc/security/fuzz/) I just haven't gotten around to setting it 
up. It just hasn't been on a high priority given everything else unless people 
think I should prioritize it.
   
   That said, I think it makes sense to test some examples of invalid variants. 
I'll add some tests for that.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] feat(parquet): add variant encoder/decoder [arrow-go]

2025-05-24 Thread via GitHub


zeroshade commented on code in PR #344:
URL: https://github.com/apache/arrow-go/pull/344#discussion_r2105891543


##
parquet/variant/builder.go:
##
@@ -0,0 +1,847 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+// http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+package variant
+
+import (
+   "bytes"
+   "cmp"
+   "encoding/binary"
+   "errors"
+   "fmt"
+   "io"
+   "math"
+   "reflect"
+   "slices"
+   "strings"
+   "time"
+   "unsafe"
+
+   "github.com/apache/arrow-go/v18/arrow"
+   "github.com/apache/arrow-go/v18/arrow/decimal"
+   "github.com/google/uuid"
+)
+
+// Builder is used to construct Variant values by appending data of various 
types.
+// It manages an internal buffer for the value data and a dictionary for field 
keys.
+type Builder struct {
+   buf bytes.Buffer
+   dictmap[string]uint32
+   dictKeys[][]byte
+   allowDuplicates bool
+}
+
+// SetAllowDuplicates controls whether duplicate keys are allowed in objects.
+// When true, the last value for a key is used. When false, an error is 
returned
+// if a duplicate key is detected.
+func (b *Builder) SetAllowDuplicates(allow bool) {
+   b.allowDuplicates = allow
+}

Review Comment:
   I don't think so. I think we probably want to see where the whole ecosystem 
falls on this and then see if we can get all the implementations to align. I 
took this behavior from the spark implementation.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] feat(parquet): add variant encoder/decoder [arrow-go]

2025-05-24 Thread via GitHub


zeroshade commented on code in PR #344:
URL: https://github.com/apache/arrow-go/pull/344#discussion_r2105891187


##
parquet/variant/variant.go:
##
@@ -0,0 +1,725 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+// http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+package variant
+
+import (
+   "bytes"
+   "encoding/binary"
+   "encoding/json"
+   "errors"
+   "fmt"
+   "iter"
+   "maps"
+   "slices"
+   "strings"
+   "time"
+   "unsafe"
+
+   "github.com/apache/arrow-go/v18/arrow"
+   "github.com/apache/arrow-go/v18/arrow/decimal"
+   "github.com/apache/arrow-go/v18/arrow/decimal128"
+   "github.com/apache/arrow-go/v18/parquet/internal/debug"
+   "github.com/google/uuid"
+)
+
+//go:generate go tool stringer -type=BasicType -linecomment 
-output=basic_type_string.go
+//go:generate go tool stringer -type=PrimitiveType -linecomment 
-output=primitive_type_string.go
+
+// BasicType represents the fundamental type category of a variant value.
+type BasicType int
+
+const (
+   BasicUndefined   BasicType = iota - 1 // Unknown
+   BasicPrimitive// Primitive
+   BasicShortString  // ShortString
+   BasicObject   // Object
+   BasicArray// Array
+)
+
+func basicTypeFromHeader(hdr byte) BasicType {
+   return BasicType(hdr & basicTypeMask)

Review Comment:
   sure



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] feat(parquet): add variant encoder/decoder [arrow-go]

2025-05-24 Thread via GitHub


zeroshade commented on code in PR #344:
URL: https://github.com/apache/arrow-go/pull/344#discussion_r2105890998


##
parquet/variant/utils.go:
##


Review Comment:
   I don't think so. All of the functions in utils.go are already covered by 
the existing unit test coverage, so I don't see any need for it.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] feat(parquet): add variant encoder/decoder [arrow-go]

2025-05-24 Thread via GitHub


lidavidm commented on code in PR #344:
URL: https://github.com/apache/arrow-go/pull/344#discussion_r2105717305


##
parquet/variant/variant.go:
##
@@ -0,0 +1,725 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+// http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+package variant
+
+import (
+   "bytes"
+   "encoding/binary"
+   "encoding/json"
+   "errors"
+   "fmt"
+   "iter"
+   "maps"
+   "slices"
+   "strings"
+   "time"
+   "unsafe"
+
+   "github.com/apache/arrow-go/v18/arrow"
+   "github.com/apache/arrow-go/v18/arrow/decimal"
+   "github.com/apache/arrow-go/v18/arrow/decimal128"
+   "github.com/apache/arrow-go/v18/parquet/internal/debug"
+   "github.com/google/uuid"
+)
+
+//go:generate go tool stringer -type=BasicType -linecomment 
-output=basic_type_string.go
+//go:generate go tool stringer -type=PrimitiveType -linecomment 
-output=primitive_type_string.go
+
+// BasicType represents the fundamental type category of a variant value.
+type BasicType int
+
+const (
+   BasicUndefined   BasicType = iota - 1 // Unknown
+   BasicPrimitive// Primitive
+   BasicShortString  // ShortString
+   BasicObject   // Object
+   BasicArray// Array
+)
+
+func basicTypeFromHeader(hdr byte) BasicType {
+   return BasicType(hdr & basicTypeMask)

Review Comment:
   worth having a debug.Assert perhaps?



##
parquet/variant/builder.go:
##
@@ -0,0 +1,847 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+// http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+package variant
+
+import (
+   "bytes"
+   "cmp"
+   "encoding/binary"
+   "errors"
+   "fmt"
+   "io"
+   "math"
+   "reflect"
+   "slices"
+   "strings"
+   "time"
+   "unsafe"
+
+   "github.com/apache/arrow-go/v18/arrow"
+   "github.com/apache/arrow-go/v18/arrow/decimal"
+   "github.com/google/uuid"
+)
+
+// Builder is used to construct Variant values by appending data of various 
types.
+// It manages an internal buffer for the value data and a dictionary for field 
keys.
+type Builder struct {
+   buf bytes.Buffer
+   dictmap[string]uint32
+   dictKeys[][]byte
+   allowDuplicates bool
+}
+
+// SetAllowDuplicates controls whether duplicate keys are allowed in objects.
+// When true, the last value for a key is used. When false, an error is 
returned
+// if a duplicate key is detected.
+func (b *Builder) SetAllowDuplicates(allow bool) {
+   b.allowDuplicates = allow
+}
+
+// AddKey adds a key to the builder's dictionary and returns its ID.
+// If the key already exists in the dictionary, its existing ID is returned.
+func (b *Builder) AddKey(key string) (id uint32) {
+   if b.dict == nil {
+   b.dict = make(map[string]uint32)
+   b.dictKeys = make([][]byte, 0, 16)
+   }
+
+   var ok bool
+   if id, ok = b.dict[key]; ok {
+   return id
+   }
+
+   id = uint32(len(b.dictKeys))
+   b.dict[key] = id
+   b.dictKeys = append(b.dictKeys, unsafe.Slice(unsafe.StringData(key), 
len(key)))
+
+   return id
+}
+
+// AppendOpt represents options for appending time-related values. These are 
only
+// used when using the generic Append method that takes an interface{}.
+type AppendOpt int16
+
+const (
+   // OptTimestampNano specifies that timestamps should use nanosecond 
precision,
+ 

Re: [PR] feat(parquet): add variant encoder/decoder [arrow-go]

2025-05-23 Thread via GitHub


sfc-gh-mbojanczyk commented on PR #344:
URL: https://github.com/apache/arrow-go/pull/344#issuecomment-2906037583

   > @sfc-gh-mbojanczyk Hey, sorry for the long delay here. I've gone through 
multiple iterations and discussions with people, plus pulling inspiration from 
the C++, Spark, and parquet-java implementations to figure out a good design. 
Please let me know what you think of the updated version! Thanks!
   
   No worries- I got caught in a whirlwind over here myself and was about to 
dust this off too :) Lemme take a peek here after the long weekend.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] feat(parquet): add variant encoder/decoder [arrow-go]

2025-05-23 Thread via GitHub


zeroshade commented on PR #344:
URL: https://github.com/apache/arrow-go/pull/344#issuecomment-2905641862

   @sfc-gh-mbojanczyk Hey, sorry for the long delay here. I've gone through 
multiple iterations and discussions with people, plus pulling inspiration from 
the C++, Spark, and parquet-java implementations to figure out a good design. 
Please let me know what you think of the updated version! Thanks!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] feat(parquet): add variant encoder/decoder [arrow-go]

2025-04-28 Thread via GitHub


zeroshade commented on code in PR #344:
URL: https://github.com/apache/arrow-go/pull/344#discussion_r2064964717


##
parquet/variants/primitive.go:
##
@@ -0,0 +1,678 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+// http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+package variants
+
+import (
+   "fmt"
+   "io"
+   "math"
+   "reflect"
+   "strings"
+   "time"
+)
+
+// Variant primitive type IDs.
+type primitiveType int
+
+const (
+   primitiveInvalidprimitiveType = -1
+   primitiveNull   primitiveType = 0
+   primitiveTrue   primitiveType = 1
+   primitiveFalse  primitiveType = 2
+   primitiveInt8   primitiveType = 3
+   primitiveInt16  primitiveType = 4
+   primitiveInt32  primitiveType = 5
+   primitiveInt64  primitiveType = 6
+   primitiveDouble primitiveType = 7
+   primitiveDecimal4   primitiveType = 8  // TODO
+   primitiveDecimal8   primitiveType = 9  // TODO
+   primitiveDecimal16  primitiveType = 10 // TODO
+   primitiveDate   primitiveType = 11
+   primitiveTimestampMicrosprimitiveType = 12
+   primitiveTimestampNTZMicros primitiveType = 13
+   primitiveFloat  primitiveType = 14
+   primitiveBinary primitiveType = 15
+   primitiveString primitiveType = 16
+   primitiveTimeNTZprimitiveType = 17
+   primitiveTimestampNanos primitiveType = 18
+   primitiveTimestampNTZNanos  primitiveType = 19
+   primitiveUUID   primitiveType = 20
+)
+
+func (pt primitiveType) String() string {
+   switch pt {
+   case primitiveNull:
+   return "Null"
+   case primitiveFalse, primitiveTrue:
+   return "Boolean"
+   case primitiveInt8:
+   return "Int8"
+   case primitiveInt16:
+   return "Int16"
+   case primitiveInt32:
+   return "Int32"
+   case primitiveInt64:
+   return "Int64"
+   case primitiveDouble:
+   return "Double"
+   case primitiveDecimal4:
+   return "Decimal4"
+   case primitiveDecimal8:
+   return "Decimal8"
+   case primitiveDecimal16:
+   return "Decimal16"
+   case primitiveDate:
+   return "Date"
+   case primitiveTimestampMicros:
+   return "Timestamp(micros)"
+   case primitiveTimestampNTZMicros:
+   return "TimestampNTZ(micros)"
+   case primitiveFloat:
+   return "Float"
+   case primitiveBinary:
+   return "Binary"
+   case primitiveString:
+   return "String"
+   case primitiveTimeNTZ:
+   return "TimeNTZ"
+   case primitiveTimestampNanos:
+   return "Timestamp(nanos)"
+   case primitiveTimestampNTZNanos:
+   return "TimestampNTZ(nanos)"
+   case primitiveUUID:
+   return "UUID"
+   }
+   return "Invalid"
+}
+
+func validPrimitiveValue(prim primitiveType) error {
+   if prim < primitiveNull || prim > primitiveUUID {
+   return fmt.Errorf("invalid primitive type: %d", prim)
+   }
+   return nil
+}
+
+func primitiveFromHeader(hdr byte) (primitiveType, error) {
+   // Special case the basic type of Short String and call it a Primitive 
String.
+   bt := BasicTypeFromHeader(hdr)
+   if bt == BasicShortString {
+   return primitiveString, nil
+   } else if bt == BasicPrimitive {
+   prim := primitiveType(hdr >> 2)
+   if err := validPrimitiveValue(prim); err != nil {
+   return primitiveInvalid, err
+   }
+   return prim, nil
+   }
+   return primitiveInvalid, fmt.Errorf("header is not of a primitive or 
short string basic type: %s", bt)
+}
+
+func primitiveHeader(prim primitiveType) (byte, error) {
+   if err := validPrimitiveValue(prim); err != nil {
+   return 0, err
+   }
+   hdr := byte(prim << 2)
+   hdr |= byte(BasicPrimitive)
+   return hdr, nil
+}
+
+// marshalPri

Re: [PR] feat(parquet): add variant encoder/decoder [arrow-go]

2025-04-28 Thread via GitHub


sfc-gh-mbojanczyk commented on code in PR #344:
URL: https://github.com/apache/arrow-go/pull/344#discussion_r2064947485


##
parquet/variants/primitive.go:
##
@@ -0,0 +1,678 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+// http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+package variants
+
+import (
+   "fmt"
+   "io"
+   "math"
+   "reflect"
+   "strings"
+   "time"
+)
+
+// Variant primitive type IDs.
+type primitiveType int
+
+const (
+   primitiveInvalidprimitiveType = -1
+   primitiveNull   primitiveType = 0
+   primitiveTrue   primitiveType = 1
+   primitiveFalse  primitiveType = 2
+   primitiveInt8   primitiveType = 3
+   primitiveInt16  primitiveType = 4
+   primitiveInt32  primitiveType = 5
+   primitiveInt64  primitiveType = 6
+   primitiveDouble primitiveType = 7
+   primitiveDecimal4   primitiveType = 8  // TODO
+   primitiveDecimal8   primitiveType = 9  // TODO
+   primitiveDecimal16  primitiveType = 10 // TODO
+   primitiveDate   primitiveType = 11
+   primitiveTimestampMicrosprimitiveType = 12
+   primitiveTimestampNTZMicros primitiveType = 13
+   primitiveFloat  primitiveType = 14
+   primitiveBinary primitiveType = 15
+   primitiveString primitiveType = 16
+   primitiveTimeNTZprimitiveType = 17
+   primitiveTimestampNanos primitiveType = 18
+   primitiveTimestampNTZNanos  primitiveType = 19
+   primitiveUUID   primitiveType = 20
+)
+
+func (pt primitiveType) String() string {
+   switch pt {
+   case primitiveNull:
+   return "Null"
+   case primitiveFalse, primitiveTrue:
+   return "Boolean"
+   case primitiveInt8:
+   return "Int8"
+   case primitiveInt16:
+   return "Int16"
+   case primitiveInt32:
+   return "Int32"
+   case primitiveInt64:
+   return "Int64"
+   case primitiveDouble:
+   return "Double"
+   case primitiveDecimal4:
+   return "Decimal4"
+   case primitiveDecimal8:
+   return "Decimal8"
+   case primitiveDecimal16:
+   return "Decimal16"
+   case primitiveDate:
+   return "Date"
+   case primitiveTimestampMicros:
+   return "Timestamp(micros)"
+   case primitiveTimestampNTZMicros:
+   return "TimestampNTZ(micros)"
+   case primitiveFloat:
+   return "Float"
+   case primitiveBinary:
+   return "Binary"
+   case primitiveString:
+   return "String"
+   case primitiveTimeNTZ:
+   return "TimeNTZ"
+   case primitiveTimestampNanos:
+   return "Timestamp(nanos)"
+   case primitiveTimestampNTZNanos:
+   return "TimestampNTZ(nanos)"
+   case primitiveUUID:
+   return "UUID"
+   }
+   return "Invalid"
+}
+
+func validPrimitiveValue(prim primitiveType) error {
+   if prim < primitiveNull || prim > primitiveUUID {
+   return fmt.Errorf("invalid primitive type: %d", prim)
+   }
+   return nil
+}
+
+func primitiveFromHeader(hdr byte) (primitiveType, error) {
+   // Special case the basic type of Short String and call it a Primitive 
String.
+   bt := BasicTypeFromHeader(hdr)
+   if bt == BasicShortString {
+   return primitiveString, nil
+   } else if bt == BasicPrimitive {
+   prim := primitiveType(hdr >> 2)
+   if err := validPrimitiveValue(prim); err != nil {
+   return primitiveInvalid, err
+   }
+   return prim, nil
+   }
+   return primitiveInvalid, fmt.Errorf("header is not of a primitive or 
short string basic type: %s", bt)
+}
+
+func primitiveHeader(prim primitiveType) (byte, error) {
+   if err := validPrimitiveValue(prim); err != nil {
+   return 0, err
+   }
+   hdr := byte(prim << 2)
+   hdr |= byte(BasicPrimitive)
+   return hdr, nil
+}
+
+// ma

Re: [PR] feat(parquet): add variant encoder/decoder [arrow-go]

2025-04-28 Thread via GitHub


sfc-gh-mbojanczyk commented on code in PR #344:
URL: https://github.com/apache/arrow-go/pull/344#discussion_r2064947485


##
parquet/variants/primitive.go:
##
@@ -0,0 +1,678 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+// http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+package variants
+
+import (
+   "fmt"
+   "io"
+   "math"
+   "reflect"
+   "strings"
+   "time"
+)
+
+// Variant primitive type IDs.
+type primitiveType int
+
+const (
+   primitiveInvalidprimitiveType = -1
+   primitiveNull   primitiveType = 0
+   primitiveTrue   primitiveType = 1
+   primitiveFalse  primitiveType = 2
+   primitiveInt8   primitiveType = 3
+   primitiveInt16  primitiveType = 4
+   primitiveInt32  primitiveType = 5
+   primitiveInt64  primitiveType = 6
+   primitiveDouble primitiveType = 7
+   primitiveDecimal4   primitiveType = 8  // TODO
+   primitiveDecimal8   primitiveType = 9  // TODO
+   primitiveDecimal16  primitiveType = 10 // TODO
+   primitiveDate   primitiveType = 11
+   primitiveTimestampMicrosprimitiveType = 12
+   primitiveTimestampNTZMicros primitiveType = 13
+   primitiveFloat  primitiveType = 14
+   primitiveBinary primitiveType = 15
+   primitiveString primitiveType = 16
+   primitiveTimeNTZprimitiveType = 17
+   primitiveTimestampNanos primitiveType = 18
+   primitiveTimestampNTZNanos  primitiveType = 19
+   primitiveUUID   primitiveType = 20
+)
+
+func (pt primitiveType) String() string {
+   switch pt {
+   case primitiveNull:
+   return "Null"
+   case primitiveFalse, primitiveTrue:
+   return "Boolean"
+   case primitiveInt8:
+   return "Int8"
+   case primitiveInt16:
+   return "Int16"
+   case primitiveInt32:
+   return "Int32"
+   case primitiveInt64:
+   return "Int64"
+   case primitiveDouble:
+   return "Double"
+   case primitiveDecimal4:
+   return "Decimal4"
+   case primitiveDecimal8:
+   return "Decimal8"
+   case primitiveDecimal16:
+   return "Decimal16"
+   case primitiveDate:
+   return "Date"
+   case primitiveTimestampMicros:
+   return "Timestamp(micros)"
+   case primitiveTimestampNTZMicros:
+   return "TimestampNTZ(micros)"
+   case primitiveFloat:
+   return "Float"
+   case primitiveBinary:
+   return "Binary"
+   case primitiveString:
+   return "String"
+   case primitiveTimeNTZ:
+   return "TimeNTZ"
+   case primitiveTimestampNanos:
+   return "Timestamp(nanos)"
+   case primitiveTimestampNTZNanos:
+   return "TimestampNTZ(nanos)"
+   case primitiveUUID:
+   return "UUID"
+   }
+   return "Invalid"
+}
+
+func validPrimitiveValue(prim primitiveType) error {
+   if prim < primitiveNull || prim > primitiveUUID {
+   return fmt.Errorf("invalid primitive type: %d", prim)
+   }
+   return nil
+}
+
+func primitiveFromHeader(hdr byte) (primitiveType, error) {
+   // Special case the basic type of Short String and call it a Primitive 
String.
+   bt := BasicTypeFromHeader(hdr)
+   if bt == BasicShortString {
+   return primitiveString, nil
+   } else if bt == BasicPrimitive {
+   prim := primitiveType(hdr >> 2)
+   if err := validPrimitiveValue(prim); err != nil {
+   return primitiveInvalid, err
+   }
+   return prim, nil
+   }
+   return primitiveInvalid, fmt.Errorf("header is not of a primitive or 
short string basic type: %s", bt)
+}
+
+func primitiveHeader(prim primitiveType) (byte, error) {
+   if err := validPrimitiveValue(prim); err != nil {
+   return 0, err
+   }
+   hdr := byte(prim << 2)
+   hdr |= byte(BasicPrimitive)
+   return hdr, nil
+}
+
+// ma

Re: [PR] feat(parquet): add variant encoder/decoder [arrow-go]

2025-04-28 Thread via GitHub


zeroshade commented on code in PR #344:
URL: https://github.com/apache/arrow-go/pull/344#discussion_r2064915823


##
parquet/variants/primitive.go:
##
@@ -0,0 +1,678 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+// http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+package variants
+
+import (
+   "fmt"
+   "io"
+   "math"
+   "reflect"
+   "strings"
+   "time"
+)
+
+// Variant primitive type IDs.
+type primitiveType int
+
+const (
+   primitiveInvalidprimitiveType = -1
+   primitiveNull   primitiveType = 0
+   primitiveTrue   primitiveType = 1
+   primitiveFalse  primitiveType = 2
+   primitiveInt8   primitiveType = 3
+   primitiveInt16  primitiveType = 4
+   primitiveInt32  primitiveType = 5
+   primitiveInt64  primitiveType = 6
+   primitiveDouble primitiveType = 7
+   primitiveDecimal4   primitiveType = 8  // TODO
+   primitiveDecimal8   primitiveType = 9  // TODO
+   primitiveDecimal16  primitiveType = 10 // TODO
+   primitiveDate   primitiveType = 11
+   primitiveTimestampMicrosprimitiveType = 12
+   primitiveTimestampNTZMicros primitiveType = 13
+   primitiveFloat  primitiveType = 14
+   primitiveBinary primitiveType = 15
+   primitiveString primitiveType = 16
+   primitiveTimeNTZprimitiveType = 17
+   primitiveTimestampNanos primitiveType = 18
+   primitiveTimestampNTZNanos  primitiveType = 19
+   primitiveUUID   primitiveType = 20
+)
+
+func (pt primitiveType) String() string {
+   switch pt {
+   case primitiveNull:
+   return "Null"
+   case primitiveFalse, primitiveTrue:
+   return "Boolean"
+   case primitiveInt8:
+   return "Int8"
+   case primitiveInt16:
+   return "Int16"
+   case primitiveInt32:
+   return "Int32"
+   case primitiveInt64:
+   return "Int64"
+   case primitiveDouble:
+   return "Double"
+   case primitiveDecimal4:
+   return "Decimal4"
+   case primitiveDecimal8:
+   return "Decimal8"
+   case primitiveDecimal16:
+   return "Decimal16"
+   case primitiveDate:
+   return "Date"
+   case primitiveTimestampMicros:
+   return "Timestamp(micros)"
+   case primitiveTimestampNTZMicros:
+   return "TimestampNTZ(micros)"
+   case primitiveFloat:
+   return "Float"
+   case primitiveBinary:
+   return "Binary"
+   case primitiveString:
+   return "String"
+   case primitiveTimeNTZ:
+   return "TimeNTZ"
+   case primitiveTimestampNanos:
+   return "Timestamp(nanos)"
+   case primitiveTimestampNTZNanos:
+   return "TimestampNTZ(nanos)"
+   case primitiveUUID:
+   return "UUID"
+   }
+   return "Invalid"
+}
+
+func validPrimitiveValue(prim primitiveType) error {
+   if prim < primitiveNull || prim > primitiveUUID {
+   return fmt.Errorf("invalid primitive type: %d", prim)
+   }
+   return nil
+}
+
+func primitiveFromHeader(hdr byte) (primitiveType, error) {
+   // Special case the basic type of Short String and call it a Primitive 
String.
+   bt := BasicTypeFromHeader(hdr)
+   if bt == BasicShortString {
+   return primitiveString, nil
+   } else if bt == BasicPrimitive {
+   prim := primitiveType(hdr >> 2)
+   if err := validPrimitiveValue(prim); err != nil {
+   return primitiveInvalid, err
+   }
+   return prim, nil
+   }
+   return primitiveInvalid, fmt.Errorf("header is not of a primitive or 
short string basic type: %s", bt)
+}
+
+func primitiveHeader(prim primitiveType) (byte, error) {
+   if err := validPrimitiveValue(prim); err != nil {
+   return 0, err
+   }
+   hdr := byte(prim << 2)
+   hdr |= byte(BasicPrimitive)
+   return hdr, nil
+}
+
+// marshalPri

Re: [PR] feat(parquet): add variant encoder/decoder [arrow-go]

2025-04-28 Thread via GitHub


zeroshade commented on code in PR #344:
URL: https://github.com/apache/arrow-go/pull/344#discussion_r2064853352


##
parquet/variants/primitive.go:
##
@@ -0,0 +1,674 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+// http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+package variants
+
+import (
+   "fmt"
+   "io"
+   "math"
+   "reflect"
+   "strings"
+   "time"
+   "unsafe"
+
+   "github.com/google/uuid"
+)
+
+// Variant primitive type IDs.
+type primitiveType int
+
+const (
+   primitiveInvalidprimitiveType = -1
+   primitiveNull   primitiveType = 0
+   primitiveTrue   primitiveType = 1
+   primitiveFalse  primitiveType = 2
+   primitiveInt8   primitiveType = 3
+   primitiveInt16  primitiveType = 4
+   primitiveInt32  primitiveType = 5
+   primitiveInt64  primitiveType = 6
+   primitiveDouble primitiveType = 7
+   primitiveDecimal4   primitiveType = 8  // TODO
+   primitiveDecimal8   primitiveType = 9  // TODO
+   primitiveDecimal16  primitiveType = 10 // TODO
+   primitiveDate   primitiveType = 11
+   primitiveTimestampMicrosprimitiveType = 12
+   primitiveTimestampNTZMicros primitiveType = 13
+   primitiveFloat  primitiveType = 14
+   primitiveBinary primitiveType = 15
+   primitiveString primitiveType = 16
+   primitiveTimeNTZprimitiveType = 17
+   primitiveTimestampNanos primitiveType = 18
+   primitiveTimestampNTZNanos  primitiveType = 19
+   primitiveUUID   primitiveType = 20
+)
+
+func (pt primitiveType) String() string {
+   switch pt {
+   case primitiveNull:
+   return "Null"
+   case primitiveFalse, primitiveTrue:
+   return "Boolean"
+   case primitiveInt8:
+   return "Int8"
+   case primitiveInt16:
+   return "Int16"
+   case primitiveInt32:
+   return "Int32"
+   case primitiveInt64:
+   return "Int64"
+   case primitiveDouble:
+   return "Double"
+   case primitiveDecimal4:
+   return "Decimal4"
+   case primitiveDecimal8:
+   return "Decimal8"
+   case primitiveDecimal16:
+   return "Decimal16"
+   case primitiveDate:
+   return "Date"
+   case primitiveTimestampMicros:
+   return "Timestamp(micros)"
+   case primitiveTimestampNTZMicros:
+   return "TimestampNTZ(micros)"
+   case primitiveFloat:
+   return "Float"
+   case primitiveBinary:
+   return "Binary"
+   case primitiveString:
+   return "String"
+   case primitiveTimeNTZ:
+   return "TimeNTZ"
+   case primitiveTimestampNanos:
+   return "Timestamp(nanos)"
+   case primitiveTimestampNTZNanos:
+   return "TimestampNTZ(nanos)"
+   case primitiveUUID:
+   return "UUID"
+   }
+   return "Invalid"
+}
+
+func validPrimitiveValue(prim primitiveType) error {
+   if prim < primitiveNull || prim > primitiveUUID {
+   return fmt.Errorf("invalid primitive type: %d", prim)
+   }
+   return nil
+}
+
+func primitiveFromHeader(hdr byte) (primitiveType, error) {
+   // Special case the basic type of Short String and call it a Primitive 
String.
+   bt := BasicTypeFromHeader(hdr)
+   if bt == BasicShortString {
+   return primitiveString, nil
+   } else if bt == BasicPrimitive {
+   prim := primitiveType(hdr >> 2)
+   if err := validPrimitiveValue(prim); err != nil {
+   return primitiveInvalid, err
+   }
+   return prim, nil
+   }
+   return primitiveInvalid, fmt.Errorf("header is not of a primitive or 
short string basic type: %s", bt)
+}
+
+func primitiveHeader(prim primitiveType) (byte, error) {
+   if err := validPrimitiveValue(prim); err != nil {
+   return 0, err
+   }
+   hdr := byte(prim << 2)
+   hdr |= byte(BasicPr

Re: [PR] feat(parquet): add variant encoder/decoder [arrow-go]

2025-04-28 Thread via GitHub


zeroshade commented on code in PR #344:
URL: https://github.com/apache/arrow-go/pull/344#discussion_r2064852563


##
parquet/variants/primitive.go:
##
@@ -0,0 +1,678 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+// http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+package variants
+
+import (
+   "fmt"
+   "io"
+   "math"
+   "reflect"
+   "strings"
+   "time"
+)
+
+// Variant primitive type IDs.
+type primitiveType int
+
+const (
+   primitiveInvalidprimitiveType = -1
+   primitiveNull   primitiveType = 0
+   primitiveTrue   primitiveType = 1
+   primitiveFalse  primitiveType = 2
+   primitiveInt8   primitiveType = 3
+   primitiveInt16  primitiveType = 4
+   primitiveInt32  primitiveType = 5
+   primitiveInt64  primitiveType = 6
+   primitiveDouble primitiveType = 7
+   primitiveDecimal4   primitiveType = 8  // TODO
+   primitiveDecimal8   primitiveType = 9  // TODO
+   primitiveDecimal16  primitiveType = 10 // TODO
+   primitiveDate   primitiveType = 11
+   primitiveTimestampMicrosprimitiveType = 12
+   primitiveTimestampNTZMicros primitiveType = 13
+   primitiveFloat  primitiveType = 14
+   primitiveBinary primitiveType = 15
+   primitiveString primitiveType = 16
+   primitiveTimeNTZprimitiveType = 17
+   primitiveTimestampNanos primitiveType = 18
+   primitiveTimestampNTZNanos  primitiveType = 19
+   primitiveUUID   primitiveType = 20
+)
+
+func (pt primitiveType) String() string {
+   switch pt {
+   case primitiveNull:
+   return "Null"
+   case primitiveFalse, primitiveTrue:
+   return "Boolean"
+   case primitiveInt8:
+   return "Int8"
+   case primitiveInt16:
+   return "Int16"
+   case primitiveInt32:
+   return "Int32"
+   case primitiveInt64:
+   return "Int64"
+   case primitiveDouble:
+   return "Double"
+   case primitiveDecimal4:
+   return "Decimal4"
+   case primitiveDecimal8:
+   return "Decimal8"
+   case primitiveDecimal16:
+   return "Decimal16"
+   case primitiveDate:
+   return "Date"
+   case primitiveTimestampMicros:
+   return "Timestamp(micros)"
+   case primitiveTimestampNTZMicros:
+   return "TimestampNTZ(micros)"
+   case primitiveFloat:
+   return "Float"
+   case primitiveBinary:
+   return "Binary"
+   case primitiveString:
+   return "String"
+   case primitiveTimeNTZ:
+   return "TimeNTZ"
+   case primitiveTimestampNanos:
+   return "Timestamp(nanos)"
+   case primitiveTimestampNTZNanos:
+   return "TimestampNTZ(nanos)"
+   case primitiveUUID:
+   return "UUID"
+   }
+   return "Invalid"
+}
+
+func validPrimitiveValue(prim primitiveType) error {
+   if prim < primitiveNull || prim > primitiveUUID {
+   return fmt.Errorf("invalid primitive type: %d", prim)
+   }
+   return nil
+}
+
+func primitiveFromHeader(hdr byte) (primitiveType, error) {
+   // Special case the basic type of Short String and call it a Primitive 
String.
+   bt := BasicTypeFromHeader(hdr)
+   if bt == BasicShortString {
+   return primitiveString, nil
+   } else if bt == BasicPrimitive {
+   prim := primitiveType(hdr >> 2)
+   if err := validPrimitiveValue(prim); err != nil {
+   return primitiveInvalid, err
+   }
+   return prim, nil
+   }
+   return primitiveInvalid, fmt.Errorf("header is not of a primitive or 
short string basic type: %s", bt)
+}
+
+func primitiveHeader(prim primitiveType) (byte, error) {
+   if err := validPrimitiveValue(prim); err != nil {
+   return 0, err
+   }
+   hdr := byte(prim << 2)
+   hdr |= byte(BasicPrimitive)
+   return hdr, nil
+}
+
+// marshalPri

Re: [PR] feat(parquet): add variant encoder/decoder [arrow-go]

2025-04-28 Thread via GitHub


sfc-gh-mbojanczyk commented on code in PR #344:
URL: https://github.com/apache/arrow-go/pull/344#discussion_r2064846393


##
parquet/variants/util.go:
##
@@ -0,0 +1,154 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+// http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+package variants
+
+import (
+   "fmt"
+   "io"
+   "reflect"
+   "time"
+)
+
+// Reads a little-endian encoded uint (betwen 1 and 8 bytes wide) from a raw 
buffer at a specified
+// offset and returns its value. If any part of the read would be out of 
bounds, this returns an error.
+func readUint(raw []byte, offset, size int) (uint64, error) {
+   if size < 1 || size > 8 {
+   return 0, fmt.Errorf("invalid size, must be in range [1,8]: 
%d", size)
+   }
+   if maxPos := offset + size; maxPos > len(raw) {
+   return 0, fmt.Errorf("out of bounds: trying to access position 
%d, max position is %d", maxPos, len(raw))
+   }
+   var ret uint64
+   for i := range size {
+   ret |= uint64(raw[i+offset]) << (8 * i)
+   }
+   return ret, nil
+}

Review Comment:
   (Sorry, misread the Short String header, it's six bits. Removed that example)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] feat(parquet): add variant encoder/decoder [arrow-go]

2025-04-28 Thread via GitHub


sfc-gh-mbojanczyk commented on code in PR #344:
URL: https://github.com/apache/arrow-go/pull/344#discussion_r2064842943


##
parquet/variants/util.go:
##
@@ -0,0 +1,154 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+// http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+package variants
+
+import (
+   "fmt"
+   "io"
+   "reflect"
+   "time"
+)
+
+// Reads a little-endian encoded uint (betwen 1 and 8 bytes wide) from a raw 
buffer at a specified
+// offset and returns its value. If any part of the read would be out of 
bounds, this returns an error.
+func readUint(raw []byte, offset, size int) (uint64, error) {
+   if size < 1 || size > 8 {
+   return 0, fmt.Errorf("invalid size, must be in range [1,8]: 
%d", size)
+   }
+   if maxPos := offset + size; maxPos > len(raw) {
+   return 0, fmt.Errorf("out of bounds: trying to access position 
%d, max position is %d", maxPos, len(raw))
+   }
+   var ret uint64
+   for i := range size {
+   ret |= uint64(raw[i+offset]) << (8 * i)
+   }
+   return ret, nil
+}

Review Comment:
   Ah, I see where your confusion is coming from (the spec doesn't make it easy 
to find all these in one place).
   
   There are _so many places_ where widths are set, `is_large` is solely one 
place. In this case `is_large` indicates whether `num_elements` in the [Object 
Data](https://github.com/apache/parquet-format/blob/master/VariantEncoding.md#value-data-for-object-basic_type2)
 is 1 or 4 bytes. I'm talking about `field_id` and `field_offset` sizes. From 
the header documentation:
   
   > `field_offset_size_minus_one` and `field_id_size_minus_one` are 2-bit 
values that represent the number of bytes used to encode the field offsets and 
field ids.
   
   In this case, `field_id` and `field_offset` can be 1-4 bytes.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] feat(parquet): add variant encoder/decoder [arrow-go]

2025-04-28 Thread via GitHub


sfc-gh-mbojanczyk commented on code in PR #344:
URL: https://github.com/apache/arrow-go/pull/344#discussion_r2064842943


##
parquet/variants/util.go:
##
@@ -0,0 +1,154 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+// http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+package variants
+
+import (
+   "fmt"
+   "io"
+   "reflect"
+   "time"
+)
+
+// Reads a little-endian encoded uint (betwen 1 and 8 bytes wide) from a raw 
buffer at a specified
+// offset and returns its value. If any part of the read would be out of 
bounds, this returns an error.
+func readUint(raw []byte, offset, size int) (uint64, error) {
+   if size < 1 || size > 8 {
+   return 0, fmt.Errorf("invalid size, must be in range [1,8]: 
%d", size)
+   }
+   if maxPos := offset + size; maxPos > len(raw) {
+   return 0, fmt.Errorf("out of bounds: trying to access position 
%d, max position is %d", maxPos, len(raw))
+   }
+   var ret uint64
+   for i := range size {
+   ret |= uint64(raw[i+offset]) << (8 * i)
+   }
+   return ret, nil
+}

Review Comment:
   Ah, I see where your confusion is coming from (the spec doesn't make it easy 
to find all these in one place).
   
   There are _so many places_ where widths are set, `is_large` is solely one 
place. In this case `is_large` indicates whether `num_elements` in the [Object 
Data](https://github.com/apache/parquet-format/blob/master/VariantEncoding.md#value-data-for-object-basic_type2)
 is 1 or 4 bytes. I'm talking about `field_id` and `field_offset` sizes. From 
the header documentation:
   
   > `field_offset_size_minus_one` and `field_id_size_minus_one` are 2-bit 
values that represent the number of bytes used to encode the field offsets and 
field ids.
   
   In this case, `field_id` and `field_offset` can be 1-4 bytes.
   
   Perhaps a better example here is in the 
[header](https://github.com/apache/parquet-format/blob/master/VariantEncoding.md#value-header-for-short-string-basic_type1)
 for `Short String`:
   
   > When `basic_type` is 1, value_header is a 6-bit `short_string_header` ... 
The `short_string_header` value is the length of the string.
   
   So a short string's length can be represented by 1-6 bytes inclusive.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] feat(parquet): add variant encoder/decoder [arrow-go]

2025-04-28 Thread via GitHub


zeroshade commented on code in PR #344:
URL: https://github.com/apache/arrow-go/pull/344#discussion_r2064842449


##
parquet/variants/util.go:
##
@@ -0,0 +1,154 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+// http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+package variants
+
+import (
+   "fmt"
+   "io"
+   "reflect"
+   "time"
+)
+
+// Reads a little-endian encoded uint (betwen 1 and 8 bytes wide) from a raw 
buffer at a specified
+// offset and returns its value. If any part of the read would be out of 
bounds, this returns an error.
+func readUint(raw []byte, offset, size int) (uint64, error) {
+   if size < 1 || size > 8 {
+   return 0, fmt.Errorf("invalid size, must be in range [1,8]: 
%d", size)
+   }
+   if maxPos := offset + size; maxPos > len(raw) {
+   return 0, fmt.Errorf("out of bounds: trying to access position 
%d, max position is %d", maxPos, len(raw))
+   }
+   var ret uint64
+   for i := range size {
+   ret |= uint64(raw[i+offset]) << (8 * i)
+   }
+   return ret, nil
+}
+
+// Reads a little-endian encoded integer (between 1 and 8 bytes wide) from a 
raw buffer at a specified offset.
+func readInt(raw []byte, offset, size int) (int64, error) {
+   u, err := readUint(raw, offset, size)
+   if err != nil {
+   return -1, err
+   }
+   return int64(u), nil
+}
+
+func fieldOffsetSize(maxSize int32) int {
+   if maxSize < 0xFF {
+   return 1
+   } else if maxSize < 0x {
+   return 2
+   } else if maxSize < 0xFF {
+   return 3
+   }
+   return 4
+}
+
+// Checks that a given range is in the provided raw buffer.
+func checkBounds(raw []byte, low, high int) error {
+   maxPos := len(raw)
+   if low >= maxPos {
+   return fmt.Errorf("out of bounds: trying to access position %d, 
max is %d", low, maxPos)
+   }
+   if high > maxPos {
+   return fmt.Errorf("out of bounds: trying to access position %d, 
max is %d", high, maxPos)
+   }
+   return nil
+}
+
+// Encodes a number of a specified width in little-endian format and writes to 
a writer.
+func encodeNumber(val int64, size int, w io.Writer) {
+   buf := make([]byte, size)
+   for i := range size {
+   buf[i] = byte(val)
+   val >>= 8
+   }
+   w.Write(buf)

Review Comment:
   The problem with rolling our own here is that we're going to need to manage 
the big/little endian logic ourselves then so that this runs properly on 
big-endian systems.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] feat(parquet): add variant encoder/decoder [arrow-go]

2025-04-28 Thread via GitHub


zeroshade commented on code in PR #344:
URL: https://github.com/apache/arrow-go/pull/344#discussion_r2064830536


##
parquet/variants/util.go:
##
@@ -0,0 +1,154 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+// http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+package variants
+
+import (
+   "fmt"
+   "io"
+   "reflect"
+   "time"
+)
+
+// Reads a little-endian encoded uint (betwen 1 and 8 bytes wide) from a raw 
buffer at a specified
+// offset and returns its value. If any part of the read would be out of 
bounds, this returns an error.
+func readUint(raw []byte, offset, size int) (uint64, error) {
+   if size < 1 || size > 8 {
+   return 0, fmt.Errorf("invalid size, must be in range [1,8]: 
%d", size)
+   }
+   if maxPos := offset + size; maxPos > len(raw) {
+   return 0, fmt.Errorf("out of bounds: trying to access position 
%d, max position is %d", maxPos, len(raw))
+   }
+   var ret uint64
+   for i := range size {
+   ret |= uint64(raw[i+offset]) << (8 * i)
+   }
+   return ret, nil
+}

Review Comment:
   Ah, now I see it. thats weird and annoying



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] feat(parquet): add variant encoder/decoder [arrow-go]

2025-04-28 Thread via GitHub


zeroshade commented on code in PR #344:
URL: https://github.com/apache/arrow-go/pull/344#discussion_r2064826002


##
parquet/variants/util.go:
##
@@ -0,0 +1,154 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+// http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+package variants
+
+import (
+   "fmt"
+   "io"
+   "reflect"
+   "time"
+)
+
+// Reads a little-endian encoded uint (betwen 1 and 8 bytes wide) from a raw 
buffer at a specified
+// offset and returns its value. If any part of the read would be out of 
bounds, this returns an error.
+func readUint(raw []byte, offset, size int) (uint64, error) {
+   if size < 1 || size > 8 {
+   return 0, fmt.Errorf("invalid size, must be in range [1,8]: 
%d", size)
+   }
+   if maxPos := offset + size; maxPos > len(raw) {
+   return 0, fmt.Errorf("out of bounds: trying to access position 
%d, max position is %d", maxPos, len(raw))
+   }
+   var ret uint64
+   for i := range size {
+   ret |= uint64(raw[i+offset]) << (8 * i)
+   }
+   return ret, nil
+}

Review Comment:
   Where do you see that it could be 2 or 3 bytes?
   
   > If `is_large` is 0, 1 byte is used, and if `is_large` is 1, 4 bytes are 
used.
   
   According to the documentation it is **either** 1 byte or 4 bytes. Never 2 
or 3.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] feat(parquet): add variant encoder/decoder [arrow-go]

2025-04-28 Thread via GitHub


sfc-gh-mbojanczyk commented on code in PR #344:
URL: https://github.com/apache/arrow-go/pull/344#discussion_r2064778633


##
parquet/variants/util.go:
##
@@ -0,0 +1,154 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+// http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+package variants
+
+import (
+   "fmt"
+   "io"
+   "reflect"
+   "time"
+)
+
+// Reads a little-endian encoded uint (betwen 1 and 8 bytes wide) from a raw 
buffer at a specified
+// offset and returns its value. If any part of the read would be out of 
bounds, this returns an error.
+func readUint(raw []byte, offset, size int) (uint64, error) {
+   if size < 1 || size > 8 {
+   return 0, fmt.Errorf("invalid size, must be in range [1,8]: 
%d", size)
+   }
+   if maxPos := offset + size; maxPos > len(raw) {
+   return 0, fmt.Errorf("out of bounds: trying to access position 
%d, max position is %d", maxPos, len(raw))
+   }
+   var ret uint64
+   for i := range size {
+   ret |= uint64(raw[i+offset]) << (8 * i)
+   }
+   return ret, nil
+}
+
+// Reads a little-endian encoded integer (between 1 and 8 bytes wide) from a 
raw buffer at a specified offset.
+func readInt(raw []byte, offset, size int) (int64, error) {
+   u, err := readUint(raw, offset, size)
+   if err != nil {
+   return -1, err
+   }
+   return int64(u), nil
+}
+
+func fieldOffsetSize(maxSize int32) int {
+   if maxSize < 0xFF {
+   return 1
+   } else if maxSize < 0x {
+   return 2
+   } else if maxSize < 0xFF {
+   return 3
+   }
+   return 4
+}
+
+// Checks that a given range is in the provided raw buffer.
+func checkBounds(raw []byte, low, high int) error {
+   maxPos := len(raw)
+   if low >= maxPos {
+   return fmt.Errorf("out of bounds: trying to access position %d, 
max is %d", low, maxPos)
+   }
+   if high > maxPos {
+   return fmt.Errorf("out of bounds: trying to access position %d, 
max is %d", high, maxPos)
+   }
+   return nil
+}
+
+// Encodes a number of a specified width in little-endian format and writes to 
a writer.
+func encodeNumber(val int64, size int, w io.Writer) {
+   buf := make([]byte, size)
+   for i := range size {
+   buf[i] = byte(val)
+   val >>= 8
+   }
+   w.Write(buf)

Review Comment:
   Similar vein to `Decode()`- this will work for the powers-of-two widths, but 
if we've got a width that's not (ie. 3) we'd encode with additional padding 
since we've got to encode to the next power of two width.
   
   This isn't the end of the world (encoding over the minimal necessary width 
_is_ within spec), so I have fewer objections here. I'd argue with `Decode()` 
rolling our own is more of a necessity due to not being in control of what 
comes in (eg. the Java library _can_ encode 3-byte wide numbers, so it's gotta 
be handled). This keeps the encode and decode logic fairly similar. Also, FWIW, 
I feel like the intention of the spec authors is to minimize the number of 
bytes used to encode things, see the existence of `Short String` even though a 
primitive `String` type exists.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] feat(parquet): add variant encoder/decoder [arrow-go]

2025-04-28 Thread via GitHub


sfc-gh-mbojanczyk commented on code in PR #344:
URL: https://github.com/apache/arrow-go/pull/344#discussion_r2064760350


##
parquet/variants/util.go:
##
@@ -0,0 +1,154 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+// http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+package variants
+
+import (
+   "fmt"
+   "io"
+   "reflect"
+   "time"
+)
+
+// Reads a little-endian encoded uint (betwen 1 and 8 bytes wide) from a raw 
buffer at a specified
+// offset and returns its value. If any part of the read would be out of 
bounds, this returns an error.
+func readUint(raw []byte, offset, size int) (uint64, error) {
+   if size < 1 || size > 8 {
+   return 0, fmt.Errorf("invalid size, must be in range [1,8]: 
%d", size)
+   }
+   if maxPos := offset + size; maxPos > len(raw) {
+   return 0, fmt.Errorf("out of bounds: trying to access position 
%d, max position is %d", maxPos, len(raw))
+   }
+   var ret uint64
+   for i := range size {
+   ret |= uint64(raw[i+offset]) << (8 * i)
+   }
+   return ret, nil
+}

Review Comment:
   The subtlety here: you need to be able to decode things of widths 1-8 
_inclusive_. For example, the [field offset 
size](https://github.com/apache/parquet-format/blob/master/VariantEncoding.md#value-header-for-object-basic_type2)
 of an object can be 1, 2, 3, or 4 bytes. `binary.Decode` can handle the 1, 2, 
and 4 case, but cannot handle the 3-byte width case (see [this 
example](https://go.dev/play/p/5QKZ1ove5Ab) in the Go playground).
   
   For the odd-width cases I suppose we could pad the buffer with additional 
zeros, but that's getting gnarly just to make `binary.Decode` work IMO



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] feat(parquet): add variant encoder/decoder [arrow-go]

2025-04-28 Thread via GitHub


zeroshade commented on code in PR #344:
URL: https://github.com/apache/arrow-go/pull/344#discussion_r2064332295


##
parquet/variants/util.go:
##
@@ -0,0 +1,154 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+// http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+package variants
+
+import (
+   "fmt"
+   "io"
+   "reflect"
+   "time"
+)
+
+// Reads a little-endian encoded uint (betwen 1 and 8 bytes wide) from a raw 
buffer at a specified
+// offset and returns its value. If any part of the read would be out of 
bounds, this returns an error.
+func readUint(raw []byte, offset, size int) (uint64, error) {
+   if size < 1 || size > 8 {
+   return 0, fmt.Errorf("invalid size, must be in range [1,8]: 
%d", size)
+   }
+   if maxPos := offset + size; maxPos > len(raw) {
+   return 0, fmt.Errorf("out of bounds: trying to access position 
%d, max position is %d", maxPos, len(raw))
+   }
+   var ret uint64
+   for i := range size {
+   ret |= uint64(raw[i+offset]) << (8 * i)
+   }
+   return ret, nil
+}
+
+// Reads a little-endian encoded integer (between 1 and 8 bytes wide) from a 
raw buffer at a specified offset.
+func readInt(raw []byte, offset, size int) (int64, error) {
+   u, err := readUint(raw, offset, size)
+   if err != nil {
+   return -1, err
+   }
+   return int64(u), nil
+}
+
+func fieldOffsetSize(maxSize int32) int {
+   if maxSize < 0xFF {
+   return 1
+   } else if maxSize < 0x {
+   return 2
+   } else if maxSize < 0xFF {
+   return 3
+   }
+   return 4
+}
+
+// Checks that a given range is in the provided raw buffer.
+func checkBounds(raw []byte, low, high int) error {
+   maxPos := len(raw)
+   if low >= maxPos {
+   return fmt.Errorf("out of bounds: trying to access position %d, 
max is %d", low, maxPos)
+   }
+   if high > maxPos {
+   return fmt.Errorf("out of bounds: trying to access position %d, 
max is %d", high, maxPos)
+   }
+   return nil
+}
+
+// Encodes a number of a specified width in little-endian format and writes to 
a writer.
+func encodeNumber(val int64, size int, w io.Writer) {
+   buf := make([]byte, size)
+   for i := range size {
+   buf[i] = byte(val)
+   val >>= 8
+   }
+   w.Write(buf)

Review Comment:
   you can use `binary.Encode(buf, binary.LittleEndian, val)` which will handle 
the whole range for you



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] feat(parquet): add variant encoder/decoder [arrow-go]

2025-04-28 Thread via GitHub


zeroshade commented on code in PR #344:
URL: https://github.com/apache/arrow-go/pull/344#discussion_r2064330741


##
parquet/variants/util.go:
##
@@ -0,0 +1,154 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+// http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+package variants
+
+import (
+   "fmt"
+   "io"
+   "reflect"
+   "time"
+)
+
+// Reads a little-endian encoded uint (betwen 1 and 8 bytes wide) from a raw 
buffer at a specified
+// offset and returns its value. If any part of the read would be out of 
bounds, this returns an error.
+func readUint(raw []byte, offset, size int) (uint64, error) {
+   if size < 1 || size > 8 {
+   return 0, fmt.Errorf("invalid size, must be in range [1,8]: 
%d", size)
+   }
+   if maxPos := offset + size; maxPos > len(raw) {
+   return 0, fmt.Errorf("out of bounds: trying to access position 
%d, max position is %d", maxPos, len(raw))
+   }
+   var ret uint64
+   for i := range size {
+   ret |= uint64(raw[i+offset]) << (8 * i)
+   }
+   return ret, nil
+}

Review Comment:
   Then you could probably just do something like:
   
   ```go
   switch size {
   case 1:
   var val uint8
   _, err := binary.Decode(raw[offset:size], binary.LittleEndian, &val)
   return val, err
   case 2:
   var val uint16
   _, err := binary.Decode(raw[offset:size], binary.LittleEndian, &val)
   return val, err
   case 4:
   var val uint32
   _, err := binary.Decode(raw[offset:size], binary.LittleEndian, &val)
   return val, err
   case 8:
   var val uint64
   _, err := binary.Decode(raw[offset:size], binary.LittleEndian, &val)
   return val, err
   }
   ```
   
   right?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] feat(parquet): add variant encoder/decoder [arrow-go]

2025-04-22 Thread via GitHub


sfc-gh-mbojanczyk commented on code in PR #344:
URL: https://github.com/apache/arrow-go/pull/344#discussion_r2054836925


##
parquet/variants/util.go:
##
@@ -0,0 +1,154 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+// http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+package variants
+
+import (
+   "fmt"
+   "io"
+   "reflect"
+   "time"
+)
+
+// Reads a little-endian encoded uint (betwen 1 and 8 bytes wide) from a raw 
buffer at a specified
+// offset and returns its value. If any part of the read would be out of 
bounds, this returns an error.
+func readUint(raw []byte, offset, size int) (uint64, error) {
+   if size < 1 || size > 8 {
+   return 0, fmt.Errorf("invalid size, must be in range [1,8]: 
%d", size)
+   }
+   if maxPos := offset + size; maxPos > len(raw) {
+   return 0, fmt.Errorf("out of bounds: trying to access position 
%d, max position is %d", maxPos, len(raw))
+   }
+   var ret uint64
+   for i := range size {
+   ret |= uint64(raw[i+offset]) << (8 * i)
+   }
+   return ret, nil
+}
+
+// Reads a little-endian encoded integer (between 1 and 8 bytes wide) from a 
raw buffer at a specified offset.
+func readInt(raw []byte, offset, size int) (int64, error) {
+   u, err := readUint(raw, offset, size)
+   if err != nil {
+   return -1, err
+   }
+   return int64(u), nil
+}
+
+func fieldOffsetSize(maxSize int32) int {
+   if maxSize < 0xFF {
+   return 1
+   } else if maxSize < 0x {
+   return 2
+   } else if maxSize < 0xFF {
+   return 3
+   }
+   return 4
+}
+
+// Checks that a given range is in the provided raw buffer.
+func checkBounds(raw []byte, low, high int) error {
+   maxPos := len(raw)
+   if low >= maxPos {

Review Comment:
   Good call. Done, plus I realized that I forgot tests for this function, so I 
added that as well.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] feat(parquet): add variant encoder/decoder [arrow-go]

2025-04-22 Thread via GitHub


sfc-gh-mbojanczyk commented on code in PR #344:
URL: https://github.com/apache/arrow-go/pull/344#discussion_r2054827281


##
parquet/variants/util.go:
##
@@ -0,0 +1,154 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+// http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+package variants
+
+import (
+   "fmt"
+   "io"
+   "reflect"
+   "time"
+)
+
+// Reads a little-endian encoded uint (betwen 1 and 8 bytes wide) from a raw 
buffer at a specified
+// offset and returns its value. If any part of the read would be out of 
bounds, this returns an error.
+func readUint(raw []byte, offset, size int) (uint64, error) {
+   if size < 1 || size > 8 {
+   return 0, fmt.Errorf("invalid size, must be in range [1,8]: 
%d", size)
+   }
+   if maxPos := offset + size; maxPos > len(raw) {
+   return 0, fmt.Errorf("out of bounds: trying to access position 
%d, max position is %d", maxPos, len(raw))
+   }
+   var ret uint64
+   for i := range size {
+   ret |= uint64(raw[i+offset]) << (8 * i)
+   }
+   return ret, nil
+}
+
+// Reads a little-endian encoded integer (between 1 and 8 bytes wide) from a 
raw buffer at a specified offset.
+func readInt(raw []byte, offset, size int) (int64, error) {
+   u, err := readUint(raw, offset, size)
+   if err != nil {
+   return -1, err
+   }
+   return int64(u), nil
+}
+
+func fieldOffsetSize(maxSize int32) int {
+   if maxSize < 0xFF {
+   return 1
+   } else if maxSize < 0x {
+   return 2
+   } else if maxSize < 0xFF {
+   return 3
+   }
+   return 4
+}
+
+// Checks that a given range is in the provided raw buffer.
+func checkBounds(raw []byte, low, high int) error {
+   maxPos := len(raw)
+   if low >= maxPos {
+   return fmt.Errorf("out of bounds: trying to access position %d, 
max is %d", low, maxPos)
+   }
+   if high > maxPos {
+   return fmt.Errorf("out of bounds: trying to access position %d, 
max is %d", high, maxPos)
+   }
+   return nil
+}
+
+// Encodes a number of a specified width in little-endian format and writes to 
a writer.
+func encodeNumber(val int64, size int, w io.Writer) {
+   buf := make([]byte, size)
+   for i := range size {
+   buf[i] = byte(val)
+   val >>= 8
+   }
+   w.Write(buf)

Review Comment:
   Similar to the comment in `readUint()`, the `binary` package only provides 
Put/Append functionality for `uint{16,32,64}`, but we need it for the whole 
range from 1-8 bytes.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] feat(parquet): add variant encoder/decoder [arrow-go]

2025-04-22 Thread via GitHub


sfc-gh-mbojanczyk commented on code in PR #344:
URL: https://github.com/apache/arrow-go/pull/344#discussion_r2054810924


##
parquet/variants/util.go:
##
@@ -0,0 +1,154 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+// http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+package variants
+
+import (
+   "fmt"
+   "io"
+   "reflect"
+   "time"
+)
+
+// Reads a little-endian encoded uint (betwen 1 and 8 bytes wide) from a raw 
buffer at a specified
+// offset and returns its value. If any part of the read would be out of 
bounds, this returns an error.
+func readUint(raw []byte, offset, size int) (uint64, error) {
+   if size < 1 || size > 8 {
+   return 0, fmt.Errorf("invalid size, must be in range [1,8]: 
%d", size)
+   }
+   if maxPos := offset + size; maxPos > len(raw) {
+   return 0, fmt.Errorf("out of bounds: trying to access position 
%d, max position is %d", maxPos, len(raw))
+   }
+   var ret uint64
+   for i := range size {
+   ret |= uint64(raw[i+offset]) << (8 * i)
+   }
+   return ret, nil
+}

Review Comment:
   The unfortunate bit here: we need to handle uints of varying widths from 1-8 
bytes, but the `binary` package expects things of a set width (eg. `Uint64()` 
wants to [read 8 
bytes](https://cs.opensource.google/go/go/+/refs/tags/go1.24.2:src/encoding/binary/binary.go;l=115-119)
 always, so that would panic for anything smaller).
   
   So as-is, `binary` can't really handle all the different widths we need to 
here, which is why I rolled my own.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] feat(parquet): add variant encoder/decoder [arrow-go]

2025-04-16 Thread via GitHub


sfc-gh-mbojanczyk commented on PR #344:
URL: https://github.com/apache/arrow-go/pull/344#issuecomment-2811034930

   > Made an initial pass on this and left a bunch of comments. I'll try to 
take another look later on
   
   Thanks so much! I'll get to addressing the comments here shortly (was out 
for a few days)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org