Re: [PR] feat(parquet): add variant encoder/decoder [arrow-go]
zeroshade merged PR #344: URL: https://github.com/apache/arrow-go/pull/344 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] feat(parquet): add variant encoder/decoder [arrow-go]
zeroshade commented on code in PR #344: URL: https://github.com/apache/arrow-go/pull/344#discussion_r2112493931 ## parquet/variant/doc.go: ## @@ -0,0 +1,142 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +// Package variant provides an implementation of the Apache Parquet Variant data type. +// +// The Variant type is a flexible binary format designed to represent complex nested +// data structures with minimal overhead. It supports a wide range of primitive types +// as well as nested arrays and objects (similar to JSON). The format uses a memory-efficient +// binary representation with a separate metadata section for dictionary encoding of keys. +// +// # Key Components +// +// - [Value]: The primary type representing a variant value +// - [Metadata]: Contains information about the dictionary of keys +// - [Builder]: Used to construct variant values +// +// # Format Overview +// +// The variant format consists of two parts: +// +// 1. Metadata: A dictionary of keys used in objects +// 2. Value: The actual data payload +// +// Values can be one of the following types: +// +// - Primitive values (null, bool, int8/16/32/64, float32/64, etc.) +// - Short strings (less than 64 bytes) +// - Long strings and binary data +// - Date, time and timestamp values +// - Decimal values (4, 8, or 16 bytes) +// - Arrays of any variant value +// - Objects with key-value pairs +// +// # Working with Variants +// +// To create a variant value, use the Builder: +// +// var b variant.Builder +// b.Append(map[string]any{ +// "id": 123, +// "name": "example", +// "data": []any{1, 2, 3}, +// }) +// value, err := b.Build() +// +// To parse an existing variant value: +// +// v, err := variant.New(metadataBytes, valueBytes) +// +// You can access the data using the [Value.Value] method which returns the appropriate Go type: +// +// switch v.Type() { +// case variant.Object: +// obj := v.Value().(variant.ObjectValue) +// field, err := obj.ValueByKey("name") +// case variant.Array: +// arr := v.Value().(variant.ArrayValue) +// elem, err := arr.Value(0) +// case variant.String: +// s := v.Value().(string) +// case variant.Int64: +// i := v.Value().(int64) +// } +// +// You can also switch on the type of the result value from the [Value.Value] method: +// +// switch val := v.Value().(type) { +// case nil: +// // ... +// case int32: +// // ... +// case string: +// // ... +// case variant.ArrayValue: +// for i, item := range val.Values() { +// // item is a variant.Value +// } +// case variant.ObjectValue: +// for k, item := range val.Values() { +// // k is the field key +// // item is a variant.Value for that field +// } +// } +// +// Values can also be converted to JSON: +// +// jsonBytes, err := json.Marshal(v) +// +// # Low-level Construction +// +// For direct construction of complex nested structures, you can use the low-level +// methods: +// +// var b variant.Builder Review Comment: > At the end of the day, this is your codebase and I'm just tossing in my 2c. Yup, but I still value the feedback. Especially as I am often *way* too deep in the weeds and will miss things that can affect usability by consumers. So the feedback is important. > `Append()` does the overwhelming majority of the heavy lifting for users and is probably going to be the most used entry point, so this PR is fully featured enough (ie. it's a good hole for folk to drop into). Cool. I'll push this as is for now and then we can look at creating the higher-level "nicer" API for simplifying object/array construction later on. Another reason is that I want to get the basic variant support out and my "compromise" version has a lot of duplicated code in there that I wasn't able to get to my liking yet. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.o
Re: [PR] feat(parquet): add variant encoder/decoder [arrow-go]
sfc-gh-mbojanczyk commented on PR #344: URL: https://github.com/apache/arrow-go/pull/344#issuecomment-2917181016 Can't actually toss in my approval since I'm the original author, but this PR looks good to me. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] feat(parquet): add variant encoder/decoder [arrow-go]
sfc-gh-mbojanczyk commented on code in PR #344: URL: https://github.com/apache/arrow-go/pull/344#discussion_r2112471286 ## parquet/variant/doc.go: ## @@ -0,0 +1,142 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +// Package variant provides an implementation of the Apache Parquet Variant data type. +// +// The Variant type is a flexible binary format designed to represent complex nested +// data structures with minimal overhead. It supports a wide range of primitive types +// as well as nested arrays and objects (similar to JSON). The format uses a memory-efficient +// binary representation with a separate metadata section for dictionary encoding of keys. +// +// # Key Components +// +// - [Value]: The primary type representing a variant value +// - [Metadata]: Contains information about the dictionary of keys +// - [Builder]: Used to construct variant values +// +// # Format Overview +// +// The variant format consists of two parts: +// +// 1. Metadata: A dictionary of keys used in objects +// 2. Value: The actual data payload +// +// Values can be one of the following types: +// +// - Primitive values (null, bool, int8/16/32/64, float32/64, etc.) +// - Short strings (less than 64 bytes) +// - Long strings and binary data +// - Date, time and timestamp values +// - Decimal values (4, 8, or 16 bytes) +// - Arrays of any variant value +// - Objects with key-value pairs +// +// # Working with Variants +// +// To create a variant value, use the Builder: +// +// var b variant.Builder +// b.Append(map[string]any{ +// "id": 123, +// "name": "example", +// "data": []any{1, 2, 3}, +// }) +// value, err := b.Build() +// +// To parse an existing variant value: +// +// v, err := variant.New(metadataBytes, valueBytes) +// +// You can access the data using the [Value.Value] method which returns the appropriate Go type: +// +// switch v.Type() { +// case variant.Object: +// obj := v.Value().(variant.ObjectValue) +// field, err := obj.ValueByKey("name") +// case variant.Array: +// arr := v.Value().(variant.ArrayValue) +// elem, err := arr.Value(0) +// case variant.String: +// s := v.Value().(string) +// case variant.Int64: +// i := v.Value().(int64) +// } +// +// You can also switch on the type of the result value from the [Value.Value] method: +// +// switch val := v.Value().(type) { +// case nil: +// // ... +// case int32: +// // ... +// case string: +// // ... +// case variant.ArrayValue: +// for i, item := range val.Values() { +// // item is a variant.Value +// } +// case variant.ObjectValue: +// for k, item := range val.Values() { +// // k is the field key +// // item is a variant.Value for that field +// } +// } +// +// Values can also be converted to JSON: +// +// jsonBytes, err := json.Marshal(v) +// +// # Low-level Construction +// +// For direct construction of complex nested structures, you can use the low-level +// methods: +// +// var b variant.Builder Review Comment: > Yea, Finish() errors if the children haven't had Finish() called first. Perfect! > I think @lidavidm has a good point here. The "compromise" example I put together is just built on top of the existing Builder API I've put together by creating the "higher-level" API on top of the "lower-level" one that I've already built. Any objections to pushing through what we've already got here and then following up with the higher-level builder that handles the offsets and fields for you as a separate PR afterwards @sfc-gh-mbojanczyk? At the end of the day, this is your codebase and I'm just tossing in my 2c. In my experience, users are going to fall into holes- it's the library author's job to lead them to the correct one :) `Append()` does the overwhelming majority of the heavy lifting for users and is probably going to be the most used entry point, so this PR is fully featured enough (ie. it's a good hole for folk to drop into). -- This is an automated message from the Apache Git Service. To respond to
Re: [PR] feat(parquet): add variant encoder/decoder [arrow-go]
zeroshade commented on code in PR #344: URL: https://github.com/apache/arrow-go/pull/344#discussion_r2112139261 ## parquet/variant/doc.go: ## @@ -0,0 +1,142 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +// Package variant provides an implementation of the Apache Parquet Variant data type. +// +// The Variant type is a flexible binary format designed to represent complex nested +// data structures with minimal overhead. It supports a wide range of primitive types +// as well as nested arrays and objects (similar to JSON). The format uses a memory-efficient +// binary representation with a separate metadata section for dictionary encoding of keys. +// +// # Key Components +// +// - [Value]: The primary type representing a variant value +// - [Metadata]: Contains information about the dictionary of keys +// - [Builder]: Used to construct variant values +// +// # Format Overview +// +// The variant format consists of two parts: +// +// 1. Metadata: A dictionary of keys used in objects +// 2. Value: The actual data payload +// +// Values can be one of the following types: +// +// - Primitive values (null, bool, int8/16/32/64, float32/64, etc.) +// - Short strings (less than 64 bytes) +// - Long strings and binary data +// - Date, time and timestamp values +// - Decimal values (4, 8, or 16 bytes) +// - Arrays of any variant value +// - Objects with key-value pairs +// +// # Working with Variants +// +// To create a variant value, use the Builder: +// +// var b variant.Builder +// b.Append(map[string]any{ +// "id": 123, +// "name": "example", +// "data": []any{1, 2, 3}, +// }) +// value, err := b.Build() +// +// To parse an existing variant value: +// +// v, err := variant.New(metadataBytes, valueBytes) +// +// You can access the data using the [Value.Value] method which returns the appropriate Go type: +// +// switch v.Type() { +// case variant.Object: +// obj := v.Value().(variant.ObjectValue) +// field, err := obj.ValueByKey("name") +// case variant.Array: +// arr := v.Value().(variant.ArrayValue) +// elem, err := arr.Value(0) +// case variant.String: +// s := v.Value().(string) +// case variant.Int64: +// i := v.Value().(int64) +// } +// +// You can also switch on the type of the result value from the [Value.Value] method: +// +// switch val := v.Value().(type) { +// case nil: +// // ... +// case int32: +// // ... +// case string: +// // ... +// case variant.ArrayValue: +// for i, item := range val.Values() { +// // item is a variant.Value +// } +// case variant.ObjectValue: +// for k, item := range val.Values() { +// // k is the field key +// // item is a variant.Value for that field +// } +// } +// +// Values can also be converted to JSON: +// +// jsonBytes, err := json.Marshal(v) +// +// # Low-level Construction +// +// For direct construction of complex nested structures, you can use the low-level +// methods: +// +// var b variant.Builder Review Comment: > I like it. I assume you'll do some handling if the parent has Finish() called without the children's Finish() (either panic, or just call their Finish() methods, right? Yea, `Finish()` errors if the children haven't had `Finish()` called first. > You can also still just directly append a map or slice instead of keeping track manually - would that solve your use case? (Also, presumably we can add the higher level builders separately/on top of this API? Does it have to be built into the builder for this first pass?) I think @lidavidm has a good point here. The "compromise" example I put together is just built on top of the existing Builder API I've put together by creating the "higher-level" API on top of the "lower-level" one that I've already built. Any objections to pushing through what we've already got here and then following up with the higher-level builder that handles the offsets and fields for you as a separate PR afterwards @sfc-gh-mbojanczyk? -- This is an automated message from the Apache Git
Re: [PR] feat(parquet): add variant encoder/decoder [arrow-go]
lidavidm commented on code in PR #344: URL: https://github.com/apache/arrow-go/pull/344#discussion_r2110573898 ## parquet/variant/doc.go: ## @@ -0,0 +1,142 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +// Package variant provides an implementation of the Apache Parquet Variant data type. +// +// The Variant type is a flexible binary format designed to represent complex nested +// data structures with minimal overhead. It supports a wide range of primitive types +// as well as nested arrays and objects (similar to JSON). The format uses a memory-efficient +// binary representation with a separate metadata section for dictionary encoding of keys. +// +// # Key Components +// +// - [Value]: The primary type representing a variant value +// - [Metadata]: Contains information about the dictionary of keys +// - [Builder]: Used to construct variant values +// +// # Format Overview +// +// The variant format consists of two parts: +// +// 1. Metadata: A dictionary of keys used in objects +// 2. Value: The actual data payload +// +// Values can be one of the following types: +// +// - Primitive values (null, bool, int8/16/32/64, float32/64, etc.) +// - Short strings (less than 64 bytes) +// - Long strings and binary data +// - Date, time and timestamp values +// - Decimal values (4, 8, or 16 bytes) +// - Arrays of any variant value +// - Objects with key-value pairs +// +// # Working with Variants +// +// To create a variant value, use the Builder: +// +// var b variant.Builder +// b.Append(map[string]any{ +// "id": 123, +// "name": "example", +// "data": []any{1, 2, 3}, +// }) +// value, err := b.Build() +// +// To parse an existing variant value: +// +// v, err := variant.New(metadataBytes, valueBytes) +// +// You can access the data using the [Value.Value] method which returns the appropriate Go type: +// +// switch v.Type() { +// case variant.Object: +// obj := v.Value().(variant.ObjectValue) +// field, err := obj.ValueByKey("name") +// case variant.Array: +// arr := v.Value().(variant.ArrayValue) +// elem, err := arr.Value(0) +// case variant.String: +// s := v.Value().(string) +// case variant.Int64: +// i := v.Value().(int64) +// } +// +// You can also switch on the type of the result value from the [Value.Value] method: +// +// switch val := v.Value().(type) { +// case nil: +// // ... +// case int32: +// // ... +// case string: +// // ... +// case variant.ArrayValue: +// for i, item := range val.Values() { +// // item is a variant.Value +// } +// case variant.ObjectValue: +// for k, item := range val.Values() { +// // k is the field key +// // item is a variant.Value for that field +// } +// } +// +// Values can also be converted to JSON: +// +// jsonBytes, err := json.Marshal(v) +// +// # Low-level Construction +// +// For direct construction of complex nested structures, you can use the low-level +// methods: +// +// var b variant.Builder Review Comment: You can also still just directly append a map or slice instead of keeping track manually - would that solve your use case? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] feat(parquet): add variant encoder/decoder [arrow-go]
lidavidm commented on code in PR #344: URL: https://github.com/apache/arrow-go/pull/344#discussion_r2110574589 ## parquet/variant/doc.go: ## @@ -0,0 +1,142 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +// Package variant provides an implementation of the Apache Parquet Variant data type. +// +// The Variant type is a flexible binary format designed to represent complex nested +// data structures with minimal overhead. It supports a wide range of primitive types +// as well as nested arrays and objects (similar to JSON). The format uses a memory-efficient +// binary representation with a separate metadata section for dictionary encoding of keys. +// +// # Key Components +// +// - [Value]: The primary type representing a variant value +// - [Metadata]: Contains information about the dictionary of keys +// - [Builder]: Used to construct variant values +// +// # Format Overview +// +// The variant format consists of two parts: +// +// 1. Metadata: A dictionary of keys used in objects +// 2. Value: The actual data payload +// +// Values can be one of the following types: +// +// - Primitive values (null, bool, int8/16/32/64, float32/64, etc.) +// - Short strings (less than 64 bytes) +// - Long strings and binary data +// - Date, time and timestamp values +// - Decimal values (4, 8, or 16 bytes) +// - Arrays of any variant value +// - Objects with key-value pairs +// +// # Working with Variants +// +// To create a variant value, use the Builder: +// +// var b variant.Builder +// b.Append(map[string]any{ +// "id": 123, +// "name": "example", +// "data": []any{1, 2, 3}, +// }) +// value, err := b.Build() +// +// To parse an existing variant value: +// +// v, err := variant.New(metadataBytes, valueBytes) +// +// You can access the data using the [Value.Value] method which returns the appropriate Go type: +// +// switch v.Type() { +// case variant.Object: +// obj := v.Value().(variant.ObjectValue) +// field, err := obj.ValueByKey("name") +// case variant.Array: +// arr := v.Value().(variant.ArrayValue) +// elem, err := arr.Value(0) +// case variant.String: +// s := v.Value().(string) +// case variant.Int64: +// i := v.Value().(int64) +// } +// +// You can also switch on the type of the result value from the [Value.Value] method: +// +// switch val := v.Value().(type) { +// case nil: +// // ... +// case int32: +// // ... +// case string: +// // ... +// case variant.ArrayValue: +// for i, item := range val.Values() { +// // item is a variant.Value +// } +// case variant.ObjectValue: +// for k, item := range val.Values() { +// // k is the field key +// // item is a variant.Value for that field +// } +// } +// +// Values can also be converted to JSON: +// +// jsonBytes, err := json.Marshal(v) +// +// # Low-level Construction +// +// For direct construction of complex nested structures, you can use the low-level +// methods: +// +// var b variant.Builder Review Comment: (Also, presumably we can add the higher level builders separately/on top of this API? Does it have to be built into the builder for this first pass?) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] feat(parquet): add variant encoder/decoder [arrow-go]
sfc-gh-mbojanczyk commented on code in PR #344: URL: https://github.com/apache/arrow-go/pull/344#discussion_r2110554283 ## parquet/variant/doc.go: ## @@ -0,0 +1,142 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +// Package variant provides an implementation of the Apache Parquet Variant data type. +// +// The Variant type is a flexible binary format designed to represent complex nested +// data structures with minimal overhead. It supports a wide range of primitive types +// as well as nested arrays and objects (similar to JSON). The format uses a memory-efficient +// binary representation with a separate metadata section for dictionary encoding of keys. +// +// # Key Components +// +// - [Value]: The primary type representing a variant value +// - [Metadata]: Contains information about the dictionary of keys +// - [Builder]: Used to construct variant values +// +// # Format Overview +// +// The variant format consists of two parts: +// +// 1. Metadata: A dictionary of keys used in objects +// 2. Value: The actual data payload +// +// Values can be one of the following types: +// +// - Primitive values (null, bool, int8/16/32/64, float32/64, etc.) +// - Short strings (less than 64 bytes) +// - Long strings and binary data +// - Date, time and timestamp values +// - Decimal values (4, 8, or 16 bytes) +// - Arrays of any variant value +// - Objects with key-value pairs +// +// # Working with Variants +// +// To create a variant value, use the Builder: +// +// var b variant.Builder +// b.Append(map[string]any{ +// "id": 123, +// "name": "example", +// "data": []any{1, 2, 3}, +// }) +// value, err := b.Build() +// +// To parse an existing variant value: +// +// v, err := variant.New(metadataBytes, valueBytes) +// +// You can access the data using the [Value.Value] method which returns the appropriate Go type: +// +// switch v.Type() { +// case variant.Object: +// obj := v.Value().(variant.ObjectValue) +// field, err := obj.ValueByKey("name") +// case variant.Array: +// arr := v.Value().(variant.ArrayValue) +// elem, err := arr.Value(0) +// case variant.String: +// s := v.Value().(string) +// case variant.Int64: +// i := v.Value().(int64) +// } +// +// You can also switch on the type of the result value from the [Value.Value] method: +// +// switch val := v.Value().(type) { +// case nil: +// // ... +// case int32: +// // ... +// case string: +// // ... +// case variant.ArrayValue: +// for i, item := range val.Values() { +// // item is a variant.Value +// } +// case variant.ObjectValue: +// for k, item := range val.Values() { +// // k is the field key +// // item is a variant.Value for that field +// } +// } +// +// Values can also be converted to JSON: +// +// jsonBytes, err := json.Marshal(v) +// +// # Low-level Construction +// +// For direct construction of complex nested structures, you can use the low-level +// methods: +// +// var b variant.Builder Review Comment: I like it. I assume you'll do some handling if the parent has `Finish()` called without the children's `Finish()` (either panic, or just call their `Finish()` methods, right? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] feat(parquet): add variant encoder/decoder [arrow-go]
zeroshade commented on code in PR #344: URL: https://github.com/apache/arrow-go/pull/344#discussion_r2110486245 ## parquet/variant/doc.go: ## @@ -0,0 +1,142 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +// Package variant provides an implementation of the Apache Parquet Variant data type. +// +// The Variant type is a flexible binary format designed to represent complex nested +// data structures with minimal overhead. It supports a wide range of primitive types +// as well as nested arrays and objects (similar to JSON). The format uses a memory-efficient +// binary representation with a separate metadata section for dictionary encoding of keys. +// +// # Key Components +// +// - [Value]: The primary type representing a variant value +// - [Metadata]: Contains information about the dictionary of keys +// - [Builder]: Used to construct variant values +// +// # Format Overview +// +// The variant format consists of two parts: +// +// 1. Metadata: A dictionary of keys used in objects +// 2. Value: The actual data payload +// +// Values can be one of the following types: +// +// - Primitive values (null, bool, int8/16/32/64, float32/64, etc.) +// - Short strings (less than 64 bytes) +// - Long strings and binary data +// - Date, time and timestamp values +// - Decimal values (4, 8, or 16 bytes) +// - Arrays of any variant value +// - Objects with key-value pairs +// +// # Working with Variants +// +// To create a variant value, use the Builder: +// +// var b variant.Builder +// b.Append(map[string]any{ +// "id": 123, +// "name": "example", +// "data": []any{1, 2, 3}, +// }) +// value, err := b.Build() +// +// To parse an existing variant value: +// +// v, err := variant.New(metadataBytes, valueBytes) +// +// You can access the data using the [Value.Value] method which returns the appropriate Go type: +// +// switch v.Type() { +// case variant.Object: +// obj := v.Value().(variant.ObjectValue) +// field, err := obj.ValueByKey("name") +// case variant.Array: +// arr := v.Value().(variant.ArrayValue) +// elem, err := arr.Value(0) +// case variant.String: +// s := v.Value().(string) +// case variant.Int64: +// i := v.Value().(int64) +// } +// +// You can also switch on the type of the result value from the [Value.Value] method: +// +// switch val := v.Value().(type) { +// case nil: +// // ... +// case int32: +// // ... +// case string: +// // ... +// case variant.ArrayValue: +// for i, item := range val.Values() { +// // item is a variant.Value +// } +// case variant.ObjectValue: +// for k, item := range val.Values() { +// // k is the field key +// // item is a variant.Value for that field +// } +// } +// +// Values can also be converted to JSON: +// +// jsonBytes, err := json.Marshal(v) +// +// # Low-level Construction +// +// For direct construction of complex nested structures, you can use the low-level +// methods: +// +// var b variant.Builder Review Comment: Okay, what do you think of something like this: ```go var b variant.Builder obj, err := b.NewObject() // handle err obj.AppendInt("id", 1) observe, err := obj.AppendObject("observation") // handle err observe.AppendString("location", "In the Volcano") observe.AppendString("time", "12:34:56") value, err := observe.AppendObject("value") // handle err value.AppendInt("humidity", 456) value.AppendInt("temperature", 123) value.Finish() observe.Finish() species, err := obj.AppendObject("species") // handle err species.AppendString("name", "lava monster") species.AppendInt("population", 6789) species.Finish() obj.Finish() v, err := b.Build() // handle err out, err := json.Marshal(v) /* result is this `{ "id": 1, "observation": { "location": "In the V
Re: [PR] feat(parquet): add variant encoder/decoder [arrow-go]
zeroshade commented on code in PR #344: URL: https://github.com/apache/arrow-go/pull/344#discussion_r2110433238 ## parquet/variant/doc.go: ## @@ -0,0 +1,142 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +// Package variant provides an implementation of the Apache Parquet Variant data type. +// +// The Variant type is a flexible binary format designed to represent complex nested +// data structures with minimal overhead. It supports a wide range of primitive types +// as well as nested arrays and objects (similar to JSON). The format uses a memory-efficient +// binary representation with a separate metadata section for dictionary encoding of keys. +// +// # Key Components +// +// - [Value]: The primary type representing a variant value +// - [Metadata]: Contains information about the dictionary of keys +// - [Builder]: Used to construct variant values +// +// # Format Overview +// +// The variant format consists of two parts: +// +// 1. Metadata: A dictionary of keys used in objects +// 2. Value: The actual data payload +// +// Values can be one of the following types: +// +// - Primitive values (null, bool, int8/16/32/64, float32/64, etc.) +// - Short strings (less than 64 bytes) +// - Long strings and binary data +// - Date, time and timestamp values +// - Decimal values (4, 8, or 16 bytes) +// - Arrays of any variant value +// - Objects with key-value pairs +// +// # Working with Variants +// +// To create a variant value, use the Builder: +// +// var b variant.Builder +// b.Append(map[string]any{ +// "id": 123, +// "name": "example", +// "data": []any{1, 2, 3}, +// }) +// value, err := b.Build() +// +// To parse an existing variant value: +// +// v, err := variant.New(metadataBytes, valueBytes) +// +// You can access the data using the [Value.Value] method which returns the appropriate Go type: +// +// switch v.Type() { +// case variant.Object: +// obj := v.Value().(variant.ObjectValue) +// field, err := obj.ValueByKey("name") +// case variant.Array: +// arr := v.Value().(variant.ArrayValue) +// elem, err := arr.Value(0) +// case variant.String: +// s := v.Value().(string) +// case variant.Int64: +// i := v.Value().(int64) +// } +// +// You can also switch on the type of the result value from the [Value.Value] method: +// +// switch val := v.Value().(type) { +// case nil: +// // ... +// case int32: +// // ... +// case string: +// // ... +// case variant.ArrayValue: +// for i, item := range val.Values() { +// // item is a variant.Value +// } +// case variant.ObjectValue: +// for k, item := range val.Values() { +// // k is the field key +// // item is a variant.Value for that field +// } +// } +// +// Values can also be converted to JSON: +// +// jsonBytes, err := json.Marshal(v) +// +// # Low-level Construction +// +// For direct construction of complex nested structures, you can use the low-level +// methods: +// +// var b variant.Builder Review Comment: > Things also get tricky with sorting of metadata keys and building (as you can't know the index of the dictionary key until build time, which then changes an Object's index, which can then change the size of offsets, etc...). Well that's an entirely different ball of worms. Even if you use the nested object approach, things are still tricky if you want to sort the metadata keys as you end up having to patch, modify, or entirely rewrite any already written objects when you change the field index by sorting the metadata keys. (See the [rust impl](https://github.com/apache/arrow-rs/pull/7452/files#diff-83c4572d83951bc1234fd12d3cc6537b1e025188f06c39b378b0f7b9c32aa4f3)). So using the nested approach with multiple buffers doesn't help us make sorting metadata keys more efficient. > I don't think it can be done yea, that's why the API looks the way it does right now, so that we can build everything using a single buffer instead of needing multiple buffers and copying a bunch of stuff around. I've managed
Re: [PR] feat(parquet): add variant encoder/decoder [arrow-go]
sfc-gh-mbojanczyk commented on code in PR #344: URL: https://github.com/apache/arrow-go/pull/344#discussion_r2110236886 ## parquet/variant/variant.go: ## @@ -0,0 +1,722 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +package variant + +import ( + "bytes" + "encoding/binary" + "encoding/json" + "errors" + "fmt" + "iter" + "maps" + "slices" + "strings" + "time" + "unsafe" + + "github.com/apache/arrow-go/v18/arrow" + "github.com/apache/arrow-go/v18/arrow/decimal" + "github.com/apache/arrow-go/v18/arrow/decimal128" + "github.com/apache/arrow-go/v18/parquet/internal/debug" + "github.com/google/uuid" +) + +//go:generate go tool stringer -type=BasicType -linecomment -output=basic_type_string.go +//go:generate go tool stringer -type=PrimitiveType -linecomment -output=primitive_type_string.go + +// BasicType represents the fundamental type category of a variant value. +type BasicType int + +const ( + BasicUndefined BasicType = iota - 1 // Unknown + BasicPrimitive// Primitive + BasicShortString // ShortString + BasicObject // Object + BasicArray// Array +) + +func basicTypeFromHeader(hdr byte) BasicType { + // because we're doing hdr & 0x3, it is impossible for the result + // to be outside of the range of BasicType. Therefore, we don't + // need to perform any checks. The value will always be [0,3] + return BasicType(hdr & basicTypeMask) +} + +// PrimitiveType represents specific primitive data types within the variant format. +type PrimitiveType int + +const ( + PrimitiveInvalidPrimitiveType = iota - 1 // Unknown Review Comment: Oh right! I think I ended up explicitly defining each primitive value (ie. `primitiveNull = 0`, `primitiveFalse = 1`, etc...) because I've been bitten by relying on `iota` in the past (a well meaning coworker reordered my enum to be in alphabetical order and a bunch of tests blew up). IMO, if it's defined in spec, it's safer to be explicit instead of using an enum. ## parquet/variant/doc.go: ## @@ -0,0 +1,142 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +// Package variant provides an implementation of the Apache Parquet Variant data type. +// +// The Variant type is a flexible binary format designed to represent complex nested +// data structures with minimal overhead. It supports a wide range of primitive types +// as well as nested arrays and objects (similar to JSON). The format uses a memory-efficient +// binary representation with a separate metadata section for dictionary encoding of keys. +// +// # Key Components +// +// - [Value]: The primary type representing a variant value +// - [Metadata]: Contains information about the dictionary of keys +// - [Builder]: Used to construct variant values +// +// # Format Overview +// +// The variant format consists of two parts: +// +// 1. Metadata: A dictionary of keys used in objects +// 2. Value: The actual data payload +// +// Values can be one of the following types: +// +// - Primitive values (null, bool, int8/16/32/64, float32/64, etc.) +// - Short strings (less than 64 bytes) +// - Long strings and binary data +// - Date, time and timestamp values +// - Decimal values (4, 8, or 16 bytes) +// - Ar
Re: [PR] feat(parquet): add variant encoder/decoder [arrow-go]
zeroshade commented on code in PR #344: URL: https://github.com/apache/arrow-go/pull/344#discussion_r2110077872 ## parquet/variant/variant.go: ## @@ -0,0 +1,722 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +package variant + +import ( + "bytes" + "encoding/binary" + "encoding/json" + "errors" + "fmt" + "iter" + "maps" + "slices" + "strings" + "time" + "unsafe" + + "github.com/apache/arrow-go/v18/arrow" + "github.com/apache/arrow-go/v18/arrow/decimal" + "github.com/apache/arrow-go/v18/arrow/decimal128" + "github.com/apache/arrow-go/v18/parquet/internal/debug" + "github.com/google/uuid" +) + +//go:generate go tool stringer -type=BasicType -linecomment -output=basic_type_string.go +//go:generate go tool stringer -type=PrimitiveType -linecomment -output=primitive_type_string.go + +// BasicType represents the fundamental type category of a variant value. +type BasicType int + +const ( + BasicUndefined BasicType = iota - 1 // Unknown + BasicPrimitive// Primitive + BasicShortString // ShortString + BasicObject // Object + BasicArray// Array +) + +func basicTypeFromHeader(hdr byte) BasicType { + // because we're doing hdr & 0x3, it is impossible for the result + // to be outside of the range of BasicType. Therefore, we don't + // need to perform any checks. The value will always be [0,3] + return BasicType(hdr & basicTypeMask) +} + +// PrimitiveType represents specific primitive data types within the variant format. +type PrimitiveType int + +const ( + PrimitiveInvalidPrimitiveType = iota - 1 // Unknown Review Comment: These constants have specific values defined by the spec, so the zero value needs to be Null as the constant for "Null" type is 0. I don't really have wiggle room to change that. Though, technically if all variant building is done through the builder, we could probably get away without exporting this enum and these constants. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] feat(parquet): add variant encoder/decoder [arrow-go]
zeroshade commented on code in PR #344: URL: https://github.com/apache/arrow-go/pull/344#discussion_r2110087873 ## parquet/variant/doc.go: ## @@ -0,0 +1,142 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +// Package variant provides an implementation of the Apache Parquet Variant data type. +// +// The Variant type is a flexible binary format designed to represent complex nested +// data structures with minimal overhead. It supports a wide range of primitive types +// as well as nested arrays and objects (similar to JSON). The format uses a memory-efficient +// binary representation with a separate metadata section for dictionary encoding of keys. +// +// # Key Components +// +// - [Value]: The primary type representing a variant value +// - [Metadata]: Contains information about the dictionary of keys +// - [Builder]: Used to construct variant values +// +// # Format Overview +// +// The variant format consists of two parts: +// +// 1. Metadata: A dictionary of keys used in objects +// 2. Value: The actual data payload +// +// Values can be one of the following types: +// +// - Primitive values (null, bool, int8/16/32/64, float32/64, etc.) +// - Short strings (less than 64 bytes) +// - Long strings and binary data +// - Date, time and timestamp values +// - Decimal values (4, 8, or 16 bytes) +// - Arrays of any variant value +// - Objects with key-value pairs +// +// # Working with Variants +// +// To create a variant value, use the Builder: +// +// var b variant.Builder +// b.Append(map[string]any{ +// "id": 123, +// "name": "example", +// "data": []any{1, 2, 3}, +// }) +// value, err := b.Build() +// +// To parse an existing variant value: +// +// v, err := variant.New(metadataBytes, valueBytes) +// +// You can access the data using the [Value.Value] method which returns the appropriate Go type: +// +// switch v.Type() { +// case variant.Object: +// obj := v.Value().(variant.ObjectValue) +// field, err := obj.ValueByKey("name") +// case variant.Array: +// arr := v.Value().(variant.ArrayValue) +// elem, err := arr.Value(0) +// case variant.String: +// s := v.Value().(string) +// case variant.Int64: +// i := v.Value().(int64) +// } +// +// You can also switch on the type of the result value from the [Value.Value] method: +// +// switch val := v.Value().(type) { +// case nil: +// // ... +// case int32: +// // ... +// case string: +// // ... +// case variant.ArrayValue: +// for i, item := range val.Values() { +// // item is a variant.Value +// } +// case variant.ObjectValue: +// for k, item := range val.Values() { +// // k is the field key +// // item is a variant.Value for that field +// } +// } +// +// Values can also be converted to JSON: +// +// jsonBytes, err := json.Marshal(v) +// +// # Low-level Construction +// +// For direct construction of complex nested structures, you can use the low-level +// methods: +// +// var b variant.Builder Review Comment: If we instead use a `StartObject`/`StartArray` pattern to construct child builders, then the user ends up having to keep track of the child builders instead of keeping track of the offests, so I'm not sure the added complexity of having to check and enforce that all the nested builders are completed before calling `Finish`/`Build` is worth it or saves that much. I also couldn't think of a clean way to keep the ability to build the whole object with a single buffer (instead of multiple buffers) while handling the nested builders and enforcing all the checks without a ton of repetitive code. I'll think about this a bit more and see if I can come up with something I like. @lidavid @mapleFU what do you think? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.
Re: [PR] feat(parquet): add variant encoder/decoder [arrow-go]
zeroshade commented on code in PR #344: URL: https://github.com/apache/arrow-go/pull/344#discussion_r2110087873 ## parquet/variant/doc.go: ## @@ -0,0 +1,142 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +// Package variant provides an implementation of the Apache Parquet Variant data type. +// +// The Variant type is a flexible binary format designed to represent complex nested +// data structures with minimal overhead. It supports a wide range of primitive types +// as well as nested arrays and objects (similar to JSON). The format uses a memory-efficient +// binary representation with a separate metadata section for dictionary encoding of keys. +// +// # Key Components +// +// - [Value]: The primary type representing a variant value +// - [Metadata]: Contains information about the dictionary of keys +// - [Builder]: Used to construct variant values +// +// # Format Overview +// +// The variant format consists of two parts: +// +// 1. Metadata: A dictionary of keys used in objects +// 2. Value: The actual data payload +// +// Values can be one of the following types: +// +// - Primitive values (null, bool, int8/16/32/64, float32/64, etc.) +// - Short strings (less than 64 bytes) +// - Long strings and binary data +// - Date, time and timestamp values +// - Decimal values (4, 8, or 16 bytes) +// - Arrays of any variant value +// - Objects with key-value pairs +// +// # Working with Variants +// +// To create a variant value, use the Builder: +// +// var b variant.Builder +// b.Append(map[string]any{ +// "id": 123, +// "name": "example", +// "data": []any{1, 2, 3}, +// }) +// value, err := b.Build() +// +// To parse an existing variant value: +// +// v, err := variant.New(metadataBytes, valueBytes) +// +// You can access the data using the [Value.Value] method which returns the appropriate Go type: +// +// switch v.Type() { +// case variant.Object: +// obj := v.Value().(variant.ObjectValue) +// field, err := obj.ValueByKey("name") +// case variant.Array: +// arr := v.Value().(variant.ArrayValue) +// elem, err := arr.Value(0) +// case variant.String: +// s := v.Value().(string) +// case variant.Int64: +// i := v.Value().(int64) +// } +// +// You can also switch on the type of the result value from the [Value.Value] method: +// +// switch val := v.Value().(type) { +// case nil: +// // ... +// case int32: +// // ... +// case string: +// // ... +// case variant.ArrayValue: +// for i, item := range val.Values() { +// // item is a variant.Value +// } +// case variant.ObjectValue: +// for k, item := range val.Values() { +// // k is the field key +// // item is a variant.Value for that field +// } +// } +// +// Values can also be converted to JSON: +// +// jsonBytes, err := json.Marshal(v) +// +// # Low-level Construction +// +// For direct construction of complex nested structures, you can use the low-level +// methods: +// +// var b variant.Builder Review Comment: If we instead use a `StartObject`/`StartArray` pattern to construct child builders, then the user ends up having to keep track of the child builders instead of keeping track of the offests, so I'm not sure the added complexity of having to check and enforce that all the nested builders are completed before calling `Finish`/`Build` is worth it or saves that much. I also couldn't think of a clean way to keep the ability to build the whole object with a single buffer (instead of multiple buffers) while handling the nested builders and enforcing all the checks without a ton of repetitive code. I'll think about this a bit more and see if I can come up with something I like. @lidavid @wgtmac what do you think? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.or
Re: [PR] feat(parquet): add variant encoder/decoder [arrow-go]
zeroshade commented on code in PR #344: URL: https://github.com/apache/arrow-go/pull/344#discussion_r2110080419 ## parquet/variant/basic_type_string.go: ## Review Comment: sure, i'll update the go-generate lines and have it rename the files. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] feat(parquet): add variant encoder/decoder [arrow-go]
zeroshade commented on code in PR #344: URL: https://github.com/apache/arrow-go/pull/344#discussion_r2110079492 ## parquet/variant/builder.go: ## @@ -0,0 +1,852 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +package variant + +import ( + "bytes" + "cmp" + "encoding/binary" + "errors" + "fmt" + "io" + "math" + "reflect" + "slices" + "strings" + "time" + "unsafe" + + "github.com/apache/arrow-go/v18/arrow" + "github.com/apache/arrow-go/v18/arrow/decimal" + "github.com/google/uuid" +) + +// Builder is used to construct Variant values by appending data of various types. +// It manages an internal buffer for the value data and a dictionary for field keys. +type Builder struct { + buf bytes.Buffer + dictmap[string]uint32 + dictKeys[][]byte + allowDuplicates bool +} + +// SetAllowDuplicates controls whether duplicate keys are allowed in objects. +// When true, the last value for a key is used. When false, an error is returned +// if a duplicate key is detected. +func (b *Builder) SetAllowDuplicates(allow bool) { + b.allowDuplicates = allow +} + +// AddKey adds a key to the builder's dictionary and returns its ID. +// If the key already exists in the dictionary, its existing ID is returned. +func (b *Builder) AddKey(key string) (id uint32) { + if b.dict == nil { + b.dict = make(map[string]uint32) + b.dictKeys = make([][]byte, 0, 16) + } + + var ok bool + if id, ok = b.dict[key]; ok { + return id + } + + id = uint32(len(b.dictKeys)) + b.dict[key] = id + b.dictKeys = append(b.dictKeys, unsafe.Slice(unsafe.StringData(key), len(key))) + + return id +} + +// AppendOpt represents options for appending time-related values. These are only +// used when using the generic Append method that takes an interface{}. +type AppendOpt int16 + +const ( + // OptTimestampNano specifies that timestamps should use nanosecond precision, + // otherwise microsecond precision is used. + OptTimestampNano AppendOpt = 1 << iota + // OptTimestampUTC specifies that timestamps should be in UTC timezone, otherwise + // no time zone (NTZ) is used. + OptTimestampUTC + // OptTimeAsDate specifies that time.Time values should be encoded as dates + OptTimeAsDate + // OptTimeAsTime specifies that time.Time values should be encoded as a time value + OptTimeAsTime +) + +func extractFieldInfo(f reflect.StructField) (name string, o AppendOpt) { + tag := f.Tag.Get("variant") + if tag == "" { + return f.Name, 0 + } + + parts := strings.Split(tag, ",") + if len(parts) == 1 { + return parts[0], 0 + } + + name = parts[0] + if name == "" { + name = f.Name + } + + for _, opt := range parts[1:] { + switch strings.ToLower(opt) { + case "nanos": + o |= OptTimestampNano + case "utc": + o |= OptTimestampUTC + case "date": + o |= OptTimeAsDate + case "time": + o |= OptTimeAsTime + } + } + + return name, o +} + +// Append adds a value of any supported type to the builder. +// +// Any basic primitive type is supported, the AppendOpt options are used to control how +// timestamps are appended (e.g., as microseconds or nanoseconds and timezone). The options +// also control how a [time.Time] value is appended (e.g., as a date, timestamp, or time). +// +// Appending a value with type `[]any` will construct an array appropriately, appending +// each element. Calling with a map[string]any will construct an object, recursively calling +// Append for each value, propagating the options. +// +// For other types (arbitrary slices, arrays, maps and structs), reflection is used to determine +// the type and whether we can append it. A nil pointer will append a null, while a non-nil +// pointer will append the va
Re: [PR] feat(parquet): add variant encoder/decoder [arrow-go]
sfc-gh-mbojanczyk commented on code in PR #344: URL: https://github.com/apache/arrow-go/pull/344#discussion_r2109823794 ## parquet/variant/variant.go: ## @@ -0,0 +1,722 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +package variant + +import ( + "bytes" + "encoding/binary" + "encoding/json" + "errors" + "fmt" + "iter" + "maps" + "slices" + "strings" + "time" + "unsafe" + + "github.com/apache/arrow-go/v18/arrow" + "github.com/apache/arrow-go/v18/arrow/decimal" + "github.com/apache/arrow-go/v18/arrow/decimal128" + "github.com/apache/arrow-go/v18/parquet/internal/debug" + "github.com/google/uuid" +) + +//go:generate go tool stringer -type=BasicType -linecomment -output=basic_type_string.go +//go:generate go tool stringer -type=PrimitiveType -linecomment -output=primitive_type_string.go + +// BasicType represents the fundamental type category of a variant value. +type BasicType int + +const ( + BasicUndefined BasicType = iota - 1 // Unknown + BasicPrimitive// Primitive + BasicShortString // ShortString + BasicObject // Object + BasicArray// Array +) + +func basicTypeFromHeader(hdr byte) BasicType { + // because we're doing hdr & 0x3, it is impossible for the result + // to be outside of the range of BasicType. Therefore, we don't + // need to perform any checks. The value will always be [0,3] + return BasicType(hdr & basicTypeMask) +} + +// PrimitiveType represents specific primitive data types within the variant format. +type PrimitiveType int + +const ( + PrimitiveInvalidPrimitiveType = iota - 1 // Unknown + PrimitiveNull// Null + PrimitiveBoolTrue// BoolTrue + PrimitiveBoolFalse // BoolFalse + PrimitiveInt8// Int8 + PrimitiveInt16 // Int16 + PrimitiveInt32 // Int32 + PrimitiveInt64 // Int64 + PrimitiveDouble // Double + PrimitiveDecimal4// Decimal32 + PrimitiveDecimal8// Decimal64 + PrimitiveDecimal16 // Decimal128 + PrimitiveDate// Date + PrimitiveTimestampMicros // Timestamp(micros) + PrimitiveTimestampMicrosNTZ // TimestampNTZ(micros) + PrimitiveFloat // Float + PrimitiveBinary // Binary + PrimitiveString // String + PrimitiveTimeMicrosNTZ // TimeNTZ(micros) + PrimitiveTimestampNanos // Timestamp(nanos) + PrimitiveTimestampNanosNTZ // TimestampNTZ(nanos) + PrimitiveUUID// UUID +) + +func primitiveTypeFromHeader(hdr byte) PrimitiveType { + return PrimitiveType((hdr >> basicTypeBits) & typeInfoMask) +} + +// Type represents the high-level variant data type. +// This is what applications typically use to identify the type of a variant value. +type Type int + +const ( + Object Type = iota + Array + Null + Bool + Int8 + Int16 + Int32 + Int64 + String + Double + Decimal4 + Decimal8 + Decimal16 + Date + TimestampMicros + TimestampMicrosNTZ + Float + Binary + Time + TimestampNanos + TimestampNanosNTZ + UUID +) + +const ( + versionMaskuint8 = 0x0F + sortedStrMask uint8 = 0b1 + basicTypeMask uint8 = 0x3 + basicTypeBits uint8 = 2 + t
Re: [PR] feat(parquet): add variant encoder/decoder [arrow-go]
zeroshade commented on PR #344: URL: https://github.com/apache/arrow-go/pull/344#issuecomment-2913095614 I'll give @sfc-gh-mbojanczyk a chance to comment and respond here before I merge this just to make sure I get all the feedback I can. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] feat(parquet): add variant encoder/decoder [arrow-go]
lidavidm commented on code in PR #344: URL: https://github.com/apache/arrow-go/pull/344#discussion_r2106054776 ## parquet/variant/variant_test.go: ## Review Comment: Thanks. I wonder if fuzzing the variants specifically would be a more manageable case to start with. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] feat(parquet): add variant encoder/decoder [arrow-go]
zeroshade commented on code in PR #344: URL: https://github.com/apache/arrow-go/pull/344#discussion_r2105906717 ## parquet/variant/variant.go: ## @@ -0,0 +1,725 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +package variant + +import ( + "bytes" + "encoding/binary" + "encoding/json" + "errors" + "fmt" + "iter" + "maps" + "slices" + "strings" + "time" + "unsafe" + + "github.com/apache/arrow-go/v18/arrow" + "github.com/apache/arrow-go/v18/arrow/decimal" + "github.com/apache/arrow-go/v18/arrow/decimal128" + "github.com/apache/arrow-go/v18/parquet/internal/debug" + "github.com/google/uuid" +) + +//go:generate go tool stringer -type=BasicType -linecomment -output=basic_type_string.go +//go:generate go tool stringer -type=PrimitiveType -linecomment -output=primitive_type_string.go + +// BasicType represents the fundamental type category of a variant value. +type BasicType int + +const ( + BasicUndefined BasicType = iota - 1 // Unknown + BasicPrimitive// Primitive + BasicShortString // ShortString + BasicObject // Object + BasicArray// Array +) + +func basicTypeFromHeader(hdr byte) BasicType { + return BasicType(hdr & basicTypeMask) Review Comment: turns out there's no need for a `debug.Assert` here, because we're doing `& basicTypeMask` we're guaranteed the result is going to be in the correct range for the `BasicType` enum: `[0,3]` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] feat(parquet): add variant encoder/decoder [arrow-go]
zeroshade commented on code in PR #344: URL: https://github.com/apache/arrow-go/pull/344#discussion_r2105891795 ## parquet/variant/variant_test.go: ## Review Comment: Currently the Go implementation doesn't have any fuzzing set up. Go has a whole infrastructure for setting up fuzz testing (https://go.dev/doc/security/fuzz/) I just haven't gotten around to setting it up. It just hasn't been on a high priority given everything else unless people think I should prioritize it. That said, I think it makes sense to test some examples of invalid variants. I'll add some tests for that. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] feat(parquet): add variant encoder/decoder [arrow-go]
zeroshade commented on code in PR #344: URL: https://github.com/apache/arrow-go/pull/344#discussion_r2105891543 ## parquet/variant/builder.go: ## @@ -0,0 +1,847 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +package variant + +import ( + "bytes" + "cmp" + "encoding/binary" + "errors" + "fmt" + "io" + "math" + "reflect" + "slices" + "strings" + "time" + "unsafe" + + "github.com/apache/arrow-go/v18/arrow" + "github.com/apache/arrow-go/v18/arrow/decimal" + "github.com/google/uuid" +) + +// Builder is used to construct Variant values by appending data of various types. +// It manages an internal buffer for the value data and a dictionary for field keys. +type Builder struct { + buf bytes.Buffer + dictmap[string]uint32 + dictKeys[][]byte + allowDuplicates bool +} + +// SetAllowDuplicates controls whether duplicate keys are allowed in objects. +// When true, the last value for a key is used. When false, an error is returned +// if a duplicate key is detected. +func (b *Builder) SetAllowDuplicates(allow bool) { + b.allowDuplicates = allow +} Review Comment: I don't think so. I think we probably want to see where the whole ecosystem falls on this and then see if we can get all the implementations to align. I took this behavior from the spark implementation. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] feat(parquet): add variant encoder/decoder [arrow-go]
zeroshade commented on code in PR #344: URL: https://github.com/apache/arrow-go/pull/344#discussion_r2105891187 ## parquet/variant/variant.go: ## @@ -0,0 +1,725 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +package variant + +import ( + "bytes" + "encoding/binary" + "encoding/json" + "errors" + "fmt" + "iter" + "maps" + "slices" + "strings" + "time" + "unsafe" + + "github.com/apache/arrow-go/v18/arrow" + "github.com/apache/arrow-go/v18/arrow/decimal" + "github.com/apache/arrow-go/v18/arrow/decimal128" + "github.com/apache/arrow-go/v18/parquet/internal/debug" + "github.com/google/uuid" +) + +//go:generate go tool stringer -type=BasicType -linecomment -output=basic_type_string.go +//go:generate go tool stringer -type=PrimitiveType -linecomment -output=primitive_type_string.go + +// BasicType represents the fundamental type category of a variant value. +type BasicType int + +const ( + BasicUndefined BasicType = iota - 1 // Unknown + BasicPrimitive// Primitive + BasicShortString // ShortString + BasicObject // Object + BasicArray// Array +) + +func basicTypeFromHeader(hdr byte) BasicType { + return BasicType(hdr & basicTypeMask) Review Comment: sure -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] feat(parquet): add variant encoder/decoder [arrow-go]
zeroshade commented on code in PR #344: URL: https://github.com/apache/arrow-go/pull/344#discussion_r2105890998 ## parquet/variant/utils.go: ## Review Comment: I don't think so. All of the functions in utils.go are already covered by the existing unit test coverage, so I don't see any need for it. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] feat(parquet): add variant encoder/decoder [arrow-go]
lidavidm commented on code in PR #344: URL: https://github.com/apache/arrow-go/pull/344#discussion_r2105717305 ## parquet/variant/variant.go: ## @@ -0,0 +1,725 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +package variant + +import ( + "bytes" + "encoding/binary" + "encoding/json" + "errors" + "fmt" + "iter" + "maps" + "slices" + "strings" + "time" + "unsafe" + + "github.com/apache/arrow-go/v18/arrow" + "github.com/apache/arrow-go/v18/arrow/decimal" + "github.com/apache/arrow-go/v18/arrow/decimal128" + "github.com/apache/arrow-go/v18/parquet/internal/debug" + "github.com/google/uuid" +) + +//go:generate go tool stringer -type=BasicType -linecomment -output=basic_type_string.go +//go:generate go tool stringer -type=PrimitiveType -linecomment -output=primitive_type_string.go + +// BasicType represents the fundamental type category of a variant value. +type BasicType int + +const ( + BasicUndefined BasicType = iota - 1 // Unknown + BasicPrimitive// Primitive + BasicShortString // ShortString + BasicObject // Object + BasicArray// Array +) + +func basicTypeFromHeader(hdr byte) BasicType { + return BasicType(hdr & basicTypeMask) Review Comment: worth having a debug.Assert perhaps? ## parquet/variant/builder.go: ## @@ -0,0 +1,847 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +package variant + +import ( + "bytes" + "cmp" + "encoding/binary" + "errors" + "fmt" + "io" + "math" + "reflect" + "slices" + "strings" + "time" + "unsafe" + + "github.com/apache/arrow-go/v18/arrow" + "github.com/apache/arrow-go/v18/arrow/decimal" + "github.com/google/uuid" +) + +// Builder is used to construct Variant values by appending data of various types. +// It manages an internal buffer for the value data and a dictionary for field keys. +type Builder struct { + buf bytes.Buffer + dictmap[string]uint32 + dictKeys[][]byte + allowDuplicates bool +} + +// SetAllowDuplicates controls whether duplicate keys are allowed in objects. +// When true, the last value for a key is used. When false, an error is returned +// if a duplicate key is detected. +func (b *Builder) SetAllowDuplicates(allow bool) { + b.allowDuplicates = allow +} + +// AddKey adds a key to the builder's dictionary and returns its ID. +// If the key already exists in the dictionary, its existing ID is returned. +func (b *Builder) AddKey(key string) (id uint32) { + if b.dict == nil { + b.dict = make(map[string]uint32) + b.dictKeys = make([][]byte, 0, 16) + } + + var ok bool + if id, ok = b.dict[key]; ok { + return id + } + + id = uint32(len(b.dictKeys)) + b.dict[key] = id + b.dictKeys = append(b.dictKeys, unsafe.Slice(unsafe.StringData(key), len(key))) + + return id +} + +// AppendOpt represents options for appending time-related values. These are only +// used when using the generic Append method that takes an interface{}. +type AppendOpt int16 + +const ( + // OptTimestampNano specifies that timestamps should use nanosecond precision, +
Re: [PR] feat(parquet): add variant encoder/decoder [arrow-go]
sfc-gh-mbojanczyk commented on PR #344: URL: https://github.com/apache/arrow-go/pull/344#issuecomment-2906037583 > @sfc-gh-mbojanczyk Hey, sorry for the long delay here. I've gone through multiple iterations and discussions with people, plus pulling inspiration from the C++, Spark, and parquet-java implementations to figure out a good design. Please let me know what you think of the updated version! Thanks! No worries- I got caught in a whirlwind over here myself and was about to dust this off too :) Lemme take a peek here after the long weekend. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] feat(parquet): add variant encoder/decoder [arrow-go]
zeroshade commented on PR #344: URL: https://github.com/apache/arrow-go/pull/344#issuecomment-2905641862 @sfc-gh-mbojanczyk Hey, sorry for the long delay here. I've gone through multiple iterations and discussions with people, plus pulling inspiration from the C++, Spark, and parquet-java implementations to figure out a good design. Please let me know what you think of the updated version! Thanks! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] feat(parquet): add variant encoder/decoder [arrow-go]
zeroshade commented on code in PR #344: URL: https://github.com/apache/arrow-go/pull/344#discussion_r2064964717 ## parquet/variants/primitive.go: ## @@ -0,0 +1,678 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +package variants + +import ( + "fmt" + "io" + "math" + "reflect" + "strings" + "time" +) + +// Variant primitive type IDs. +type primitiveType int + +const ( + primitiveInvalidprimitiveType = -1 + primitiveNull primitiveType = 0 + primitiveTrue primitiveType = 1 + primitiveFalse primitiveType = 2 + primitiveInt8 primitiveType = 3 + primitiveInt16 primitiveType = 4 + primitiveInt32 primitiveType = 5 + primitiveInt64 primitiveType = 6 + primitiveDouble primitiveType = 7 + primitiveDecimal4 primitiveType = 8 // TODO + primitiveDecimal8 primitiveType = 9 // TODO + primitiveDecimal16 primitiveType = 10 // TODO + primitiveDate primitiveType = 11 + primitiveTimestampMicrosprimitiveType = 12 + primitiveTimestampNTZMicros primitiveType = 13 + primitiveFloat primitiveType = 14 + primitiveBinary primitiveType = 15 + primitiveString primitiveType = 16 + primitiveTimeNTZprimitiveType = 17 + primitiveTimestampNanos primitiveType = 18 + primitiveTimestampNTZNanos primitiveType = 19 + primitiveUUID primitiveType = 20 +) + +func (pt primitiveType) String() string { + switch pt { + case primitiveNull: + return "Null" + case primitiveFalse, primitiveTrue: + return "Boolean" + case primitiveInt8: + return "Int8" + case primitiveInt16: + return "Int16" + case primitiveInt32: + return "Int32" + case primitiveInt64: + return "Int64" + case primitiveDouble: + return "Double" + case primitiveDecimal4: + return "Decimal4" + case primitiveDecimal8: + return "Decimal8" + case primitiveDecimal16: + return "Decimal16" + case primitiveDate: + return "Date" + case primitiveTimestampMicros: + return "Timestamp(micros)" + case primitiveTimestampNTZMicros: + return "TimestampNTZ(micros)" + case primitiveFloat: + return "Float" + case primitiveBinary: + return "Binary" + case primitiveString: + return "String" + case primitiveTimeNTZ: + return "TimeNTZ" + case primitiveTimestampNanos: + return "Timestamp(nanos)" + case primitiveTimestampNTZNanos: + return "TimestampNTZ(nanos)" + case primitiveUUID: + return "UUID" + } + return "Invalid" +} + +func validPrimitiveValue(prim primitiveType) error { + if prim < primitiveNull || prim > primitiveUUID { + return fmt.Errorf("invalid primitive type: %d", prim) + } + return nil +} + +func primitiveFromHeader(hdr byte) (primitiveType, error) { + // Special case the basic type of Short String and call it a Primitive String. + bt := BasicTypeFromHeader(hdr) + if bt == BasicShortString { + return primitiveString, nil + } else if bt == BasicPrimitive { + prim := primitiveType(hdr >> 2) + if err := validPrimitiveValue(prim); err != nil { + return primitiveInvalid, err + } + return prim, nil + } + return primitiveInvalid, fmt.Errorf("header is not of a primitive or short string basic type: %s", bt) +} + +func primitiveHeader(prim primitiveType) (byte, error) { + if err := validPrimitiveValue(prim); err != nil { + return 0, err + } + hdr := byte(prim << 2) + hdr |= byte(BasicPrimitive) + return hdr, nil +} + +// marshalPri
Re: [PR] feat(parquet): add variant encoder/decoder [arrow-go]
sfc-gh-mbojanczyk commented on code in PR #344: URL: https://github.com/apache/arrow-go/pull/344#discussion_r2064947485 ## parquet/variants/primitive.go: ## @@ -0,0 +1,678 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +package variants + +import ( + "fmt" + "io" + "math" + "reflect" + "strings" + "time" +) + +// Variant primitive type IDs. +type primitiveType int + +const ( + primitiveInvalidprimitiveType = -1 + primitiveNull primitiveType = 0 + primitiveTrue primitiveType = 1 + primitiveFalse primitiveType = 2 + primitiveInt8 primitiveType = 3 + primitiveInt16 primitiveType = 4 + primitiveInt32 primitiveType = 5 + primitiveInt64 primitiveType = 6 + primitiveDouble primitiveType = 7 + primitiveDecimal4 primitiveType = 8 // TODO + primitiveDecimal8 primitiveType = 9 // TODO + primitiveDecimal16 primitiveType = 10 // TODO + primitiveDate primitiveType = 11 + primitiveTimestampMicrosprimitiveType = 12 + primitiveTimestampNTZMicros primitiveType = 13 + primitiveFloat primitiveType = 14 + primitiveBinary primitiveType = 15 + primitiveString primitiveType = 16 + primitiveTimeNTZprimitiveType = 17 + primitiveTimestampNanos primitiveType = 18 + primitiveTimestampNTZNanos primitiveType = 19 + primitiveUUID primitiveType = 20 +) + +func (pt primitiveType) String() string { + switch pt { + case primitiveNull: + return "Null" + case primitiveFalse, primitiveTrue: + return "Boolean" + case primitiveInt8: + return "Int8" + case primitiveInt16: + return "Int16" + case primitiveInt32: + return "Int32" + case primitiveInt64: + return "Int64" + case primitiveDouble: + return "Double" + case primitiveDecimal4: + return "Decimal4" + case primitiveDecimal8: + return "Decimal8" + case primitiveDecimal16: + return "Decimal16" + case primitiveDate: + return "Date" + case primitiveTimestampMicros: + return "Timestamp(micros)" + case primitiveTimestampNTZMicros: + return "TimestampNTZ(micros)" + case primitiveFloat: + return "Float" + case primitiveBinary: + return "Binary" + case primitiveString: + return "String" + case primitiveTimeNTZ: + return "TimeNTZ" + case primitiveTimestampNanos: + return "Timestamp(nanos)" + case primitiveTimestampNTZNanos: + return "TimestampNTZ(nanos)" + case primitiveUUID: + return "UUID" + } + return "Invalid" +} + +func validPrimitiveValue(prim primitiveType) error { + if prim < primitiveNull || prim > primitiveUUID { + return fmt.Errorf("invalid primitive type: %d", prim) + } + return nil +} + +func primitiveFromHeader(hdr byte) (primitiveType, error) { + // Special case the basic type of Short String and call it a Primitive String. + bt := BasicTypeFromHeader(hdr) + if bt == BasicShortString { + return primitiveString, nil + } else if bt == BasicPrimitive { + prim := primitiveType(hdr >> 2) + if err := validPrimitiveValue(prim); err != nil { + return primitiveInvalid, err + } + return prim, nil + } + return primitiveInvalid, fmt.Errorf("header is not of a primitive or short string basic type: %s", bt) +} + +func primitiveHeader(prim primitiveType) (byte, error) { + if err := validPrimitiveValue(prim); err != nil { + return 0, err + } + hdr := byte(prim << 2) + hdr |= byte(BasicPrimitive) + return hdr, nil +} + +// ma
Re: [PR] feat(parquet): add variant encoder/decoder [arrow-go]
sfc-gh-mbojanczyk commented on code in PR #344: URL: https://github.com/apache/arrow-go/pull/344#discussion_r2064947485 ## parquet/variants/primitive.go: ## @@ -0,0 +1,678 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +package variants + +import ( + "fmt" + "io" + "math" + "reflect" + "strings" + "time" +) + +// Variant primitive type IDs. +type primitiveType int + +const ( + primitiveInvalidprimitiveType = -1 + primitiveNull primitiveType = 0 + primitiveTrue primitiveType = 1 + primitiveFalse primitiveType = 2 + primitiveInt8 primitiveType = 3 + primitiveInt16 primitiveType = 4 + primitiveInt32 primitiveType = 5 + primitiveInt64 primitiveType = 6 + primitiveDouble primitiveType = 7 + primitiveDecimal4 primitiveType = 8 // TODO + primitiveDecimal8 primitiveType = 9 // TODO + primitiveDecimal16 primitiveType = 10 // TODO + primitiveDate primitiveType = 11 + primitiveTimestampMicrosprimitiveType = 12 + primitiveTimestampNTZMicros primitiveType = 13 + primitiveFloat primitiveType = 14 + primitiveBinary primitiveType = 15 + primitiveString primitiveType = 16 + primitiveTimeNTZprimitiveType = 17 + primitiveTimestampNanos primitiveType = 18 + primitiveTimestampNTZNanos primitiveType = 19 + primitiveUUID primitiveType = 20 +) + +func (pt primitiveType) String() string { + switch pt { + case primitiveNull: + return "Null" + case primitiveFalse, primitiveTrue: + return "Boolean" + case primitiveInt8: + return "Int8" + case primitiveInt16: + return "Int16" + case primitiveInt32: + return "Int32" + case primitiveInt64: + return "Int64" + case primitiveDouble: + return "Double" + case primitiveDecimal4: + return "Decimal4" + case primitiveDecimal8: + return "Decimal8" + case primitiveDecimal16: + return "Decimal16" + case primitiveDate: + return "Date" + case primitiveTimestampMicros: + return "Timestamp(micros)" + case primitiveTimestampNTZMicros: + return "TimestampNTZ(micros)" + case primitiveFloat: + return "Float" + case primitiveBinary: + return "Binary" + case primitiveString: + return "String" + case primitiveTimeNTZ: + return "TimeNTZ" + case primitiveTimestampNanos: + return "Timestamp(nanos)" + case primitiveTimestampNTZNanos: + return "TimestampNTZ(nanos)" + case primitiveUUID: + return "UUID" + } + return "Invalid" +} + +func validPrimitiveValue(prim primitiveType) error { + if prim < primitiveNull || prim > primitiveUUID { + return fmt.Errorf("invalid primitive type: %d", prim) + } + return nil +} + +func primitiveFromHeader(hdr byte) (primitiveType, error) { + // Special case the basic type of Short String and call it a Primitive String. + bt := BasicTypeFromHeader(hdr) + if bt == BasicShortString { + return primitiveString, nil + } else if bt == BasicPrimitive { + prim := primitiveType(hdr >> 2) + if err := validPrimitiveValue(prim); err != nil { + return primitiveInvalid, err + } + return prim, nil + } + return primitiveInvalid, fmt.Errorf("header is not of a primitive or short string basic type: %s", bt) +} + +func primitiveHeader(prim primitiveType) (byte, error) { + if err := validPrimitiveValue(prim); err != nil { + return 0, err + } + hdr := byte(prim << 2) + hdr |= byte(BasicPrimitive) + return hdr, nil +} + +// ma
Re: [PR] feat(parquet): add variant encoder/decoder [arrow-go]
zeroshade commented on code in PR #344: URL: https://github.com/apache/arrow-go/pull/344#discussion_r2064915823 ## parquet/variants/primitive.go: ## @@ -0,0 +1,678 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +package variants + +import ( + "fmt" + "io" + "math" + "reflect" + "strings" + "time" +) + +// Variant primitive type IDs. +type primitiveType int + +const ( + primitiveInvalidprimitiveType = -1 + primitiveNull primitiveType = 0 + primitiveTrue primitiveType = 1 + primitiveFalse primitiveType = 2 + primitiveInt8 primitiveType = 3 + primitiveInt16 primitiveType = 4 + primitiveInt32 primitiveType = 5 + primitiveInt64 primitiveType = 6 + primitiveDouble primitiveType = 7 + primitiveDecimal4 primitiveType = 8 // TODO + primitiveDecimal8 primitiveType = 9 // TODO + primitiveDecimal16 primitiveType = 10 // TODO + primitiveDate primitiveType = 11 + primitiveTimestampMicrosprimitiveType = 12 + primitiveTimestampNTZMicros primitiveType = 13 + primitiveFloat primitiveType = 14 + primitiveBinary primitiveType = 15 + primitiveString primitiveType = 16 + primitiveTimeNTZprimitiveType = 17 + primitiveTimestampNanos primitiveType = 18 + primitiveTimestampNTZNanos primitiveType = 19 + primitiveUUID primitiveType = 20 +) + +func (pt primitiveType) String() string { + switch pt { + case primitiveNull: + return "Null" + case primitiveFalse, primitiveTrue: + return "Boolean" + case primitiveInt8: + return "Int8" + case primitiveInt16: + return "Int16" + case primitiveInt32: + return "Int32" + case primitiveInt64: + return "Int64" + case primitiveDouble: + return "Double" + case primitiveDecimal4: + return "Decimal4" + case primitiveDecimal8: + return "Decimal8" + case primitiveDecimal16: + return "Decimal16" + case primitiveDate: + return "Date" + case primitiveTimestampMicros: + return "Timestamp(micros)" + case primitiveTimestampNTZMicros: + return "TimestampNTZ(micros)" + case primitiveFloat: + return "Float" + case primitiveBinary: + return "Binary" + case primitiveString: + return "String" + case primitiveTimeNTZ: + return "TimeNTZ" + case primitiveTimestampNanos: + return "Timestamp(nanos)" + case primitiveTimestampNTZNanos: + return "TimestampNTZ(nanos)" + case primitiveUUID: + return "UUID" + } + return "Invalid" +} + +func validPrimitiveValue(prim primitiveType) error { + if prim < primitiveNull || prim > primitiveUUID { + return fmt.Errorf("invalid primitive type: %d", prim) + } + return nil +} + +func primitiveFromHeader(hdr byte) (primitiveType, error) { + // Special case the basic type of Short String and call it a Primitive String. + bt := BasicTypeFromHeader(hdr) + if bt == BasicShortString { + return primitiveString, nil + } else if bt == BasicPrimitive { + prim := primitiveType(hdr >> 2) + if err := validPrimitiveValue(prim); err != nil { + return primitiveInvalid, err + } + return prim, nil + } + return primitiveInvalid, fmt.Errorf("header is not of a primitive or short string basic type: %s", bt) +} + +func primitiveHeader(prim primitiveType) (byte, error) { + if err := validPrimitiveValue(prim); err != nil { + return 0, err + } + hdr := byte(prim << 2) + hdr |= byte(BasicPrimitive) + return hdr, nil +} + +// marshalPri
Re: [PR] feat(parquet): add variant encoder/decoder [arrow-go]
zeroshade commented on code in PR #344: URL: https://github.com/apache/arrow-go/pull/344#discussion_r2064853352 ## parquet/variants/primitive.go: ## @@ -0,0 +1,674 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +package variants + +import ( + "fmt" + "io" + "math" + "reflect" + "strings" + "time" + "unsafe" + + "github.com/google/uuid" +) + +// Variant primitive type IDs. +type primitiveType int + +const ( + primitiveInvalidprimitiveType = -1 + primitiveNull primitiveType = 0 + primitiveTrue primitiveType = 1 + primitiveFalse primitiveType = 2 + primitiveInt8 primitiveType = 3 + primitiveInt16 primitiveType = 4 + primitiveInt32 primitiveType = 5 + primitiveInt64 primitiveType = 6 + primitiveDouble primitiveType = 7 + primitiveDecimal4 primitiveType = 8 // TODO + primitiveDecimal8 primitiveType = 9 // TODO + primitiveDecimal16 primitiveType = 10 // TODO + primitiveDate primitiveType = 11 + primitiveTimestampMicrosprimitiveType = 12 + primitiveTimestampNTZMicros primitiveType = 13 + primitiveFloat primitiveType = 14 + primitiveBinary primitiveType = 15 + primitiveString primitiveType = 16 + primitiveTimeNTZprimitiveType = 17 + primitiveTimestampNanos primitiveType = 18 + primitiveTimestampNTZNanos primitiveType = 19 + primitiveUUID primitiveType = 20 +) + +func (pt primitiveType) String() string { + switch pt { + case primitiveNull: + return "Null" + case primitiveFalse, primitiveTrue: + return "Boolean" + case primitiveInt8: + return "Int8" + case primitiveInt16: + return "Int16" + case primitiveInt32: + return "Int32" + case primitiveInt64: + return "Int64" + case primitiveDouble: + return "Double" + case primitiveDecimal4: + return "Decimal4" + case primitiveDecimal8: + return "Decimal8" + case primitiveDecimal16: + return "Decimal16" + case primitiveDate: + return "Date" + case primitiveTimestampMicros: + return "Timestamp(micros)" + case primitiveTimestampNTZMicros: + return "TimestampNTZ(micros)" + case primitiveFloat: + return "Float" + case primitiveBinary: + return "Binary" + case primitiveString: + return "String" + case primitiveTimeNTZ: + return "TimeNTZ" + case primitiveTimestampNanos: + return "Timestamp(nanos)" + case primitiveTimestampNTZNanos: + return "TimestampNTZ(nanos)" + case primitiveUUID: + return "UUID" + } + return "Invalid" +} + +func validPrimitiveValue(prim primitiveType) error { + if prim < primitiveNull || prim > primitiveUUID { + return fmt.Errorf("invalid primitive type: %d", prim) + } + return nil +} + +func primitiveFromHeader(hdr byte) (primitiveType, error) { + // Special case the basic type of Short String and call it a Primitive String. + bt := BasicTypeFromHeader(hdr) + if bt == BasicShortString { + return primitiveString, nil + } else if bt == BasicPrimitive { + prim := primitiveType(hdr >> 2) + if err := validPrimitiveValue(prim); err != nil { + return primitiveInvalid, err + } + return prim, nil + } + return primitiveInvalid, fmt.Errorf("header is not of a primitive or short string basic type: %s", bt) +} + +func primitiveHeader(prim primitiveType) (byte, error) { + if err := validPrimitiveValue(prim); err != nil { + return 0, err + } + hdr := byte(prim << 2) + hdr |= byte(BasicPr
Re: [PR] feat(parquet): add variant encoder/decoder [arrow-go]
zeroshade commented on code in PR #344: URL: https://github.com/apache/arrow-go/pull/344#discussion_r2064852563 ## parquet/variants/primitive.go: ## @@ -0,0 +1,678 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +package variants + +import ( + "fmt" + "io" + "math" + "reflect" + "strings" + "time" +) + +// Variant primitive type IDs. +type primitiveType int + +const ( + primitiveInvalidprimitiveType = -1 + primitiveNull primitiveType = 0 + primitiveTrue primitiveType = 1 + primitiveFalse primitiveType = 2 + primitiveInt8 primitiveType = 3 + primitiveInt16 primitiveType = 4 + primitiveInt32 primitiveType = 5 + primitiveInt64 primitiveType = 6 + primitiveDouble primitiveType = 7 + primitiveDecimal4 primitiveType = 8 // TODO + primitiveDecimal8 primitiveType = 9 // TODO + primitiveDecimal16 primitiveType = 10 // TODO + primitiveDate primitiveType = 11 + primitiveTimestampMicrosprimitiveType = 12 + primitiveTimestampNTZMicros primitiveType = 13 + primitiveFloat primitiveType = 14 + primitiveBinary primitiveType = 15 + primitiveString primitiveType = 16 + primitiveTimeNTZprimitiveType = 17 + primitiveTimestampNanos primitiveType = 18 + primitiveTimestampNTZNanos primitiveType = 19 + primitiveUUID primitiveType = 20 +) + +func (pt primitiveType) String() string { + switch pt { + case primitiveNull: + return "Null" + case primitiveFalse, primitiveTrue: + return "Boolean" + case primitiveInt8: + return "Int8" + case primitiveInt16: + return "Int16" + case primitiveInt32: + return "Int32" + case primitiveInt64: + return "Int64" + case primitiveDouble: + return "Double" + case primitiveDecimal4: + return "Decimal4" + case primitiveDecimal8: + return "Decimal8" + case primitiveDecimal16: + return "Decimal16" + case primitiveDate: + return "Date" + case primitiveTimestampMicros: + return "Timestamp(micros)" + case primitiveTimestampNTZMicros: + return "TimestampNTZ(micros)" + case primitiveFloat: + return "Float" + case primitiveBinary: + return "Binary" + case primitiveString: + return "String" + case primitiveTimeNTZ: + return "TimeNTZ" + case primitiveTimestampNanos: + return "Timestamp(nanos)" + case primitiveTimestampNTZNanos: + return "TimestampNTZ(nanos)" + case primitiveUUID: + return "UUID" + } + return "Invalid" +} + +func validPrimitiveValue(prim primitiveType) error { + if prim < primitiveNull || prim > primitiveUUID { + return fmt.Errorf("invalid primitive type: %d", prim) + } + return nil +} + +func primitiveFromHeader(hdr byte) (primitiveType, error) { + // Special case the basic type of Short String and call it a Primitive String. + bt := BasicTypeFromHeader(hdr) + if bt == BasicShortString { + return primitiveString, nil + } else if bt == BasicPrimitive { + prim := primitiveType(hdr >> 2) + if err := validPrimitiveValue(prim); err != nil { + return primitiveInvalid, err + } + return prim, nil + } + return primitiveInvalid, fmt.Errorf("header is not of a primitive or short string basic type: %s", bt) +} + +func primitiveHeader(prim primitiveType) (byte, error) { + if err := validPrimitiveValue(prim); err != nil { + return 0, err + } + hdr := byte(prim << 2) + hdr |= byte(BasicPrimitive) + return hdr, nil +} + +// marshalPri
Re: [PR] feat(parquet): add variant encoder/decoder [arrow-go]
sfc-gh-mbojanczyk commented on code in PR #344: URL: https://github.com/apache/arrow-go/pull/344#discussion_r2064846393 ## parquet/variants/util.go: ## @@ -0,0 +1,154 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +package variants + +import ( + "fmt" + "io" + "reflect" + "time" +) + +// Reads a little-endian encoded uint (betwen 1 and 8 bytes wide) from a raw buffer at a specified +// offset and returns its value. If any part of the read would be out of bounds, this returns an error. +func readUint(raw []byte, offset, size int) (uint64, error) { + if size < 1 || size > 8 { + return 0, fmt.Errorf("invalid size, must be in range [1,8]: %d", size) + } + if maxPos := offset + size; maxPos > len(raw) { + return 0, fmt.Errorf("out of bounds: trying to access position %d, max position is %d", maxPos, len(raw)) + } + var ret uint64 + for i := range size { + ret |= uint64(raw[i+offset]) << (8 * i) + } + return ret, nil +} Review Comment: (Sorry, misread the Short String header, it's six bits. Removed that example) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] feat(parquet): add variant encoder/decoder [arrow-go]
sfc-gh-mbojanczyk commented on code in PR #344: URL: https://github.com/apache/arrow-go/pull/344#discussion_r2064842943 ## parquet/variants/util.go: ## @@ -0,0 +1,154 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +package variants + +import ( + "fmt" + "io" + "reflect" + "time" +) + +// Reads a little-endian encoded uint (betwen 1 and 8 bytes wide) from a raw buffer at a specified +// offset and returns its value. If any part of the read would be out of bounds, this returns an error. +func readUint(raw []byte, offset, size int) (uint64, error) { + if size < 1 || size > 8 { + return 0, fmt.Errorf("invalid size, must be in range [1,8]: %d", size) + } + if maxPos := offset + size; maxPos > len(raw) { + return 0, fmt.Errorf("out of bounds: trying to access position %d, max position is %d", maxPos, len(raw)) + } + var ret uint64 + for i := range size { + ret |= uint64(raw[i+offset]) << (8 * i) + } + return ret, nil +} Review Comment: Ah, I see where your confusion is coming from (the spec doesn't make it easy to find all these in one place). There are _so many places_ where widths are set, `is_large` is solely one place. In this case `is_large` indicates whether `num_elements` in the [Object Data](https://github.com/apache/parquet-format/blob/master/VariantEncoding.md#value-data-for-object-basic_type2) is 1 or 4 bytes. I'm talking about `field_id` and `field_offset` sizes. From the header documentation: > `field_offset_size_minus_one` and `field_id_size_minus_one` are 2-bit values that represent the number of bytes used to encode the field offsets and field ids. In this case, `field_id` and `field_offset` can be 1-4 bytes. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] feat(parquet): add variant encoder/decoder [arrow-go]
sfc-gh-mbojanczyk commented on code in PR #344: URL: https://github.com/apache/arrow-go/pull/344#discussion_r2064842943 ## parquet/variants/util.go: ## @@ -0,0 +1,154 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +package variants + +import ( + "fmt" + "io" + "reflect" + "time" +) + +// Reads a little-endian encoded uint (betwen 1 and 8 bytes wide) from a raw buffer at a specified +// offset and returns its value. If any part of the read would be out of bounds, this returns an error. +func readUint(raw []byte, offset, size int) (uint64, error) { + if size < 1 || size > 8 { + return 0, fmt.Errorf("invalid size, must be in range [1,8]: %d", size) + } + if maxPos := offset + size; maxPos > len(raw) { + return 0, fmt.Errorf("out of bounds: trying to access position %d, max position is %d", maxPos, len(raw)) + } + var ret uint64 + for i := range size { + ret |= uint64(raw[i+offset]) << (8 * i) + } + return ret, nil +} Review Comment: Ah, I see where your confusion is coming from (the spec doesn't make it easy to find all these in one place). There are _so many places_ where widths are set, `is_large` is solely one place. In this case `is_large` indicates whether `num_elements` in the [Object Data](https://github.com/apache/parquet-format/blob/master/VariantEncoding.md#value-data-for-object-basic_type2) is 1 or 4 bytes. I'm talking about `field_id` and `field_offset` sizes. From the header documentation: > `field_offset_size_minus_one` and `field_id_size_minus_one` are 2-bit values that represent the number of bytes used to encode the field offsets and field ids. In this case, `field_id` and `field_offset` can be 1-4 bytes. Perhaps a better example here is in the [header](https://github.com/apache/parquet-format/blob/master/VariantEncoding.md#value-header-for-short-string-basic_type1) for `Short String`: > When `basic_type` is 1, value_header is a 6-bit `short_string_header` ... The `short_string_header` value is the length of the string. So a short string's length can be represented by 1-6 bytes inclusive. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] feat(parquet): add variant encoder/decoder [arrow-go]
zeroshade commented on code in PR #344: URL: https://github.com/apache/arrow-go/pull/344#discussion_r2064842449 ## parquet/variants/util.go: ## @@ -0,0 +1,154 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +package variants + +import ( + "fmt" + "io" + "reflect" + "time" +) + +// Reads a little-endian encoded uint (betwen 1 and 8 bytes wide) from a raw buffer at a specified +// offset and returns its value. If any part of the read would be out of bounds, this returns an error. +func readUint(raw []byte, offset, size int) (uint64, error) { + if size < 1 || size > 8 { + return 0, fmt.Errorf("invalid size, must be in range [1,8]: %d", size) + } + if maxPos := offset + size; maxPos > len(raw) { + return 0, fmt.Errorf("out of bounds: trying to access position %d, max position is %d", maxPos, len(raw)) + } + var ret uint64 + for i := range size { + ret |= uint64(raw[i+offset]) << (8 * i) + } + return ret, nil +} + +// Reads a little-endian encoded integer (between 1 and 8 bytes wide) from a raw buffer at a specified offset. +func readInt(raw []byte, offset, size int) (int64, error) { + u, err := readUint(raw, offset, size) + if err != nil { + return -1, err + } + return int64(u), nil +} + +func fieldOffsetSize(maxSize int32) int { + if maxSize < 0xFF { + return 1 + } else if maxSize < 0x { + return 2 + } else if maxSize < 0xFF { + return 3 + } + return 4 +} + +// Checks that a given range is in the provided raw buffer. +func checkBounds(raw []byte, low, high int) error { + maxPos := len(raw) + if low >= maxPos { + return fmt.Errorf("out of bounds: trying to access position %d, max is %d", low, maxPos) + } + if high > maxPos { + return fmt.Errorf("out of bounds: trying to access position %d, max is %d", high, maxPos) + } + return nil +} + +// Encodes a number of a specified width in little-endian format and writes to a writer. +func encodeNumber(val int64, size int, w io.Writer) { + buf := make([]byte, size) + for i := range size { + buf[i] = byte(val) + val >>= 8 + } + w.Write(buf) Review Comment: The problem with rolling our own here is that we're going to need to manage the big/little endian logic ourselves then so that this runs properly on big-endian systems. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] feat(parquet): add variant encoder/decoder [arrow-go]
zeroshade commented on code in PR #344: URL: https://github.com/apache/arrow-go/pull/344#discussion_r2064830536 ## parquet/variants/util.go: ## @@ -0,0 +1,154 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +package variants + +import ( + "fmt" + "io" + "reflect" + "time" +) + +// Reads a little-endian encoded uint (betwen 1 and 8 bytes wide) from a raw buffer at a specified +// offset and returns its value. If any part of the read would be out of bounds, this returns an error. +func readUint(raw []byte, offset, size int) (uint64, error) { + if size < 1 || size > 8 { + return 0, fmt.Errorf("invalid size, must be in range [1,8]: %d", size) + } + if maxPos := offset + size; maxPos > len(raw) { + return 0, fmt.Errorf("out of bounds: trying to access position %d, max position is %d", maxPos, len(raw)) + } + var ret uint64 + for i := range size { + ret |= uint64(raw[i+offset]) << (8 * i) + } + return ret, nil +} Review Comment: Ah, now I see it. thats weird and annoying -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] feat(parquet): add variant encoder/decoder [arrow-go]
zeroshade commented on code in PR #344: URL: https://github.com/apache/arrow-go/pull/344#discussion_r2064826002 ## parquet/variants/util.go: ## @@ -0,0 +1,154 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +package variants + +import ( + "fmt" + "io" + "reflect" + "time" +) + +// Reads a little-endian encoded uint (betwen 1 and 8 bytes wide) from a raw buffer at a specified +// offset and returns its value. If any part of the read would be out of bounds, this returns an error. +func readUint(raw []byte, offset, size int) (uint64, error) { + if size < 1 || size > 8 { + return 0, fmt.Errorf("invalid size, must be in range [1,8]: %d", size) + } + if maxPos := offset + size; maxPos > len(raw) { + return 0, fmt.Errorf("out of bounds: trying to access position %d, max position is %d", maxPos, len(raw)) + } + var ret uint64 + for i := range size { + ret |= uint64(raw[i+offset]) << (8 * i) + } + return ret, nil +} Review Comment: Where do you see that it could be 2 or 3 bytes? > If `is_large` is 0, 1 byte is used, and if `is_large` is 1, 4 bytes are used. According to the documentation it is **either** 1 byte or 4 bytes. Never 2 or 3. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] feat(parquet): add variant encoder/decoder [arrow-go]
sfc-gh-mbojanczyk commented on code in PR #344: URL: https://github.com/apache/arrow-go/pull/344#discussion_r2064778633 ## parquet/variants/util.go: ## @@ -0,0 +1,154 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +package variants + +import ( + "fmt" + "io" + "reflect" + "time" +) + +// Reads a little-endian encoded uint (betwen 1 and 8 bytes wide) from a raw buffer at a specified +// offset and returns its value. If any part of the read would be out of bounds, this returns an error. +func readUint(raw []byte, offset, size int) (uint64, error) { + if size < 1 || size > 8 { + return 0, fmt.Errorf("invalid size, must be in range [1,8]: %d", size) + } + if maxPos := offset + size; maxPos > len(raw) { + return 0, fmt.Errorf("out of bounds: trying to access position %d, max position is %d", maxPos, len(raw)) + } + var ret uint64 + for i := range size { + ret |= uint64(raw[i+offset]) << (8 * i) + } + return ret, nil +} + +// Reads a little-endian encoded integer (between 1 and 8 bytes wide) from a raw buffer at a specified offset. +func readInt(raw []byte, offset, size int) (int64, error) { + u, err := readUint(raw, offset, size) + if err != nil { + return -1, err + } + return int64(u), nil +} + +func fieldOffsetSize(maxSize int32) int { + if maxSize < 0xFF { + return 1 + } else if maxSize < 0x { + return 2 + } else if maxSize < 0xFF { + return 3 + } + return 4 +} + +// Checks that a given range is in the provided raw buffer. +func checkBounds(raw []byte, low, high int) error { + maxPos := len(raw) + if low >= maxPos { + return fmt.Errorf("out of bounds: trying to access position %d, max is %d", low, maxPos) + } + if high > maxPos { + return fmt.Errorf("out of bounds: trying to access position %d, max is %d", high, maxPos) + } + return nil +} + +// Encodes a number of a specified width in little-endian format and writes to a writer. +func encodeNumber(val int64, size int, w io.Writer) { + buf := make([]byte, size) + for i := range size { + buf[i] = byte(val) + val >>= 8 + } + w.Write(buf) Review Comment: Similar vein to `Decode()`- this will work for the powers-of-two widths, but if we've got a width that's not (ie. 3) we'd encode with additional padding since we've got to encode to the next power of two width. This isn't the end of the world (encoding over the minimal necessary width _is_ within spec), so I have fewer objections here. I'd argue with `Decode()` rolling our own is more of a necessity due to not being in control of what comes in (eg. the Java library _can_ encode 3-byte wide numbers, so it's gotta be handled). This keeps the encode and decode logic fairly similar. Also, FWIW, I feel like the intention of the spec authors is to minimize the number of bytes used to encode things, see the existence of `Short String` even though a primitive `String` type exists. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] feat(parquet): add variant encoder/decoder [arrow-go]
sfc-gh-mbojanczyk commented on code in PR #344: URL: https://github.com/apache/arrow-go/pull/344#discussion_r2064760350 ## parquet/variants/util.go: ## @@ -0,0 +1,154 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +package variants + +import ( + "fmt" + "io" + "reflect" + "time" +) + +// Reads a little-endian encoded uint (betwen 1 and 8 bytes wide) from a raw buffer at a specified +// offset and returns its value. If any part of the read would be out of bounds, this returns an error. +func readUint(raw []byte, offset, size int) (uint64, error) { + if size < 1 || size > 8 { + return 0, fmt.Errorf("invalid size, must be in range [1,8]: %d", size) + } + if maxPos := offset + size; maxPos > len(raw) { + return 0, fmt.Errorf("out of bounds: trying to access position %d, max position is %d", maxPos, len(raw)) + } + var ret uint64 + for i := range size { + ret |= uint64(raw[i+offset]) << (8 * i) + } + return ret, nil +} Review Comment: The subtlety here: you need to be able to decode things of widths 1-8 _inclusive_. For example, the [field offset size](https://github.com/apache/parquet-format/blob/master/VariantEncoding.md#value-header-for-object-basic_type2) of an object can be 1, 2, 3, or 4 bytes. `binary.Decode` can handle the 1, 2, and 4 case, but cannot handle the 3-byte width case (see [this example](https://go.dev/play/p/5QKZ1ove5Ab) in the Go playground). For the odd-width cases I suppose we could pad the buffer with additional zeros, but that's getting gnarly just to make `binary.Decode` work IMO -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] feat(parquet): add variant encoder/decoder [arrow-go]
zeroshade commented on code in PR #344: URL: https://github.com/apache/arrow-go/pull/344#discussion_r2064332295 ## parquet/variants/util.go: ## @@ -0,0 +1,154 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +package variants + +import ( + "fmt" + "io" + "reflect" + "time" +) + +// Reads a little-endian encoded uint (betwen 1 and 8 bytes wide) from a raw buffer at a specified +// offset and returns its value. If any part of the read would be out of bounds, this returns an error. +func readUint(raw []byte, offset, size int) (uint64, error) { + if size < 1 || size > 8 { + return 0, fmt.Errorf("invalid size, must be in range [1,8]: %d", size) + } + if maxPos := offset + size; maxPos > len(raw) { + return 0, fmt.Errorf("out of bounds: trying to access position %d, max position is %d", maxPos, len(raw)) + } + var ret uint64 + for i := range size { + ret |= uint64(raw[i+offset]) << (8 * i) + } + return ret, nil +} + +// Reads a little-endian encoded integer (between 1 and 8 bytes wide) from a raw buffer at a specified offset. +func readInt(raw []byte, offset, size int) (int64, error) { + u, err := readUint(raw, offset, size) + if err != nil { + return -1, err + } + return int64(u), nil +} + +func fieldOffsetSize(maxSize int32) int { + if maxSize < 0xFF { + return 1 + } else if maxSize < 0x { + return 2 + } else if maxSize < 0xFF { + return 3 + } + return 4 +} + +// Checks that a given range is in the provided raw buffer. +func checkBounds(raw []byte, low, high int) error { + maxPos := len(raw) + if low >= maxPos { + return fmt.Errorf("out of bounds: trying to access position %d, max is %d", low, maxPos) + } + if high > maxPos { + return fmt.Errorf("out of bounds: trying to access position %d, max is %d", high, maxPos) + } + return nil +} + +// Encodes a number of a specified width in little-endian format and writes to a writer. +func encodeNumber(val int64, size int, w io.Writer) { + buf := make([]byte, size) + for i := range size { + buf[i] = byte(val) + val >>= 8 + } + w.Write(buf) Review Comment: you can use `binary.Encode(buf, binary.LittleEndian, val)` which will handle the whole range for you -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] feat(parquet): add variant encoder/decoder [arrow-go]
zeroshade commented on code in PR #344: URL: https://github.com/apache/arrow-go/pull/344#discussion_r2064330741 ## parquet/variants/util.go: ## @@ -0,0 +1,154 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +package variants + +import ( + "fmt" + "io" + "reflect" + "time" +) + +// Reads a little-endian encoded uint (betwen 1 and 8 bytes wide) from a raw buffer at a specified +// offset and returns its value. If any part of the read would be out of bounds, this returns an error. +func readUint(raw []byte, offset, size int) (uint64, error) { + if size < 1 || size > 8 { + return 0, fmt.Errorf("invalid size, must be in range [1,8]: %d", size) + } + if maxPos := offset + size; maxPos > len(raw) { + return 0, fmt.Errorf("out of bounds: trying to access position %d, max position is %d", maxPos, len(raw)) + } + var ret uint64 + for i := range size { + ret |= uint64(raw[i+offset]) << (8 * i) + } + return ret, nil +} Review Comment: Then you could probably just do something like: ```go switch size { case 1: var val uint8 _, err := binary.Decode(raw[offset:size], binary.LittleEndian, &val) return val, err case 2: var val uint16 _, err := binary.Decode(raw[offset:size], binary.LittleEndian, &val) return val, err case 4: var val uint32 _, err := binary.Decode(raw[offset:size], binary.LittleEndian, &val) return val, err case 8: var val uint64 _, err := binary.Decode(raw[offset:size], binary.LittleEndian, &val) return val, err } ``` right? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] feat(parquet): add variant encoder/decoder [arrow-go]
sfc-gh-mbojanczyk commented on code in PR #344: URL: https://github.com/apache/arrow-go/pull/344#discussion_r2054836925 ## parquet/variants/util.go: ## @@ -0,0 +1,154 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +package variants + +import ( + "fmt" + "io" + "reflect" + "time" +) + +// Reads a little-endian encoded uint (betwen 1 and 8 bytes wide) from a raw buffer at a specified +// offset and returns its value. If any part of the read would be out of bounds, this returns an error. +func readUint(raw []byte, offset, size int) (uint64, error) { + if size < 1 || size > 8 { + return 0, fmt.Errorf("invalid size, must be in range [1,8]: %d", size) + } + if maxPos := offset + size; maxPos > len(raw) { + return 0, fmt.Errorf("out of bounds: trying to access position %d, max position is %d", maxPos, len(raw)) + } + var ret uint64 + for i := range size { + ret |= uint64(raw[i+offset]) << (8 * i) + } + return ret, nil +} + +// Reads a little-endian encoded integer (between 1 and 8 bytes wide) from a raw buffer at a specified offset. +func readInt(raw []byte, offset, size int) (int64, error) { + u, err := readUint(raw, offset, size) + if err != nil { + return -1, err + } + return int64(u), nil +} + +func fieldOffsetSize(maxSize int32) int { + if maxSize < 0xFF { + return 1 + } else if maxSize < 0x { + return 2 + } else if maxSize < 0xFF { + return 3 + } + return 4 +} + +// Checks that a given range is in the provided raw buffer. +func checkBounds(raw []byte, low, high int) error { + maxPos := len(raw) + if low >= maxPos { Review Comment: Good call. Done, plus I realized that I forgot tests for this function, so I added that as well. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] feat(parquet): add variant encoder/decoder [arrow-go]
sfc-gh-mbojanczyk commented on code in PR #344: URL: https://github.com/apache/arrow-go/pull/344#discussion_r2054827281 ## parquet/variants/util.go: ## @@ -0,0 +1,154 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +package variants + +import ( + "fmt" + "io" + "reflect" + "time" +) + +// Reads a little-endian encoded uint (betwen 1 and 8 bytes wide) from a raw buffer at a specified +// offset and returns its value. If any part of the read would be out of bounds, this returns an error. +func readUint(raw []byte, offset, size int) (uint64, error) { + if size < 1 || size > 8 { + return 0, fmt.Errorf("invalid size, must be in range [1,8]: %d", size) + } + if maxPos := offset + size; maxPos > len(raw) { + return 0, fmt.Errorf("out of bounds: trying to access position %d, max position is %d", maxPos, len(raw)) + } + var ret uint64 + for i := range size { + ret |= uint64(raw[i+offset]) << (8 * i) + } + return ret, nil +} + +// Reads a little-endian encoded integer (between 1 and 8 bytes wide) from a raw buffer at a specified offset. +func readInt(raw []byte, offset, size int) (int64, error) { + u, err := readUint(raw, offset, size) + if err != nil { + return -1, err + } + return int64(u), nil +} + +func fieldOffsetSize(maxSize int32) int { + if maxSize < 0xFF { + return 1 + } else if maxSize < 0x { + return 2 + } else if maxSize < 0xFF { + return 3 + } + return 4 +} + +// Checks that a given range is in the provided raw buffer. +func checkBounds(raw []byte, low, high int) error { + maxPos := len(raw) + if low >= maxPos { + return fmt.Errorf("out of bounds: trying to access position %d, max is %d", low, maxPos) + } + if high > maxPos { + return fmt.Errorf("out of bounds: trying to access position %d, max is %d", high, maxPos) + } + return nil +} + +// Encodes a number of a specified width in little-endian format and writes to a writer. +func encodeNumber(val int64, size int, w io.Writer) { + buf := make([]byte, size) + for i := range size { + buf[i] = byte(val) + val >>= 8 + } + w.Write(buf) Review Comment: Similar to the comment in `readUint()`, the `binary` package only provides Put/Append functionality for `uint{16,32,64}`, but we need it for the whole range from 1-8 bytes. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] feat(parquet): add variant encoder/decoder [arrow-go]
sfc-gh-mbojanczyk commented on code in PR #344: URL: https://github.com/apache/arrow-go/pull/344#discussion_r2054810924 ## parquet/variants/util.go: ## @@ -0,0 +1,154 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +package variants + +import ( + "fmt" + "io" + "reflect" + "time" +) + +// Reads a little-endian encoded uint (betwen 1 and 8 bytes wide) from a raw buffer at a specified +// offset and returns its value. If any part of the read would be out of bounds, this returns an error. +func readUint(raw []byte, offset, size int) (uint64, error) { + if size < 1 || size > 8 { + return 0, fmt.Errorf("invalid size, must be in range [1,8]: %d", size) + } + if maxPos := offset + size; maxPos > len(raw) { + return 0, fmt.Errorf("out of bounds: trying to access position %d, max position is %d", maxPos, len(raw)) + } + var ret uint64 + for i := range size { + ret |= uint64(raw[i+offset]) << (8 * i) + } + return ret, nil +} Review Comment: The unfortunate bit here: we need to handle uints of varying widths from 1-8 bytes, but the `binary` package expects things of a set width (eg. `Uint64()` wants to [read 8 bytes](https://cs.opensource.google/go/go/+/refs/tags/go1.24.2:src/encoding/binary/binary.go;l=115-119) always, so that would panic for anything smaller). So as-is, `binary` can't really handle all the different widths we need to here, which is why I rolled my own. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] feat(parquet): add variant encoder/decoder [arrow-go]
sfc-gh-mbojanczyk commented on PR #344: URL: https://github.com/apache/arrow-go/pull/344#issuecomment-2811034930 > Made an initial pass on this and left a bunch of comments. I'll try to take another look later on Thanks so much! I'll get to addressing the comments here shortly (was out for a few days) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org