This is an automated email from the ASF dual-hosted git repository.
zeroshade pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/arrow.git
The following commit(s) were added to refs/heads/main by this push:
new 6c0c3cd082 GH-46908: [Docs][Format] Add variant extension type docs
(#47456)
6c0c3cd082 is described below
commit 6c0c3cd082b52660ae28b7bce6507a9f2d9f1091
Author: Matt Topol <[email protected]>
AuthorDate: Tue Sep 16 16:49:11 2025 -0500
GH-46908: [Docs][Format] Add variant extension type docs (#47456)
### Rationale for this change
To support the addition of the Parquet Variant data type and the Iceberg
adoption of the variant type, we need a defined way to pass this data
through Arrow-compatible systems. As such, we need a specification for a
canonical Arrow extension type to represent Variant data.
### What changes are included in this PR?
Updates to the docs which define the Arrow Variant Extension type
* GitHub Issue: #46908
---------
Co-authored-by: Ian Cook <[email protected]>
Co-authored-by: David Li <[email protected]>
Co-authored-by: Dewey Dunnington <[email protected]>
Co-authored-by: Gang Wu <[email protected]>
Co-authored-by: Antoine Pitrou <[email protected]>
Co-authored-by: Sutou Kouhei <[email protected]>
Co-authored-by: Bryce Mecum <[email protected]>
Co-authored-by: Yan Tingwang <[email protected]>
---
docs/source/format/CanonicalExtensions.rst | 128 ++++-
.../source/format/CanonicalExtensions/Examples.rst | 555 +++++++++++++++++++++
docs/source/status.rst | 2 +
3 files changed, 684 insertions(+), 1 deletion(-)
diff --git a/docs/source/format/CanonicalExtensions.rst
b/docs/source/format/CanonicalExtensions.rst
index 22f2efb16f..8608a6388e 100644
--- a/docs/source/format/CanonicalExtensions.rst
+++ b/docs/source/format/CanonicalExtensions.rst
@@ -417,7 +417,133 @@ better zero-copy compatibility with various systems that
also store booleans usi
Metadata is an empty string.
-=========================
+.. _parquet_variant_extension:
+
+Parquet Variant
+===============
+
+Variant represents a value that may be one of:
+
+* Primitive: a type and corresponding value (e.g. ``INT``, ``STRING``)
+
+* Array: An ordered list of Variant values
+
+* Object: An unordered collection of string/Variant pairs (i.e. key/value
pairs). An object may not contain duplicate keys
+
+Particularly, this provides a way to represent semi-structured data which is
stored as a
+`Parquet Variant
<https://github.com/apache/parquet-format/blob/master/VariantEncoding.md>`__
value within Arrow columns in
+a lossless fashion. This also provides the ability to represent `shredded
<https://github.com/apache/parquet-format/blob/master/VariantShredding.md>`__
+variant values. The canonical extension type allows systems to pass Variant
encoded data around without special handling unless
+they want to directly interact with the encoded variant data. See the Parquet
format specification for details on what the actual
+binary values look like.
+
+* Extension name: ``arrow.parquet.variant``.
+
+* The storage type of this extension is a ``Struct`` that obeys the following
rules:
+
+ * A *non-nullable* field named ``metadata`` which is of type ``Binary``,
``LargeBinary``, or ``BinaryView``.
+
+ * At least one (or both) of the following:
+
+ * A field named ``value`` which is of type ``Binary``, ``LargeBinary``, or
``BinaryView``.
+ (unshredded variants consist of just the ``metadata`` and ``value``
fields only)
+
+ * A field named ``typed_value`` which can be a
:ref:`variant_primitive_type_mapping` or a ``List``, ``LargeList``,
``ListView`` or ``Struct``
+
+ * If the ``typed_value`` field is a ``List``, ``LargeList`` or
``ListView`` its elements **must** be *non-nullable* and **must**
+ be a ``Struct`` consisting of at least one (or both) of the following:
+
+ * A field named ``value`` which is of type ``Binary``,
``LargeBinary``, or ``BinaryView``.
+
+ * A field named ``typed_value`` which follows the rules outlined above
(this allows for arbitrarily nested data).
+
+ * If the ``typed_value`` field is a ``Struct``, then its fields **must**
be *non-nullable*, representing the fields being shredded
+ from the objects, and **must** be a ``Struct`` consisting of at least
one (or both) of the following:
+
+ * A field named ``value`` which is of type ``Binary``,
``LargeBinary``, or ``BinaryView``.
+
+ * A field named ``typed_value`` which follows the rules outlined above
(this allows for arbitrarily nested data).
+
+* Extension type parameters:
+
+ This type does not have any parameters.
+
+* Description of the serialization:
+
+ Extension metadata is an empty string.
+
+.. note::
+
+ It is also *permissible* for the ``metadata`` field to be
dictionary-encoded with a preferred (*but not required*) index type of ``int8``,
+ or run-end-encoded with a preferred (*but not required*) runs type of
``int8``.
+
+.. note::
+
+ The fields may be in any order, and thus must be accessed by **name** not
by *position*. The field names are case sensitive.
+
+.. _variant_primitive_type_mapping:
+
+Primitive Type Mappings
+-----------------------
+
++----------------------+------------------------+
+| Arrow Primitive Type | Variant Primitive Type |
++======================+========================+
+| Null | Null |
++----------------------+------------------------+
+| Boolean | Boolean (true/false) |
++----------------------+------------------------+
+| Int8 | Int8 |
++----------------------+------------------------+
+| Uint8 | Int16 |
++----------------------+------------------------+
+| Int16 | Int16 |
++----------------------+------------------------+
+| Uint16 | Int32 |
++----------------------+------------------------+
+| Int32 | Int32 |
++----------------------+------------------------+
+| Uint32 | Int64 |
++----------------------+------------------------+
+| Int64 | Int64 |
++----------------------+------------------------+
+| Float | Float |
++----------------------+------------------------+
+| Double | Double |
++----------------------+------------------------+
+| Decimal32 | decimal4 |
++----------------------+------------------------+
+| Decimal64 | decimal8 |
++----------------------+------------------------+
+| Decimal128 | decimal16 |
++----------------------+------------------------+
+| Date32 | Date |
++----------------------+------------------------+
+| Time64 | TimeNTZ |
++----------------------+------------------------+
+| Timestamp(us, UTC) | Timestamp (micro) |
++----------------------+------------------------+
+| Timestamp(us) | TimestampNTZ (micro) |
++----------------------+------------------------+
+| Timestamp(ns, UTC) | Timestamp (nano) |
++----------------------+------------------------+
+| Timestamp(ns) | TimestampNTZ (nano) |
++----------------------+------------------------+
+| Binary | Binary |
++----------------------+------------------------+
+| LargeBinary | Binary |
++----------------------+------------------------+
+| BinaryView | Binary |
++----------------------+------------------------+
+| String | String |
++----------------------+------------------------+
+| LargeString | String |
++----------------------+------------------------+
+| StringView | String |
++----------------------+------------------------+
+| UUID extension type | UUID |
++----------------------+------------------------+
+
Community Extension Types
=========================
diff --git a/docs/source/format/CanonicalExtensions/Examples.rst
b/docs/source/format/CanonicalExtensions/Examples.rst
new file mode 100644
index 0000000000..1d3e0d79d8
--- /dev/null
+++ b/docs/source/format/CanonicalExtensions/Examples.rst
@@ -0,0 +1,555 @@
+.. Licensed to the Apache Software Foundation (ASF) under one
+.. or more contributor license agreements. See the NOTICE file
+.. distributed with this work for additional information
+.. regarding copyright ownership. The ASF licenses this file
+.. to you under the Apache License, Version 2.0 (the
+.. "License"); you may not use this file except in compliance
+.. with the License. You may obtain a copy of the License at
+
+.. http://www.apache.org/licenses/LICENSE-2.0
+
+.. Unless required by applicable law or agreed to in writing,
+.. software distributed under the License is distributed on an
+.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+.. KIND, either express or implied. See the License for the
+.. specific language governing permissions and limitations
+.. under the License.
+
+.. _format_canonical_extension_examples:
+
+****************************
+Canonical Extension Examples
+****************************
+
+=========================
+Parquet Variant Extension
+=========================
+
+
+Unshredded
+''''''''''
+
+The simplest case, an unshredded variant always consists of **exactly** two
fields: ``metadata`` and ``value``. Any of
+the following storage types are valid (not an exhaustive list):
+
+* ``struct<metadata: binary non-nullable, value: binary nullable>``
+* ``struct<value: binary nullable, metadata: binary non-nullable>``
+* ``struct<metadata: dictionary<int8, binary> non-nullable, value: binary_view
nullable>``
+
+Simple Shredding
+''''''''''''''''
+
+Suppose we have a Variant field named *measurement* and we want to shred the
``int64`` values into a separate column for efficiency.
+In Parquet, this could be represented as::
+
+ required group measurement (VARIANT) {
+ required binary metadata;
+ optional binary value;
+ optional int64 typed_value;
+ }
+
+Thus the corresponding storage type for the ``arrow.parquet.variant`` Arrow
extension type would be::
+
+ struct<
+ metadata: binary non-nullable,
+ value: binary nullable,
+ typed_value: int64 nullable
+ >
+
+If we suppose a series of measurements consisting of::
+
+ 34, null, "n/a", 100
+
+The data should be stored/represented in Arrow as::
+
+ * Length: 4, Null count: 1
+ * Validity Bitmap buffer:
+
+ | Byte 0 (validity bitmap) | Bytes 1-63 |
+ |--------------------------|---------------|
+ | 00001011 | 0 (padding) |
+
+ * Children arrays:
+ * field-0 array (`VarBinary`)
+ * Length: 4, Null count: 0
+ * Offsets buffer:
+
+ | Bytes 0-19 | Bytes 20-63 |
+ |------------------|--------------------------|
+ | 0, 2, 4, 6, 8 | unspecified (padding) |
+
+ * Value buffer: (01 00 -> indicates version 1 empty metadata)
+
+ | Bytes 0-7 | Bytes 8-63 |
+ |-------------------------|--------------------------|
+ | 01 00 01 00 01 00 01 00 | unspecified (padding) |
+
+ * field-1 array (`VarBinary`)
+ * Length: 4, Null count: 2
+ * Validity Bitmap buffer:
+
+ | Byte 0 (validity bitmap) | Bytes 1-63 |
+ |--------------------------|---------------|
+ | 00000110 | 0 (padding) |
+
+ * Offsets buffer:
+
+ | Bytes 0-19 | Bytes 20-63 |
+ |------------------|--------------------------|
+ | 0, 0, 1, 5, 5 | unspecified (padding) |
+
+ * Value buffer: (`00` -> null,
+ `0x13 0x6E 0x2F 0x61` -> variant encoding literal string "n/a")
+
+ | Bytes 0-4 | Bytes 5-63 |
+ |------------------------|--------------------------|
+ | 00 0x13 0x6E 0x2F 0x61 | unspecified (padding) |
+
+ * field-2 array (int64 array)
+ * Length: 4, Null count: 2
+ * Validity Bitmap buffer:
+
+ | Byte 0 (validity bitmap) | Bytes 1-63 |
+ |--------------------------|---------------|
+ | 00001001 | 0 (padding) |
+
+ * Value buffer:
+
+ | Bytes 0-31 | Bytes 32-63 |
+ |---------------------|--------------------------|
+ | 34, 00, 00, 100 | unspecified (padding) |
+
+.. note::
+
+ Notice that there is a variant ``literal null`` in the ``value`` array,
this is due to the
+ `shredding specification
<https://github.com/apache/parquet-format/blob/master/VariantShredding.md#value-shredding>`__
+ so that a consumer can tell the difference between a *missing* field and a
*null* field. A null
+ element must be encoded as a Variant null: *basic type* ``0`` (primitive)
and *physical type* ``0`` (null).
+
+Shredding an Array
+''''''''''''''''''
+
+For our next example, we will represent a shredded array of strings. Let's
consider a column that looks like: ::
+
+ ["comedy", "drama"], ["horror", null], ["comedy", "drama", "romance"], null
+
+Representing this shredded variant in Parquet could look like::
+
+ optional group tags (VARIANT) {
+ required binary metadata;
+ optional binary value;
+ optional group typed_value (LIST) { # optional to allow null lists
+ repeated group list {
+ required group element { # shredded element
+ optional binary value;
+ optional binary typed_value (STRING);
+ }
+ }
+ }
+ }
+
+The array structure for Variant encoding does not allow missing elements, so
all elements of the array must
+be *non-nullable*. As such, either **typed_value** or **value** (*but not
both!*) must be *non-null*.
+
+The storage type to represent this in Arrow as a Variant extension type would
be::
+
+ struct<
+ metadata: binary non-nullable,
+ value: binary nullable,
+ typed_value: list<element: struct<
+ value: binary nullable,
+ typed_value: string nullable
+ > non-nullable> nullable
+ >
+
+.. note::
+
+ As usual, **Binary** could also be **LargeBinary** or **BinaryView**,
**String** could also be **LargeString** or **StringView**,
+ and **List** could also be **LargeList** or **ListView**.
+
+The data would then be stored in Arrow as follows::
+
+ * Length: 4, Null count: 1
+ * Validity Bitmap buffer:
+
+ | Byte 0 (validity bitmap) | Bytes 1-63 |
+ |--------------------------|---------------|
+ | 00000111 | 0 (padding) |
+
+ * Children arrays:
+ * field-0 array (`VarBinary` metadata)
+ * Length: 4, Null count: 0
+ * Offsets buffer:
+
+ | Bytes 0-19 | Bytes 20-63 |
+ |------------------|--------------------------|
+ | 0, 2, 4, 6, 8 | unspecified (padding) |
+
+ * Value buffer: (01 00 -> indicates version 1 empty metadata)
+
+ | Bytes 0-7 | Bytes 8-63 |
+ |-------------------------|--------------------------|
+ | 01 00 01 00 01 00 01 00 | unspecified (padding) |
+
+ * field-1 array (`VarBinary` value)
+ * Length: 4, Null count: 1
+ * Validity bitmap buffer:
+
+ | Byte 0 (validity bitmap) | Bytes 1-63 |
+ |--------------------------|---------------|
+ | 00001000 | 0 (padding) |
+
+ * Offsets buffer:
+
+ | Bytes 0-19 | Bytes 20-63 |
+ |------------------|--------------------------|
+ | 0, 0, 0, 0, 1 | unspecified (padding) |
+
+ * Value buffer: (00 -> variant null)
+
+ | Bytes 0 | Bytes 1-63 |
+ |--------------------|--------------------------|
+ | 00 | unspecified (padding) |
+
+ * field-2 array (`List<Struct<VarBinary, String>>` typed_value)
+ * Length: 4, Null count: 1
+ * Validity bitmap buffer:
+
+ | Byte 0 (validity bitmap) | Bytes 1-63 |
+ |--------------------------|-------------|
+ | 00000111 | 0 (padding) |
+
+ * Offsets buffer (int32)
+
+ | Bytes 0-19 | Bytes 20-63 |
+ |-------------------|-----------------------|
+ | 0, 2, 4, 7, 7 | unspecified (padding) |
+
+ * Values array (`Struct<VarBinary, String>` element):
+ * Length: 7, Null count: 0
+ * Validity bitmap buffer: Not required
+
+ * Children arrays:
+ * field-0 array (`VarBinary` value)
+ * Length: 7, Null count: 6
+ * Validity bitmap buffer:
+
+ | Byte 0 (validity bitmap) | Bytes 1-63 |
+ |--------------------------|-------------|
+ | 00001000 | 0 (padding) |
+
+ * Offsets buffer (int32):
+
+ | Bytes 0-31 | Bytes 32-63 |
+ |---------------------------|--------------------------|
+ | 0, 0, 0, 0, 1, 1, 1, 1 | unspecified (padding) |
+
+ * Values buffer (`00` -> variant null):
+
+ | Bytes 0 | Bytes 1-63 |
+ |--------------------|--------------------------|
+ | 00 | unspecified (padding) |
+
+ * field-1 array (`String` typed_value)
+ * Length: 7, Null count: 1
+ * Validity bitmap buffer:
+
+ | Byte 0 (validity bitmap) | Bytes 1-63 |
+ |--------------------------|-------------|
+ | 01110111 | 0 (padding) |
+
+ * Offsets buffer (int32):
+
+ | Bytes 0-31 | Bytes 32-63 |
+ |---------------------------------|--------------------------|
+ | 0, 6, 11, 17, 17, 23, 28, 35 | unspecified (padding) |
+
+ * Values buffer:
+
+ | Bytes 0-35 | Bytes 36-63
|
+
|--------------------------------------|--------------------------|
+ | comedydramahorrorcomedydramaromance | unspecified (padding)
|
+
+Shredding an Object
+'''''''''''''''''''
+
+Let's consider a JSON column of "events" which contain a field named
``event_type`` (a string)
+and a field named ``event_ts`` (a timestamp) that we wish to shred into
separate columns, In Parquet,
+it could look something like this::
+
+ optional group event (VARIANT) {
+ required binary metadata;
+ optional binary value; # variant, remaining fields/values
+ optional group typed_value { # shredded fields for variant object
+ required group event_type { # event_type shredded field
+ optional binary value;
+ optional binary typed_value (STRING);
+ }
+ required group event_ts { # event_ts shredded field
+ optional binary value;
+ optional int64 typed_value (TIMESTAMP(true, MICROS))
+ }
+ }
+ }
+
+We can then translate this into the expected extension storage type::
+
+ struct<
+ metadata: binary non-nullable,
+ value: binary nullable,
+ typed_value: struct<
+ event_type: struct<
+ value: binary nullable,
+ typed_value: string nullable
+ > non-nullable,
+ event_ts: struct<
+ value: binary nullable,
+ typed_value: timestamp(us, UTC) nullable
+ > non-nullable
+ > nullable
+ >
+
+If a field *does not exist* in the variant object value, then both the
**value** and **typed_value** columns for that row
+will be null. If a field is *present*, but the value is null, then **value**
must contain a Variant null.
+
+It is *invalid* for both **value** and **typed_value** to be non-null for a
given index. A reader can choose not to error
+in this scenario, but if so it **must** use the value in the **typed_value**
column for that index.
+
+Let's consider the following series of objects::
+
+ {"event_type": "noop", "event_ts": 1729794114937}
+
+ {"event_type": "login", "event_ts": 1729794146402, "email":
"[email protected]"}
+
+ {"error_msg": "malformed..."}
+
+ "malformed: not an object"
+
+ {"event_ts": 1729794240241, "click": "_button"}
+
+ {"event_ts": null, "event_ts": 1729794954163}
+
+ {"event_type": "noop", "event_ts": "2024-10-24"}
+
+ {}
+
+ null
+
+ *Entirely missing*
+
+To represent those values as a column of Variant values using the Variant
extension type we get the following::
+
+ * Length: 10, Null count: 1
+ * Validity bitmap buffer:
+
+ | Byte 0 (validity bitmap) | Byte 1 | Bytes 2-63 |
+ |--------------------------|-----------|-----------------------|
+ | 11111111 | 00000001 | 0 (padding) |
+
+ * Children arrays
+ * field-0 array (`VarBinary` Metadata)
+ * Length: 10, Null count: 0
+ * Offsets buffer:
+
+ | Bytes 0-43 (int32) | Bytes 44-63 |
+ |------------------------------------------|-------------------------|
+ | 0, 2, 11, 24, 26, 35, 37, 39, 41, 43, 45 | unspecified (padding) |
+
+ * Value buffer: (01 00 -> version 1 empty metadata,
+ 01 01 00 XX ... -> Version 1, metadata with 1 elem,
offset 0, offset XX == len(string), ... is dict string bytes)
+
+ | Bytes 0-1 | Bytes 2-10 | Bytes 11-23 | Bytes 24-25
| Bytes 26-34 |
+
|-------------------------------|-----------------------|-------------|-------------------|
+ | 01 00 | 01 01 00 05 email | 01 01 00 09 error_msg | 01 00
| 01 01 00 05 click |
+
+ | Bytes 35-36 | Bytes 37-38 | Bytes 39-40 | Bytes 41-42 | Bytes 43-44
| Bytes 45-63 |
+
|-------------|-------------|-------------|-------------|-------------|-----------------------|
+ | 01 00 | 01 00 | 01 00 | 01 00 | 01 00
| unspecified (padding) |
+
+ * field-1 array (`VarBinary` Value)
+ * Length: 10, Null count: 5
+ * Validity bitmap buffer:
+
+ | Byte 0 (validity bitmap) | Byte 1 | Bytes 2-63 |
+ |---------------------------|-----------|-----------------------|
+ | 00011110 | 00000001 | 0 (padding) |
+
+ * Offsets buffer (filled in based on lengths of encoded variants):
+
+ | ... |
+
+ * Value buffer:
+
+ | VariantEncode({"email": "[email protected]"}) |
VariantEncode({"error_msg": "malformed..."}) |
+ | VariantEncode("malformed: not an object") | VariantEncode({"click":
"_button"}) | 00 (null) |
+
+ * field-2 array (`Struct<...>` typed_value)
+ * Length: 10, Null count: 3
+ * Validity bitmap buffer:
+
+ | Byte 0 (validity bitmap) | Byte 1 | Bytes 2-63 |
+ |--------------------------|-----------|-----------------------|
+ | 11110111 | 00000000 | 0 (padding) |
+
+ * Children arrays:
+ * field-0 array (`Struct<VarBinary, String>` event_type)
+ * Length: 10, Null count: 0
+ * Validity bitmap buffer: not required
+
+ * Children arrays
+ * field-0 array (`VarBinary` value)
+ * Length: 10, Null count: 9
+ * Validity bitmap buffer:
+
+ | Byte 0 (validity bitmap) | Byte 1 | Bytes 2-63
|
+
|--------------------------|-----------|-----------------------|
+ | 01000000 | 00000000 | 0 (padding)
|
+
+ * Offsets buffer (int32)
+
+ | Bytes 0-43 (int32) | Bytes 44-63 |
+ |---------------------------------|-------------------------|
+ | 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1 | unspecified (padding) |
+
+ * Value buffer:
+
+ | Byte 0 | Bytes 1-63 |
+ |--------|------------------------|
+ | 00 | unspecified (padding) |
+
+ * field-1 array (`String` typed_value)
+ * Length: 10, Null count: 7
+ * Validity bitmap buffer:
+
+ | Byte 0 (validity bitmap) | Byte 1 | Bytes 2-63
|
+
|--------------------------|-----------|-----------------------|
+ | 01000011 | 00000000 | 0 (padding)
|
+
+ * Offsets buffer (int32)
+
+ | Byte 0-43 | Bytes 44-63
|
+
|-------------------------------------|------------------------|
+ | 0, 4, 9, 9, 9, 9, 9, 13, 13, 13, 13 | unspecified (padding)
|
+
+ * Value buffer:
+
+ | Bytes 0-3 | Bytes 4-8 | Bytes 9-12 | Bytes 13-63 |
+ |-----------|-----------|------------|------------------------|
+ | noop | login | noop | unspecified (padding) |
+
+
+ * field-1 array (`Struct<VarBinary, Timestamp>` event_ts)
+ * Length: 10, Null count: 0
+ * Validity bitmap buffer: not required
+
+ * Children arrays
+ * field-0 array (`VarBinary` value)
+ * Length: 10, Null count: 9
+ * Validity bitmap buffer:
+
+ | Byte 0 (validity bitmap) | Byte 1 | Bytes 2-63
|
+
|--------------------------|-----------|-----------------------|
+ | 01000000 | 00000000 | 0 (padding)
|
+
+ * Offsets buffer (int32)
+
+ | Bytes 0-43 (int32) | Bytes 44-63 |
+ |---------------------------------|-------------------------|
+ | ... | unspecified (padding) |
+
+ * Value buffer:
+
+ | VariantEncode("2024-10-24") |
+
+ * field-1 array (`Timestamp(us, UTC)` typed_value)
+ * Length: 10, Null count: 6
+ * Validity bitmap buffer:
+
+ | Byte 0 (validity bitmap) | Byte 1 | Bytes 2-63
|
+
|--------------------------|-----------|-----------------------|
+ | 00110011 | 00000000 | 0 (padding)
|
+
+ * Value buffer:
+
+ | Bytes 0-7 | Bytes 8-15 | Bytes 16-31 | Bytes 32-39
| Bytes 40-47 | Bytes 48-63 |
+
|---------------|---------------|--------------|---------------|---------------|------------------------|
+ | 1729794114937 | 1729794146402 | unspecified | 1729794240241
| 1729794954163 | unspecified (padding) |
+
+
+Putting it all together
+'''''''''''''''''''''''
+
+As mentioned, the **typed_value** field associated with a Variant **value**
can be of any shredded type. As a result,
+as long as we follow the original rules we can have an arbitrary number of
nested levels based on how you want to
+shred the object. For example, we might have a few more fields alongside
**event_type** to shred out. Possibly an object
+that looks like this::
+
+ {
+ "event_type": "login",
+ "event_ts": 1729794114937,
+ "location”: {"longitude": 1.5, "latitude": 5.5},
+ "tags": ["foo", "bar", "baz"]
+ }
+
+If we shred the extra fields out and represent it as Parquet it looks like::
+
+ optional group event (VARIANT) {
+ required binary metadata;
+ optional binary value; # variant, remaining fields/values
+ optional group typed_value { # shredded fields for variant object
+ required group event_type { # event_type shredded field
+ optional binary value;
+ optional binary typed_value (STRING);
+ }
+ required group event_ts { # event_ts shredded field
+ optional binary value;
+ optional int64 typed_value (TIMESTAMP(true, MICROS))
+ }
+ required group location { # location shredded field
+ optional binary value;
+ optional group typed_value {
+ required group longitude {
+ optional binary value;
+ optional float64 typed_value;
+ }
+ required group latitude {
+ optional binary value;
+ optional float64 typed_value;
+ }
+ }
+ }
+ required group tags { # tags shredded field
+ optional binary value;
+ optional group typed_value (LIST) {
+ repeated group list {
+ required group element {
+ optional binary value;
+ optional binary typed_value (STRING);
+ }
+ }
+ }
+ }
+ }
+ }
+
+Finally, following the rules we set forth on constructing the Variant
Extension Type storage type, we end up with::
+
+ struct<
+ metadata: binary non-nullable,
+ value: binary nullable,
+ typed_value: struct<
+ event_type: struct<value: binary nullable, typed_value: string nullable>
non-nullable,
+ event_ts: struct<value: binary nullable, typed_value: timestamp(us, UTC)
nullable> non-nullable,
+ location: struct<
+ value: binary nullable,
+ typed_value: struct<
+ longitude: struct<value: binary nullable, typed_value: double
nullable> non-nullable,
+ latitude: struct<value: binary nullable, typed_value: double
nullable> non-nullable
+ > nullable> non-nullable,
+ tags: struct<
+ value: binary nullable,
+ typed_value: list<struct<value: binary nullable, typed_value: string
nullable> non-nullable> nullable
+ > non-nullable
+ > nullable
+ >
+
diff --git a/docs/source/status.rst b/docs/source/status.rst
index 0b124bbbeb..6aff1a8c12 100644
--- a/docs/source/status.rst
+++ b/docs/source/status.rst
@@ -129,6 +129,8 @@ Data Types
+-----------------------+-------+-------+-------+------------+-------+-------+-------+-------+
| 8-bit Boolean | ✓ | | ✓ | | | |
| |
+-----------------------+-------+-------+-------+------------+-------+-------+-------+-------+
+| Parquet Variant | | | ✓ | | | |
| |
++-----------------------+-------+-------+-------+------------+-------+-------+-------+-------+
Notes: