[jira] [Updated] (HIVE-29183) Integrating Variant Type into Hive

Denys Kuzmenko (Jira) Tue, 09 Sep 2025 01:50:04 -0700


     [ 
https://issues.apache.org/jira/browse/HIVE-29183?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Denys Kuzmenko updated HIVE-29183:
----------------------------------
    Description: 
A variant is a value that stores semi-structured data. The structure and data 
types in a variant are not necessarily consistent across rows in a table or 
data file. The variant type and binary encoding are defined in the Parquet 
project, with support currently available for V1. Support for Variant is added 
in Iceberg v3.

Variants are similar to JSON with a wider set of primitive values including 
date, timestamp, timestamptz, binary, and decimals.

Variant values may contain nested types:

* An array is an ordered collection of variant values.
* An object is a collection of fields that are a string key and a variant value.
As a semi-structured type, there are important differences between variant and 
Iceberg's other types:

* Variant arrays are similar to lists, but may contain any variant value rather 
than a fixed element type.
* Variant objects are similar to structs, but may contain variable fields 
identified by name and field values may be any variant value rather than a 
fixed field type.

Variant data types allow for the efficient binary encoding of dynamic 
semi-structured data such as JSON, Avro, Parquet, etc. By encoding 
semi-structured data as a variant column, we retain the flexibility of the 
source data, while allowing query engines to more efficiently operate on the 
data.

With the support of Variant type, such data can be encoded in an efficient 
binary representation internally for better performance. Without that, we need 
to parse the data in its format inefficiently.

This will allow the following use cases:

* Create an Iceberg table with a Variant column
CREATE TABLE car_sales(record Variant);
* Insert semi-structured data into the Variant column
INSERT INTO car_sales SELECT PARSE_JSON(<json_string>)
* Query against the semi-structured data
SELECT VARIANT_GET(record, '$.dealer.ship', 'string') FROM car_sales

Iceberg's Variant type proposal:
https://docs.google.com/document/d/1sq70XDiWJ2DemWyA5dVB80gKzwi0CWoM0LOWM7VJVd8

  was:
A variant is a value that stores semi-structured data. The structure and data 
types in a variant are not necessarily consistent across rows in a table or 
data file. The variant type and binary encoding are defined in the Parquet 
project, with support currently available for V1. Support for Variant is added 
in Iceberg v3.

Variants are similar to JSON with a wider set of primitive values including 
date, timestamp, timestamptz, binary, and decimals.

Variant values may contain nested types:

* An array is an ordered collection of variant values.
* An object is a collection of fields that are a string key and a variant value.
As a semi-structured type, there are important differences between variant and 
Iceberg's other types:

* Variant arrays are similar to lists, but may contain any variant value rather 
than a fixed element type.
* Variant objects are similar to structs, but may contain variable fields 
identified by name and field values may be any variant value rather than a 
fixed field type.

Variant data types allow for the efficient binary encoding of dynamic 
semi-structured data such as JSON, Avro, Parquet, etc. By encoding 
semi-structured data as a variant column, we retain the flexibility of the 
source data, while allowing query engines to more efficiently operate on the 
data.

With the support of Variant type, such data can be encoded in an efficient 
binary representation internally for better performance. Without that, we need 
to parse the data in its format inefficiently.

This will allow the following use cases:

* Create an Iceberg table with a Variant column
CREATE TABLE car_sales(record Variant);
* Insert semi-structured data into the Variant column
INSERT INTO car_sales SELECT PARSE_JSON(<json_string>)
* Query against the semi-structured data
SELECT VARIANT_GET(record, '$.dealer.ship', 'string') FROM car_sales

Variant type Iceberg's proposal
https://docs.google.com/document/d/1sq70XDiWJ2DemWyA5dVB80gKzwi0CWoM0LOWM7VJVd8


> Integrating Variant Type into Hive
> ----------------------------------
>
>                 Key: HIVE-29183
>                 URL: https://issues.apache.org/jira/browse/HIVE-29183
>             Project: Hive
>          Issue Type: New Feature
>          Components: Hive, Iceberg integration, SQL
>            Reporter: Denys Kuzmenko
>            Priority: Major
>
> A variant is a value that stores semi-structured data. The structure and data 
> types in a variant are not necessarily consistent across rows in a table or 
> data file. The variant type and binary encoding are defined in the Parquet 
> project, with support currently available for V1. Support for Variant is 
> added in Iceberg v3.
> Variants are similar to JSON with a wider set of primitive values including 
> date, timestamp, timestamptz, binary, and decimals.
> Variant values may contain nested types:
> * An array is an ordered collection of variant values.
> * An object is a collection of fields that are a string key and a variant 
> value.
> As a semi-structured type, there are important differences between variant 
> and Iceberg's other types:
> * Variant arrays are similar to lists, but may contain any variant value 
> rather than a fixed element type.
> * Variant objects are similar to structs, but may contain variable fields 
> identified by name and field values may be any variant value rather than a 
> fixed field type.
> Variant data types allow for the efficient binary encoding of dynamic 
> semi-structured data such as JSON, Avro, Parquet, etc. By encoding 
> semi-structured data as a variant column, we retain the flexibility of the 
> source data, while allowing query engines to more efficiently operate on the 
> data.
> With the support of Variant type, such data can be encoded in an efficient 
> binary representation internally for better performance. Without that, we 
> need to parse the data in its format inefficiently.
> This will allow the following use cases:
> * Create an Iceberg table with a Variant column
> CREATE TABLE car_sales(record Variant);
> * Insert semi-structured data into the Variant column
> INSERT INTO car_sales SELECT PARSE_JSON(<json_string>)
> * Query against the semi-structured data
> SELECT VARIANT_GET(record, '$.dealer.ship', 'string') FROM car_sales
> Iceberg's Variant type proposal:
> https://docs.google.com/document/d/1sq70XDiWJ2DemWyA5dVB80gKzwi0CWoM0LOWM7VJVd8



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HIVE-29183) Integrating Variant Type into Hive

Reply via email to