[
https://issues.apache.org/jira/browse/HIVE-29183?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Denys Kuzmenko updated HIVE-29183:
----------------------------------
Summary: Add basic Variant Type support in Hive (was: Integrating Variant
Type into Hive)
> Add basic Variant Type support in Hive
> --------------------------------------
>
> Key: HIVE-29183
> URL: https://issues.apache.org/jira/browse/HIVE-29183
> Project: Hive
> Issue Type: New Feature
> Components: Hive, Iceberg integration, SQL
> Reporter: Denys Kuzmenko
> Priority: Major
> Labels: pull-request-available
> Attachments: Iceberg Variant DataType in Hive.pdf
>
>
> A variant is a value that stores semi-structured data. The structure and data
> types in a variant are not necessarily consistent across rows in a table or
> data file. The variant type and binary encoding are defined in the Parquet
> project, with support currently available for V1. Support for Variant is
> added in Iceberg v3.
> Variants are similar to JSON with a wider set of primitive values including
> date, timestamp, timestamptz, binary, and decimals.
> Variant values may contain nested types:
> * An array is an ordered collection of variant values.
> * An object is a collection of fields that are a string key and a variant
> value.
> As a semi-structured type, there are important differences between variant
> and Iceberg's other types:
> * Variant arrays are similar to lists, but may contain any variant value
> rather than a fixed element type.
> * Variant objects are similar to structs, but may contain variable fields
> identified by name and field values may be any variant value rather than a
> fixed field type.
> Variant data types allow for the efficient binary encoding of dynamic
> semi-structured data such as JSON, Avro, Parquet, etc. By encoding
> semi-structured data as a variant column, we retain the flexibility of the
> source data, while allowing query engines to more efficiently operate on the
> data.
> With the support of Variant type, such data can be encoded in an efficient
> binary representation internally for better performance. Without that, we
> need to parse the data in its format inefficiently.
> This will allow the following use cases:
> * Create an Iceberg table with a Variant column
> CREATE TABLE IF NOT EXISTS car_sales(record Variant);
> * Insert semi-structured data into the Variant column
> INSERT INTO car_sales SELECT PARSE_JSON(<json_string>)
> * Query against the semi-structured data
> SELECT VARIANT_GET(record, '$.dealer.ship', 'string') FROM car_sales
> Variant Binary Encoding
> https://github.com/apache/parquet-format/blob/master/VariantEncoding.md
> Iceberg's Variant type proposal:
> https://docs.google.com/document/d/1sq70XDiWJ2DemWyA5dVB80gKzwi0CWoM0LOWM7VJVd8
--
This message was sent by Atlassian Jira
(v8.20.10#820010)