[
https://issues.apache.org/jira/browse/HIVE-29183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18019139#comment-18019139
]
Ayush Saxena commented on HIVE-29183:
-------------------------------------
I & [~dkuzmenko] been discussing around the implementation. I have attached a
quick doc on how I initially thought it could look like and a link to the POC
around that ([GitHub Pull Request
#6069|https://github.com/apache/hive/pull/6069])
My Initial hack was like Variant can internally be a Struct with two Fixed
binary fields & I handled it that way in the POC. Denys while exploring found
out that we have engines using dedicated class implementation for the Variant
type. Flink, Trino & Spark I believe.
So, it would be better if we can also go with that approach, rather than Struct
with two fields internally, we have a BinaryVariant/Variant Class which stores
the metadata and value binary & we can use that class.
The POC PR, doesn't have that, We plan to extend that. Rest demonstrates a
basic functionality we expect in the Phase-1
A sample output can be seen here in the PR q.out file
[https://github.com/apache/hive/pull/6069/files#diff-4b0500edafc601566ef37aa94c5bb704ed30a00f0461d26373c59501c4972b40]
Open to more suggestions/recommendation from other folks :)
> Integrating Variant Type into Hive
> ----------------------------------
>
> Key: HIVE-29183
> URL: https://issues.apache.org/jira/browse/HIVE-29183
> Project: Hive
> Issue Type: New Feature
> Components: Hive, Iceberg integration, SQL
> Reporter: Denys Kuzmenko
> Priority: Major
> Labels: pull-request-available
> Attachments: Iceberg Variant DataType in Hive.pdf
>
>
> A variant is a value that stores semi-structured data. The structure and data
> types in a variant are not necessarily consistent across rows in a table or
> data file. The variant type and binary encoding are defined in the Parquet
> project, with support currently available for V1. Support for Variant is
> added in Iceberg v3.
> Variants are similar to JSON with a wider set of primitive values including
> date, timestamp, timestamptz, binary, and decimals.
> Variant values may contain nested types:
> * An array is an ordered collection of variant values.
> * An object is a collection of fields that are a string key and a variant
> value.
> As a semi-structured type, there are important differences between variant
> and Iceberg's other types:
> * Variant arrays are similar to lists, but may contain any variant value
> rather than a fixed element type.
> * Variant objects are similar to structs, but may contain variable fields
> identified by name and field values may be any variant value rather than a
> fixed field type.
> Variant data types allow for the efficient binary encoding of dynamic
> semi-structured data such as JSON, Avro, Parquet, etc. By encoding
> semi-structured data as a variant column, we retain the flexibility of the
> source data, while allowing query engines to more efficiently operate on the
> data.
> With the support of Variant type, such data can be encoded in an efficient
> binary representation internally for better performance. Without that, we
> need to parse the data in its format inefficiently.
> This will allow the following use cases:
> * Create an Iceberg table with a Variant column
> CREATE TABLE IF NOT EXISTS car_sales(record Variant);
> * Insert semi-structured data into the Variant column
> INSERT INTO car_sales SELECT PARSE_JSON(<json_string>)
> * Query against the semi-structured data
> SELECT VARIANT_GET(record, '$.dealer.ship', 'string') FROM car_sales
> Variant Binary Encoding
> https://github.com/apache/parquet-format/blob/master/VariantEncoding.md
> Iceberg's Variant type proposal:
> https://docs.google.com/document/d/1sq70XDiWJ2DemWyA5dVB80gKzwi0CWoM0LOWM7VJVd8
--
This message was sent by Atlassian Jira
(v8.20.10#820010)