[ 
https://issues.apache.org/jira/browse/HIVE-29183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18019139#comment-18019139
 ] 

Ayush Saxena commented on HIVE-29183:
-------------------------------------

I & [~dkuzmenko]  been discussing around the implementation. I have attached a 
quick doc on how I initially thought it could look like and a link to the POC 
around that ([GitHub Pull Request 
#6069|https://github.com/apache/hive/pull/6069])

My Initial hack was like Variant can internally be a Struct with two Fixed 
binary fields & I handled it that way in the POC. Denys while exploring found 
out that we have engines using dedicated class implementation for the Variant 
type. Flink, Trino & Spark I believe. 

So, it would be better if we can also go with that approach, rather than Struct 
with two fields internally, we have a BinaryVariant/Variant Class which stores 
the metadata and value binary & we can use that class.

The POC PR, doesn't have that, We plan to extend that. Rest demonstrates a 
basic functionality we expect in the Phase-1

A sample output can be seen here in the PR q.out file

[https://github.com/apache/hive/pull/6069/files#diff-4b0500edafc601566ef37aa94c5bb704ed30a00f0461d26373c59501c4972b40]

 

Open to more suggestions/recommendation from other folks :) 

> Integrating Variant Type into Hive
> ----------------------------------
>
>                 Key: HIVE-29183
>                 URL: https://issues.apache.org/jira/browse/HIVE-29183
>             Project: Hive
>          Issue Type: New Feature
>          Components: Hive, Iceberg integration, SQL
>            Reporter: Denys Kuzmenko
>            Priority: Major
>              Labels: pull-request-available
>         Attachments: Iceberg Variant DataType in Hive.pdf
>
>
> A variant is a value that stores semi-structured data. The structure and data 
> types in a variant are not necessarily consistent across rows in a table or 
> data file. The variant type and binary encoding are defined in the Parquet 
> project, with support currently available for V1. Support for Variant is 
> added in Iceberg v3.
> Variants are similar to JSON with a wider set of primitive values including 
> date, timestamp, timestamptz, binary, and decimals.
> Variant values may contain nested types:
> * An array is an ordered collection of variant values.
> * An object is a collection of fields that are a string key and a variant 
> value.
> As a semi-structured type, there are important differences between variant 
> and Iceberg's other types:
> * Variant arrays are similar to lists, but may contain any variant value 
> rather than a fixed element type.
> * Variant objects are similar to structs, but may contain variable fields 
> identified by name and field values may be any variant value rather than a 
> fixed field type.
> Variant data types allow for the efficient binary encoding of dynamic 
> semi-structured data such as JSON, Avro, Parquet, etc. By encoding 
> semi-structured data as a variant column, we retain the flexibility of the 
> source data, while allowing query engines to more efficiently operate on the 
> data.
> With the support of Variant type, such data can be encoded in an efficient 
> binary representation internally for better performance. Without that, we 
> need to parse the data in its format inefficiently.
> This will allow the following use cases:
> * Create an Iceberg table with a Variant column
> CREATE TABLE IF NOT EXISTS car_sales(record Variant);
> * Insert semi-structured data into the Variant column
> INSERT INTO car_sales SELECT PARSE_JSON(<json_string>)
> * Query against the semi-structured data
> SELECT VARIANT_GET(record, '$.dealer.ship', 'string') FROM car_sales
> Variant Binary Encoding
> https://github.com/apache/parquet-format/blob/master/VariantEncoding.md
> Iceberg's Variant type proposal:
> https://docs.google.com/document/d/1sq70XDiWJ2DemWyA5dVB80gKzwi0CWoM0LOWM7VJVd8



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to