Re: [Early Feedback] Variant and Subcolumnarization Support

Amogh Jahagirdar Sat, 11 May 2024 09:27:00 -0700

Hi all,

Thanks for raising this thread! Overall I agree that some forms of variant
types would be useful in Iceberg and we've seen interest in support for
that as well. Specifically, there has been discussion on adding an
optimized JSON data types as part of the V3 spec
<https://github.com/apache/iceberg/issues/9066>.
I have some questions on some of the points that were brought up:


> One specific point to make here is that, since an Apache OSS version of
variant encoding already exists in Spark, it likely makes sense to simply
adopt the Spark encoding as the Iceberg standard as well. The encoding we
use internally today in Snowflake is slightly different, but essentially
equivalent, and we see no particular value in trying to clutter the space
with another equivalent-but-incompatible encoding.

1.)  Isn't the variant encoding in Spark essentially a binary data type
(i.e. an array of arbitrary bytes)? Iceberg already has a binary data type
<https://iceberg.apache.org/spec/#primitive-types>defined as well. I would
think that we would want to spec out something along the lines of a JSONB
data type (how are keys organized in the structure, null values etc). It's
worth looking at what systems like Postgres do.
In general, I think we would want to choose an encoding scheme which has
good compression characteristics that balances with the decoding/reading
performance. We do want to make engine integration easy as possible (so
that common json functions that engines expose)

2.) I'm less familiar with subcolumnarization, but it sounds like a
technique for pruning and efficient materialization of nested fields which
is very interesting, but
 I think I'd probably try and get consensus on the specific data types we
want to add, and spec those out first, with integration considerations
(Spark, Trino, Python, etc). I think there's enough to unpack there that's
worth a discussion, before jumping into more complex optimizations. Certain
cases may be inefficient without this technique but I think it's quite
useful to at least start out with the data type definitions and engine
integrations. We could then possibly look at support for this technique
afterwards. Wonder what others think though.

> so our plan is to write something up in that vein that covers the
proposed spec changes, backwards compatibility, implementor burdens, etc.
3.) Minor comment, since I'm assuming this is considered as "implementation
burdens" but I think a proposal should also ideally cover any relevant
details for integrations with Spark, Trino, PyIceberg, etc. At least to
make sure the community is aware of any challenges in integration (if
there's any), so we can make progress in those areas and get adoption for
these new data types!

Overall this is very exciting!

Thanks,

Amogh Jahagirdar

On Fri, May 10, 2024 at 11:28 PM Gang Wu <ust...@gmail.com> wrote:

> Hi,
>
> This sounds very interesting!
>
> IIUC, the current variant type in the Apache Spark stores data in the
> BINARY type. When it comes to subcolumnarization, does it require the file
> format (e.g. Apache Parquet/ORC/Avro) to support variant type natively?
>
> Best,
> Gang
>
> On Sat, May 11, 2024 at 1:07 PM Tyler Akidau
> <tyler.aki...@snowflake.com.invalid> wrote:
>
>> Hello,
>>
>> We (Tyler, Nileema, Selcuk, Aihua) are working on a proposal for which
>> we’d like to get early feedback from the community. As you may know,
>> Snowflake has embraced Iceberg as its open Data Lake format. Having made
>> good progress on our own adoption of the Iceberg standard, we’re now in a
>> position where there are features not yet supported in Iceberg which we
>> think would be valuable for our users, and that we would like to discuss
>> with and help contribute to the Iceberg community.
>>
>> The first two such features we’d like to discuss are in support of
>> efficient querying of dynamically typed, semi-structured data: variant data
>> types, and subcolumnarization of variant columns. In more detail, for
>> anyone who may not already be familiar:
>>
>> 1. Variant data types
>> Variant types allow for the efficient binary encoding of dynamic
>> semi-structured data such as JSON, Avro, etc. By encoding semi-structured
>> data as a variant column, we retain the flexibility of the source data,
>> while allowing query engines to more efficiently operate on the data.
>> Snowflake has supported the variant data type on Snowflake tables for many
>> years [1]. As more and more users utilize Iceberg tables in Snowflake,
>> we’re hearing an increasing chorus of requests for variant support.
>> Additionally, other query engines such as Apache Spark have begun adding
>> variant support [2]. As such, we believe it would be beneficial to the
>> Iceberg community as a whole to standardize on the variant data type
>> encoding used across Iceberg tables.
>>
>> One specific point to make here is that, since an Apache OSS version of
>> variant encoding already exists in Spark, it likely makes sense to simply
>> adopt the Spark encoding as the Iceberg standard as well. The encoding we
>> use internally today in Snowflake is slightly different, but essentially
>> equivalent, and we see no particular value in trying to clutter the space
>> with another equivalent-but-incompatible encoding.
>>
>>
>> 2. Subcolumnarization
>> Subcolumnarization of variant columns allows query engines to efficiently
>> prune datasets when subcolumns (i.e., nested fields) within a variant
>> column are queried, and also allows optionally materializing some of the
>> nested fields as a column on their own, affording queries on these
>> subcolumns the ability to read less data and spend less CPU on extraction.
>> When subcolumnarizing, the system managing table metadata and data tracks
>> individual pruning statistics (min, max, null, etc.) for some subset of the
>> nested fields within a variant, and also manages any optional
>> materialization. Without subcolumnarization, any query which touches a
>> variant column must read, parse, extract, and filter every row for which
>> that column is non-null. Thus, by providing a standardized way of tracking
>> subcolum metadata and data for variant columns, Iceberg can make
>> subcolumnar optimizations accessible across various catalogs and query
>> engines.
>>
>> Subcolumnarization is a non-trivial topic, so we expect any concrete
>> proposal to include not only the set of changes to Iceberg metadata that
>> allow compatible query engines to interopate on subcolumnarization data for
>> variant columns, but also reference documentation explaining
>> subcolumnarization principles and recommended best practices.
>>
>>
>> It sounds like the recent Geo proposal [3] may be a good starting point
>> for how to approach this, so our plan is to write something up in that vein
>> that covers the proposed spec changes, backwards compatibility, implementor
>> burdens, etc. But we wanted to first reach out to the community to
>> introduce ourselves and the idea, and see if there’s any early feedback we
>> should incorporate before we spend too much time on a concrete proposal.
>>
>> Thank you!
>>
>> [1] https://docs.snowflake.com/en/sql-reference/data-types-semistructured
>> [2] https://github.com/apache/spark/blob/master/common/variant/README.md
>> [3]
>> https://docs.google.com/document/d/1iVFbrRNEzZl8tDcZC81GFt01QJkLJsI9E2NBOt21IRI/edit
>>
>> -Tyler, Nileema, Selcuk, Aihua
>>
>>

Re: [Early Feedback] Variant and Subcolumnarization Support

Reply via email to