Re: [Early Feedback] Variant and Subcolumnarization Support

Russell Spitzer Fri, 12 Jul 2024 12:38:52 -0700

Feels like eventually the encoding should land in parquet proper right? I'm
fine with us just copying into Iceberg though for the time being.


On Fri, Jul 12, 2024 at 2:31 PM Ryan Blue <b...@databricks.com.invalid>
wrote:

> Oops, it looks like I missed where Aihua brought this up in his last email:
>
> > do we have an issue to directly use Spark implementation in Iceberg?
>
> Yes, I think that we do have an issue using the Spark library. What do you
> think about a Java implementation in Iceberg?
>
> Ryan
>
> On Fri, Jul 12, 2024 at 12:28 PM Ryan Blue <b...@databricks.com> wrote:
>
>> I raised the same point from Peter's email in a comment on the doc as
>> well. There is a spark-variant_2.13 artifact that would be a much smaller
>> scope than relying on large portions of Spark, but I even then I doubt that
>> it is a good idea for Iceberg to depend on that because it is a Scala
>> artifact and we would need to bring in a ton of Scala libs. I think what
>> makes the most sense is to have an independent implementation of the spec
>> in Iceberg.
>>
>> On Fri, Jul 12, 2024 at 11:51 AM Péter Váry <peter.vary.apa...@gmail.com>
>> wrote:
>>
>>> Hi Aihua,
>>> Long time no see :)
>>> Would this mean, that every engine which plans to support Variant data
>>> type needs to add Spark as a dependency? Like Flink/Trino/Hive etc?
>>> Thanks, Peter
>>>
>>>
>>> On Fri, Jul 12, 2024, 19:10 Aihua Xu <aihu...@apache.org> wrote:
>>>
>>>> Thanks Ryan.
>>>>
>>>> Yeah. That's another reason we want to pursue Spark encoding to keep
>>>> compatibility for the open source engines.
>>>>
>>>> One more question regarding the encoding implementation: do we have an
>>>> issue to directly use Spark implementation in Iceberg? Russell pointed out
>>>> that Trino doesn't have Spark dependency and that could be a problem?
>>>>
>>>> Thanks,
>>>> Aihua
>>>>
>>>> On 2024/07/12 15:02:06 Ryan Blue wrote:
>>>> > Thanks, Aihua!
>>>> >
>>>> > I think that the encoding choice in the current doc is a good one. I
>>>> went
>>>> > through the Spark encoding in detail and it looks like a better
>>>> choice than
>>>> > the other candidate encodings for quickly accessing nested fields.
>>>> >
>>>> > Another reason to use the Spark type is that this is what Delta's
>>>> variant
>>>> > type is based on, so Parquet files in tables written by Delta could be
>>>> > converted or used in Iceberg tables without needing to rewrite variant
>>>> > data. (Also, note that I work at Databricks and have an interest in
>>>> > increasing format compatibility.)
>>>> >
>>>> > Ryan
>>>> >
>>>> > On Thu, Jul 11, 2024 at 11:21 AM Aihua Xu <aihua...@snowflake.com
>>>> .invalid>
>>>> > wrote:
>>>> >
>>>> > > [Discuss] Consensus for Variant Encoding
>>>> > >
>>>> > > It’s great to be able to present the Variant type proposal in the
>>>> > > community sync yesterday and I’m looking to host a meeting next week
>>>> > > (targeting for 9am, July 17th) to go over any further concerns
>>>> about the
>>>> > > encoding of the Variant type and any other questions on the first
>>>> phase of
>>>> > > the proposal
>>>> > > <
>>>> https://docs.google.com/document/d/1QjhpG_SVNPZh3anFcpicMQx90ebwjL7rmzFYfUP89Iw/edit
>>>> >.
>>>> > > We are hoping that anyone who is interested in the proposal can
>>>> either join
>>>> > > or reply with their comments so we can discuss them. Summary of the
>>>> > > discussion and notes will be sent to the mailing list for further
>>>> comment
>>>> > > there.
>>>> > >
>>>> > >
>>>> > >    -
>>>> > >
>>>> > >    What should be the underlying binary representation
>>>> > >
>>>> > > We have evaluated a few encodings in the doc including ION, JSONB,
>>>> and
>>>> > > Spark encoding.Choosing the underlying encoding is an important
>>>> first step
>>>> > > here and we believe we have general support for Spark’s Variant
>>>> encoding.
>>>> > > We would like to hear if anyone else has strong opinions in this
>>>> space.
>>>> > >
>>>> > >
>>>> > >    -
>>>> > >
>>>> > >    Should we support multiple logical types or just Variant?
>>>> Variant vs.
>>>> > >    Variant + JSON.
>>>> > >
>>>> > > This is to discuss what logical data type(s) to be supported in
>>>> Iceberg -
>>>> > > Variant only vs. Variant + JSON. Both types would share the same
>>>> underlying
>>>> > > encoding but would imply different limitations on engines working
>>>> with
>>>> > > those types.
>>>> > >
>>>> > > From the sync up meeting, we are more favoring toward supporting
>>>> Variant
>>>> > > only and we want to have a consensus on the supported type(s).
>>>> > >
>>>> > >
>>>> > >    -
>>>> > >
>>>> > >    How should we move forward with Subcolumnization?
>>>> > >
>>>> > > Subcolumnization is an optimization for Variant type by separating
>>>> out
>>>> > > subcolumns with their own metadata. This is not critical for
>>>> choosing the
>>>> > > initial encoding of the Variant type so we were hoping to gain
>>>> consensus on
>>>> > > leaving that for a follow up spec.
>>>> > >
>>>> > >
>>>> > > Thanks
>>>> > >
>>>> > > Aihua
>>>> > >
>>>> > > Meeting invite:
>>>> > >
>>>> > > Wednesday, July 17 · 9:00 – 10:00am
>>>> > > Time zone: America/Los_Angeles
>>>> > > Google Meet joining info
>>>> > > Video call link: https://meet.google.com/pbm-ovzn-aoq
>>>> > > Or dial: ‪(US) +1 650-449-9343‬ PIN: ‪170 576 525‬#
>>>> > > More phone numbers: https://tel.meet/pbm-ovzn-aoq?pin=4079632691790
>>>> > >
>>>> > > On Tue, May 28, 2024 at 9:21 PM Aihua Xu <aihua...@snowflake.com>
>>>> wrote:
>>>> > >
>>>> > >> Hello,
>>>> > >>
>>>> > >> We have drafted the proposal
>>>> > >> <
>>>> https://docs.google.com/document/d/1QjhpG_SVNPZh3anFcpicMQx90ebwjL7rmzFYfUP89Iw/edit
>>>> >
>>>> > >> for Variant data type. Please help review and comment.
>>>> > >>
>>>> > >> Thanks,
>>>> > >> Aihua
>>>> > >>
>>>> > >> On Thu, May 16, 2024 at 12:45 PM Jack Ye <yezhao...@gmail.com>
>>>> wrote:
>>>> > >>
>>>> > >>> +10000 for a JSON/BSON type. We also had the same discussion
>>>> internally
>>>> > >>> and a JSON type would really play well with for example the SUPER
>>>> type in
>>>> > >>> Redshift:
>>>> > >>> https://docs.aws.amazon.com/redshift/latest/dg/r_SUPER_type.html,
>>>> and
>>>> > >>> can also provide better integration with the Trino JSON type.
>>>> > >>>
>>>> > >>> Looking forward to the proposal!
>>>> > >>>
>>>> > >>> Best,
>>>> > >>> Jack Ye
>>>> > >>>
>>>> > >>>
>>>> > >>> On Wed, May 15, 2024 at 9:37 AM Tyler Akidau
>>>> > >>> <tyler.aki...@snowflake.com.invalid> wrote:
>>>> > >>>
>>>> > >>>> On Tue, May 14, 2024 at 7:58 PM Gang Wu <ust...@gmail.com>
>>>> wrote:
>>>> > >>>>
>>>> > >>>>> > We may need some guidance on just how many we need to look at;
>>>> > >>>>> > we were planning on Spark and Trino, but weren't sure how much
>>>> > >>>>> > further down the rabbit hole we needed to go。
>>>> > >>>>>
>>>> > >>>>> There are some engines living outside the Java world. It would
>>>> be
>>>> > >>>>> good if the proposal could cover the effort it takes to
>>>> integrate
>>>> > >>>>> variant type to them (e.g. velox, datafusion, etc.). This is
>>>> something
>>>> > >>>>> that
>>>> > >>>>> some proprietary iceberg vendors also care about.
>>>> > >>>>>
>>>> > >>>>
>>>> > >>>> Ack, makes sense. We can make sure to share some perspective on
>>>> this.
>>>> > >>>>
>>>> > >>>> > Not necessarily, no. As long as there's a binary type and
>>>> Iceberg and
>>>> > >>>>> > the query engines are aware that the binary column needs to be
>>>> > >>>>> > interpreted as a variant, that should be sufficient.
>>>> > >>>>>
>>>> > >>>>> From the perspective of interoperability, it would be good to
>>>> support
>>>> > >>>>> native
>>>> > >>>>> type from file specs. Life will be easier for projects like
>>>> Apache
>>>> > >>>>> XTable.
>>>> > >>>>> File format could also provide finer-grained statistics for
>>>> variant
>>>> > >>>>> type which
>>>> > >>>>> facilitates data skipping.
>>>> > >>>>>
>>>> > >>>>
>>>> > >>>> Agreed, there can definitely be additional value in native file
>>>> format
>>>> > >>>> integration. Just wanted to highlight that it's not a strict
>>>> requirement.
>>>> > >>>>
>>>> > >>>> -Tyler
>>>> > >>>>
>>>> > >>>>
>>>> > >>>>>
>>>> > >>>>> Gang
>>>> > >>>>>
>>>> > >>>>> On Wed, May 15, 2024 at 6:49 AM Tyler Akidau
>>>> > >>>>> <tyler.aki...@snowflake.com.invalid> wrote:
>>>> > >>>>>
>>>> > >>>>>> Good to see you again as well, JB! Thanks!
>>>> > >>>>>>
>>>> > >>>>>> -Tyler
>>>> > >>>>>>
>>>> > >>>>>>
>>>> > >>>>>> On Tue, May 14, 2024 at 1:04 PM Jean-Baptiste Onofré <
>>>> j...@nanthrax.net>
>>>> > >>>>>> wrote:
>>>> > >>>>>>
>>>> > >>>>>>> Hi Tyler,
>>>> > >>>>>>>
>>>> > >>>>>>> Super happy to see you there :) It reminds me our discussions
>>>> back in
>>>> > >>>>>>> the start of Apache Beam :)
>>>> > >>>>>>>
>>>> > >>>>>>> Anyway, the thread is pretty interesting. I remember some
>>>> discussions
>>>> > >>>>>>> about JSON datatype for spec v3. The binary data type is
>>>> already
>>>> > >>>>>>> supported in the spec v2.
>>>> > >>>>>>>
>>>> > >>>>>>> I'm looking forward to the proposal and happy to help on this
>>>> !
>>>> > >>>>>>>
>>>> > >>>>>>> Regards
>>>> > >>>>>>> JB
>>>> > >>>>>>>
>>>> > >>>>>>> On Sat, May 11, 2024 at 7:06 AM Tyler Akidau
>>>> > >>>>>>> <tyler.aki...@snowflake.com.invalid> wrote:
>>>> > >>>>>>> >
>>>> > >>>>>>> > Hello,
>>>> > >>>>>>> >
>>>> > >>>>>>> > We (Tyler, Nileema, Selcuk, Aihua) are working on a
>>>> proposal for
>>>> > >>>>>>> which we’d like to get early feedback from the community. As
>>>> you may know,
>>>> > >>>>>>> Snowflake has embraced Iceberg as its open Data Lake format.
>>>> Having made
>>>> > >>>>>>> good progress on our own adoption of the Iceberg standard,
>>>> we’re now in a
>>>> > >>>>>>> position where there are features not yet supported in
>>>> Iceberg which we
>>>> > >>>>>>> think would be valuable for our users, and that we would like
>>>> to discuss
>>>> > >>>>>>> with and help contribute to the Iceberg community.
>>>> > >>>>>>> >
>>>> > >>>>>>> > The first two such features we’d like to discuss are in
>>>> support of
>>>> > >>>>>>> efficient querying of dynamically typed, semi-structured
>>>> data: variant data
>>>> > >>>>>>> types, and subcolumnarization of variant columns. In more
>>>> detail, for
>>>> > >>>>>>> anyone who may not already be familiar:
>>>> > >>>>>>> >
>>>> > >>>>>>> > 1. Variant data types
>>>> > >>>>>>> > Variant types allow for the efficient binary encoding of
>>>> dynamic
>>>> > >>>>>>> semi-structured data such as JSON, Avro, etc. By encoding
>>>> semi-structured
>>>> > >>>>>>> data as a variant column, we retain the flexibility of the
>>>> source data,
>>>> > >>>>>>> while allowing query engines to more efficiently operate on
>>>> the data.
>>>> > >>>>>>> Snowflake has supported the variant data type on Snowflake
>>>> tables for many
>>>> > >>>>>>> years [1]. As more and more users utilize Iceberg tables in
>>>> Snowflake,
>>>> > >>>>>>> we’re hearing an increasing chorus of requests for variant
>>>> support.
>>>> > >>>>>>> Additionally, other query engines such as Apache Spark have
>>>> begun adding
>>>> > >>>>>>> variant support [2]. As such, we believe it would be
>>>> beneficial to the
>>>> > >>>>>>> Iceberg community as a whole to standardize on the variant
>>>> data type
>>>> > >>>>>>> encoding used across Iceberg tables.
>>>> > >>>>>>> >
>>>> > >>>>>>> > One specific point to make here is that, since an Apache OSS
>>>> > >>>>>>> version of variant encoding already exists in Spark, it
>>>> likely makes sense
>>>> > >>>>>>> to simply adopt the Spark encoding as the Iceberg standard as
>>>> well. The
>>>> > >>>>>>> encoding we use internally today in Snowflake is slightly
>>>> different, but
>>>> > >>>>>>> essentially equivalent, and we see no particular value in
>>>> trying to clutter
>>>> > >>>>>>> the space with another equivalent-but-incompatible encoding.
>>>> > >>>>>>> >
>>>> > >>>>>>> >
>>>> > >>>>>>> > 2. Subcolumnarization
>>>> > >>>>>>> > Subcolumnarization of variant columns allows query engines
>>>> to
>>>> > >>>>>>> efficiently prune datasets when subcolumns (i.e., nested
>>>> fields) within a
>>>> > >>>>>>> variant column are queried, and also allows optionally
>>>> materializing some
>>>> > >>>>>>> of the nested fields as a column on their own, affording
>>>> queries on these
>>>> > >>>>>>> subcolumns the ability to read less data and spend less CPU
>>>> on extraction.
>>>> > >>>>>>> When subcolumnarizing, the system managing table metadata and
>>>> data tracks
>>>> > >>>>>>> individual pruning statistics (min, max, null, etc.) for some
>>>> subset of the
>>>> > >>>>>>> nested fields within a variant, and also manages any optional
>>>> > >>>>>>> materialization. Without subcolumnarization, any query which
>>>> touches a
>>>> > >>>>>>> variant column must read, parse, extract, and filter every
>>>> row for which
>>>> > >>>>>>> that column is non-null. Thus, by providing a standardized
>>>> way of tracking
>>>> > >>>>>>> subcolum metadata and data for variant columns, Iceberg can
>>>> make
>>>> > >>>>>>> subcolumnar optimizations accessible across various catalogs
>>>> and query
>>>> > >>>>>>> engines.
>>>> > >>>>>>> >
>>>> > >>>>>>> > Subcolumnarization is a non-trivial topic, so we expect any
>>>> > >>>>>>> concrete proposal to include not only the set of changes to
>>>> Iceberg
>>>> > >>>>>>> metadata that allow compatible query engines to interopate on
>>>> > >>>>>>> subcolumnarization data for variant columns, but also
>>>> reference
>>>> > >>>>>>> documentation explaining subcolumnarization principles and
>>>> recommended best
>>>> > >>>>>>> practices.
>>>> > >>>>>>> >
>>>> > >>>>>>> >
>>>> > >>>>>>> > It sounds like the recent Geo proposal [3] may be a good
>>>> starting
>>>> > >>>>>>> point for how to approach this, so our plan is to write
>>>> something up in
>>>> > >>>>>>> that vein that covers the proposed spec changes, backwards
>>>> compatibility,
>>>> > >>>>>>> implementor burdens, etc. But we wanted to first reach out to
>>>> the community
>>>> > >>>>>>> to introduce ourselves and the idea, and see if there’s any
>>>> early feedback
>>>> > >>>>>>> we should incorporate before we spend too much time on a
>>>> concrete proposal.
>>>> > >>>>>>> >
>>>> > >>>>>>> > Thank you!
>>>> > >>>>>>> >
>>>> > >>>>>>> > [1]
>>>> > >>>>>>>
>>>> https://docs.snowflake.com/en/sql-reference/data-types-semistructured
>>>> > >>>>>>> > [2]
>>>> > >>>>>>>
>>>> https://github.com/apache/spark/blob/master/common/variant/README.md
>>>> > >>>>>>> > [3]
>>>> > >>>>>>>
>>>> https://docs.google.com/document/d/1iVFbrRNEzZl8tDcZC81GFt01QJkLJsI9E2NBOt21IRI/edit
>>>> > >>>>>>> >
>>>> > >>>>>>> > -Tyler, Nileema, Selcuk, Aihua
>>>> > >>>>>>> >
>>>> > >>>>>>>
>>>> > >>>>>>
>>>> >
>>>> > --
>>>> > Ryan Blue
>>>> > Databricks
>>>> >
>>>>
>>>
>>
>> --
>> Ryan Blue
>> Databricks
>>
>
>
> --
> Ryan Blue
> Databricks
>

Re: [Early Feedback] Variant and Subcolumnarization Support

Reply via email to