Re: [Early Feedback] Variant and Subcolumnarization Support

Ryan Blue Thu, 18 Jul 2024 12:55:40 -0700

Similarly, I'm aligned with point 1 and I'd choose to support only variant
for point 3.


We'll need to work with the Spark community to find a good place for the
library and spec, since it touches many different projects. I'd also prefer
Iceberg as the home.

I also think it's a good idea to get subcolumnarization into our spec when
we update. Without that I think the feature will be fairly limited.

On Thu, Jul 18, 2024 at 10:56 AM Russell Spitzer <russell.spit...@gmail.com>
wrote:

> I'm aligned with point 1.
>
> For point 2 I think we should choose quickly, I honestly do think this
> would be fine as part of the Iceberg Spec directly but understand it may be
> better for the more broad community if it was a sub project. As a
> sub-project I would still prefer it being an Iceberg Subproject since we
> are engine/file-format agnostic.
>
> 3. I support adding just Variant.
>
> On Thu, Jul 18, 2024 at 12:54 AM Aihua Xu <aihu...@apache.org> wrote:
>
>> Hello community,
>>
>> It’s great to sync up with some of you on Variant and SubColumarization
>> support in Iceberg again. Apologize that I didn’t record the meeting but
>> here are some key items that we want to follow up with the community.
>>
>> 1. Adopt Spark Variant encoding
>> Those present were in favor of  adopting the Spark variant encoding for
>> Iceberg Variant with extensions to support other Iceberg types. We would
>> like to know if anyone has an objection to this to reuse an open source
>> encoding.
>>
>> 2. Movement of the Spark Variant Spec to another project
>> To avoid introducing Apache Spark as a dependency for the engines and
>> file formats, we discussed separating Spark Variant encoding spec and
>> implementation from the Spark Project to a neutral location. We thought up
>> several solutions but didn’t have consensus on any of them. We are looking
>> for more feedback on this topic from the community either in terms of
>> support for one of these options or another idea on how to support the spec.
>>
>> Options Proposed:
>> * Leave the Spec in Spark (Difficult for versioning and other engines)
>> * Copying the Spec into Iceberg Project Directly (Difficult for other
>> Table Formats)
>> * Creating a Sub-Project of Apache Iceberg and moving the spec and
>> reference implementation there (Logistically complicated)
>> * Creating a Sub-Project of Apache Spark and moving the spec and
>> reference implementation there (Logistically complicated)
>>
>> 3. Add Variant type vs. Variant and JSON types
>> Those who were present were in favor of adding only the Variant type to
>> Iceberg. We are looking for anyone who has an objection to going forward
>> with just the Variant Type and no Iceberg JSON Type. We were favoring
>> adding Variant type only because:
>> * Introducing a JSON type would require engines that only support VARIANT
>> to do write time validation of their input to a JSON column. If they don’t
>> have a JSON type an engine wouldn’t support this.
>> * Engines which don’t support Variant will work most of the time but can
>> have fallback strings defined in the spec for reading unsupported types.
>> Writing a JSON into a Variant will always work.
>>
>> 4. Support for Subcolumnization spec (shredding in Spark)
>> We have no action items on this but would like to follow up on
>> discussions on Subcolumnization in the future.
>> * We had general agreement that this should be included in Iceberg V3 or
>> else adding variant may not be useful.
>> * We are interested in also adopting the shredding spec from Spark and
>> would like to move it to whatever place we decided the Variant spec is
>> going to live.
>>
>> Let us know if missed anything and if you have any additional thoughts or
>> suggestions.
>>
>> Thanks
>> Aihua
>>
>>
>> On 2024/07/15 18:32:22 Aihua Xu wrote:
>> > Thanks for the discussion.
>> >
>> > I will move forward to work on spec PR.
>> >
>> > Regarding the implementation, we will have module for Variant support
>> in Iceberg so we will not have to bring in Spark libraries.
>> >
>> > I'm reposting the meeting invite in case it's not clear in my original
>> email since I included in the end. Looks like we don't have major
>> objections/diverges but let's sync up and have consensus.
>> >
>> > Meeting invite:
>> >
>> > Wednesday, July 17 · 9:00 – 10:00am
>> > Time zone: America/Los_Angeles
>> > Google Meet joining info
>> > Video call link: https://meet.google.com/pbm-ovzn-aoq
>> > Or dial: ‪(US) +1 650-449-9343‬ PIN: ‪170 576 525‬#
>> > More phone numbers: https://tel.meet/pbm-ovzn-aoq?pin=4079632691790
>> >
>> > Thanks,
>> > Aihua
>> >
>> > On 2024/07/12 20:55:01 Micah Kornfield wrote:
>> > > I don't think this needs to hold up the PR but I think coming to a
>> > > consensus on the exact set of types supported is worthwhile (and if
>> the
>> > > goal is to maintain the same set as specified by the Spark Variant
>> type or
>> > > if divergence is expected/allowed).  From a fragmentation perspective
>> it
>> > > would be a shame if they diverge, so maybe a next step is also
>> suggesting
>> > > support to the Spark community on the missing existing Iceberg types?
>> > >
>> > > Thanks,
>> > > Micah
>> > >
>> > > On Fri, Jul 12, 2024 at 1:44 PM Russell Spitzer <
>> russell.spit...@gmail.com>
>> > > wrote:
>> > >
>> > > > Just talked with Aihua and he's working on the Spec PR now. We can
>> get
>> > > > feedback there from everyone.
>> > > >
>> > > > On Fri, Jul 12, 2024 at 3:41 PM Ryan Blue
>> <b...@databricks.com.invalid>
>> > > > wrote:
>> > > >
>> > > >> Good idea, but I'm hoping that we can continue to get their
>> feedback in
>> > > >> parallel to getting the spec changes started. Piotr didn't seem to
>> object
>> > > >> to the encoding from what I read of his comments. Hopefully he
>> (and others)
>> > > >> chime in here.
>> > > >>
>> > > >> On Fri, Jul 12, 2024 at 1:32 PM Russell Spitzer <
>> > > >> russell.spit...@gmail.com> wrote:
>> > > >>
>> > > >>> I just want to make sure we get Piotr and Peter on board as
>> > > >>> representatives of Flink and Trino engines. Also make sure we
>> have anyone
>> > > >>> else chime in who has experience with Ray if possible.
>> > > >>>
>> > > >>> Spec changes feel like the right next step.
>> > > >>>
>> > > >>> On Fri, Jul 12, 2024 at 3:14 PM Ryan Blue
>> <b...@databricks.com.invalid>
>> > > >>> wrote:
>> > > >>>
>> > > >>>> Okay, what are the next steps here? This proposal has been out
>> for
>> > > >>>> quite a while and I don't see any major objections to using the
>> Spark
>> > > >>>> encoding. It's quite well designed and fits the need well. It
>> can also be
>> > > >>>> extended to support additional types that are missing if that's
>> a priority.
>> > > >>>>
>> > > >>>> Should we move forward by starting a draft of the changes to the
>> table
>> > > >>>> spec? Then we can vote on committing those changes and get
>> moving on an
>> > > >>>> implementation (or possibly do the implementation in parallel).
>> > > >>>>
>> > > >>>> On Fri, Jul 12, 2024 at 1:08 PM Russell Spitzer <
>> > > >>>> russell.spit...@gmail.com> wrote:
>> > > >>>>
>> > > >>>>> That's fair, I'm sold on an Iceberg Module.
>> > > >>>>>
>> > > >>>>> On Fri, Jul 12, 2024 at 2:53 PM Ryan Blue
>> <b...@databricks.com.invalid>
>> > > >>>>> wrote:
>> > > >>>>>
>> > > >>>>>> > Feels like eventually the encoding should land in parquet
>> proper
>> > > >>>>>> right?
>> > > >>>>>>
>> > > >>>>>> What about using it in ORC? I don't know where it should end
>> up.
>> > > >>>>>> Maybe Iceberg should make a standalone module from it?
>> > > >>>>>>
>> > > >>>>>> On Fri, Jul 12, 2024 at 12:38 PM Russell Spitzer <
>> > > >>>>>> russell.spit...@gmail.com> wrote:
>> > > >>>>>>
>> > > >>>>>>> Feels like eventually the encoding should land in parquet
>> proper
>> > > >>>>>>> right? I'm fine with us just copying into Iceberg though for
>> the time
>> > > >>>>>>> being.
>> > > >>>>>>>
>> > > >>>>>>> On Fri, Jul 12, 2024 at 2:31 PM Ryan Blue
>> > > >>>>>>> <b...@databricks.com.invalid> wrote:
>> > > >>>>>>>
>> > > >>>>>>>> Oops, it looks like I missed where Aihua brought this up in
>> his
>> > > >>>>>>>> last email:
>> > > >>>>>>>>
>> > > >>>>>>>> > do we have an issue to directly use Spark implementation in
>> > > >>>>>>>> Iceberg?
>> > > >>>>>>>>
>> > > >>>>>>>> Yes, I think that we do have an issue using the Spark
>> library. What
>> > > >>>>>>>> do you think about a Java implementation in Iceberg?
>> > > >>>>>>>>
>> > > >>>>>>>> Ryan
>> > > >>>>>>>>
>> > > >>>>>>>> On Fri, Jul 12, 2024 at 12:28 PM Ryan Blue <
>> b...@databricks.com>
>> > > >>>>>>>> wrote:
>> > > >>>>>>>>
>> > > >>>>>>>>> I raised the same point from Peter's email in a comment on
>> the doc
>> > > >>>>>>>>> as well. There is a spark-variant_2.13 artifact that would
>> be a much
>> > > >>>>>>>>> smaller scope than relying on large portions of Spark, but
>> I even then I
>> > > >>>>>>>>> doubt that it is a good idea for Iceberg to depend on that
>> because it is a
>> > > >>>>>>>>> Scala artifact and we would need to bring in a ton of Scala
>> libs. I think
>> > > >>>>>>>>> what makes the most sense is to have an independent
>> implementation of the
>> > > >>>>>>>>> spec in Iceberg.
>> > > >>>>>>>>>
>> > > >>>>>>>>> On Fri, Jul 12, 2024 at 11:51 AM Péter Váry <
>> > > >>>>>>>>> peter.vary.apa...@gmail.com> wrote:
>> > > >>>>>>>>>
>> > > >>>>>>>>>> Hi Aihua,
>> > > >>>>>>>>>> Long time no see :)
>> > > >>>>>>>>>> Would this mean, that every engine which plans to support
>> Variant
>> > > >>>>>>>>>> data type needs to add Spark as a dependency? Like
>> Flink/Trino/Hive etc?
>> > > >>>>>>>>>> Thanks, Peter
>> > > >>>>>>>>>>
>> > > >>>>>>>>>>
>> > > >>>>>>>>>> On Fri, Jul 12, 2024, 19:10 Aihua Xu <aihu...@apache.org>
>> wrote:
>> > > >>>>>>>>>>
>> > > >>>>>>>>>>> Thanks Ryan.
>> > > >>>>>>>>>>>
>> > > >>>>>>>>>>> Yeah. That's another reason we want to pursue Spark
>> encoding to
>> > > >>>>>>>>>>> keep compatibility for the open source engines.
>> > > >>>>>>>>>>>
>> > > >>>>>>>>>>> One more question regarding the encoding implementation:
>> do we
>> > > >>>>>>>>>>> have an issue to directly use Spark implementation in
>> Iceberg? Russell
>> > > >>>>>>>>>>> pointed out that Trino doesn't have Spark dependency and
>> that could be a
>> > > >>>>>>>>>>> problem?
>> > > >>>>>>>>>>>
>> > > >>>>>>>>>>> Thanks,
>> > > >>>>>>>>>>> Aihua
>> > > >>>>>>>>>>>
>> > > >>>>>>>>>>> On 2024/07/12 15:02:06 Ryan Blue wrote:
>> > > >>>>>>>>>>> > Thanks, Aihua!
>> > > >>>>>>>>>>> >
>> > > >>>>>>>>>>> > I think that the encoding choice in the current doc is
>> a good
>> > > >>>>>>>>>>> one. I went
>> > > >>>>>>>>>>> > through the Spark encoding in detail and it looks like a
>> > > >>>>>>>>>>> better choice than
>> > > >>>>>>>>>>> > the other candidate encodings for quickly accessing
>> nested
>> > > >>>>>>>>>>> fields.
>> > > >>>>>>>>>>> >
>> > > >>>>>>>>>>> > Another reason to use the Spark type is that this is
>> what
>> > > >>>>>>>>>>> Delta's variant
>> > > >>>>>>>>>>> > type is based on, so Parquet files in tables written by
>> Delta
>> > > >>>>>>>>>>> could be
>> > > >>>>>>>>>>> > converted or used in Iceberg tables without needing to
>> rewrite
>> > > >>>>>>>>>>> variant
>> > > >>>>>>>>>>> > data. (Also, note that I work at Databricks and have an
>> > > >>>>>>>>>>> interest in
>> > > >>>>>>>>>>> > increasing format compatibility.)
>> > > >>>>>>>>>>> >
>> > > >>>>>>>>>>> > Ryan
>> > > >>>>>>>>>>> >
>> > > >>>>>>>>>>> > On Thu, Jul 11, 2024 at 11:21 AM Aihua Xu <
>> > > >>>>>>>>>>> aihua...@snowflake.com.invalid>
>> > > >>>>>>>>>>> > wrote:
>> > > >>>>>>>>>>> >
>> > > >>>>>>>>>>> > > [Discuss] Consensus for Variant Encoding
>> > > >>>>>>>>>>> > >
>> > > >>>>>>>>>>> > > It’s great to be able to present the Variant type
>> proposal
>> > > >>>>>>>>>>> in the
>> > > >>>>>>>>>>> > > community sync yesterday and I’m looking to host a
>> meeting
>> > > >>>>>>>>>>> next week
>> > > >>>>>>>>>>> > > (targeting for 9am, July 17th) to go over any further
>> > > >>>>>>>>>>> concerns about the
>> > > >>>>>>>>>>> > > encoding of the Variant type and any other questions
>> on the
>> > > >>>>>>>>>>> first phase of
>> > > >>>>>>>>>>> > > the proposal
>> > > >>>>>>>>>>> > > <
>> > > >>>>>>>>>>>
>> https://docs.google.com/document/d/1QjhpG_SVNPZh3anFcpicMQx90ebwjL7rmzFYfUP89Iw/edit
>> > > >>>>>>>>>>> >.
>> > > >>>>>>>>>>> > > We are hoping that anyone who is interested in the
>> proposal
>> > > >>>>>>>>>>> can either join
>> > > >>>>>>>>>>> > > or reply with their comments so we can discuss them.
>> Summary
>> > > >>>>>>>>>>> of the
>> > > >>>>>>>>>>> > > discussion and notes will be sent to the mailing list
>> for
>> > > >>>>>>>>>>> further comment
>> > > >>>>>>>>>>> > > there.
>> > > >>>>>>>>>>> > >
>> > > >>>>>>>>>>> > >
>> > > >>>>>>>>>>> > >    -
>> > > >>>>>>>>>>> > >
>> > > >>>>>>>>>>> > >    What should be the underlying binary representation
>> > > >>>>>>>>>>> > >
>> > > >>>>>>>>>>> > > We have evaluated a few encodings in the doc
>> including ION,
>> > > >>>>>>>>>>> JSONB, and
>> > > >>>>>>>>>>> > > Spark encoding.Choosing the underlying encoding is an
>> > > >>>>>>>>>>> important first step
>> > > >>>>>>>>>>> > > here and we believe we have general support for
>> Spark’s
>> > > >>>>>>>>>>> Variant encoding.
>> > > >>>>>>>>>>> > > We would like to hear if anyone else has strong
>> opinions in
>> > > >>>>>>>>>>> this space.
>> > > >>>>>>>>>>> > >
>> > > >>>>>>>>>>> > >
>> > > >>>>>>>>>>> > >    -
>> > > >>>>>>>>>>> > >
>> > > >>>>>>>>>>> > >    Should we support multiple logical types or just
>> Variant?
>> > > >>>>>>>>>>> Variant vs.
>> > > >>>>>>>>>>> > >    Variant + JSON.
>> > > >>>>>>>>>>> > >
>> > > >>>>>>>>>>> > > This is to discuss what logical data type(s) to be
>> supported
>> > > >>>>>>>>>>> in Iceberg -
>> > > >>>>>>>>>>> > > Variant only vs. Variant + JSON. Both types would
>> share the
>> > > >>>>>>>>>>> same underlying
>> > > >>>>>>>>>>> > > encoding but would imply different limitations on
>> engines
>> > > >>>>>>>>>>> working with
>> > > >>>>>>>>>>> > > those types.
>> > > >>>>>>>>>>> > >
>> > > >>>>>>>>>>> > > From the sync up meeting, we are more favoring toward
>> > > >>>>>>>>>>> supporting Variant
>> > > >>>>>>>>>>> > > only and we want to have a consensus on the supported
>> > > >>>>>>>>>>> type(s).
>> > > >>>>>>>>>>> > >
>> > > >>>>>>>>>>> > >
>> > > >>>>>>>>>>> > >    -
>> > > >>>>>>>>>>> > >
>> > > >>>>>>>>>>> > >    How should we move forward with Subcolumnization?
>> > > >>>>>>>>>>> > >
>> > > >>>>>>>>>>> > > Subcolumnization is an optimization for Variant type
>> by
>> > > >>>>>>>>>>> separating out
>> > > >>>>>>>>>>> > > subcolumns with their own metadata. This is not
>> critical for
>> > > >>>>>>>>>>> choosing the
>> > > >>>>>>>>>>> > > initial encoding of the Variant type so we were
>> hoping to
>> > > >>>>>>>>>>> gain consensus on
>> > > >>>>>>>>>>> > > leaving that for a follow up spec.
>> > > >>>>>>>>>>> > >
>> > > >>>>>>>>>>> > >
>> > > >>>>>>>>>>> > > Thanks
>> > > >>>>>>>>>>> > >
>> > > >>>>>>>>>>> > > Aihua
>> > > >>>>>>>>>>> > >
>> > > >>>>>>>>>>> > > Meeting invite:
>> > > >>>>>>>>>>> > >
>> > > >>>>>>>>>>> > > Wednesday, July 17 · 9:00 – 10:00am
>> > > >>>>>>>>>>> > > Time zone: America/Los_Angeles
>> > > >>>>>>>>>>> > > Google Meet joining info
>> > > >>>>>>>>>>> > > Video call link: https://meet.google.com/pbm-ovzn-aoq
>> > > >>>>>>>>>>> > > Or dial: ‪(US) +1 650-449-9343‬ PIN: ‪170 576 525‬#
>> > > >>>>>>>>>>> > > More phone numbers:
>> > > >>>>>>>>>>> https://tel.meet/pbm-ovzn-aoq?pin=4079632691790
>> > > >>>>>>>>>>> > >
>> > > >>>>>>>>>>> > > On Tue, May 28, 2024 at 9:21 PM Aihua Xu <
>> > > >>>>>>>>>>> aihua...@snowflake.com> wrote:
>> > > >>>>>>>>>>> > >
>> > > >>>>>>>>>>> > >> Hello,
>> > > >>>>>>>>>>> > >>
>> > > >>>>>>>>>>> > >> We have drafted the proposal
>> > > >>>>>>>>>>> > >> <
>> > > >>>>>>>>>>>
>> https://docs.google.com/document/d/1QjhpG_SVNPZh3anFcpicMQx90ebwjL7rmzFYfUP89Iw/edit
>> > > >>>>>>>>>>> >
>> > > >>>>>>>>>>> > >> for Variant data type. Please help review and
>> comment.
>> > > >>>>>>>>>>> > >>
>> > > >>>>>>>>>>> > >> Thanks,
>> > > >>>>>>>>>>> > >> Aihua
>> > > >>>>>>>>>>> > >>
>> > > >>>>>>>>>>> > >> On Thu, May 16, 2024 at 12:45 PM Jack Ye <
>> > > >>>>>>>>>>> yezhao...@gmail.com> wrote:
>> > > >>>>>>>>>>> > >>
>> > > >>>>>>>>>>> > >>> +10000 for a JSON/BSON type. We also had the same
>> > > >>>>>>>>>>> discussion internally
>> > > >>>>>>>>>>> > >>> and a JSON type would really play well with for
>> example
>> > > >>>>>>>>>>> the SUPER type in
>> > > >>>>>>>>>>> > >>> Redshift:
>> > > >>>>>>>>>>> > >>>
>> > > >>>>>>>>>>>
>> https://docs.aws.amazon.com/redshift/latest/dg/r_SUPER_type.html,
>> > > >>>>>>>>>>> and
>> > > >>>>>>>>>>> > >>> can also provide better integration with the Trino
>> JSON
>> > > >>>>>>>>>>> type.
>> > > >>>>>>>>>>> > >>>
>> > > >>>>>>>>>>> > >>> Looking forward to the proposal!
>> > > >>>>>>>>>>> > >>>
>> > > >>>>>>>>>>> > >>> Best,
>> > > >>>>>>>>>>> > >>> Jack Ye
>> > > >>>>>>>>>>> > >>>
>> > > >>>>>>>>>>> > >>>
>> > > >>>>>>>>>>> > >>> On Wed, May 15, 2024 at 9:37 AM Tyler Akidau
>> > > >>>>>>>>>>> > >>> <tyler.aki...@snowflake.com.invalid> wrote:
>> > > >>>>>>>>>>> > >>>
>> > > >>>>>>>>>>> > >>>> On Tue, May 14, 2024 at 7:58 PM Gang Wu <
>> ust...@gmail.com>
>> > > >>>>>>>>>>> wrote:
>> > > >>>>>>>>>>> > >>>>
>> > > >>>>>>>>>>> > >>>>> > We may need some guidance on just how many we
>> need to
>> > > >>>>>>>>>>> look at;
>> > > >>>>>>>>>>> > >>>>> > we were planning on Spark and Trino, but
>> weren't sure
>> > > >>>>>>>>>>> how much
>> > > >>>>>>>>>>> > >>>>> > further down the rabbit hole we needed to go。
>> > > >>>>>>>>>>> > >>>>>
>> > > >>>>>>>>>>> > >>>>> There are some engines living outside the Java
>> world. It
>> > > >>>>>>>>>>> would be
>> > > >>>>>>>>>>> > >>>>> good if the proposal could cover the effort it
>> takes to
>> > > >>>>>>>>>>> integrate
>> > > >>>>>>>>>>> > >>>>> variant type to them (e.g. velox, datafusion,
>> etc.).
>> > > >>>>>>>>>>> This is something
>> > > >>>>>>>>>>> > >>>>> that
>> > > >>>>>>>>>>> > >>>>> some proprietary iceberg vendors also care about.
>> > > >>>>>>>>>>> > >>>>>
>> > > >>>>>>>>>>> > >>>>
>> > > >>>>>>>>>>> > >>>> Ack, makes sense. We can make sure to share some
>> > > >>>>>>>>>>> perspective on this.
>> > > >>>>>>>>>>> > >>>>
>> > > >>>>>>>>>>> > >>>> > Not necessarily, no. As long as there's a binary
>> type
>> > > >>>>>>>>>>> and Iceberg and
>> > > >>>>>>>>>>> > >>>>> > the query engines are aware that the binary
>> column
>> > > >>>>>>>>>>> needs to be
>> > > >>>>>>>>>>> > >>>>> > interpreted as a variant, that should be
>> sufficient.
>> > > >>>>>>>>>>> > >>>>>
>> > > >>>>>>>>>>> > >>>>> From the perspective of interoperability, it
>> would be
>> > > >>>>>>>>>>> good to support
>> > > >>>>>>>>>>> > >>>>> native
>> > > >>>>>>>>>>> > >>>>> type from file specs. Life will be easier for
>> projects
>> > > >>>>>>>>>>> like Apache
>> > > >>>>>>>>>>> > >>>>> XTable.
>> > > >>>>>>>>>>> > >>>>> File format could also provide finer-grained
>> statistics
>> > > >>>>>>>>>>> for variant
>> > > >>>>>>>>>>> > >>>>> type which
>> > > >>>>>>>>>>> > >>>>> facilitates data skipping.
>> > > >>>>>>>>>>> > >>>>>
>> > > >>>>>>>>>>> > >>>>
>> > > >>>>>>>>>>> > >>>> Agreed, there can definitely be additional value in
>> > > >>>>>>>>>>> native file format
>> > > >>>>>>>>>>> > >>>> integration. Just wanted to highlight that it's
>> not a
>> > > >>>>>>>>>>> strict requirement.
>> > > >>>>>>>>>>> > >>>>
>> > > >>>>>>>>>>> > >>>> -Tyler
>> > > >>>>>>>>>>> > >>>>
>> > > >>>>>>>>>>> > >>>>
>> > > >>>>>>>>>>> > >>>>>
>> > > >>>>>>>>>>> > >>>>> Gang
>> > > >>>>>>>>>>> > >>>>>
>> > > >>>>>>>>>>> > >>>>> On Wed, May 15, 2024 at 6:49 AM Tyler Akidau
>> > > >>>>>>>>>>> > >>>>> <tyler.aki...@snowflake.com.invalid> wrote:
>> > > >>>>>>>>>>> > >>>>>
>> > > >>>>>>>>>>> > >>>>>> Good to see you again as well, JB! Thanks!
>> > > >>>>>>>>>>> > >>>>>>
>> > > >>>>>>>>>>> > >>>>>> -Tyler
>> > > >>>>>>>>>>> > >>>>>>
>> > > >>>>>>>>>>> > >>>>>>
>> > > >>>>>>>>>>> > >>>>>> On Tue, May 14, 2024 at 1:04 PM Jean-Baptiste
>> Onofré <
>> > > >>>>>>>>>>> j...@nanthrax.net>
>> > > >>>>>>>>>>> > >>>>>> wrote:
>> > > >>>>>>>>>>> > >>>>>>
>> > > >>>>>>>>>>> > >>>>>>> Hi Tyler,
>> > > >>>>>>>>>>> > >>>>>>>
>> > > >>>>>>>>>>> > >>>>>>> Super happy to see you there :) It reminds me
>> our
>> > > >>>>>>>>>>> discussions back in
>> > > >>>>>>>>>>> > >>>>>>> the start of Apache Beam :)
>> > > >>>>>>>>>>> > >>>>>>>
>> > > >>>>>>>>>>> > >>>>>>> Anyway, the thread is pretty interesting. I
>> remember
>> > > >>>>>>>>>>> some discussions
>> > > >>>>>>>>>>> > >>>>>>> about JSON datatype for spec v3. The binary
>> data type
>> > > >>>>>>>>>>> is already
>> > > >>>>>>>>>>> > >>>>>>> supported in the spec v2.
>> > > >>>>>>>>>>> > >>>>>>>
>> > > >>>>>>>>>>> > >>>>>>> I'm looking forward to the proposal and happy
>> to help
>> > > >>>>>>>>>>> on this !
>> > > >>>>>>>>>>> > >>>>>>>
>> > > >>>>>>>>>>> > >>>>>>> Regards
>> > > >>>>>>>>>>> > >>>>>>> JB
>> > > >>>>>>>>>>> > >>>>>>>
>> > > >>>>>>>>>>> > >>>>>>> On Sat, May 11, 2024 at 7:06 AM Tyler Akidau
>> > > >>>>>>>>>>> > >>>>>>> <tyler.aki...@snowflake.com.invalid> wrote:
>> > > >>>>>>>>>>> > >>>>>>> >
>> > > >>>>>>>>>>> > >>>>>>> > Hello,
>> > > >>>>>>>>>>> > >>>>>>> >
>> > > >>>>>>>>>>> > >>>>>>> > We (Tyler, Nileema, Selcuk, Aihua) are
>> working on a
>> > > >>>>>>>>>>> proposal for
>> > > >>>>>>>>>>> > >>>>>>> which we’d like to get early feedback from the
>> > > >>>>>>>>>>> community. As you may know,
>> > > >>>>>>>>>>> > >>>>>>> Snowflake has embraced Iceberg as its open Data
>> Lake
>> > > >>>>>>>>>>> format. Having made
>> > > >>>>>>>>>>> > >>>>>>> good progress on our own adoption of the Iceberg
>> > > >>>>>>>>>>> standard, we’re now in a
>> > > >>>>>>>>>>> > >>>>>>> position where there are features not yet
>> supported in
>> > > >>>>>>>>>>> Iceberg which we
>> > > >>>>>>>>>>> > >>>>>>> think would be valuable for our users, and that
>> we
>> > > >>>>>>>>>>> would like to discuss
>> > > >>>>>>>>>>> > >>>>>>> with and help contribute to the Iceberg
>> community.
>> > > >>>>>>>>>>> > >>>>>>> >
>> > > >>>>>>>>>>> > >>>>>>> > The first two such features we’d like to
>> discuss are
>> > > >>>>>>>>>>> in support of
>> > > >>>>>>>>>>> > >>>>>>> efficient querying of dynamically typed,
>> > > >>>>>>>>>>> semi-structured data: variant data
>> > > >>>>>>>>>>> > >>>>>>> types, and subcolumnarization of variant
>> columns. In
>> > > >>>>>>>>>>> more detail, for
>> > > >>>>>>>>>>> > >>>>>>> anyone who may not already be familiar:
>> > > >>>>>>>>>>> > >>>>>>> >
>> > > >>>>>>>>>>> > >>>>>>> > 1. Variant data types
>> > > >>>>>>>>>>> > >>>>>>> > Variant types allow for the efficient binary
>> > > >>>>>>>>>>> encoding of dynamic
>> > > >>>>>>>>>>> > >>>>>>> semi-structured data such as JSON, Avro, etc. By
>> > > >>>>>>>>>>> encoding semi-structured
>> > > >>>>>>>>>>> > >>>>>>> data as a variant column, we retain the
>> flexibility of
>> > > >>>>>>>>>>> the source data,
>> > > >>>>>>>>>>> > >>>>>>> while allowing query engines to more efficiently
>> > > >>>>>>>>>>> operate on the data.
>> > > >>>>>>>>>>> > >>>>>>> Snowflake has supported the variant data type on
>> > > >>>>>>>>>>> Snowflake tables for many
>> > > >>>>>>>>>>> > >>>>>>> years [1]. As more and more users utilize
>> Iceberg
>> > > >>>>>>>>>>> tables in Snowflake,
>> > > >>>>>>>>>>> > >>>>>>> we’re hearing an increasing chorus of requests
>> for
>> > > >>>>>>>>>>> variant support.
>> > > >>>>>>>>>>> > >>>>>>> Additionally, other query engines such as
>> Apache Spark
>> > > >>>>>>>>>>> have begun adding
>> > > >>>>>>>>>>> > >>>>>>> variant support [2]. As such, we believe it
>> would be
>> > > >>>>>>>>>>> beneficial to the
>> > > >>>>>>>>>>> > >>>>>>> Iceberg community as a whole to standardize on
>> the
>> > > >>>>>>>>>>> variant data type
>> > > >>>>>>>>>>> > >>>>>>> encoding used across Iceberg tables.
>> > > >>>>>>>>>>> > >>>>>>> >
>> > > >>>>>>>>>>> > >>>>>>> > One specific point to make here is that,
>> since an
>> > > >>>>>>>>>>> Apache OSS
>> > > >>>>>>>>>>> > >>>>>>> version of variant encoding already exists in
>> Spark,
>> > > >>>>>>>>>>> it likely makes sense
>> > > >>>>>>>>>>> > >>>>>>> to simply adopt the Spark encoding as the
>> Iceberg
>> > > >>>>>>>>>>> standard as well. The
>> > > >>>>>>>>>>> > >>>>>>> encoding we use internally today in Snowflake is
>> > > >>>>>>>>>>> slightly different, but
>> > > >>>>>>>>>>> > >>>>>>> essentially equivalent, and we see no
>> particular value
>> > > >>>>>>>>>>> in trying to clutter
>> > > >>>>>>>>>>> > >>>>>>> the space with another
>> equivalent-but-incompatible
>> > > >>>>>>>>>>> encoding.
>> > > >>>>>>>>>>> > >>>>>>> >
>> > > >>>>>>>>>>> > >>>>>>> >
>> > > >>>>>>>>>>> > >>>>>>> > 2. Subcolumnarization
>> > > >>>>>>>>>>> > >>>>>>> > Subcolumnarization of variant columns allows
>> query
>> > > >>>>>>>>>>> engines to
>> > > >>>>>>>>>>> > >>>>>>> efficiently prune datasets when subcolumns
>> (i.e.,
>> > > >>>>>>>>>>> nested fields) within a
>> > > >>>>>>>>>>> > >>>>>>> variant column are queried, and also allows
>> optionally
>> > > >>>>>>>>>>> materializing some
>> > > >>>>>>>>>>> > >>>>>>> of the nested fields as a column on their own,
>> > > >>>>>>>>>>> affording queries on these
>> > > >>>>>>>>>>> > >>>>>>> subcolumns the ability to read less data and
>> spend
>> > > >>>>>>>>>>> less CPU on extraction.
>> > > >>>>>>>>>>> > >>>>>>> When subcolumnarizing, the system managing table
>> > > >>>>>>>>>>> metadata and data tracks
>> > > >>>>>>>>>>> > >>>>>>> individual pruning statistics (min, max, null,
>> etc.)
>> > > >>>>>>>>>>> for some subset of the
>> > > >>>>>>>>>>> > >>>>>>> nested fields within a variant, and also
>> manages any
>> > > >>>>>>>>>>> optional
>> > > >>>>>>>>>>> > >>>>>>> materialization. Without subcolumnarization,
>> any query
>> > > >>>>>>>>>>> which touches a
>> > > >>>>>>>>>>> > >>>>>>> variant column must read, parse, extract, and
>> filter
>> > > >>>>>>>>>>> every row for which
>> > > >>>>>>>>>>> > >>>>>>> that column is non-null. Thus, by providing a
>> > > >>>>>>>>>>> standardized way of tracking
>> > > >>>>>>>>>>> > >>>>>>> subcolum metadata and data for variant columns,
>> > > >>>>>>>>>>> Iceberg can make
>> > > >>>>>>>>>>> > >>>>>>> subcolumnar optimizations accessible across
>> various
>> > > >>>>>>>>>>> catalogs and query
>> > > >>>>>>>>>>> > >>>>>>> engines.
>> > > >>>>>>>>>>> > >>>>>>> >
>> > > >>>>>>>>>>> > >>>>>>> > Subcolumnarization is a non-trivial topic, so
>> we
>> > > >>>>>>>>>>> expect any
>> > > >>>>>>>>>>> > >>>>>>> concrete proposal to include not only the set of
>> > > >>>>>>>>>>> changes to Iceberg
>> > > >>>>>>>>>>> > >>>>>>> metadata that allow compatible query engines to
>> > > >>>>>>>>>>> interopate on
>> > > >>>>>>>>>>> > >>>>>>> subcolumnarization data for variant columns,
>> but also
>> > > >>>>>>>>>>> reference
>> > > >>>>>>>>>>> > >>>>>>> documentation explaining subcolumnarization
>> principles
>> > > >>>>>>>>>>> and recommended best
>> > > >>>>>>>>>>> > >>>>>>> practices.
>> > > >>>>>>>>>>> > >>>>>>> >
>> > > >>>>>>>>>>> > >>>>>>> >
>> > > >>>>>>>>>>> > >>>>>>> > It sounds like the recent Geo proposal [3]
>> may be a
>> > > >>>>>>>>>>> good starting
>> > > >>>>>>>>>>> > >>>>>>> point for how to approach this, so our plan is
>> to
>> > > >>>>>>>>>>> write something up in
>> > > >>>>>>>>>>> > >>>>>>> that vein that covers the proposed spec changes,
>> > > >>>>>>>>>>> backwards compatibility,
>> > > >>>>>>>>>>> > >>>>>>> implementor burdens, etc. But we wanted to
>> first reach
>> > > >>>>>>>>>>> out to the community
>> > > >>>>>>>>>>> > >>>>>>> to introduce ourselves and the idea, and see if
>> > > >>>>>>>>>>> there’s any early feedback
>> > > >>>>>>>>>>> > >>>>>>> we should incorporate before we spend too much
>> time on
>> > > >>>>>>>>>>> a concrete proposal.
>> > > >>>>>>>>>>> > >>>>>>> >
>> > > >>>>>>>>>>> > >>>>>>> > Thank you!
>> > > >>>>>>>>>>> > >>>>>>> >
>> > > >>>>>>>>>>> > >>>>>>> > [1]
>> > > >>>>>>>>>>> > >>>>>>>
>> > > >>>>>>>>>>>
>> https://docs.snowflake.com/en/sql-reference/data-types-semistructured
>> > > >>>>>>>>>>> > >>>>>>> > [2]
>> > > >>>>>>>>>>> > >>>>>>>
>> > > >>>>>>>>>>>
>> https://github.com/apache/spark/blob/master/common/variant/README.md
>> > > >>>>>>>>>>> > >>>>>>> > [3]
>> > > >>>>>>>>>>> > >>>>>>>
>> > > >>>>>>>>>>>
>> https://docs.google.com/document/d/1iVFbrRNEzZl8tDcZC81GFt01QJkLJsI9E2NBOt21IRI/edit
>> > > >>>>>>>>>>> > >>>>>>> >
>> > > >>>>>>>>>>> > >>>>>>> > -Tyler, Nileema, Selcuk, Aihua
>> > > >>>>>>>>>>> > >>>>>>> >
>> > > >>>>>>>>>>> > >>>>>>>
>> > > >>>>>>>>>>> > >>>>>>
>> > > >>>>>>>>>>> >
>> > > >>>>>>>>>>> > --
>> > > >>>>>>>>>>> > Ryan Blue
>> > > >>>>>>>>>>> > Databricks
>> > > >>>>>>>>>>> >
>> > > >>>>>>>>>>>
>> > > >>>>>>>>>>
>> > > >>>>>>>>>
>> > > >>>>>>>>> --
>> > > >>>>>>>>> Ryan Blue
>> > > >>>>>>>>> Databricks
>> > > >>>>>>>>>
>> > > >>>>>>>>
>> > > >>>>>>>>
>> > > >>>>>>>> --
>> > > >>>>>>>> Ryan Blue
>> > > >>>>>>>> Databricks
>> > > >>>>>>>>
>> > > >>>>>>>
>> > > >>>>>>
>> > > >>>>>> --
>> > > >>>>>> Ryan Blue
>> > > >>>>>> Databricks
>> > > >>>>>>
>> > > >>>>>
>> > > >>>>
>> > > >>>> --
>> > > >>>> Ryan Blue
>> > > >>>> Databricks
>> > > >>>>
>> > > >>>
>> > > >>
>> > > >> --
>> > > >> Ryan Blue
>> > > >> Databricks
>> > > >>
>> > > >
>> > >
>> >
>>
>

-- 
Ryan Blue
Databricks

Re: [Early Feedback] Variant and Subcolumnarization Support

Reply via email to