Re: [Early Feedback] Variant and Subcolumnarization Support

Aihua Xu Wed, 17 Jul 2024 22:54:39 -0700

Hello community,

It’s great to sync up with some of you on Variant and SubColumarization support 
in Iceberg again. Apologize that I didn’t record the meeting but here are some 
key items that we want to follow up with the community.


1. Adopt Spark Variant encoding 
Those present were in favor of  adopting the Spark variant encoding for Iceberg 
Variant with extensions to support other Iceberg types. We would like to know 
if anyone has an objection to this to reuse an open source encoding.

2. Movement of the Spark Variant Spec to another project
To avoid introducing Apache Spark as a dependency for the engines and file 
formats, we discussed separating Spark Variant encoding spec and implementation 
from the Spark Project to a neutral location. We thought up several solutions 
but didn’t have consensus on any of them. We are looking for more feedback on 
this topic from the community either in terms of support for one of these 
options or another idea on how to support the spec.

Options Proposed:
* Leave the Spec in Spark (Difficult for versioning and other engines)
* Copying the Spec into Iceberg Project Directly (Difficult for other Table 
Formats)
* Creating a Sub-Project of Apache Iceberg and moving the spec and reference 
implementation there (Logistically complicated)           
* Creating a Sub-Project of Apache Spark and moving the spec and reference 
implementation there (Logistically complicated) 

3. Add Variant type vs. Variant and JSON types
Those who were present were in favor of adding only the Variant type to 
Iceberg. We are looking for anyone who has an objection to going forward with 
just the Variant Type and no Iceberg JSON Type. We were favoring adding Variant 
type only because:
* Introducing a JSON type would require engines that only support VARIANT to do 
write time validation of their input to a JSON column. If they don’t have a 
JSON type an engine wouldn’t support this.
* Engines which don’t support Variant will work most of the time but can have 
fallback strings defined in the spec for reading unsupported types. Writing a 
JSON into a Variant will always work.

4. Support for Subcolumnization spec (shredding in Spark) 
We have no action items on this but would like to follow up on discussions on 
Subcolumnization in the future. 
* We had general agreement that this should be included in Iceberg V3 or else 
adding variant may not be useful.
* We are interested in also adopting the shredding spec from Spark and would 
like to move it to whatever place we decided the Variant spec is going to live.

Let us know if missed anything and if you have any additional thoughts or 
suggestions.

Thanks
Aihua


On 2024/07/15 18:32:22 Aihua Xu wrote:
> Thanks for the discussion.
> 
> I will move forward to work on spec PR. 
> 
> Regarding the implementation, we will have module for Variant support in 
> Iceberg so we will not have to bring in Spark libraries. 
> 
> I'm reposting the meeting invite in case it's not clear in my original email 
> since I included in the end. Looks like we don't have major 
> objections/diverges but let's sync up and have consensus. 
> 
> Meeting invite:
> 
> Wednesday, July 17 · 9:00 – 10:00am
> Time zone: America/Los_Angeles
> Google Meet joining info
> Video call link: https://meet.google.com/pbm-ovzn-aoq
> Or dial: ‪(US) +1 650-449-9343‬ PIN: ‪170 576 525‬#
> More phone numbers: https://tel.meet/pbm-ovzn-aoq?pin=4079632691790
> 
> Thanks,
> Aihua
> 
> On 2024/07/12 20:55:01 Micah Kornfield wrote:
> > I don't think this needs to hold up the PR but I think coming to a
> > consensus on the exact set of types supported is worthwhile (and if the
> > goal is to maintain the same set as specified by the Spark Variant type or
> > if divergence is expected/allowed).  From a fragmentation perspective it
> > would be a shame if they diverge, so maybe a next step is also suggesting
> > support to the Spark community on the missing existing Iceberg types?
> > 
> > Thanks,
> > Micah
> > 
> > On Fri, Jul 12, 2024 at 1:44 PM Russell Spitzer <[email protected]>
> > wrote:
> > 
> > > Just talked with Aihua and he's working on the Spec PR now. We can get
> > > feedback there from everyone.
> > >
> > > On Fri, Jul 12, 2024 at 3:41 PM Ryan Blue <[email protected]>
> > > wrote:
> > >
> > >> Good idea, but I'm hoping that we can continue to get their feedback in
> > >> parallel to getting the spec changes started. Piotr didn't seem to object
> > >> to the encoding from what I read of his comments. Hopefully he (and 
> > >> others)
> > >> chime in here.
> > >>
> > >> On Fri, Jul 12, 2024 at 1:32 PM Russell Spitzer <
> > >> [email protected]> wrote:
> > >>
> > >>> I just want to make sure we get Piotr and Peter on board as
> > >>> representatives of Flink and Trino engines. Also make sure we have 
> > >>> anyone
> > >>> else chime in who has experience with Ray if possible.
> > >>>
> > >>> Spec changes feel like the right next step.
> > >>>
> > >>> On Fri, Jul 12, 2024 at 3:14 PM Ryan Blue <[email protected]>
> > >>> wrote:
> > >>>
> > >>>> Okay, what are the next steps here? This proposal has been out for
> > >>>> quite a while and I don't see any major objections to using the Spark
> > >>>> encoding. It's quite well designed and fits the need well. It can also 
> > >>>> be
> > >>>> extended to support additional types that are missing if that's a 
> > >>>> priority.
> > >>>>
> > >>>> Should we move forward by starting a draft of the changes to the table
> > >>>> spec? Then we can vote on committing those changes and get moving on an
> > >>>> implementation (or possibly do the implementation in parallel).
> > >>>>
> > >>>> On Fri, Jul 12, 2024 at 1:08 PM Russell Spitzer <
> > >>>> [email protected]> wrote:
> > >>>>
> > >>>>> That's fair, I'm sold on an Iceberg Module.
> > >>>>>
> > >>>>> On Fri, Jul 12, 2024 at 2:53 PM Ryan Blue 
> > >>>>> <[email protected]>
> > >>>>> wrote:
> > >>>>>
> > >>>>>> > Feels like eventually the encoding should land in parquet proper
> > >>>>>> right?
> > >>>>>>
> > >>>>>> What about using it in ORC? I don't know where it should end up.
> > >>>>>> Maybe Iceberg should make a standalone module from it?
> > >>>>>>
> > >>>>>> On Fri, Jul 12, 2024 at 12:38 PM Russell Spitzer <
> > >>>>>> [email protected]> wrote:
> > >>>>>>
> > >>>>>>> Feels like eventually the encoding should land in parquet proper
> > >>>>>>> right? I'm fine with us just copying into Iceberg though for the 
> > >>>>>>> time
> > >>>>>>> being.
> > >>>>>>>
> > >>>>>>> On Fri, Jul 12, 2024 at 2:31 PM Ryan Blue
> > >>>>>>> <[email protected]> wrote:
> > >>>>>>>
> > >>>>>>>> Oops, it looks like I missed where Aihua brought this up in his
> > >>>>>>>> last email:
> > >>>>>>>>
> > >>>>>>>> > do we have an issue to directly use Spark implementation in
> > >>>>>>>> Iceberg?
> > >>>>>>>>
> > >>>>>>>> Yes, I think that we do have an issue using the Spark library. What
> > >>>>>>>> do you think about a Java implementation in Iceberg?
> > >>>>>>>>
> > >>>>>>>> Ryan
> > >>>>>>>>
> > >>>>>>>> On Fri, Jul 12, 2024 at 12:28 PM Ryan Blue <[email protected]>
> > >>>>>>>> wrote:
> > >>>>>>>>
> > >>>>>>>>> I raised the same point from Peter's email in a comment on the doc
> > >>>>>>>>> as well. There is a spark-variant_2.13 artifact that would be a 
> > >>>>>>>>> much
> > >>>>>>>>> smaller scope than relying on large portions of Spark, but I even 
> > >>>>>>>>> then I
> > >>>>>>>>> doubt that it is a good idea for Iceberg to depend on that 
> > >>>>>>>>> because it is a
> > >>>>>>>>> Scala artifact and we would need to bring in a ton of Scala libs. 
> > >>>>>>>>> I think
> > >>>>>>>>> what makes the most sense is to have an independent 
> > >>>>>>>>> implementation of the
> > >>>>>>>>> spec in Iceberg.
> > >>>>>>>>>
> > >>>>>>>>> On Fri, Jul 12, 2024 at 11:51 AM Péter Váry <
> > >>>>>>>>> [email protected]> wrote:
> > >>>>>>>>>
> > >>>>>>>>>> Hi Aihua,
> > >>>>>>>>>> Long time no see :)
> > >>>>>>>>>> Would this mean, that every engine which plans to support Variant
> > >>>>>>>>>> data type needs to add Spark as a dependency? Like 
> > >>>>>>>>>> Flink/Trino/Hive etc?
> > >>>>>>>>>> Thanks, Peter
> > >>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>> On Fri, Jul 12, 2024, 19:10 Aihua Xu <[email protected]> wrote:
> > >>>>>>>>>>
> > >>>>>>>>>>> Thanks Ryan.
> > >>>>>>>>>>>
> > >>>>>>>>>>> Yeah. That's another reason we want to pursue Spark encoding to
> > >>>>>>>>>>> keep compatibility for the open source engines.
> > >>>>>>>>>>>
> > >>>>>>>>>>> One more question regarding the encoding implementation: do we
> > >>>>>>>>>>> have an issue to directly use Spark implementation in Iceberg? 
> > >>>>>>>>>>> Russell
> > >>>>>>>>>>> pointed out that Trino doesn't have Spark dependency and that 
> > >>>>>>>>>>> could be a
> > >>>>>>>>>>> problem?
> > >>>>>>>>>>>
> > >>>>>>>>>>> Thanks,
> > >>>>>>>>>>> Aihua
> > >>>>>>>>>>>
> > >>>>>>>>>>> On 2024/07/12 15:02:06 Ryan Blue wrote:
> > >>>>>>>>>>> > Thanks, Aihua!
> > >>>>>>>>>>> >
> > >>>>>>>>>>> > I think that the encoding choice in the current doc is a good
> > >>>>>>>>>>> one. I went
> > >>>>>>>>>>> > through the Spark encoding in detail and it looks like a
> > >>>>>>>>>>> better choice than
> > >>>>>>>>>>> > the other candidate encodings for quickly accessing nested
> > >>>>>>>>>>> fields.
> > >>>>>>>>>>> >
> > >>>>>>>>>>> > Another reason to use the Spark type is that this is what
> > >>>>>>>>>>> Delta's variant
> > >>>>>>>>>>> > type is based on, so Parquet files in tables written by Delta
> > >>>>>>>>>>> could be
> > >>>>>>>>>>> > converted or used in Iceberg tables without needing to rewrite
> > >>>>>>>>>>> variant
> > >>>>>>>>>>> > data. (Also, note that I work at Databricks and have an
> > >>>>>>>>>>> interest in
> > >>>>>>>>>>> > increasing format compatibility.)
> > >>>>>>>>>>> >
> > >>>>>>>>>>> > Ryan
> > >>>>>>>>>>> >
> > >>>>>>>>>>> > On Thu, Jul 11, 2024 at 11:21 AM Aihua Xu <
> > >>>>>>>>>>> [email protected]>
> > >>>>>>>>>>> > wrote:
> > >>>>>>>>>>> >
> > >>>>>>>>>>> > > [Discuss] Consensus for Variant Encoding
> > >>>>>>>>>>> > >
> > >>>>>>>>>>> > > It’s great to be able to present the Variant type proposal
> > >>>>>>>>>>> in the
> > >>>>>>>>>>> > > community sync yesterday and I’m looking to host a meeting
> > >>>>>>>>>>> next week
> > >>>>>>>>>>> > > (targeting for 9am, July 17th) to go over any further
> > >>>>>>>>>>> concerns about the
> > >>>>>>>>>>> > > encoding of the Variant type and any other questions on the
> > >>>>>>>>>>> first phase of
> > >>>>>>>>>>> > > the proposal
> > >>>>>>>>>>> > > <
> > >>>>>>>>>>> https://docs.google.com/document/d/1QjhpG_SVNPZh3anFcpicMQx90ebwjL7rmzFYfUP89Iw/edit
> > >>>>>>>>>>> >.
> > >>>>>>>>>>> > > We are hoping that anyone who is interested in the proposal
> > >>>>>>>>>>> can either join
> > >>>>>>>>>>> > > or reply with their comments so we can discuss them. Summary
> > >>>>>>>>>>> of the
> > >>>>>>>>>>> > > discussion and notes will be sent to the mailing list for
> > >>>>>>>>>>> further comment
> > >>>>>>>>>>> > > there.
> > >>>>>>>>>>> > >
> > >>>>>>>>>>> > >
> > >>>>>>>>>>> > >    -
> > >>>>>>>>>>> > >
> > >>>>>>>>>>> > >    What should be the underlying binary representation
> > >>>>>>>>>>> > >
> > >>>>>>>>>>> > > We have evaluated a few encodings in the doc including ION,
> > >>>>>>>>>>> JSONB, and
> > >>>>>>>>>>> > > Spark encoding.Choosing the underlying encoding is an
> > >>>>>>>>>>> important first step
> > >>>>>>>>>>> > > here and we believe we have general support for Spark’s
> > >>>>>>>>>>> Variant encoding.
> > >>>>>>>>>>> > > We would like to hear if anyone else has strong opinions in
> > >>>>>>>>>>> this space.
> > >>>>>>>>>>> > >
> > >>>>>>>>>>> > >
> > >>>>>>>>>>> > >    -
> > >>>>>>>>>>> > >
> > >>>>>>>>>>> > >    Should we support multiple logical types or just Variant?
> > >>>>>>>>>>> Variant vs.
> > >>>>>>>>>>> > >    Variant + JSON.
> > >>>>>>>>>>> > >
> > >>>>>>>>>>> > > This is to discuss what logical data type(s) to be supported
> > >>>>>>>>>>> in Iceberg -
> > >>>>>>>>>>> > > Variant only vs. Variant + JSON. Both types would share the
> > >>>>>>>>>>> same underlying
> > >>>>>>>>>>> > > encoding but would imply different limitations on engines
> > >>>>>>>>>>> working with
> > >>>>>>>>>>> > > those types.
> > >>>>>>>>>>> > >
> > >>>>>>>>>>> > > From the sync up meeting, we are more favoring toward
> > >>>>>>>>>>> supporting Variant
> > >>>>>>>>>>> > > only and we want to have a consensus on the supported
> > >>>>>>>>>>> type(s).
> > >>>>>>>>>>> > >
> > >>>>>>>>>>> > >
> > >>>>>>>>>>> > >    -
> > >>>>>>>>>>> > >
> > >>>>>>>>>>> > >    How should we move forward with Subcolumnization?
> > >>>>>>>>>>> > >
> > >>>>>>>>>>> > > Subcolumnization is an optimization for Variant type by
> > >>>>>>>>>>> separating out
> > >>>>>>>>>>> > > subcolumns with their own metadata. This is not critical for
> > >>>>>>>>>>> choosing the
> > >>>>>>>>>>> > > initial encoding of the Variant type so we were hoping to
> > >>>>>>>>>>> gain consensus on
> > >>>>>>>>>>> > > leaving that for a follow up spec.
> > >>>>>>>>>>> > >
> > >>>>>>>>>>> > >
> > >>>>>>>>>>> > > Thanks
> > >>>>>>>>>>> > >
> > >>>>>>>>>>> > > Aihua
> > >>>>>>>>>>> > >
> > >>>>>>>>>>> > > Meeting invite:
> > >>>>>>>>>>> > >
> > >>>>>>>>>>> > > Wednesday, July 17 · 9:00 – 10:00am
> > >>>>>>>>>>> > > Time zone: America/Los_Angeles
> > >>>>>>>>>>> > > Google Meet joining info
> > >>>>>>>>>>> > > Video call link: https://meet.google.com/pbm-ovzn-aoq
> > >>>>>>>>>>> > > Or dial: ‪(US) +1 650-449-9343‬ PIN: ‪170 576 525‬#
> > >>>>>>>>>>> > > More phone numbers:
> > >>>>>>>>>>> https://tel.meet/pbm-ovzn-aoq?pin=4079632691790
> > >>>>>>>>>>> > >
> > >>>>>>>>>>> > > On Tue, May 28, 2024 at 9:21 PM Aihua Xu <
> > >>>>>>>>>>> [email protected]> wrote:
> > >>>>>>>>>>> > >
> > >>>>>>>>>>> > >> Hello,
> > >>>>>>>>>>> > >>
> > >>>>>>>>>>> > >> We have drafted the proposal
> > >>>>>>>>>>> > >> <
> > >>>>>>>>>>> https://docs.google.com/document/d/1QjhpG_SVNPZh3anFcpicMQx90ebwjL7rmzFYfUP89Iw/edit
> > >>>>>>>>>>> >
> > >>>>>>>>>>> > >> for Variant data type. Please help review and comment.
> > >>>>>>>>>>> > >>
> > >>>>>>>>>>> > >> Thanks,
> > >>>>>>>>>>> > >> Aihua
> > >>>>>>>>>>> > >>
> > >>>>>>>>>>> > >> On Thu, May 16, 2024 at 12:45 PM Jack Ye <
> > >>>>>>>>>>> [email protected]> wrote:
> > >>>>>>>>>>> > >>
> > >>>>>>>>>>> > >>> +10000 for a JSON/BSON type. We also had the same
> > >>>>>>>>>>> discussion internally
> > >>>>>>>>>>> > >>> and a JSON type would really play well with for example
> > >>>>>>>>>>> the SUPER type in
> > >>>>>>>>>>> > >>> Redshift:
> > >>>>>>>>>>> > >>>
> > >>>>>>>>>>> https://docs.aws.amazon.com/redshift/latest/dg/r_SUPER_type.html,
> > >>>>>>>>>>> and
> > >>>>>>>>>>> > >>> can also provide better integration with the Trino JSON
> > >>>>>>>>>>> type.
> > >>>>>>>>>>> > >>>
> > >>>>>>>>>>> > >>> Looking forward to the proposal!
> > >>>>>>>>>>> > >>>
> > >>>>>>>>>>> > >>> Best,
> > >>>>>>>>>>> > >>> Jack Ye
> > >>>>>>>>>>> > >>>
> > >>>>>>>>>>> > >>>
> > >>>>>>>>>>> > >>> On Wed, May 15, 2024 at 9:37 AM Tyler Akidau
> > >>>>>>>>>>> > >>> <[email protected]> wrote:
> > >>>>>>>>>>> > >>>
> > >>>>>>>>>>> > >>>> On Tue, May 14, 2024 at 7:58 PM Gang Wu 
> > >>>>>>>>>>> > >>>> <[email protected]>
> > >>>>>>>>>>> wrote:
> > >>>>>>>>>>> > >>>>
> > >>>>>>>>>>> > >>>>> > We may need some guidance on just how many we need to
> > >>>>>>>>>>> look at;
> > >>>>>>>>>>> > >>>>> > we were planning on Spark and Trino, but weren't sure
> > >>>>>>>>>>> how much
> > >>>>>>>>>>> > >>>>> > further down the rabbit hole we needed to go。
> > >>>>>>>>>>> > >>>>>
> > >>>>>>>>>>> > >>>>> There are some engines living outside the Java world. It
> > >>>>>>>>>>> would be
> > >>>>>>>>>>> > >>>>> good if the proposal could cover the effort it takes to
> > >>>>>>>>>>> integrate
> > >>>>>>>>>>> > >>>>> variant type to them (e.g. velox, datafusion, etc.).
> > >>>>>>>>>>> This is something
> > >>>>>>>>>>> > >>>>> that
> > >>>>>>>>>>> > >>>>> some proprietary iceberg vendors also care about.
> > >>>>>>>>>>> > >>>>>
> > >>>>>>>>>>> > >>>>
> > >>>>>>>>>>> > >>>> Ack, makes sense. We can make sure to share some
> > >>>>>>>>>>> perspective on this.
> > >>>>>>>>>>> > >>>>
> > >>>>>>>>>>> > >>>> > Not necessarily, no. As long as there's a binary type
> > >>>>>>>>>>> and Iceberg and
> > >>>>>>>>>>> > >>>>> > the query engines are aware that the binary column
> > >>>>>>>>>>> needs to be
> > >>>>>>>>>>> > >>>>> > interpreted as a variant, that should be sufficient.
> > >>>>>>>>>>> > >>>>>
> > >>>>>>>>>>> > >>>>> From the perspective of interoperability, it would be
> > >>>>>>>>>>> good to support
> > >>>>>>>>>>> > >>>>> native
> > >>>>>>>>>>> > >>>>> type from file specs. Life will be easier for projects
> > >>>>>>>>>>> like Apache
> > >>>>>>>>>>> > >>>>> XTable.
> > >>>>>>>>>>> > >>>>> File format could also provide finer-grained statistics
> > >>>>>>>>>>> for variant
> > >>>>>>>>>>> > >>>>> type which
> > >>>>>>>>>>> > >>>>> facilitates data skipping.
> > >>>>>>>>>>> > >>>>>
> > >>>>>>>>>>> > >>>>
> > >>>>>>>>>>> > >>>> Agreed, there can definitely be additional value in
> > >>>>>>>>>>> native file format
> > >>>>>>>>>>> > >>>> integration. Just wanted to highlight that it's not a
> > >>>>>>>>>>> strict requirement.
> > >>>>>>>>>>> > >>>>
> > >>>>>>>>>>> > >>>> -Tyler
> > >>>>>>>>>>> > >>>>
> > >>>>>>>>>>> > >>>>
> > >>>>>>>>>>> > >>>>>
> > >>>>>>>>>>> > >>>>> Gang
> > >>>>>>>>>>> > >>>>>
> > >>>>>>>>>>> > >>>>> On Wed, May 15, 2024 at 6:49 AM Tyler Akidau
> > >>>>>>>>>>> > >>>>> <[email protected]> wrote:
> > >>>>>>>>>>> > >>>>>
> > >>>>>>>>>>> > >>>>>> Good to see you again as well, JB! Thanks!
> > >>>>>>>>>>> > >>>>>>
> > >>>>>>>>>>> > >>>>>> -Tyler
> > >>>>>>>>>>> > >>>>>>
> > >>>>>>>>>>> > >>>>>>
> > >>>>>>>>>>> > >>>>>> On Tue, May 14, 2024 at 1:04 PM Jean-Baptiste Onofré <
> > >>>>>>>>>>> [email protected]>
> > >>>>>>>>>>> > >>>>>> wrote:
> > >>>>>>>>>>> > >>>>>>
> > >>>>>>>>>>> > >>>>>>> Hi Tyler,
> > >>>>>>>>>>> > >>>>>>>
> > >>>>>>>>>>> > >>>>>>> Super happy to see you there :) It reminds me our
> > >>>>>>>>>>> discussions back in
> > >>>>>>>>>>> > >>>>>>> the start of Apache Beam :)
> > >>>>>>>>>>> > >>>>>>>
> > >>>>>>>>>>> > >>>>>>> Anyway, the thread is pretty interesting. I remember
> > >>>>>>>>>>> some discussions
> > >>>>>>>>>>> > >>>>>>> about JSON datatype for spec v3. The binary data type
> > >>>>>>>>>>> is already
> > >>>>>>>>>>> > >>>>>>> supported in the spec v2.
> > >>>>>>>>>>> > >>>>>>>
> > >>>>>>>>>>> > >>>>>>> I'm looking forward to the proposal and happy to help
> > >>>>>>>>>>> on this !
> > >>>>>>>>>>> > >>>>>>>
> > >>>>>>>>>>> > >>>>>>> Regards
> > >>>>>>>>>>> > >>>>>>> JB
> > >>>>>>>>>>> > >>>>>>>
> > >>>>>>>>>>> > >>>>>>> On Sat, May 11, 2024 at 7:06 AM Tyler Akidau
> > >>>>>>>>>>> > >>>>>>> <[email protected]> wrote:
> > >>>>>>>>>>> > >>>>>>> >
> > >>>>>>>>>>> > >>>>>>> > Hello,
> > >>>>>>>>>>> > >>>>>>> >
> > >>>>>>>>>>> > >>>>>>> > We (Tyler, Nileema, Selcuk, Aihua) are working on a
> > >>>>>>>>>>> proposal for
> > >>>>>>>>>>> > >>>>>>> which we’d like to get early feedback from the
> > >>>>>>>>>>> community. As you may know,
> > >>>>>>>>>>> > >>>>>>> Snowflake has embraced Iceberg as its open Data Lake
> > >>>>>>>>>>> format. Having made
> > >>>>>>>>>>> > >>>>>>> good progress on our own adoption of the Iceberg
> > >>>>>>>>>>> standard, we’re now in a
> > >>>>>>>>>>> > >>>>>>> position where there are features not yet supported in
> > >>>>>>>>>>> Iceberg which we
> > >>>>>>>>>>> > >>>>>>> think would be valuable for our users, and that we
> > >>>>>>>>>>> would like to discuss
> > >>>>>>>>>>> > >>>>>>> with and help contribute to the Iceberg community.
> > >>>>>>>>>>> > >>>>>>> >
> > >>>>>>>>>>> > >>>>>>> > The first two such features we’d like to discuss are
> > >>>>>>>>>>> in support of
> > >>>>>>>>>>> > >>>>>>> efficient querying of dynamically typed,
> > >>>>>>>>>>> semi-structured data: variant data
> > >>>>>>>>>>> > >>>>>>> types, and subcolumnarization of variant columns. In
> > >>>>>>>>>>> more detail, for
> > >>>>>>>>>>> > >>>>>>> anyone who may not already be familiar:
> > >>>>>>>>>>> > >>>>>>> >
> > >>>>>>>>>>> > >>>>>>> > 1. Variant data types
> > >>>>>>>>>>> > >>>>>>> > Variant types allow for the efficient binary
> > >>>>>>>>>>> encoding of dynamic
> > >>>>>>>>>>> > >>>>>>> semi-structured data such as JSON, Avro, etc. By
> > >>>>>>>>>>> encoding semi-structured
> > >>>>>>>>>>> > >>>>>>> data as a variant column, we retain the flexibility of
> > >>>>>>>>>>> the source data,
> > >>>>>>>>>>> > >>>>>>> while allowing query engines to more efficiently
> > >>>>>>>>>>> operate on the data.
> > >>>>>>>>>>> > >>>>>>> Snowflake has supported the variant data type on
> > >>>>>>>>>>> Snowflake tables for many
> > >>>>>>>>>>> > >>>>>>> years [1]. As more and more users utilize Iceberg
> > >>>>>>>>>>> tables in Snowflake,
> > >>>>>>>>>>> > >>>>>>> we’re hearing an increasing chorus of requests for
> > >>>>>>>>>>> variant support.
> > >>>>>>>>>>> > >>>>>>> Additionally, other query engines such as Apache Spark
> > >>>>>>>>>>> have begun adding
> > >>>>>>>>>>> > >>>>>>> variant support [2]. As such, we believe it would be
> > >>>>>>>>>>> beneficial to the
> > >>>>>>>>>>> > >>>>>>> Iceberg community as a whole to standardize on the
> > >>>>>>>>>>> variant data type
> > >>>>>>>>>>> > >>>>>>> encoding used across Iceberg tables.
> > >>>>>>>>>>> > >>>>>>> >
> > >>>>>>>>>>> > >>>>>>> > One specific point to make here is that, since an
> > >>>>>>>>>>> Apache OSS
> > >>>>>>>>>>> > >>>>>>> version of variant encoding already exists in Spark,
> > >>>>>>>>>>> it likely makes sense
> > >>>>>>>>>>> > >>>>>>> to simply adopt the Spark encoding as the Iceberg
> > >>>>>>>>>>> standard as well. The
> > >>>>>>>>>>> > >>>>>>> encoding we use internally today in Snowflake is
> > >>>>>>>>>>> slightly different, but
> > >>>>>>>>>>> > >>>>>>> essentially equivalent, and we see no particular value
> > >>>>>>>>>>> in trying to clutter
> > >>>>>>>>>>> > >>>>>>> the space with another equivalent-but-incompatible
> > >>>>>>>>>>> encoding.
> > >>>>>>>>>>> > >>>>>>> >
> > >>>>>>>>>>> > >>>>>>> >
> > >>>>>>>>>>> > >>>>>>> > 2. Subcolumnarization
> > >>>>>>>>>>> > >>>>>>> > Subcolumnarization of variant columns allows query
> > >>>>>>>>>>> engines to
> > >>>>>>>>>>> > >>>>>>> efficiently prune datasets when subcolumns (i.e.,
> > >>>>>>>>>>> nested fields) within a
> > >>>>>>>>>>> > >>>>>>> variant column are queried, and also allows optionally
> > >>>>>>>>>>> materializing some
> > >>>>>>>>>>> > >>>>>>> of the nested fields as a column on their own,
> > >>>>>>>>>>> affording queries on these
> > >>>>>>>>>>> > >>>>>>> subcolumns the ability to read less data and spend
> > >>>>>>>>>>> less CPU on extraction.
> > >>>>>>>>>>> > >>>>>>> When subcolumnarizing, the system managing table
> > >>>>>>>>>>> metadata and data tracks
> > >>>>>>>>>>> > >>>>>>> individual pruning statistics (min, max, null, etc.)
> > >>>>>>>>>>> for some subset of the
> > >>>>>>>>>>> > >>>>>>> nested fields within a variant, and also manages any
> > >>>>>>>>>>> optional
> > >>>>>>>>>>> > >>>>>>> materialization. Without subcolumnarization, any query
> > >>>>>>>>>>> which touches a
> > >>>>>>>>>>> > >>>>>>> variant column must read, parse, extract, and filter
> > >>>>>>>>>>> every row for which
> > >>>>>>>>>>> > >>>>>>> that column is non-null. Thus, by providing a
> > >>>>>>>>>>> standardized way of tracking
> > >>>>>>>>>>> > >>>>>>> subcolum metadata and data for variant columns,
> > >>>>>>>>>>> Iceberg can make
> > >>>>>>>>>>> > >>>>>>> subcolumnar optimizations accessible across various
> > >>>>>>>>>>> catalogs and query
> > >>>>>>>>>>> > >>>>>>> engines.
> > >>>>>>>>>>> > >>>>>>> >
> > >>>>>>>>>>> > >>>>>>> > Subcolumnarization is a non-trivial topic, so we
> > >>>>>>>>>>> expect any
> > >>>>>>>>>>> > >>>>>>> concrete proposal to include not only the set of
> > >>>>>>>>>>> changes to Iceberg
> > >>>>>>>>>>> > >>>>>>> metadata that allow compatible query engines to
> > >>>>>>>>>>> interopate on
> > >>>>>>>>>>> > >>>>>>> subcolumnarization data for variant columns, but also
> > >>>>>>>>>>> reference
> > >>>>>>>>>>> > >>>>>>> documentation explaining subcolumnarization principles
> > >>>>>>>>>>> and recommended best
> > >>>>>>>>>>> > >>>>>>> practices.
> > >>>>>>>>>>> > >>>>>>> >
> > >>>>>>>>>>> > >>>>>>> >
> > >>>>>>>>>>> > >>>>>>> > It sounds like the recent Geo proposal [3] may be a
> > >>>>>>>>>>> good starting
> > >>>>>>>>>>> > >>>>>>> point for how to approach this, so our plan is to
> > >>>>>>>>>>> write something up in
> > >>>>>>>>>>> > >>>>>>> that vein that covers the proposed spec changes,
> > >>>>>>>>>>> backwards compatibility,
> > >>>>>>>>>>> > >>>>>>> implementor burdens, etc. But we wanted to first reach
> > >>>>>>>>>>> out to the community
> > >>>>>>>>>>> > >>>>>>> to introduce ourselves and the idea, and see if
> > >>>>>>>>>>> there’s any early feedback
> > >>>>>>>>>>> > >>>>>>> we should incorporate before we spend too much time on
> > >>>>>>>>>>> a concrete proposal.
> > >>>>>>>>>>> > >>>>>>> >
> > >>>>>>>>>>> > >>>>>>> > Thank you!
> > >>>>>>>>>>> > >>>>>>> >
> > >>>>>>>>>>> > >>>>>>> > [1]
> > >>>>>>>>>>> > >>>>>>>
> > >>>>>>>>>>> https://docs.snowflake.com/en/sql-reference/data-types-semistructured
> > >>>>>>>>>>> > >>>>>>> > [2]
> > >>>>>>>>>>> > >>>>>>>
> > >>>>>>>>>>> https://github.com/apache/spark/blob/master/common/variant/README.md
> > >>>>>>>>>>> > >>>>>>> > [3]
> > >>>>>>>>>>> > >>>>>>>
> > >>>>>>>>>>> https://docs.google.com/document/d/1iVFbrRNEzZl8tDcZC81GFt01QJkLJsI9E2NBOt21IRI/edit
> > >>>>>>>>>>> > >>>>>>> >
> > >>>>>>>>>>> > >>>>>>> > -Tyler, Nileema, Selcuk, Aihua
> > >>>>>>>>>>> > >>>>>>> >
> > >>>>>>>>>>> > >>>>>>>
> > >>>>>>>>>>> > >>>>>>
> > >>>>>>>>>>> >
> > >>>>>>>>>>> > --
> > >>>>>>>>>>> > Ryan Blue
> > >>>>>>>>>>> > Databricks
> > >>>>>>>>>>> >
> > >>>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>> --
> > >>>>>>>>> Ryan Blue
> > >>>>>>>>> Databricks
> > >>>>>>>>>
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>> --
> > >>>>>>>> Ryan Blue
> > >>>>>>>> Databricks
> > >>>>>>>>
> > >>>>>>>
> > >>>>>>
> > >>>>>> --
> > >>>>>> Ryan Blue
> > >>>>>> Databricks
> > >>>>>>
> > >>>>>
> > >>>>
> > >>>> --
> > >>>> Ryan Blue
> > >>>> Databricks
> > >>>>
> > >>>
> > >>
> > >> --
> > >> Ryan Blue
> > >> Databricks
> > >>
> > >
> > 
>

Re: [Early Feedback] Variant and Subcolumnarization Support

Reply via email to