Re: [Early Feedback] Variant and Subcolumnarization Support

Aihua Xu Fri, 26 Jul 2024 23:41:08 -0700

Let me know if I understand correctly: basically the spec will not include
any type promotion. E.g., if the chosen type for the subcolumn is int64,
then only int64 value will be encoded in `typed_value`, other types of
values including int32 type will be encoded in `untyped_value`; similarly
if the chosen type is decimal(10, 5), the values of decimal(10, 2) will be
encoded in `untyped_value`.


This looks clean to me.

On Fri, Jul 26, 2024 at 3:11 PM Micah Kornfield <emkornfi...@gmail.com>
wrote:

> If we aren't optimizing for strict decimal behavior, then I think the
>> cleanest option is to use `untyped_value` when decimal scales need to be
>> preserved. I would also remove language from the shredding spec about
>> storage modifying values so that this is an engine concern. The storage
>> spec should state how you can encode values, without making recommendations
>> about modifying those values. If an engine's semantics for variant allow it
>> to modify value types, then that's up to the engine.
>
>
> If I understand this correctly this also applies to int32/int64
> conversions?  This sounds like a good solution to me as well (IIUC all we
> are saying here is it is an engine decision on when it wants to widen types
> for storage purposes and that might involve some loss in data fidelity?).
>
> Thanks,
> Micah
>
> On Fri, Jul 26, 2024 at 9:55 AM Ryan Blue <b...@databricks.com.invalid>
> wrote:
>
>> As a follow up, I talked with Russell quite a bit about losing types
>> after the meeting and he convinced me that while there are valid use cases
>> for strict decimal behavior, the majority of cases are either engines using
>> decimal to keep track of the original number of digits in a number or
>> people that simply want to limit the number of digits. In that case, I
>> think the natural conclusion is that it should be _possible_ to have strict
>> behavior but we should not increase complexity too much to optimize for it.
>>
>> If we aren't optimizing for strict decimal behavior, then I think the
>> cleanest option is to use `untyped_value` when decimal scales need to be
>> preserved. I would also remove language from the shredding spec about
>> storage modifying values so that this is an engine concern. The storage
>> spec should state how you can encode values, without making recommendations
>> about modifying those values. If an engine's semantics for variant allow it
>> to modify value types, then that's up to the engine.
>>
>> In the discussion, I wasn't the only person in favor of not modifying
>> decimal scales, but I'm curious if this distinction satisfies everyone. If
>> we remove the wording from the proposal that recommends modifying decimals
>> and leave this to the engine, do we have agreement?
>>
>> On Thu, Jul 25, 2024 at 6:46 PM Aihua Xu <aihu...@gmail.com> wrote:
>>
>>> Hi community,
>>>
>>> Thanks for joining the meeting to discuss variant shredding. For those
>>> who were unable to attend the meeting, please check out the recorded
>>> meeting
>>> <https://drive.google.com/file/d/1kiwv29nxxOqMCbxXn-NRoz-x2E9yIMlJ/view?usp=drive_link>
>>>  if
>>> you are interested. Also to follow up on the meeting to converge on
>>> lossiness discussion from shredding offline,  I have converted the spark
>>> shredding proposal by David into google doc
>>> <https://docs.google.com/document/d/1JeBt4NIju08jQ2AbludiK-U0M9ISIgysP7fUDWtv7rg/edit>
>>>  and
>>> please comment.
>>>
>>> Thanks,
>>> Aihua
>>>
>>>
>>> On Thu, Jul 25, 2024 at 10:14 AM Aihua Xu <aihu...@gmail.com> wrote:
>>>
>>>> Yes. This time I was able to record it and I will share it when it’s
>>>> processed.
>>>>
>>>>
>>>> On Jul 25, 2024, at 10:01 AM, Amogh Jahagirdar <2am...@gmail.com>
>>>> wrote:
>>>>
>>>> 
>>>> Any chance this meeting was recorded? I couldn't make it but would be
>>>> interested in catching up on the discussion.
>>>>
>>>> Thanks,
>>>>
>>>> Amogh Jahagirdar
>>>>
>>>> On Tue, Jul 23, 2024 at 11:30 AM Aihua Xu <aihu...@gmail.com> wrote:
>>>>
>>>>> Thanks folks for additional discussion.
>>>>>
>>>>> There are some questions related to subcolumniziation (spark shredding
>>>>> - see the discussion <https://github.com/apache/spark/pull/46831>)
>>>>> and we would like to host another meeting to mainly discuss that since we
>>>>> plan to adopt it. We can also follow up the Spark variant topics (I can 
>>>>> see
>>>>> that mostly we are aligned with the exception to find a place for the spec
>>>>> and implementation). Look forward to meeting with you. BTW: should I
>>>>> include dev@iceberg.apache.org in the email invite?
>>>>>
>>>>> Sync up on Variant subcolumnization (shredding)
>>>>> Thursday, July 25 · 8:00 – 9:00am
>>>>> Time zone: America/Los_Angeles
>>>>> Google Meet joining info
>>>>> Video call link: https://meet.google.com/mug-dvnv-hnq
>>>>> Or dial: ‪(US) +1 904-900-0730‬ PIN: ‪671 997 419‬#
>>>>> More phone numbers: https://tel.meet/mug-dvnv-hnq?pin=1654043233422
>>>>>
>>>>> Thanks,
>>>>> Aihua
>>>>>
>>>>> On Tue, Jul 23, 2024 at 6:36 AM Amogh Jahagirdar <2am...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> I'm late replying to this but I'm also in agreement with 1 (adopting
>>>>>> the spark variant encoding), 3 (specifically only having a variant type),
>>>>>> and 4 (ensuring we are thinking through subcolumnarization upfront since
>>>>>> without it the variant type may not be that useful).
>>>>>>
>>>>>> I'd also support having the spec, and reference implementation in
>>>>>> Iceberg; as others have said, it centralizes improvements in a single,
>>>>>> agnostic dependency for engines, rather than engines having to take
>>>>>> dependencies on other engine modules.
>>>>>>
>>>>>> Thanks,
>>>>>>
>>>>>> Amogh Jahagirdar
>>>>>>
>>>>>> On Tue, Jul 23, 2024 at 12:15 AM Péter Váry <
>>>>>> peter.vary.apa...@gmail.com> wrote:
>>>>>>
>>>>>>> I have been looking around, how can we map Variant type in Flink. I
>>>>>>> have not found any existing type which we could use, but Flink already 
>>>>>>> have
>>>>>>> some JSON parsing capabilities [1] for string fields.
>>>>>>>
>>>>>>> So until we have native support in Flink for something similar to
>>>>>>> Vartiant type, I expect that we need to map it to JSON strings in 
>>>>>>> RowData.
>>>>>>>
>>>>>>> Based on that, here are my preferences:
>>>>>>> 1. I'm ok with adapting Spark Variant type, if we build our own
>>>>>>> Iceberg serializer/deserializer module for it
>>>>>>> 2. I prefer to move the spec to Iceberg, so we own it, and extend
>>>>>>> it, if needed. This could be important in the first phase. Later when 
>>>>>>> it is
>>>>>>> more stable we might donate it to some other project, like Parquet
>>>>>>> 3. I would prefer to support only a single type, and Variant is more
>>>>>>> expressive, but having a standard way to convert between JSON and 
>>>>>>> Variant
>>>>>>> would be useful for Flink users.
>>>>>>> 4. On subcolumnarization: I think Flink will only use this feature
>>>>>>> as much as the Iceberg readers implement this, so I would like to see as
>>>>>>> much as possible of it in the common Iceberg code
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Peter
>>>>>>>
>>>>>>> [1] -
>>>>>>> https://nightlies.apache.org/flink/flink-docs-release-1.19/docs/dev/table/functions/systemfunctions/#json-functions
>>>>>>>
>>>>>>>
>>>>>>> On Tue, Jul 23, 2024, 06:36 Micah Kornfield <emkornfi...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Sorry for the late reply.  I agree with the sentiments on 1 and 3
>>>>>>>> that have already been posted (adopt the Spark encoding, and only have 
>>>>>>>> the
>>>>>>>> Variant type).  As mentioned on the doc for 3, I think it would be 
>>>>>>>> good to
>>>>>>>> specify how to map scalar types to a JSON representation so there can 
>>>>>>>> be
>>>>>>>> consistency between engines that don't support variant.
>>>>>>>>
>>>>>>>>
>>>>>>>>> Regarding point 2, I also feel Iceberg is more natural to host
>>>>>>>>> such a subproject for variant spec and implementation. But let me 
>>>>>>>>> reach out
>>>>>>>>> to the Spark community to discuss.
>>>>>>>>
>>>>>>>>
>>>>>>>> The only  other place I can think of that might be a good home for
>>>>>>>> Variant spec could be in Apache Arrow as a canonical extension type. 
>>>>>>>> There
>>>>>>>> is an issue for this [1].  I think the main thing on where this is 
>>>>>>>> housed
>>>>>>>> is which types are intended to be supported.  I believe Arrow is 
>>>>>>>> currently
>>>>>>>> a superset of the Iceberg type system (UUID is supported as a canonical
>>>>>>>> extension type [2]).
>>>>>>>>
>>>>>>>> For point 4 subcolumnarization, I think ideally this belongs in
>>>>>>>> Iceberg (and if Iceberg and Delta Lake can agree on how to do it that 
>>>>>>>> would
>>>>>>>> be great) with potential consultation with Parquet/ORC communities to
>>>>>>>> potentially add better native support.
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Micah
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> [1] https://github.com/apache/arrow/issues/42069
>>>>>>>> [2] https://arrow.apache.org/docs/format/CanonicalExtensions.html
>>>>>>>>
>>>>>>>> On Sat, Jul 20, 2024 at 5:54 PM Aihua Xu <aihu...@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> Thanks for the discussion and feedback.
>>>>>>>>>
>>>>>>>>> Do we have the consensus on point 1 and point 3 to move forward
>>>>>>>>> with Spark variant encoding and support Variant type only? Or let me 
>>>>>>>>> know
>>>>>>>>> how to proceed from here.
>>>>>>>>>
>>>>>>>>> Regarding point 2, I also feel Iceberg is more natural to host
>>>>>>>>> such a subproject for variant spec and implementation. But let me 
>>>>>>>>> reach out
>>>>>>>>> to the Spark community to discuss.
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Aihua
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Fri, Jul 19, 2024 at 9:35 AM Yufei Gu <flyrain...@gmail.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Agreed with point 1.
>>>>>>>>>>
>>>>>>>>>> For point 2, I also prefer to hold the spec and reference
>>>>>>>>>> implementation under Iceberg. Here are the reasons:
>>>>>>>>>> 1. It is unconventional and impractical for one engine to depend
>>>>>>>>>> on another for data types. For instance, it is not ideal for Trino 
>>>>>>>>>> to rely
>>>>>>>>>> on data types defined by the Spark engine.
>>>>>>>>>> 2. Iceberg serves as a bridge between engines and file formats.
>>>>>>>>>> By centralizing the specification in Iceberg, any future 
>>>>>>>>>> optimizations or
>>>>>>>>>> updates to file formats can be referred to within Iceberg, ensuring
>>>>>>>>>> consistency and reducing dependencies.
>>>>>>>>>>
>>>>>>>>>> For point 3, I'd prefer to support the variant type only at this
>>>>>>>>>> moment.
>>>>>>>>>>
>>>>>>>>>> Yufei
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Thu, Jul 18, 2024 at 12:55 PM Ryan Blue
>>>>>>>>>> <b...@databricks.com.invalid> wrote:
>>>>>>>>>>
>>>>>>>>>>> Similarly, I'm aligned with point 1 and I'd choose to support
>>>>>>>>>>> only variant for point 3.
>>>>>>>>>>>
>>>>>>>>>>> We'll need to work with the Spark community to find a good place
>>>>>>>>>>> for the library and spec, since it touches many different projects. 
>>>>>>>>>>> I'd
>>>>>>>>>>> also prefer Iceberg as the home.
>>>>>>>>>>>
>>>>>>>>>>> I also think it's a good idea to get subcolumnarization into our
>>>>>>>>>>> spec when we update. Without that I think the feature will be fairly
>>>>>>>>>>> limited.
>>>>>>>>>>>
>>>>>>>>>>> On Thu, Jul 18, 2024 at 10:56 AM Russell Spitzer <
>>>>>>>>>>> russell.spit...@gmail.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> I'm aligned with point 1.
>>>>>>>>>>>>
>>>>>>>>>>>> For point 2 I think we should choose quickly, I honestly do
>>>>>>>>>>>> think this would be fine as part of the Iceberg Spec directly but
>>>>>>>>>>>> understand it may be better for the more broad community if it was 
>>>>>>>>>>>> a sub
>>>>>>>>>>>> project. As a sub-project I would still prefer it being an Iceberg
>>>>>>>>>>>> Subproject since we are engine/file-format agnostic.
>>>>>>>>>>>>
>>>>>>>>>>>> 3. I support adding just Variant.
>>>>>>>>>>>>
>>>>>>>>>>>> On Thu, Jul 18, 2024 at 12:54 AM Aihua Xu <aihu...@apache.org>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Hello community,
>>>>>>>>>>>>>
>>>>>>>>>>>>> It’s great to sync up with some of you on Variant and
>>>>>>>>>>>>> SubColumarization support in Iceberg again. Apologize that I 
>>>>>>>>>>>>> didn’t record
>>>>>>>>>>>>> the meeting but here are some key items that we want to follow up 
>>>>>>>>>>>>> with the
>>>>>>>>>>>>> community.
>>>>>>>>>>>>>
>>>>>>>>>>>>> 1. Adopt Spark Variant encoding
>>>>>>>>>>>>> Those present were in favor of  adopting the Spark variant
>>>>>>>>>>>>> encoding for Iceberg Variant with extensions to support other 
>>>>>>>>>>>>> Iceberg
>>>>>>>>>>>>> types. We would like to know if anyone has an objection to this 
>>>>>>>>>>>>> to reuse an
>>>>>>>>>>>>> open source encoding.
>>>>>>>>>>>>>
>>>>>>>>>>>>> 2. Movement of the Spark Variant Spec to another project
>>>>>>>>>>>>> To avoid introducing Apache Spark as a dependency for the
>>>>>>>>>>>>> engines and file formats, we discussed separating Spark Variant 
>>>>>>>>>>>>> encoding
>>>>>>>>>>>>> spec and implementation from the Spark Project to a neutral 
>>>>>>>>>>>>> location. We
>>>>>>>>>>>>> thought up several solutions but didn’t have consensus on any of 
>>>>>>>>>>>>> them. We
>>>>>>>>>>>>> are looking for more feedback on this topic from the community 
>>>>>>>>>>>>> either in
>>>>>>>>>>>>> terms of support for one of these options or another idea on how 
>>>>>>>>>>>>> to support
>>>>>>>>>>>>> the spec.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Options Proposed:
>>>>>>>>>>>>> * Leave the Spec in Spark (Difficult for versioning and other
>>>>>>>>>>>>> engines)
>>>>>>>>>>>>> * Copying the Spec into Iceberg Project Directly (Difficult
>>>>>>>>>>>>> for other Table Formats)
>>>>>>>>>>>>> * Creating a Sub-Project of Apache Iceberg and moving the spec
>>>>>>>>>>>>> and reference implementation there (Logistically complicated)
>>>>>>>>>>>>> * Creating a Sub-Project of Apache Spark and moving the spec
>>>>>>>>>>>>> and reference implementation there (Logistically complicated)
>>>>>>>>>>>>>
>>>>>>>>>>>>> 3. Add Variant type vs. Variant and JSON types
>>>>>>>>>>>>> Those who were present were in favor of adding only the
>>>>>>>>>>>>> Variant type to Iceberg. We are looking for anyone who has an 
>>>>>>>>>>>>> objection to
>>>>>>>>>>>>> going forward with just the Variant Type and no Iceberg JSON 
>>>>>>>>>>>>> Type. We were
>>>>>>>>>>>>> favoring adding Variant type only because:
>>>>>>>>>>>>> * Introducing a JSON type would require engines that only
>>>>>>>>>>>>> support VARIANT to do write time validation of their input to a 
>>>>>>>>>>>>> JSON
>>>>>>>>>>>>> column. If they don’t have a JSON type an engine wouldn’t support 
>>>>>>>>>>>>> this.
>>>>>>>>>>>>> * Engines which don’t support Variant will work most of the
>>>>>>>>>>>>> time but can have fallback strings defined in the spec for reading
>>>>>>>>>>>>> unsupported types. Writing a JSON into a Variant will always work.
>>>>>>>>>>>>>
>>>>>>>>>>>>> 4. Support for Subcolumnization spec (shredding in Spark)
>>>>>>>>>>>>> We have no action items on this but would like to follow up on
>>>>>>>>>>>>> discussions on Subcolumnization in the future.
>>>>>>>>>>>>> * We had general agreement that this should be included in
>>>>>>>>>>>>> Iceberg V3 or else adding variant may not be useful.
>>>>>>>>>>>>> * We are interested in also adopting the shredding spec from
>>>>>>>>>>>>> Spark and would like to move it to whatever place we decided the 
>>>>>>>>>>>>> Variant
>>>>>>>>>>>>> spec is going to live.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Let us know if missed anything and if you have any additional
>>>>>>>>>>>>> thoughts or suggestions.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks
>>>>>>>>>>>>> Aihua
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On 2024/07/15 18:32:22 Aihua Xu wrote:
>>>>>>>>>>>>> > Thanks for the discussion.
>>>>>>>>>>>>> >
>>>>>>>>>>>>> > I will move forward to work on spec PR.
>>>>>>>>>>>>> >
>>>>>>>>>>>>> > Regarding the implementation, we will have module for
>>>>>>>>>>>>> Variant support in Iceberg so we will not have to bring in Spark 
>>>>>>>>>>>>> libraries.
>>>>>>>>>>>>> >
>>>>>>>>>>>>> > I'm reposting the meeting invite in case it's not clear in
>>>>>>>>>>>>> my original email since I included in the end. Looks like we 
>>>>>>>>>>>>> don't have
>>>>>>>>>>>>> major objections/diverges but let's sync up and have consensus.
>>>>>>>>>>>>> >
>>>>>>>>>>>>> > Meeting invite:
>>>>>>>>>>>>> >
>>>>>>>>>>>>> > Wednesday, July 17 · 9:00 – 10:00am
>>>>>>>>>>>>> > Time zone: America/Los_Angeles
>>>>>>>>>>>>> > Google Meet joining info
>>>>>>>>>>>>> > Video call link: https://meet.google.com/pbm-ovzn-aoq
>>>>>>>>>>>>> > Or dial: ‪(US) +1 650-449-9343‬ PIN: ‪170 576 525‬#
>>>>>>>>>>>>> > More phone numbers:
>>>>>>>>>>>>> https://tel.meet/pbm-ovzn-aoq?pin=4079632691790
>>>>>>>>>>>>> >
>>>>>>>>>>>>> > Thanks,
>>>>>>>>>>>>> > Aihua
>>>>>>>>>>>>> >
>>>>>>>>>>>>> > On 2024/07/12 20:55:01 Micah Kornfield wrote:
>>>>>>>>>>>>> > > I don't think this needs to hold up the PR but I think
>>>>>>>>>>>>> coming to a
>>>>>>>>>>>>> > > consensus on the exact set of types supported is
>>>>>>>>>>>>> worthwhile (and if the
>>>>>>>>>>>>> > > goal is to maintain the same set as specified by the Spark
>>>>>>>>>>>>> Variant type or
>>>>>>>>>>>>> > > if divergence is expected/allowed).  From a fragmentation
>>>>>>>>>>>>> perspective it
>>>>>>>>>>>>> > > would be a shame if they diverge, so maybe a next step is
>>>>>>>>>>>>> also suggesting
>>>>>>>>>>>>> > > support to the Spark community on the missing existing
>>>>>>>>>>>>> Iceberg types?
>>>>>>>>>>>>> > >
>>>>>>>>>>>>> > > Thanks,
>>>>>>>>>>>>> > > Micah
>>>>>>>>>>>>> > >
>>>>>>>>>>>>> > > On Fri, Jul 12, 2024 at 1:44 PM Russell Spitzer <
>>>>>>>>>>>>> russell.spit...@gmail.com>
>>>>>>>>>>>>> > > wrote:
>>>>>>>>>>>>> > >
>>>>>>>>>>>>> > > > Just talked with Aihua and he's working on the Spec PR
>>>>>>>>>>>>> now. We can get
>>>>>>>>>>>>> > > > feedback there from everyone.
>>>>>>>>>>>>> > > >
>>>>>>>>>>>>> > > > On Fri, Jul 12, 2024 at 3:41 PM Ryan Blue
>>>>>>>>>>>>> <b...@databricks.com.invalid>
>>>>>>>>>>>>> > > > wrote:
>>>>>>>>>>>>> > > >
>>>>>>>>>>>>> > > >> Good idea, but I'm hoping that we can continue to get
>>>>>>>>>>>>> their feedback in
>>>>>>>>>>>>> > > >> parallel to getting the spec changes started. Piotr
>>>>>>>>>>>>> didn't seem to object
>>>>>>>>>>>>> > > >> to the encoding from what I read of his comments.
>>>>>>>>>>>>> Hopefully he (and others)
>>>>>>>>>>>>> > > >> chime in here.
>>>>>>>>>>>>> > > >>
>>>>>>>>>>>>> > > >> On Fri, Jul 12, 2024 at 1:32 PM Russell Spitzer <
>>>>>>>>>>>>> > > >> russell.spit...@gmail.com> wrote:
>>>>>>>>>>>>> > > >>
>>>>>>>>>>>>> > > >>> I just want to make sure we get Piotr and Peter on
>>>>>>>>>>>>> board as
>>>>>>>>>>>>> > > >>> representatives of Flink and Trino engines. Also make
>>>>>>>>>>>>> sure we have anyone
>>>>>>>>>>>>> > > >>> else chime in who has experience with Ray if possible.
>>>>>>>>>>>>> > > >>>
>>>>>>>>>>>>> > > >>> Spec changes feel like the right next step.
>>>>>>>>>>>>> > > >>>
>>>>>>>>>>>>> > > >>> On Fri, Jul 12, 2024 at 3:14 PM Ryan Blue
>>>>>>>>>>>>> <b...@databricks.com.invalid>
>>>>>>>>>>>>> > > >>> wrote:
>>>>>>>>>>>>> > > >>>
>>>>>>>>>>>>> > > >>>> Okay, what are the next steps here? This proposal has
>>>>>>>>>>>>> been out for
>>>>>>>>>>>>> > > >>>> quite a while and I don't see any major objections to
>>>>>>>>>>>>> using the Spark
>>>>>>>>>>>>> > > >>>> encoding. It's quite well designed and fits the need
>>>>>>>>>>>>> well. It can also be
>>>>>>>>>>>>> > > >>>> extended to support additional types that are missing
>>>>>>>>>>>>> if that's a priority.
>>>>>>>>>>>>> > > >>>>
>>>>>>>>>>>>> > > >>>> Should we move forward by starting a draft of the
>>>>>>>>>>>>> changes to the table
>>>>>>>>>>>>> > > >>>> spec? Then we can vote on committing those changes
>>>>>>>>>>>>> and get moving on an
>>>>>>>>>>>>> > > >>>> implementation (or possibly do the implementation in
>>>>>>>>>>>>> parallel).
>>>>>>>>>>>>> > > >>>>
>>>>>>>>>>>>> > > >>>> On Fri, Jul 12, 2024 at 1:08 PM Russell Spitzer <
>>>>>>>>>>>>> > > >>>> russell.spit...@gmail.com> wrote:
>>>>>>>>>>>>> > > >>>>
>>>>>>>>>>>>> > > >>>>> That's fair, I'm sold on an Iceberg Module.
>>>>>>>>>>>>> > > >>>>>
>>>>>>>>>>>>> > > >>>>> On Fri, Jul 12, 2024 at 2:53 PM Ryan Blue
>>>>>>>>>>>>> <b...@databricks.com.invalid>
>>>>>>>>>>>>> > > >>>>> wrote:
>>>>>>>>>>>>> > > >>>>>
>>>>>>>>>>>>> > > >>>>>> > Feels like eventually the encoding should land in
>>>>>>>>>>>>> parquet proper
>>>>>>>>>>>>> > > >>>>>> right?
>>>>>>>>>>>>> > > >>>>>>
>>>>>>>>>>>>> > > >>>>>> What about using it in ORC? I don't know where it
>>>>>>>>>>>>> should end up.
>>>>>>>>>>>>> > > >>>>>> Maybe Iceberg should make a standalone module from
>>>>>>>>>>>>> it?
>>>>>>>>>>>>> > > >>>>>>
>>>>>>>>>>>>> > > >>>>>> On Fri, Jul 12, 2024 at 12:38 PM Russell Spitzer <
>>>>>>>>>>>>> > > >>>>>> russell.spit...@gmail.com> wrote:
>>>>>>>>>>>>> > > >>>>>>
>>>>>>>>>>>>> > > >>>>>>> Feels like eventually the encoding should land in
>>>>>>>>>>>>> parquet proper
>>>>>>>>>>>>> > > >>>>>>> right? I'm fine with us just copying into Iceberg
>>>>>>>>>>>>> though for the time
>>>>>>>>>>>>> > > >>>>>>> being.
>>>>>>>>>>>>> > > >>>>>>>
>>>>>>>>>>>>> > > >>>>>>> On Fri, Jul 12, 2024 at 2:31 PM Ryan Blue
>>>>>>>>>>>>> > > >>>>>>> <b...@databricks.com.invalid> wrote:
>>>>>>>>>>>>> > > >>>>>>>
>>>>>>>>>>>>> > > >>>>>>>> Oops, it looks like I missed where Aihua brought
>>>>>>>>>>>>> this up in his
>>>>>>>>>>>>> > > >>>>>>>> last email:
>>>>>>>>>>>>> > > >>>>>>>>
>>>>>>>>>>>>> > > >>>>>>>> > do we have an issue to directly use Spark
>>>>>>>>>>>>> implementation in
>>>>>>>>>>>>> > > >>>>>>>> Iceberg?
>>>>>>>>>>>>> > > >>>>>>>>
>>>>>>>>>>>>> > > >>>>>>>> Yes, I think that we do have an issue using the
>>>>>>>>>>>>> Spark library. What
>>>>>>>>>>>>> > > >>>>>>>> do you think about a Java implementation in
>>>>>>>>>>>>> Iceberg?
>>>>>>>>>>>>> > > >>>>>>>>
>>>>>>>>>>>>> > > >>>>>>>> Ryan
>>>>>>>>>>>>> > > >>>>>>>>
>>>>>>>>>>>>> > > >>>>>>>> On Fri, Jul 12, 2024 at 12:28 PM Ryan Blue <
>>>>>>>>>>>>> b...@databricks.com>
>>>>>>>>>>>>> > > >>>>>>>> wrote:
>>>>>>>>>>>>> > > >>>>>>>>
>>>>>>>>>>>>> > > >>>>>>>>> I raised the same point from Peter's email in a
>>>>>>>>>>>>> comment on the doc
>>>>>>>>>>>>> > > >>>>>>>>> as well. There is a spark-variant_2.13 artifact
>>>>>>>>>>>>> that would be a much
>>>>>>>>>>>>> > > >>>>>>>>> smaller scope than relying on large portions of
>>>>>>>>>>>>> Spark, but I even then I
>>>>>>>>>>>>> > > >>>>>>>>> doubt that it is a good idea for Iceberg to
>>>>>>>>>>>>> depend on that because it is a
>>>>>>>>>>>>> > > >>>>>>>>> Scala artifact and we would need to bring in a
>>>>>>>>>>>>> ton of Scala libs. I think
>>>>>>>>>>>>> > > >>>>>>>>> what makes the most sense is to have an
>>>>>>>>>>>>> independent implementation of the
>>>>>>>>>>>>> > > >>>>>>>>> spec in Iceberg.
>>>>>>>>>>>>> > > >>>>>>>>>
>>>>>>>>>>>>> > > >>>>>>>>> On Fri, Jul 12, 2024 at 11:51 AM Péter Váry <
>>>>>>>>>>>>> > > >>>>>>>>> peter.vary.apa...@gmail.com> wrote:
>>>>>>>>>>>>> > > >>>>>>>>>
>>>>>>>>>>>>> > > >>>>>>>>>> Hi Aihua,
>>>>>>>>>>>>> > > >>>>>>>>>> Long time no see :)
>>>>>>>>>>>>> > > >>>>>>>>>> Would this mean, that every engine which plans
>>>>>>>>>>>>> to support Variant
>>>>>>>>>>>>> > > >>>>>>>>>> data type needs to add Spark as a dependency?
>>>>>>>>>>>>> Like Flink/Trino/Hive etc?
>>>>>>>>>>>>> > > >>>>>>>>>> Thanks, Peter
>>>>>>>>>>>>> > > >>>>>>>>>>
>>>>>>>>>>>>> > > >>>>>>>>>>
>>>>>>>>>>>>> > > >>>>>>>>>> On Fri, Jul 12, 2024, 19:10 Aihua Xu <
>>>>>>>>>>>>> aihu...@apache.org> wrote:
>>>>>>>>>>>>> > > >>>>>>>>>>
>>>>>>>>>>>>> > > >>>>>>>>>>> Thanks Ryan.
>>>>>>>>>>>>> > > >>>>>>>>>>>
>>>>>>>>>>>>> > > >>>>>>>>>>> Yeah. That's another reason we want to pursue
>>>>>>>>>>>>> Spark encoding to
>>>>>>>>>>>>> > > >>>>>>>>>>> keep compatibility for the open source engines.
>>>>>>>>>>>>> > > >>>>>>>>>>>
>>>>>>>>>>>>> > > >>>>>>>>>>> One more question regarding the encoding
>>>>>>>>>>>>> implementation: do we
>>>>>>>>>>>>> > > >>>>>>>>>>> have an issue to directly use Spark
>>>>>>>>>>>>> implementation in Iceberg? Russell
>>>>>>>>>>>>> > > >>>>>>>>>>> pointed out that Trino doesn't have Spark
>>>>>>>>>>>>> dependency and that could be a
>>>>>>>>>>>>> > > >>>>>>>>>>> problem?
>>>>>>>>>>>>> > > >>>>>>>>>>>
>>>>>>>>>>>>> > > >>>>>>>>>>> Thanks,
>>>>>>>>>>>>> > > >>>>>>>>>>> Aihua
>>>>>>>>>>>>> > > >>>>>>>>>>>
>>>>>>>>>>>>> > > >>>>>>>>>>> On 2024/07/12 15:02:06 Ryan Blue wrote:
>>>>>>>>>>>>> > > >>>>>>>>>>> > Thanks, Aihua!
>>>>>>>>>>>>> > > >>>>>>>>>>> >
>>>>>>>>>>>>> > > >>>>>>>>>>> > I think that the encoding choice in the
>>>>>>>>>>>>> current doc is a good
>>>>>>>>>>>>> > > >>>>>>>>>>> one. I went
>>>>>>>>>>>>> > > >>>>>>>>>>> > through the Spark encoding in detail and it
>>>>>>>>>>>>> looks like a
>>>>>>>>>>>>> > > >>>>>>>>>>> better choice than
>>>>>>>>>>>>> > > >>>>>>>>>>> > the other candidate encodings for quickly
>>>>>>>>>>>>> accessing nested
>>>>>>>>>>>>> > > >>>>>>>>>>> fields.
>>>>>>>>>>>>> > > >>>>>>>>>>> >
>>>>>>>>>>>>> > > >>>>>>>>>>> > Another reason to use the Spark type is that
>>>>>>>>>>>>> this is what
>>>>>>>>>>>>> > > >>>>>>>>>>> Delta's variant
>>>>>>>>>>>>> > > >>>>>>>>>>> > type is based on, so Parquet files in tables
>>>>>>>>>>>>> written by Delta
>>>>>>>>>>>>> > > >>>>>>>>>>> could be
>>>>>>>>>>>>> > > >>>>>>>>>>> > converted or used in Iceberg tables without
>>>>>>>>>>>>> needing to rewrite
>>>>>>>>>>>>> > > >>>>>>>>>>> variant
>>>>>>>>>>>>> > > >>>>>>>>>>> > data. (Also, note that I work at Databricks
>>>>>>>>>>>>> and have an
>>>>>>>>>>>>> > > >>>>>>>>>>> interest in
>>>>>>>>>>>>> > > >>>>>>>>>>> > increasing format compatibility.)
>>>>>>>>>>>>> > > >>>>>>>>>>> >
>>>>>>>>>>>>> > > >>>>>>>>>>> > Ryan
>>>>>>>>>>>>> > > >>>>>>>>>>> >
>>>>>>>>>>>>> > > >>>>>>>>>>> > On Thu, Jul 11, 2024 at 11:21 AM Aihua Xu <
>>>>>>>>>>>>> > > >>>>>>>>>>> aihua...@snowflake.com.invalid>
>>>>>>>>>>>>> > > >>>>>>>>>>> > wrote:
>>>>>>>>>>>>> > > >>>>>>>>>>> >
>>>>>>>>>>>>> > > >>>>>>>>>>> > > [Discuss] Consensus for Variant Encoding
>>>>>>>>>>>>> > > >>>>>>>>>>> > >
>>>>>>>>>>>>> > > >>>>>>>>>>> > > It’s great to be able to present the
>>>>>>>>>>>>> Variant type proposal
>>>>>>>>>>>>> > > >>>>>>>>>>> in the
>>>>>>>>>>>>> > > >>>>>>>>>>> > > community sync yesterday and I’m looking
>>>>>>>>>>>>> to host a meeting
>>>>>>>>>>>>> > > >>>>>>>>>>> next week
>>>>>>>>>>>>> > > >>>>>>>>>>> > > (targeting for 9am, July 17th) to go over
>>>>>>>>>>>>> any further
>>>>>>>>>>>>> > > >>>>>>>>>>> concerns about the
>>>>>>>>>>>>> > > >>>>>>>>>>> > > encoding of the Variant type and any other
>>>>>>>>>>>>> questions on the
>>>>>>>>>>>>> > > >>>>>>>>>>> first phase of
>>>>>>>>>>>>> > > >>>>>>>>>>> > > the proposal
>>>>>>>>>>>>> > > >>>>>>>>>>> > > <
>>>>>>>>>>>>> > > >>>>>>>>>>>
>>>>>>>>>>>>> https://docs.google.com/document/d/1QjhpG_SVNPZh3anFcpicMQx90ebwjL7rmzFYfUP89Iw/edit
>>>>>>>>>>>>> > > >>>>>>>>>>> >.
>>>>>>>>>>>>> > > >>>>>>>>>>> > > We are hoping that anyone who is
>>>>>>>>>>>>> interested in the proposal
>>>>>>>>>>>>> > > >>>>>>>>>>> can either join
>>>>>>>>>>>>> > > >>>>>>>>>>> > > or reply with their comments so we can
>>>>>>>>>>>>> discuss them. Summary
>>>>>>>>>>>>> > > >>>>>>>>>>> of the
>>>>>>>>>>>>> > > >>>>>>>>>>> > > discussion and notes will be sent to the
>>>>>>>>>>>>> mailing list for
>>>>>>>>>>>>> > > >>>>>>>>>>> further comment
>>>>>>>>>>>>> > > >>>>>>>>>>> > > there.
>>>>>>>>>>>>> > > >>>>>>>>>>> > >
>>>>>>>>>>>>> > > >>>>>>>>>>> > >
>>>>>>>>>>>>> > > >>>>>>>>>>> > >    -
>>>>>>>>>>>>> > > >>>>>>>>>>> > >
>>>>>>>>>>>>> > > >>>>>>>>>>> > >    What should be the underlying binary
>>>>>>>>>>>>> representation
>>>>>>>>>>>>> > > >>>>>>>>>>> > >
>>>>>>>>>>>>> > > >>>>>>>>>>> > > We have evaluated a few encodings in the
>>>>>>>>>>>>> doc including ION,
>>>>>>>>>>>>> > > >>>>>>>>>>> JSONB, and
>>>>>>>>>>>>> > > >>>>>>>>>>> > > Spark encoding.Choosing the underlying
>>>>>>>>>>>>> encoding is an
>>>>>>>>>>>>> > > >>>>>>>>>>> important first step
>>>>>>>>>>>>> > > >>>>>>>>>>> > > here and we believe we have general
>>>>>>>>>>>>> support for Spark’s
>>>>>>>>>>>>> > > >>>>>>>>>>> Variant encoding.
>>>>>>>>>>>>> > > >>>>>>>>>>> > > We would like to hear if anyone else has
>>>>>>>>>>>>> strong opinions in
>>>>>>>>>>>>> > > >>>>>>>>>>> this space.
>>>>>>>>>>>>> > > >>>>>>>>>>> > >
>>>>>>>>>>>>> > > >>>>>>>>>>> > >
>>>>>>>>>>>>> > > >>>>>>>>>>> > >    -
>>>>>>>>>>>>> > > >>>>>>>>>>> > >
>>>>>>>>>>>>> > > >>>>>>>>>>> > >    Should we support multiple logical
>>>>>>>>>>>>> types or just Variant?
>>>>>>>>>>>>> > > >>>>>>>>>>> Variant vs.
>>>>>>>>>>>>> > > >>>>>>>>>>> > >    Variant + JSON.
>>>>>>>>>>>>> > > >>>>>>>>>>> > >
>>>>>>>>>>>>> > > >>>>>>>>>>> > > This is to discuss what logical data
>>>>>>>>>>>>> type(s) to be supported
>>>>>>>>>>>>> > > >>>>>>>>>>> in Iceberg -
>>>>>>>>>>>>> > > >>>>>>>>>>> > > Variant only vs. Variant + JSON. Both
>>>>>>>>>>>>> types would share the
>>>>>>>>>>>>> > > >>>>>>>>>>> same underlying
>>>>>>>>>>>>> > > >>>>>>>>>>> > > encoding but would imply different
>>>>>>>>>>>>> limitations on engines
>>>>>>>>>>>>> > > >>>>>>>>>>> working with
>>>>>>>>>>>>> > > >>>>>>>>>>> > > those types.
>>>>>>>>>>>>> > > >>>>>>>>>>> > >
>>>>>>>>>>>>> > > >>>>>>>>>>> > > From the sync up meeting, we are more
>>>>>>>>>>>>> favoring toward
>>>>>>>>>>>>> > > >>>>>>>>>>> supporting Variant
>>>>>>>>>>>>> > > >>>>>>>>>>> > > only and we want to have a consensus on
>>>>>>>>>>>>> the supported
>>>>>>>>>>>>> > > >>>>>>>>>>> type(s).
>>>>>>>>>>>>> > > >>>>>>>>>>> > >
>>>>>>>>>>>>> > > >>>>>>>>>>> > >
>>>>>>>>>>>>> > > >>>>>>>>>>> > >    -
>>>>>>>>>>>>> > > >>>>>>>>>>> > >
>>>>>>>>>>>>> > > >>>>>>>>>>> > >    How should we move forward with
>>>>>>>>>>>>> Subcolumnization?
>>>>>>>>>>>>> > > >>>>>>>>>>> > >
>>>>>>>>>>>>> > > >>>>>>>>>>> > > Subcolumnization is an optimization for
>>>>>>>>>>>>> Variant type by
>>>>>>>>>>>>> > > >>>>>>>>>>> separating out
>>>>>>>>>>>>> > > >>>>>>>>>>> > > subcolumns with their own metadata. This
>>>>>>>>>>>>> is not critical for
>>>>>>>>>>>>> > > >>>>>>>>>>> choosing the
>>>>>>>>>>>>> > > >>>>>>>>>>> > > initial encoding of the Variant type so we
>>>>>>>>>>>>> were hoping to
>>>>>>>>>>>>> > > >>>>>>>>>>> gain consensus on
>>>>>>>>>>>>> > > >>>>>>>>>>> > > leaving that for a follow up spec.
>>>>>>>>>>>>> > > >>>>>>>>>>> > >
>>>>>>>>>>>>> > > >>>>>>>>>>> > >
>>>>>>>>>>>>> > > >>>>>>>>>>> > > Thanks
>>>>>>>>>>>>> > > >>>>>>>>>>> > >
>>>>>>>>>>>>> > > >>>>>>>>>>> > > Aihua
>>>>>>>>>>>>> > > >>>>>>>>>>> > >
>>>>>>>>>>>>> > > >>>>>>>>>>> > > Meeting invite:
>>>>>>>>>>>>> > > >>>>>>>>>>> > >
>>>>>>>>>>>>> > > >>>>>>>>>>> > > Wednesday, July 17 · 9:00 – 10:00am
>>>>>>>>>>>>> > > >>>>>>>>>>> > > Time zone: America/Los_Angeles
>>>>>>>>>>>>> > > >>>>>>>>>>> > > Google Meet joining info
>>>>>>>>>>>>> > > >>>>>>>>>>> > > Video call link:
>>>>>>>>>>>>> https://meet.google.com/pbm-ovzn-aoq
>>>>>>>>>>>>> > > >>>>>>>>>>> > > Or dial: ‪(US) +1 650-449-9343‬ PIN: ‪170
>>>>>>>>>>>>> 576 525‬#
>>>>>>>>>>>>> > > >>>>>>>>>>> > > More phone numbers:
>>>>>>>>>>>>> > > >>>>>>>>>>>
>>>>>>>>>>>>> https://tel.meet/pbm-ovzn-aoq?pin=4079632691790
>>>>>>>>>>>>> > > >>>>>>>>>>> > >
>>>>>>>>>>>>> > > >>>>>>>>>>> > > On Tue, May 28, 2024 at 9:21 PM Aihua Xu <
>>>>>>>>>>>>> > > >>>>>>>>>>> aihua...@snowflake.com> wrote:
>>>>>>>>>>>>> > > >>>>>>>>>>> > >
>>>>>>>>>>>>> > > >>>>>>>>>>> > >> Hello,
>>>>>>>>>>>>> > > >>>>>>>>>>> > >>
>>>>>>>>>>>>> > > >>>>>>>>>>> > >> We have drafted the proposal
>>>>>>>>>>>>> > > >>>>>>>>>>> > >> <
>>>>>>>>>>>>> > > >>>>>>>>>>>
>>>>>>>>>>>>> https://docs.google.com/document/d/1QjhpG_SVNPZh3anFcpicMQx90ebwjL7rmzFYfUP89Iw/edit
>>>>>>>>>>>>> > > >>>>>>>>>>> >
>>>>>>>>>>>>> > > >>>>>>>>>>> > >> for Variant data type. Please help review
>>>>>>>>>>>>> and comment.
>>>>>>>>>>>>> > > >>>>>>>>>>> > >>
>>>>>>>>>>>>> > > >>>>>>>>>>> > >> Thanks,
>>>>>>>>>>>>> > > >>>>>>>>>>> > >> Aihua
>>>>>>>>>>>>> > > >>>>>>>>>>> > >>
>>>>>>>>>>>>> > > >>>>>>>>>>> > >> On Thu, May 16, 2024 at 12:45 PM Jack Ye <
>>>>>>>>>>>>> > > >>>>>>>>>>> yezhao...@gmail.com> wrote:
>>>>>>>>>>>>> > > >>>>>>>>>>> > >>
>>>>>>>>>>>>> > > >>>>>>>>>>> > >>> +10000 for a JSON/BSON type. We also had
>>>>>>>>>>>>> the same
>>>>>>>>>>>>> > > >>>>>>>>>>> discussion internally
>>>>>>>>>>>>> > > >>>>>>>>>>> > >>> and a JSON type would really play well
>>>>>>>>>>>>> with for example
>>>>>>>>>>>>> > > >>>>>>>>>>> the SUPER type in
>>>>>>>>>>>>> > > >>>>>>>>>>> > >>> Redshift:
>>>>>>>>>>>>> > > >>>>>>>>>>> > >>>
>>>>>>>>>>>>> > > >>>>>>>>>>>
>>>>>>>>>>>>> https://docs.aws.amazon.com/redshift/latest/dg/r_SUPER_type.html
>>>>>>>>>>>>> ,
>>>>>>>>>>>>> > > >>>>>>>>>>> and
>>>>>>>>>>>>> > > >>>>>>>>>>> > >>> can also provide better integration with
>>>>>>>>>>>>> the Trino JSON
>>>>>>>>>>>>> > > >>>>>>>>>>> type.
>>>>>>>>>>>>> > > >>>>>>>>>>> > >>>
>>>>>>>>>>>>> > > >>>>>>>>>>> > >>> Looking forward to the proposal!
>>>>>>>>>>>>> > > >>>>>>>>>>> > >>>
>>>>>>>>>>>>> > > >>>>>>>>>>> > >>> Best,
>>>>>>>>>>>>> > > >>>>>>>>>>> > >>> Jack Ye
>>>>>>>>>>>>> > > >>>>>>>>>>> > >>>
>>>>>>>>>>>>> > > >>>>>>>>>>> > >>>
>>>>>>>>>>>>> > > >>>>>>>>>>> > >>> On Wed, May 15, 2024 at 9:37 AM Tyler
>>>>>>>>>>>>> Akidau
>>>>>>>>>>>>> > > >>>>>>>>>>> > >>> <tyler.aki...@snowflake.com.invalid>
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>> > > >>>>>>>>>>> > >>>
>>>>>>>>>>>>> > > >>>>>>>>>>> > >>>> On Tue, May 14, 2024 at 7:58 PM Gang Wu
>>>>>>>>>>>>> <ust...@gmail.com>
>>>>>>>>>>>>> > > >>>>>>>>>>> wrote:
>>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>
>>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>> > We may need some guidance on just
>>>>>>>>>>>>> how many we need to
>>>>>>>>>>>>> > > >>>>>>>>>>> look at;
>>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>> > we were planning on Spark and Trino,
>>>>>>>>>>>>> but weren't sure
>>>>>>>>>>>>> > > >>>>>>>>>>> how much
>>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>> > further down the rabbit hole we
>>>>>>>>>>>>> needed to go。
>>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>
>>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>> There are some engines living outside
>>>>>>>>>>>>> the Java world. It
>>>>>>>>>>>>> > > >>>>>>>>>>> would be
>>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>> good if the proposal could cover the
>>>>>>>>>>>>> effort it takes to
>>>>>>>>>>>>> > > >>>>>>>>>>> integrate
>>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>> variant type to them (e.g. velox,
>>>>>>>>>>>>> datafusion, etc.).
>>>>>>>>>>>>> > > >>>>>>>>>>> This is something
>>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>> that
>>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>> some proprietary iceberg vendors also
>>>>>>>>>>>>> care about.
>>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>
>>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>
>>>>>>>>>>>>> > > >>>>>>>>>>> > >>>> Ack, makes sense. We can make sure to
>>>>>>>>>>>>> share some
>>>>>>>>>>>>> > > >>>>>>>>>>> perspective on this.
>>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>
>>>>>>>>>>>>> > > >>>>>>>>>>> > >>>> > Not necessarily, no. As long as
>>>>>>>>>>>>> there's a binary type
>>>>>>>>>>>>> > > >>>>>>>>>>> and Iceberg and
>>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>> > the query engines are aware that the
>>>>>>>>>>>>> binary column
>>>>>>>>>>>>> > > >>>>>>>>>>> needs to be
>>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>> > interpreted as a variant, that
>>>>>>>>>>>>> should be sufficient.
>>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>
>>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>> From the perspective of
>>>>>>>>>>>>> interoperability, it would be
>>>>>>>>>>>>> > > >>>>>>>>>>> good to support
>>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>> native
>>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>> type from file specs. Life will be
>>>>>>>>>>>>> easier for projects
>>>>>>>>>>>>> > > >>>>>>>>>>> like Apache
>>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>> XTable.
>>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>> File format could also provide
>>>>>>>>>>>>> finer-grained statistics
>>>>>>>>>>>>> > > >>>>>>>>>>> for variant
>>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>> type which
>>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>> facilitates data skipping.
>>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>
>>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>
>>>>>>>>>>>>> > > >>>>>>>>>>> > >>>> Agreed, there can definitely be
>>>>>>>>>>>>> additional value in
>>>>>>>>>>>>> > > >>>>>>>>>>> native file format
>>>>>>>>>>>>> > > >>>>>>>>>>> > >>>> integration. Just wanted to highlight
>>>>>>>>>>>>> that it's not a
>>>>>>>>>>>>> > > >>>>>>>>>>> strict requirement.
>>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>
>>>>>>>>>>>>> > > >>>>>>>>>>> > >>>> -Tyler
>>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>
>>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>
>>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>
>>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>> Gang
>>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>
>>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>> On Wed, May 15, 2024 at 6:49 AM Tyler
>>>>>>>>>>>>> Akidau
>>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>> <tyler.aki...@snowflake.com.invalid>
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>
>>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>> Good to see you again as well, JB!
>>>>>>>>>>>>> Thanks!
>>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>
>>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>> -Tyler
>>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>
>>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>
>>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>> On Tue, May 14, 2024 at 1:04 PM
>>>>>>>>>>>>> Jean-Baptiste Onofré <
>>>>>>>>>>>>> > > >>>>>>>>>>> j...@nanthrax.net>
>>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>> wrote:
>>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>
>>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> Hi Tyler,
>>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>>
>>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> Super happy to see you there :) It
>>>>>>>>>>>>> reminds me our
>>>>>>>>>>>>> > > >>>>>>>>>>> discussions back in
>>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> the start of Apache Beam :)
>>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>>
>>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> Anyway, the thread is pretty
>>>>>>>>>>>>> interesting. I remember
>>>>>>>>>>>>> > > >>>>>>>>>>> some discussions
>>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> about JSON datatype for spec v3. The
>>>>>>>>>>>>> binary data type
>>>>>>>>>>>>> > > >>>>>>>>>>> is already
>>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> supported in the spec v2.
>>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>>
>>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> I'm looking forward to the proposal
>>>>>>>>>>>>> and happy to help
>>>>>>>>>>>>> > > >>>>>>>>>>> on this !
>>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>>
>>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> Regards
>>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> JB
>>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>>
>>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> On Sat, May 11, 2024 at 7:06 AM
>>>>>>>>>>>>> Tyler Akidau
>>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> <tyler.aki...@snowflake.com.invalid>
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> >
>>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> > Hello,
>>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> >
>>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> > We (Tyler, Nileema, Selcuk, Aihua)
>>>>>>>>>>>>> are working on a
>>>>>>>>>>>>> > > >>>>>>>>>>> proposal for
>>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> which we’d like to get early
>>>>>>>>>>>>> feedback from the
>>>>>>>>>>>>> > > >>>>>>>>>>> community. As you may know,
>>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> Snowflake has embraced Iceberg as
>>>>>>>>>>>>> its open Data Lake
>>>>>>>>>>>>> > > >>>>>>>>>>> format. Having made
>>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> good progress on our own adoption of
>>>>>>>>>>>>> the Iceberg
>>>>>>>>>>>>> > > >>>>>>>>>>> standard, we’re now in a
>>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> position where there are features
>>>>>>>>>>>>> not yet supported in
>>>>>>>>>>>>> > > >>>>>>>>>>> Iceberg which we
>>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> think would be valuable for our
>>>>>>>>>>>>> users, and that we
>>>>>>>>>>>>> > > >>>>>>>>>>> would like to discuss
>>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> with and help contribute to the
>>>>>>>>>>>>> Iceberg community.
>>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> >
>>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> > The first two such features we’d
>>>>>>>>>>>>> like to discuss are
>>>>>>>>>>>>> > > >>>>>>>>>>> in support of
>>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> efficient querying of dynamically
>>>>>>>>>>>>> typed,
>>>>>>>>>>>>> > > >>>>>>>>>>> semi-structured data: variant data
>>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> types, and subcolumnarization of
>>>>>>>>>>>>> variant columns. In
>>>>>>>>>>>>> > > >>>>>>>>>>> more detail, for
>>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> anyone who may not already be
>>>>>>>>>>>>> familiar:
>>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> >
>>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> > 1. Variant data types
>>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> > Variant types allow for the
>>>>>>>>>>>>> efficient binary
>>>>>>>>>>>>> > > >>>>>>>>>>> encoding of dynamic
>>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> semi-structured data such as JSON,
>>>>>>>>>>>>> Avro, etc. By
>>>>>>>>>>>>> > > >>>>>>>>>>> encoding semi-structured
>>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> data as a variant column, we retain
>>>>>>>>>>>>> the flexibility of
>>>>>>>>>>>>> > > >>>>>>>>>>> the source data,
>>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> while allowing query engines to more
>>>>>>>>>>>>> efficiently
>>>>>>>>>>>>> > > >>>>>>>>>>> operate on the data.
>>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> Snowflake has supported the variant
>>>>>>>>>>>>> data type on
>>>>>>>>>>>>> > > >>>>>>>>>>> Snowflake tables for many
>>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> years [1]. As more and more users
>>>>>>>>>>>>> utilize Iceberg
>>>>>>>>>>>>> > > >>>>>>>>>>> tables in Snowflake,
>>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> we’re hearing an increasing chorus
>>>>>>>>>>>>> of requests for
>>>>>>>>>>>>> > > >>>>>>>>>>> variant support.
>>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> Additionally, other query engines
>>>>>>>>>>>>> such as Apache Spark
>>>>>>>>>>>>> > > >>>>>>>>>>> have begun adding
>>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> variant support [2]. As such, we
>>>>>>>>>>>>> believe it would be
>>>>>>>>>>>>> > > >>>>>>>>>>> beneficial to the
>>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> Iceberg community as a whole to
>>>>>>>>>>>>> standardize on the
>>>>>>>>>>>>> > > >>>>>>>>>>> variant data type
>>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> encoding used across Iceberg tables.
>>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> >
>>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> > One specific point to make here is
>>>>>>>>>>>>> that, since an
>>>>>>>>>>>>> > > >>>>>>>>>>> Apache OSS
>>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> version of variant encoding already
>>>>>>>>>>>>> exists in Spark,
>>>>>>>>>>>>> > > >>>>>>>>>>> it likely makes sense
>>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> to simply adopt the Spark encoding
>>>>>>>>>>>>> as the Iceberg
>>>>>>>>>>>>> > > >>>>>>>>>>> standard as well. The
>>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> encoding we use internally today in
>>>>>>>>>>>>> Snowflake is
>>>>>>>>>>>>> > > >>>>>>>>>>> slightly different, but
>>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> essentially equivalent, and we see
>>>>>>>>>>>>> no particular value
>>>>>>>>>>>>> > > >>>>>>>>>>> in trying to clutter
>>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> the space with another
>>>>>>>>>>>>> equivalent-but-incompatible
>>>>>>>>>>>>> > > >>>>>>>>>>> encoding.
>>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> >
>>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> >
>>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> > 2. Subcolumnarization
>>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> > Subcolumnarization of variant
>>>>>>>>>>>>> columns allows query
>>>>>>>>>>>>> > > >>>>>>>>>>> engines to
>>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> efficiently prune datasets when
>>>>>>>>>>>>> subcolumns (i.e.,
>>>>>>>>>>>>> > > >>>>>>>>>>> nested fields) within a
>>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> variant column are queried, and also
>>>>>>>>>>>>> allows optionally
>>>>>>>>>>>>> > > >>>>>>>>>>> materializing some
>>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> of the nested fields as a column on
>>>>>>>>>>>>> their own,
>>>>>>>>>>>>> > > >>>>>>>>>>> affording queries on these
>>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> subcolumns the ability to read less
>>>>>>>>>>>>> data and spend
>>>>>>>>>>>>> > > >>>>>>>>>>> less CPU on extraction.
>>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> When subcolumnarizing, the system
>>>>>>>>>>>>> managing table
>>>>>>>>>>>>> > > >>>>>>>>>>> metadata and data tracks
>>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> individual pruning statistics (min,
>>>>>>>>>>>>> max, null, etc.)
>>>>>>>>>>>>> > > >>>>>>>>>>> for some subset of the
>>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> nested fields within a variant, and
>>>>>>>>>>>>> also manages any
>>>>>>>>>>>>> > > >>>>>>>>>>> optional
>>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> materialization. Without
>>>>>>>>>>>>> subcolumnarization, any query
>>>>>>>>>>>>> > > >>>>>>>>>>> which touches a
>>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> variant column must read, parse,
>>>>>>>>>>>>> extract, and filter
>>>>>>>>>>>>> > > >>>>>>>>>>> every row for which
>>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> that column is non-null. Thus, by
>>>>>>>>>>>>> providing a
>>>>>>>>>>>>> > > >>>>>>>>>>> standardized way of tracking
>>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> subcolum metadata and data for
>>>>>>>>>>>>> variant columns,
>>>>>>>>>>>>> > > >>>>>>>>>>> Iceberg can make
>>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> subcolumnar optimizations accessible
>>>>>>>>>>>>> across various
>>>>>>>>>>>>> > > >>>>>>>>>>> catalogs and query
>>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> engines.
>>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> >
>>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> > Subcolumnarization is a
>>>>>>>>>>>>> non-trivial topic, so we
>>>>>>>>>>>>> > > >>>>>>>>>>> expect any
>>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> concrete proposal to include not
>>>>>>>>>>>>> only the set of
>>>>>>>>>>>>> > > >>>>>>>>>>> changes to Iceberg
>>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> metadata that allow compatible query
>>>>>>>>>>>>> engines to
>>>>>>>>>>>>> > > >>>>>>>>>>> interopate on
>>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> subcolumnarization data for variant
>>>>>>>>>>>>> columns, but also
>>>>>>>>>>>>> > > >>>>>>>>>>> reference
>>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> documentation explaining
>>>>>>>>>>>>> subcolumnarization principles
>>>>>>>>>>>>> > > >>>>>>>>>>> and recommended best
>>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> practices.
>>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> >
>>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> >
>>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> > It sounds like the recent Geo
>>>>>>>>>>>>> proposal [3] may be a
>>>>>>>>>>>>> > > >>>>>>>>>>> good starting
>>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> point for how to approach this, so
>>>>>>>>>>>>> our plan is to
>>>>>>>>>>>>> > > >>>>>>>>>>> write something up in
>>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> that vein that covers the proposed
>>>>>>>>>>>>> spec changes,
>>>>>>>>>>>>> > > >>>>>>>>>>> backwards compatibility,
>>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> implementor burdens, etc. But we
>>>>>>>>>>>>> wanted to first reach
>>>>>>>>>>>>> > > >>>>>>>>>>> out to the community
>>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> to introduce ourselves and the idea,
>>>>>>>>>>>>> and see if
>>>>>>>>>>>>> > > >>>>>>>>>>> there’s any early feedback
>>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> we should incorporate before we
>>>>>>>>>>>>> spend too much time on
>>>>>>>>>>>>> > > >>>>>>>>>>> a concrete proposal.
>>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> >
>>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> > Thank you!
>>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> >
>>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> > [1]
>>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>>
>>>>>>>>>>>>> > > >>>>>>>>>>>
>>>>>>>>>>>>> https://docs.snowflake.com/en/sql-reference/data-types-semistructured
>>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> > [2]
>>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>>
>>>>>>>>>>>>> > > >>>>>>>>>>>
>>>>>>>>>>>>> https://github.com/apache/spark/blob/master/common/variant/README.md
>>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> > [3]
>>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>>
>>>>>>>>>>>>> > > >>>>>>>>>>>
>>>>>>>>>>>>> https://docs.google.com/document/d/1iVFbrRNEzZl8tDcZC81GFt01QJkLJsI9E2NBOt21IRI/edit
>>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> >
>>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> > -Tyler, Nileema, Selcuk, Aihua
>>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> >
>>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>>
>>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>
>>>>>>>>>>>>> > > >>>>>>>>>>> >
>>>>>>>>>>>>> > > >>>>>>>>>>> > --
>>>>>>>>>>>>> > > >>>>>>>>>>> > Ryan Blue
>>>>>>>>>>>>> > > >>>>>>>>>>> > Databricks
>>>>>>>>>>>>> > > >>>>>>>>>>> >
>>>>>>>>>>>>> > > >>>>>>>>>>>
>>>>>>>>>>>>> > > >>>>>>>>>>
>>>>>>>>>>>>> > > >>>>>>>>>
>>>>>>>>>>>>> > > >>>>>>>>> --
>>>>>>>>>>>>> > > >>>>>>>>> Ryan Blue
>>>>>>>>>>>>> > > >>>>>>>>> Databricks
>>>>>>>>>>>>> > > >>>>>>>>>
>>>>>>>>>>>>> > > >>>>>>>>
>>>>>>>>>>>>> > > >>>>>>>>
>>>>>>>>>>>>> > > >>>>>>>> --
>>>>>>>>>>>>> > > >>>>>>>> Ryan Blue
>>>>>>>>>>>>> > > >>>>>>>> Databricks
>>>>>>>>>>>>> > > >>>>>>>>
>>>>>>>>>>>>> > > >>>>>>>
>>>>>>>>>>>>> > > >>>>>>
>>>>>>>>>>>>> > > >>>>>> --
>>>>>>>>>>>>> > > >>>>>> Ryan Blue
>>>>>>>>>>>>> > > >>>>>> Databricks
>>>>>>>>>>>>> > > >>>>>>
>>>>>>>>>>>>> > > >>>>>
>>>>>>>>>>>>> > > >>>>
>>>>>>>>>>>>> > > >>>> --
>>>>>>>>>>>>> > > >>>> Ryan Blue
>>>>>>>>>>>>> > > >>>> Databricks
>>>>>>>>>>>>> > > >>>>
>>>>>>>>>>>>> > > >>>
>>>>>>>>>>>>> > > >>
>>>>>>>>>>>>> > > >> --
>>>>>>>>>>>>> > > >> Ryan Blue
>>>>>>>>>>>>> > > >> Databricks
>>>>>>>>>>>>> > > >>
>>>>>>>>>>>>> > > >
>>>>>>>>>>>>> > >
>>>>>>>>>>>>> >
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>> Ryan Blue
>>>>>>>>>>> Databricks
>>>>>>>>>>>
>>>>>>>>>>
>>
>> --
>> Ryan Blue
>> Databricks
>>
>

Re: [Early Feedback] Variant and Subcolumnarization Support

Reply via email to