Re: [Early Feedback] Variant and Subcolumnarization Support

Ryan Blue Mon, 29 Jul 2024 10:57:48 -0700

IIUC all we are saying here is it is an engine decision on when it wants to
widen types for storage purposes and that might involve some loss in data
fidelity?


Yes, I think this is the cleanest path forward. Whether to widen int32 to
int64 or rewrite decimals in a larger scale is up to the engine. The role
of storage is to faithfully return what the engine writes — engines, not
storage, can modify the data.

The shredding spec allows widening in the engine or storing the original
values, but if you store the original values you may not get optimization
from shredding because it is an uncommon use case that we are choosing not
to design for to keep the spec simple.

I think that would mean that when reading:

   1. If there are any non-null values in the untyped_value column, stats
   for typed_value should not be used to skip a file or row group
   2. Values from typed_value are not modified. If an engine chooses to
   store an int32 in an int64, the value is an int64 and no “original” type is
   kept or used.

I think the next question is whether we want to allow shredding to multiple
types (e.g., shred an ID to both string and int64). I think it would be
fairly clean to allow this in the spec.

Ryan

On Sat, Jul 27, 2024 at 5:50 AM Nick Riasanovsky <n...@bodo.ai> wrote:

> This seems reasonable to me in general, and I agree we should avoid
> significantly complicating the design for an uncommon use case. I would
> like to understand the implication for operations like file compaction. Is
> it now up to this engine's discretion for how to combine files and if
> decimal scale needs to be preserved?
>
> Thanks,
> Nick Riasanovsky
>
> On Sat, Jul 27, 2024 at 2:41 AM Aihua Xu <aihu...@gmail.com> wrote:
>
>> Let me know if I understand correctly: basically the spec will not
>> include any type promotion. E.g., if the chosen type for the subcolumn is
>> int64, then only int64 value will be encoded in `typed_value`, other types
>> of values including int32 type will be encoded in `untyped_value`;
>> similarly if the chosen type is decimal(10, 5), the values of decimal(10,
>> 2) will be encoded in `untyped_value`.
>>
>> This looks clean to me.
>>
>> On Fri, Jul 26, 2024 at 3:11 PM Micah Kornfield <emkornfi...@gmail.com>
>> wrote:
>>
>>> If we aren't optimizing for strict decimal behavior, then I think the
>>>> cleanest option is to use `untyped_value` when decimal scales need to be
>>>> preserved. I would also remove language from the shredding spec about
>>>> storage modifying values so that this is an engine concern. The storage
>>>> spec should state how you can encode values, without making recommendations
>>>> about modifying those values. If an engine's semantics for variant allow it
>>>> to modify value types, then that's up to the engine.
>>>
>>>
>>> If I understand this correctly this also applies to int32/int64
>>> conversions?  This sounds like a good solution to me as well (IIUC all we
>>> are saying here is it is an engine decision on when it wants to widen types
>>> for storage purposes and that might involve some loss in data fidelity?).
>>>
>>> Thanks,
>>> Micah
>>>
>>> On Fri, Jul 26, 2024 at 9:55 AM Ryan Blue <b...@databricks.com.invalid>
>>> wrote:
>>>
>>>> As a follow up, I talked with Russell quite a bit about losing types
>>>> after the meeting and he convinced me that while there are valid use cases
>>>> for strict decimal behavior, the majority of cases are either engines using
>>>> decimal to keep track of the original number of digits in a number or
>>>> people that simply want to limit the number of digits. In that case, I
>>>> think the natural conclusion is that it should be _possible_ to have strict
>>>> behavior but we should not increase complexity too much to optimize for it.
>>>>
>>>> If we aren't optimizing for strict decimal behavior, then I think the
>>>> cleanest option is to use `untyped_value` when decimal scales need to be
>>>> preserved. I would also remove language from the shredding spec about
>>>> storage modifying values so that this is an engine concern. The storage
>>>> spec should state how you can encode values, without making recommendations
>>>> about modifying those values. If an engine's semantics for variant allow it
>>>> to modify value types, then that's up to the engine.
>>>>
>>>> In the discussion, I wasn't the only person in favor of not modifying
>>>> decimal scales, but I'm curious if this distinction satisfies everyone. If
>>>> we remove the wording from the proposal that recommends modifying decimals
>>>> and leave this to the engine, do we have agreement?
>>>>
>>>> On Thu, Jul 25, 2024 at 6:46 PM Aihua Xu <aihu...@gmail.com> wrote:
>>>>
>>>>> Hi community,
>>>>>
>>>>> Thanks for joining the meeting to discuss variant shredding. For those
>>>>> who were unable to attend the meeting, please check out the recorded
>>>>> meeting
>>>>> <https://drive.google.com/file/d/1kiwv29nxxOqMCbxXn-NRoz-x2E9yIMlJ/view?usp=drive_link>
>>>>>  if
>>>>> you are interested. Also to follow up on the meeting to converge on
>>>>> lossiness discussion from shredding offline,  I have converted the spark
>>>>> shredding proposal by David into google doc
>>>>> <https://docs.google.com/document/d/1JeBt4NIju08jQ2AbludiK-U0M9ISIgysP7fUDWtv7rg/edit>
>>>>>  and
>>>>> please comment.
>>>>>
>>>>> Thanks,
>>>>> Aihua
>>>>>
>>>>>
>>>>> On Thu, Jul 25, 2024 at 10:14 AM Aihua Xu <aihu...@gmail.com> wrote:
>>>>>
>>>>>> Yes. This time I was able to record it and I will share it when it’s
>>>>>> processed.
>>>>>>
>>>>>>
>>>>>> On Jul 25, 2024, at 10:01 AM, Amogh Jahagirdar <2am...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>> 
>>>>>> Any chance this meeting was recorded? I couldn't make it but would be
>>>>>> interested in catching up on the discussion.
>>>>>>
>>>>>> Thanks,
>>>>>>
>>>>>> Amogh Jahagirdar
>>>>>>
>>>>>> On Tue, Jul 23, 2024 at 11:30 AM Aihua Xu <aihu...@gmail.com> wrote:
>>>>>>
>>>>>>> Thanks folks for additional discussion.
>>>>>>>
>>>>>>> There are some questions related to subcolumniziation (spark
>>>>>>> shredding - see the discussion
>>>>>>> <https://github.com/apache/spark/pull/46831>) and we would like to
>>>>>>> host another meeting to mainly discuss that since we plan to adopt it. 
>>>>>>> We
>>>>>>> can also follow up the Spark variant topics (I can see that mostly we 
>>>>>>> are
>>>>>>> aligned with the exception to find a place for the spec and
>>>>>>> implementation). Look forward to meeting with you. BTW: should I include
>>>>>>> dev@iceberg.apache.org in the email invite?
>>>>>>>
>>>>>>> Sync up on Variant subcolumnization (shredding)
>>>>>>> Thursday, July 25 · 8:00 – 9:00am
>>>>>>> Time zone: America/Los_Angeles
>>>>>>> Google Meet joining info
>>>>>>> Video call link: https://meet.google.com/mug-dvnv-hnq
>>>>>>> Or dial: ‪(US) +1 904-900-0730‬ PIN: ‪671 997 419‬#
>>>>>>> More phone numbers: https://tel.meet/mug-dvnv-hnq?pin=1654043233422
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Aihua
>>>>>>>
>>>>>>> On Tue, Jul 23, 2024 at 6:36 AM Amogh Jahagirdar <2am...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> I'm late replying to this but I'm also in agreement with 1
>>>>>>>> (adopting the spark variant encoding), 3 (specifically only having a
>>>>>>>> variant type), and 4 (ensuring we are thinking through 
>>>>>>>> subcolumnarization
>>>>>>>> upfront since without it the variant type may not be that useful).
>>>>>>>>
>>>>>>>> I'd also support having the spec, and reference implementation in
>>>>>>>> Iceberg; as others have said, it centralizes improvements in a single,
>>>>>>>> agnostic dependency for engines, rather than engines having to take
>>>>>>>> dependencies on other engine modules.
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>>
>>>>>>>> Amogh Jahagirdar
>>>>>>>>
>>>>>>>> On Tue, Jul 23, 2024 at 12:15 AM Péter Váry <
>>>>>>>> peter.vary.apa...@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> I have been looking around, how can we map Variant type in Flink.
>>>>>>>>> I have not found any existing type which we could use, but Flink 
>>>>>>>>> already
>>>>>>>>> have some JSON parsing capabilities [1] for string fields.
>>>>>>>>>
>>>>>>>>> So until we have native support in Flink for something similar to
>>>>>>>>> Vartiant type, I expect that we need to map it to JSON strings in 
>>>>>>>>> RowData.
>>>>>>>>>
>>>>>>>>> Based on that, here are my preferences:
>>>>>>>>> 1. I'm ok with adapting Spark Variant type, if we build our own
>>>>>>>>> Iceberg serializer/deserializer module for it
>>>>>>>>> 2. I prefer to move the spec to Iceberg, so we own it, and extend
>>>>>>>>> it, if needed. This could be important in the first phase. Later when 
>>>>>>>>> it is
>>>>>>>>> more stable we might donate it to some other project, like Parquet
>>>>>>>>> 3. I would prefer to support only a single type, and Variant is
>>>>>>>>> more expressive, but having a standard way to convert between JSON and
>>>>>>>>> Variant would be useful for Flink users.
>>>>>>>>> 4. On subcolumnarization: I think Flink will only use this feature
>>>>>>>>> as much as the Iceberg readers implement this, so I would like to see 
>>>>>>>>> as
>>>>>>>>> much as possible of it in the common Iceberg code
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Peter
>>>>>>>>>
>>>>>>>>> [1] -
>>>>>>>>> https://nightlies.apache.org/flink/flink-docs-release-1.19/docs/dev/table/functions/systemfunctions/#json-functions
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Tue, Jul 23, 2024, 06:36 Micah Kornfield <emkornfi...@gmail.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Sorry for the late reply.  I agree with the sentiments on 1 and 3
>>>>>>>>>> that have already been posted (adopt the Spark encoding, and only 
>>>>>>>>>> have the
>>>>>>>>>> Variant type).  As mentioned on the doc for 3, I think it would be 
>>>>>>>>>> good to
>>>>>>>>>> specify how to map scalar types to a JSON representation so there 
>>>>>>>>>> can be
>>>>>>>>>> consistency between engines that don't support variant.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> Regarding point 2, I also feel Iceberg is more natural to host
>>>>>>>>>>> such a subproject for variant spec and implementation. But let me 
>>>>>>>>>>> reach out
>>>>>>>>>>> to the Spark community to discuss.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> The only  other place I can think of that might be a good home
>>>>>>>>>> for Variant spec could be in Apache Arrow as a canonical extension 
>>>>>>>>>> type.
>>>>>>>>>> There is an issue for this [1].  I think the main thing on where 
>>>>>>>>>> this is
>>>>>>>>>> housed is which types are intended to be supported.  I believe Arrow 
>>>>>>>>>> is
>>>>>>>>>> currently a superset of the Iceberg type system (UUID is supported 
>>>>>>>>>> as a
>>>>>>>>>> canonical extension type [2]).
>>>>>>>>>>
>>>>>>>>>> For point 4 subcolumnarization, I think ideally this belongs in
>>>>>>>>>> Iceberg (and if Iceberg and Delta Lake can agree on how to do it 
>>>>>>>>>> that would
>>>>>>>>>> be great) with potential consultation with Parquet/ORC communities to
>>>>>>>>>> potentially add better native support.
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>> Micah
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> [1] https://github.com/apache/arrow/issues/42069
>>>>>>>>>> [2] https://arrow.apache.org/docs/format/CanonicalExtensions.html
>>>>>>>>>>
>>>>>>>>>> On Sat, Jul 20, 2024 at 5:54 PM Aihua Xu <aihu...@gmail.com>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> Thanks for the discussion and feedback.
>>>>>>>>>>>
>>>>>>>>>>> Do we have the consensus on point 1 and point 3 to move forward
>>>>>>>>>>> with Spark variant encoding and support Variant type only? Or let 
>>>>>>>>>>> me know
>>>>>>>>>>> how to proceed from here.
>>>>>>>>>>>
>>>>>>>>>>> Regarding point 2, I also feel Iceberg is more natural to host
>>>>>>>>>>> such a subproject for variant spec and implementation. But let me 
>>>>>>>>>>> reach out
>>>>>>>>>>> to the Spark community to discuss.
>>>>>>>>>>>
>>>>>>>>>>> Thanks,
>>>>>>>>>>> Aihua
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Fri, Jul 19, 2024 at 9:35 AM Yufei Gu <flyrain...@gmail.com>
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Agreed with point 1.
>>>>>>>>>>>>
>>>>>>>>>>>> For point 2, I also prefer to hold the spec and reference
>>>>>>>>>>>> implementation under Iceberg. Here are the reasons:
>>>>>>>>>>>> 1. It is unconventional and impractical for one engine to
>>>>>>>>>>>> depend on another for data types. For instance, it is not ideal 
>>>>>>>>>>>> for Trino
>>>>>>>>>>>> to rely on data types defined by the Spark engine.
>>>>>>>>>>>> 2. Iceberg serves as a bridge between engines and file formats.
>>>>>>>>>>>> By centralizing the specification in Iceberg, any future 
>>>>>>>>>>>> optimizations or
>>>>>>>>>>>> updates to file formats can be referred to within Iceberg, ensuring
>>>>>>>>>>>> consistency and reducing dependencies.
>>>>>>>>>>>>
>>>>>>>>>>>> For point 3, I'd prefer to support the variant type only at
>>>>>>>>>>>> this moment.
>>>>>>>>>>>>
>>>>>>>>>>>> Yufei
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Thu, Jul 18, 2024 at 12:55 PM Ryan Blue
>>>>>>>>>>>> <b...@databricks.com.invalid> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Similarly, I'm aligned with point 1 and I'd choose to support
>>>>>>>>>>>>> only variant for point 3.
>>>>>>>>>>>>>
>>>>>>>>>>>>> We'll need to work with the Spark community to find a good
>>>>>>>>>>>>> place for the library and spec, since it touches many different 
>>>>>>>>>>>>> projects.
>>>>>>>>>>>>> I'd also prefer Iceberg as the home.
>>>>>>>>>>>>>
>>>>>>>>>>>>> I also think it's a good idea to get subcolumnarization into
>>>>>>>>>>>>> our spec when we update. Without that I think the feature will be 
>>>>>>>>>>>>> fairly
>>>>>>>>>>>>> limited.
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Thu, Jul 18, 2024 at 10:56 AM Russell Spitzer <
>>>>>>>>>>>>> russell.spit...@gmail.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> I'm aligned with point 1.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> For point 2 I think we should choose quickly, I honestly do
>>>>>>>>>>>>>> think this would be fine as part of the Iceberg Spec directly but
>>>>>>>>>>>>>> understand it may be better for the more broad community if it 
>>>>>>>>>>>>>> was a sub
>>>>>>>>>>>>>> project. As a sub-project I would still prefer it being an 
>>>>>>>>>>>>>> Iceberg
>>>>>>>>>>>>>> Subproject since we are engine/file-format agnostic.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> 3. I support adding just Variant.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Thu, Jul 18, 2024 at 12:54 AM Aihua Xu <aihu...@apache.org>
>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Hello community,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> It’s great to sync up with some of you on Variant and
>>>>>>>>>>>>>>> SubColumarization support in Iceberg again. Apologize that I 
>>>>>>>>>>>>>>> didn’t record
>>>>>>>>>>>>>>> the meeting but here are some key items that we want to follow 
>>>>>>>>>>>>>>> up with the
>>>>>>>>>>>>>>> community.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> 1. Adopt Spark Variant encoding
>>>>>>>>>>>>>>> Those present were in favor of  adopting the Spark variant
>>>>>>>>>>>>>>> encoding for Iceberg Variant with extensions to support other 
>>>>>>>>>>>>>>> Iceberg
>>>>>>>>>>>>>>> types. We would like to know if anyone has an objection to this 
>>>>>>>>>>>>>>> to reuse an
>>>>>>>>>>>>>>> open source encoding.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> 2. Movement of the Spark Variant Spec to another project
>>>>>>>>>>>>>>> To avoid introducing Apache Spark as a dependency for the
>>>>>>>>>>>>>>> engines and file formats, we discussed separating Spark Variant 
>>>>>>>>>>>>>>> encoding
>>>>>>>>>>>>>>> spec and implementation from the Spark Project to a neutral 
>>>>>>>>>>>>>>> location. We
>>>>>>>>>>>>>>> thought up several solutions but didn’t have consensus on any 
>>>>>>>>>>>>>>> of them. We
>>>>>>>>>>>>>>> are looking for more feedback on this topic from the community 
>>>>>>>>>>>>>>> either in
>>>>>>>>>>>>>>> terms of support for one of these options or another idea on 
>>>>>>>>>>>>>>> how to support
>>>>>>>>>>>>>>> the spec.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Options Proposed:
>>>>>>>>>>>>>>> * Leave the Spec in Spark (Difficult for versioning and
>>>>>>>>>>>>>>> other engines)
>>>>>>>>>>>>>>> * Copying the Spec into Iceberg Project Directly (Difficult
>>>>>>>>>>>>>>> for other Table Formats)
>>>>>>>>>>>>>>> * Creating a Sub-Project of Apache Iceberg and moving the
>>>>>>>>>>>>>>> spec and reference implementation there (Logistically 
>>>>>>>>>>>>>>> complicated)
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> * Creating a Sub-Project of Apache Spark and moving the spec
>>>>>>>>>>>>>>> and reference implementation there (Logistically complicated)
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> 3. Add Variant type vs. Variant and JSON types
>>>>>>>>>>>>>>> Those who were present were in favor of adding only the
>>>>>>>>>>>>>>> Variant type to Iceberg. We are looking for anyone who has an 
>>>>>>>>>>>>>>> objection to
>>>>>>>>>>>>>>> going forward with just the Variant Type and no Iceberg JSON 
>>>>>>>>>>>>>>> Type. We were
>>>>>>>>>>>>>>> favoring adding Variant type only because:
>>>>>>>>>>>>>>> * Introducing a JSON type would require engines that only
>>>>>>>>>>>>>>> support VARIANT to do write time validation of their input to a 
>>>>>>>>>>>>>>> JSON
>>>>>>>>>>>>>>> column. If they don’t have a JSON type an engine wouldn’t 
>>>>>>>>>>>>>>> support this.
>>>>>>>>>>>>>>> * Engines which don’t support Variant will work most of the
>>>>>>>>>>>>>>> time but can have fallback strings defined in the spec for 
>>>>>>>>>>>>>>> reading
>>>>>>>>>>>>>>> unsupported types. Writing a JSON into a Variant will always 
>>>>>>>>>>>>>>> work.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> 4. Support for Subcolumnization spec (shredding in Spark)
>>>>>>>>>>>>>>> We have no action items on this but would like to follow up
>>>>>>>>>>>>>>> on discussions on Subcolumnization in the future.
>>>>>>>>>>>>>>> * We had general agreement that this should be included in
>>>>>>>>>>>>>>> Iceberg V3 or else adding variant may not be useful.
>>>>>>>>>>>>>>> * We are interested in also adopting the shredding spec from
>>>>>>>>>>>>>>> Spark and would like to move it to whatever place we decided 
>>>>>>>>>>>>>>> the Variant
>>>>>>>>>>>>>>> spec is going to live.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Let us know if missed anything and if you have any
>>>>>>>>>>>>>>> additional thoughts or suggestions.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Thanks
>>>>>>>>>>>>>>> Aihua
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On 2024/07/15 18:32:22 Aihua Xu wrote:
>>>>>>>>>>>>>>> > Thanks for the discussion.
>>>>>>>>>>>>>>> >
>>>>>>>>>>>>>>> > I will move forward to work on spec PR.
>>>>>>>>>>>>>>> >
>>>>>>>>>>>>>>> > Regarding the implementation, we will have module for
>>>>>>>>>>>>>>> Variant support in Iceberg so we will not have to bring in 
>>>>>>>>>>>>>>> Spark libraries.
>>>>>>>>>>>>>>> >
>>>>>>>>>>>>>>> > I'm reposting the meeting invite in case it's not clear in
>>>>>>>>>>>>>>> my original email since I included in the end. Looks like we 
>>>>>>>>>>>>>>> don't have
>>>>>>>>>>>>>>> major objections/diverges but let's sync up and have consensus.
>>>>>>>>>>>>>>> >
>>>>>>>>>>>>>>> > Meeting invite:
>>>>>>>>>>>>>>> >
>>>>>>>>>>>>>>> > Wednesday, July 17 · 9:00 – 10:00am
>>>>>>>>>>>>>>> > Time zone: America/Los_Angeles
>>>>>>>>>>>>>>> > Google Meet joining info
>>>>>>>>>>>>>>> > Video call link: https://meet.google.com/pbm-ovzn-aoq
>>>>>>>>>>>>>>> > Or dial: ‪(US) +1 650-449-9343‬ PIN: ‪170 576 525‬#
>>>>>>>>>>>>>>> > More phone numbers:
>>>>>>>>>>>>>>> https://tel.meet/pbm-ovzn-aoq?pin=4079632691790
>>>>>>>>>>>>>>> >
>>>>>>>>>>>>>>> > Thanks,
>>>>>>>>>>>>>>> > Aihua
>>>>>>>>>>>>>>> >
>>>>>>>>>>>>>>> > On 2024/07/12 20:55:01 Micah Kornfield wrote:
>>>>>>>>>>>>>>> > > I don't think this needs to hold up the PR but I think
>>>>>>>>>>>>>>> coming to a
>>>>>>>>>>>>>>> > > consensus on the exact set of types supported is
>>>>>>>>>>>>>>> worthwhile (and if the
>>>>>>>>>>>>>>> > > goal is to maintain the same set as specified by the
>>>>>>>>>>>>>>> Spark Variant type or
>>>>>>>>>>>>>>> > > if divergence is expected/allowed).  From a
>>>>>>>>>>>>>>> fragmentation perspective it
>>>>>>>>>>>>>>> > > would be a shame if they diverge, so maybe a next step
>>>>>>>>>>>>>>> is also suggesting
>>>>>>>>>>>>>>> > > support to the Spark community on the missing existing
>>>>>>>>>>>>>>> Iceberg types?
>>>>>>>>>>>>>>> > >
>>>>>>>>>>>>>>> > > Thanks,
>>>>>>>>>>>>>>> > > Micah
>>>>>>>>>>>>>>> > >
>>>>>>>>>>>>>>> > > On Fri, Jul 12, 2024 at 1:44 PM Russell Spitzer <
>>>>>>>>>>>>>>> russell.spit...@gmail.com>
>>>>>>>>>>>>>>> > > wrote:
>>>>>>>>>>>>>>> > >
>>>>>>>>>>>>>>> > > > Just talked with Aihua and he's working on the Spec PR
>>>>>>>>>>>>>>> now. We can get
>>>>>>>>>>>>>>> > > > feedback there from everyone.
>>>>>>>>>>>>>>> > > >
>>>>>>>>>>>>>>> > > > On Fri, Jul 12, 2024 at 3:41 PM Ryan Blue
>>>>>>>>>>>>>>> <b...@databricks.com.invalid>
>>>>>>>>>>>>>>> > > > wrote:
>>>>>>>>>>>>>>> > > >
>>>>>>>>>>>>>>> > > >> Good idea, but I'm hoping that we can continue to get
>>>>>>>>>>>>>>> their feedback in
>>>>>>>>>>>>>>> > > >> parallel to getting the spec changes started. Piotr
>>>>>>>>>>>>>>> didn't seem to object
>>>>>>>>>>>>>>> > > >> to the encoding from what I read of his comments.
>>>>>>>>>>>>>>> Hopefully he (and others)
>>>>>>>>>>>>>>> > > >> chime in here.
>>>>>>>>>>>>>>> > > >>
>>>>>>>>>>>>>>> > > >> On Fri, Jul 12, 2024 at 1:32 PM Russell Spitzer <
>>>>>>>>>>>>>>> > > >> russell.spit...@gmail.com> wrote:
>>>>>>>>>>>>>>> > > >>
>>>>>>>>>>>>>>> > > >>> I just want to make sure we get Piotr and Peter on
>>>>>>>>>>>>>>> board as
>>>>>>>>>>>>>>> > > >>> representatives of Flink and Trino engines. Also
>>>>>>>>>>>>>>> make sure we have anyone
>>>>>>>>>>>>>>> > > >>> else chime in who has experience with Ray if
>>>>>>>>>>>>>>> possible.
>>>>>>>>>>>>>>> > > >>>
>>>>>>>>>>>>>>> > > >>> Spec changes feel like the right next step.
>>>>>>>>>>>>>>> > > >>>
>>>>>>>>>>>>>>> > > >>> On Fri, Jul 12, 2024 at 3:14 PM Ryan Blue
>>>>>>>>>>>>>>> <b...@databricks.com.invalid>
>>>>>>>>>>>>>>> > > >>> wrote:
>>>>>>>>>>>>>>> > > >>>
>>>>>>>>>>>>>>> > > >>>> Okay, what are the next steps here? This proposal
>>>>>>>>>>>>>>> has been out for
>>>>>>>>>>>>>>> > > >>>> quite a while and I don't see any major objections
>>>>>>>>>>>>>>> to using the Spark
>>>>>>>>>>>>>>> > > >>>> encoding. It's quite well designed and fits the
>>>>>>>>>>>>>>> need well. It can also be
>>>>>>>>>>>>>>> > > >>>> extended to support additional types that are
>>>>>>>>>>>>>>> missing if that's a priority.
>>>>>>>>>>>>>>> > > >>>>
>>>>>>>>>>>>>>> > > >>>> Should we move forward by starting a draft of the
>>>>>>>>>>>>>>> changes to the table
>>>>>>>>>>>>>>> > > >>>> spec? Then we can vote on committing those changes
>>>>>>>>>>>>>>> and get moving on an
>>>>>>>>>>>>>>> > > >>>> implementation (or possibly do the implementation
>>>>>>>>>>>>>>> in parallel).
>>>>>>>>>>>>>>> > > >>>>
>>>>>>>>>>>>>>> > > >>>> On Fri, Jul 12, 2024 at 1:08 PM Russell Spitzer <
>>>>>>>>>>>>>>> > > >>>> russell.spit...@gmail.com> wrote:
>>>>>>>>>>>>>>> > > >>>>
>>>>>>>>>>>>>>> > > >>>>> That's fair, I'm sold on an Iceberg Module.
>>>>>>>>>>>>>>> > > >>>>>
>>>>>>>>>>>>>>> > > >>>>> On Fri, Jul 12, 2024 at 2:53 PM Ryan Blue
>>>>>>>>>>>>>>> <b...@databricks.com.invalid>
>>>>>>>>>>>>>>> > > >>>>> wrote:
>>>>>>>>>>>>>>> > > >>>>>
>>>>>>>>>>>>>>> > > >>>>>> > Feels like eventually the encoding should land
>>>>>>>>>>>>>>> in parquet proper
>>>>>>>>>>>>>>> > > >>>>>> right?
>>>>>>>>>>>>>>> > > >>>>>>
>>>>>>>>>>>>>>> > > >>>>>> What about using it in ORC? I don't know where it
>>>>>>>>>>>>>>> should end up.
>>>>>>>>>>>>>>> > > >>>>>> Maybe Iceberg should make a standalone module
>>>>>>>>>>>>>>> from it?
>>>>>>>>>>>>>>> > > >>>>>>
>>>>>>>>>>>>>>> > > >>>>>> On Fri, Jul 12, 2024 at 12:38 PM Russell Spitzer <
>>>>>>>>>>>>>>> > > >>>>>> russell.spit...@gmail.com> wrote:
>>>>>>>>>>>>>>> > > >>>>>>
>>>>>>>>>>>>>>> > > >>>>>>> Feels like eventually the encoding should land
>>>>>>>>>>>>>>> in parquet proper
>>>>>>>>>>>>>>> > > >>>>>>> right? I'm fine with us just copying into
>>>>>>>>>>>>>>> Iceberg though for the time
>>>>>>>>>>>>>>> > > >>>>>>> being.
>>>>>>>>>>>>>>> > > >>>>>>>
>>>>>>>>>>>>>>> > > >>>>>>> On Fri, Jul 12, 2024 at 2:31 PM Ryan Blue
>>>>>>>>>>>>>>> > > >>>>>>> <b...@databricks.com.invalid> wrote:
>>>>>>>>>>>>>>> > > >>>>>>>
>>>>>>>>>>>>>>> > > >>>>>>>> Oops, it looks like I missed where Aihua
>>>>>>>>>>>>>>> brought this up in his
>>>>>>>>>>>>>>> > > >>>>>>>> last email:
>>>>>>>>>>>>>>> > > >>>>>>>>
>>>>>>>>>>>>>>> > > >>>>>>>> > do we have an issue to directly use Spark
>>>>>>>>>>>>>>> implementation in
>>>>>>>>>>>>>>> > > >>>>>>>> Iceberg?
>>>>>>>>>>>>>>> > > >>>>>>>>
>>>>>>>>>>>>>>> > > >>>>>>>> Yes, I think that we do have an issue using the
>>>>>>>>>>>>>>> Spark library. What
>>>>>>>>>>>>>>> > > >>>>>>>> do you think about a Java implementation in
>>>>>>>>>>>>>>> Iceberg?
>>>>>>>>>>>>>>> > > >>>>>>>>
>>>>>>>>>>>>>>> > > >>>>>>>> Ryan
>>>>>>>>>>>>>>> > > >>>>>>>>
>>>>>>>>>>>>>>> > > >>>>>>>> On Fri, Jul 12, 2024 at 12:28 PM Ryan Blue <
>>>>>>>>>>>>>>> b...@databricks.com>
>>>>>>>>>>>>>>> > > >>>>>>>> wrote:
>>>>>>>>>>>>>>> > > >>>>>>>>
>>>>>>>>>>>>>>> > > >>>>>>>>> I raised the same point from Peter's email in
>>>>>>>>>>>>>>> a comment on the doc
>>>>>>>>>>>>>>> > > >>>>>>>>> as well. There is a spark-variant_2.13
>>>>>>>>>>>>>>> artifact that would be a much
>>>>>>>>>>>>>>> > > >>>>>>>>> smaller scope than relying on large portions
>>>>>>>>>>>>>>> of Spark, but I even then I
>>>>>>>>>>>>>>> > > >>>>>>>>> doubt that it is a good idea for Iceberg to
>>>>>>>>>>>>>>> depend on that because it is a
>>>>>>>>>>>>>>> > > >>>>>>>>> Scala artifact and we would need to bring in a
>>>>>>>>>>>>>>> ton of Scala libs. I think
>>>>>>>>>>>>>>> > > >>>>>>>>> what makes the most sense is to have an
>>>>>>>>>>>>>>> independent implementation of the
>>>>>>>>>>>>>>> > > >>>>>>>>> spec in Iceberg.
>>>>>>>>>>>>>>> > > >>>>>>>>>
>>>>>>>>>>>>>>> > > >>>>>>>>> On Fri, Jul 12, 2024 at 11:51 AM Péter Váry <
>>>>>>>>>>>>>>> > > >>>>>>>>> peter.vary.apa...@gmail.com> wrote:
>>>>>>>>>>>>>>> > > >>>>>>>>>
>>>>>>>>>>>>>>> > > >>>>>>>>>> Hi Aihua,
>>>>>>>>>>>>>>> > > >>>>>>>>>> Long time no see :)
>>>>>>>>>>>>>>> > > >>>>>>>>>> Would this mean, that every engine which
>>>>>>>>>>>>>>> plans to support Variant
>>>>>>>>>>>>>>> > > >>>>>>>>>> data type needs to add Spark as a dependency?
>>>>>>>>>>>>>>> Like Flink/Trino/Hive etc?
>>>>>>>>>>>>>>> > > >>>>>>>>>> Thanks, Peter
>>>>>>>>>>>>>>> > > >>>>>>>>>>
>>>>>>>>>>>>>>> > > >>>>>>>>>>
>>>>>>>>>>>>>>> > > >>>>>>>>>> On Fri, Jul 12, 2024, 19:10 Aihua Xu <
>>>>>>>>>>>>>>> aihu...@apache.org> wrote:
>>>>>>>>>>>>>>> > > >>>>>>>>>>
>>>>>>>>>>>>>>> > > >>>>>>>>>>> Thanks Ryan.
>>>>>>>>>>>>>>> > > >>>>>>>>>>>
>>>>>>>>>>>>>>> > > >>>>>>>>>>> Yeah. That's another reason we want to
>>>>>>>>>>>>>>> pursue Spark encoding to
>>>>>>>>>>>>>>> > > >>>>>>>>>>> keep compatibility for the open source
>>>>>>>>>>>>>>> engines.
>>>>>>>>>>>>>>> > > >>>>>>>>>>>
>>>>>>>>>>>>>>> > > >>>>>>>>>>> One more question regarding the encoding
>>>>>>>>>>>>>>> implementation: do we
>>>>>>>>>>>>>>> > > >>>>>>>>>>> have an issue to directly use Spark
>>>>>>>>>>>>>>> implementation in Iceberg? Russell
>>>>>>>>>>>>>>> > > >>>>>>>>>>> pointed out that Trino doesn't have Spark
>>>>>>>>>>>>>>> dependency and that could be a
>>>>>>>>>>>>>>> > > >>>>>>>>>>> problem?
>>>>>>>>>>>>>>> > > >>>>>>>>>>>
>>>>>>>>>>>>>>> > > >>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>> > > >>>>>>>>>>> Aihua
>>>>>>>>>>>>>>> > > >>>>>>>>>>>
>>>>>>>>>>>>>>> > > >>>>>>>>>>> On 2024/07/12 15:02:06 Ryan Blue wrote:
>>>>>>>>>>>>>>> > > >>>>>>>>>>> > Thanks, Aihua!
>>>>>>>>>>>>>>> > > >>>>>>>>>>> >
>>>>>>>>>>>>>>> > > >>>>>>>>>>> > I think that the encoding choice in the
>>>>>>>>>>>>>>> current doc is a good
>>>>>>>>>>>>>>> > > >>>>>>>>>>> one. I went
>>>>>>>>>>>>>>> > > >>>>>>>>>>> > through the Spark encoding in detail and
>>>>>>>>>>>>>>> it looks like a
>>>>>>>>>>>>>>> > > >>>>>>>>>>> better choice than
>>>>>>>>>>>>>>> > > >>>>>>>>>>> > the other candidate encodings for quickly
>>>>>>>>>>>>>>> accessing nested
>>>>>>>>>>>>>>> > > >>>>>>>>>>> fields.
>>>>>>>>>>>>>>> > > >>>>>>>>>>> >
>>>>>>>>>>>>>>> > > >>>>>>>>>>> > Another reason to use the Spark type is
>>>>>>>>>>>>>>> that this is what
>>>>>>>>>>>>>>> > > >>>>>>>>>>> Delta's variant
>>>>>>>>>>>>>>> > > >>>>>>>>>>> > type is based on, so Parquet files in
>>>>>>>>>>>>>>> tables written by Delta
>>>>>>>>>>>>>>> > > >>>>>>>>>>> could be
>>>>>>>>>>>>>>> > > >>>>>>>>>>> > converted or used in Iceberg tables
>>>>>>>>>>>>>>> without needing to rewrite
>>>>>>>>>>>>>>> > > >>>>>>>>>>> variant
>>>>>>>>>>>>>>> > > >>>>>>>>>>> > data. (Also, note that I work at
>>>>>>>>>>>>>>> Databricks and have an
>>>>>>>>>>>>>>> > > >>>>>>>>>>> interest in
>>>>>>>>>>>>>>> > > >>>>>>>>>>> > increasing format compatibility.)
>>>>>>>>>>>>>>> > > >>>>>>>>>>> >
>>>>>>>>>>>>>>> > > >>>>>>>>>>> > Ryan
>>>>>>>>>>>>>>> > > >>>>>>>>>>> >
>>>>>>>>>>>>>>> > > >>>>>>>>>>> > On Thu, Jul 11, 2024 at 11:21 AM Aihua Xu <
>>>>>>>>>>>>>>> > > >>>>>>>>>>> aihua...@snowflake.com.invalid>
>>>>>>>>>>>>>>> > > >>>>>>>>>>> > wrote:
>>>>>>>>>>>>>>> > > >>>>>>>>>>> >
>>>>>>>>>>>>>>> > > >>>>>>>>>>> > > [Discuss] Consensus for Variant Encoding
>>>>>>>>>>>>>>> > > >>>>>>>>>>> > >
>>>>>>>>>>>>>>> > > >>>>>>>>>>> > > It’s great to be able to present the
>>>>>>>>>>>>>>> Variant type proposal
>>>>>>>>>>>>>>> > > >>>>>>>>>>> in the
>>>>>>>>>>>>>>> > > >>>>>>>>>>> > > community sync yesterday and I’m looking
>>>>>>>>>>>>>>> to host a meeting
>>>>>>>>>>>>>>> > > >>>>>>>>>>> next week
>>>>>>>>>>>>>>> > > >>>>>>>>>>> > > (targeting for 9am, July 17th) to go
>>>>>>>>>>>>>>> over any further
>>>>>>>>>>>>>>> > > >>>>>>>>>>> concerns about the
>>>>>>>>>>>>>>> > > >>>>>>>>>>> > > encoding of the Variant type and any
>>>>>>>>>>>>>>> other questions on the
>>>>>>>>>>>>>>> > > >>>>>>>>>>> first phase of
>>>>>>>>>>>>>>> > > >>>>>>>>>>> > > the proposal
>>>>>>>>>>>>>>> > > >>>>>>>>>>> > > <
>>>>>>>>>>>>>>> > > >>>>>>>>>>>
>>>>>>>>>>>>>>> https://docs.google.com/document/d/1QjhpG_SVNPZh3anFcpicMQx90ebwjL7rmzFYfUP89Iw/edit
>>>>>>>>>>>>>>> > > >>>>>>>>>>> >.
>>>>>>>>>>>>>>> > > >>>>>>>>>>> > > We are hoping that anyone who is
>>>>>>>>>>>>>>> interested in the proposal
>>>>>>>>>>>>>>> > > >>>>>>>>>>> can either join
>>>>>>>>>>>>>>> > > >>>>>>>>>>> > > or reply with their comments so we can
>>>>>>>>>>>>>>> discuss them. Summary
>>>>>>>>>>>>>>> > > >>>>>>>>>>> of the
>>>>>>>>>>>>>>> > > >>>>>>>>>>> > > discussion and notes will be sent to the
>>>>>>>>>>>>>>> mailing list for
>>>>>>>>>>>>>>> > > >>>>>>>>>>> further comment
>>>>>>>>>>>>>>> > > >>>>>>>>>>> > > there.
>>>>>>>>>>>>>>> > > >>>>>>>>>>> > >
>>>>>>>>>>>>>>> > > >>>>>>>>>>> > >
>>>>>>>>>>>>>>> > > >>>>>>>>>>> > >    -
>>>>>>>>>>>>>>> > > >>>>>>>>>>> > >
>>>>>>>>>>>>>>> > > >>>>>>>>>>> > >    What should be the underlying binary
>>>>>>>>>>>>>>> representation
>>>>>>>>>>>>>>> > > >>>>>>>>>>> > >
>>>>>>>>>>>>>>> > > >>>>>>>>>>> > > We have evaluated a few encodings in the
>>>>>>>>>>>>>>> doc including ION,
>>>>>>>>>>>>>>> > > >>>>>>>>>>> JSONB, and
>>>>>>>>>>>>>>> > > >>>>>>>>>>> > > Spark encoding.Choosing the underlying
>>>>>>>>>>>>>>> encoding is an
>>>>>>>>>>>>>>> > > >>>>>>>>>>> important first step
>>>>>>>>>>>>>>> > > >>>>>>>>>>> > > here and we believe we have general
>>>>>>>>>>>>>>> support for Spark’s
>>>>>>>>>>>>>>> > > >>>>>>>>>>> Variant encoding.
>>>>>>>>>>>>>>> > > >>>>>>>>>>> > > We would like to hear if anyone else has
>>>>>>>>>>>>>>> strong opinions in
>>>>>>>>>>>>>>> > > >>>>>>>>>>> this space.
>>>>>>>>>>>>>>> > > >>>>>>>>>>> > >
>>>>>>>>>>>>>>> > > >>>>>>>>>>> > >
>>>>>>>>>>>>>>> > > >>>>>>>>>>> > >    -
>>>>>>>>>>>>>>> > > >>>>>>>>>>> > >
>>>>>>>>>>>>>>> > > >>>>>>>>>>> > >    Should we support multiple logical
>>>>>>>>>>>>>>> types or just Variant?
>>>>>>>>>>>>>>> > > >>>>>>>>>>> Variant vs.
>>>>>>>>>>>>>>> > > >>>>>>>>>>> > >    Variant + JSON.
>>>>>>>>>>>>>>> > > >>>>>>>>>>> > >
>>>>>>>>>>>>>>> > > >>>>>>>>>>> > > This is to discuss what logical data
>>>>>>>>>>>>>>> type(s) to be supported
>>>>>>>>>>>>>>> > > >>>>>>>>>>> in Iceberg -
>>>>>>>>>>>>>>> > > >>>>>>>>>>> > > Variant only vs. Variant + JSON. Both
>>>>>>>>>>>>>>> types would share the
>>>>>>>>>>>>>>> > > >>>>>>>>>>> same underlying
>>>>>>>>>>>>>>> > > >>>>>>>>>>> > > encoding but would imply different
>>>>>>>>>>>>>>> limitations on engines
>>>>>>>>>>>>>>> > > >>>>>>>>>>> working with
>>>>>>>>>>>>>>> > > >>>>>>>>>>> > > those types.
>>>>>>>>>>>>>>> > > >>>>>>>>>>> > >
>>>>>>>>>>>>>>> > > >>>>>>>>>>> > > From the sync up meeting, we are more
>>>>>>>>>>>>>>> favoring toward
>>>>>>>>>>>>>>> > > >>>>>>>>>>> supporting Variant
>>>>>>>>>>>>>>> > > >>>>>>>>>>> > > only and we want to have a consensus on
>>>>>>>>>>>>>>> the supported
>>>>>>>>>>>>>>> > > >>>>>>>>>>> type(s).
>>>>>>>>>>>>>>> > > >>>>>>>>>>> > >
>>>>>>>>>>>>>>> > > >>>>>>>>>>> > >
>>>>>>>>>>>>>>> > > >>>>>>>>>>> > >    -
>>>>>>>>>>>>>>> > > >>>>>>>>>>> > >
>>>>>>>>>>>>>>> > > >>>>>>>>>>> > >    How should we move forward with
>>>>>>>>>>>>>>> Subcolumnization?
>>>>>>>>>>>>>>> > > >>>>>>>>>>> > >
>>>>>>>>>>>>>>> > > >>>>>>>>>>> > > Subcolumnization is an optimization for
>>>>>>>>>>>>>>> Variant type by
>>>>>>>>>>>>>>> > > >>>>>>>>>>> separating out
>>>>>>>>>>>>>>> > > >>>>>>>>>>> > > subcolumns with their own metadata. This
>>>>>>>>>>>>>>> is not critical for
>>>>>>>>>>>>>>> > > >>>>>>>>>>> choosing the
>>>>>>>>>>>>>>> > > >>>>>>>>>>> > > initial encoding of the Variant type so
>>>>>>>>>>>>>>> we were hoping to
>>>>>>>>>>>>>>> > > >>>>>>>>>>> gain consensus on
>>>>>>>>>>>>>>> > > >>>>>>>>>>> > > leaving that for a follow up spec.
>>>>>>>>>>>>>>> > > >>>>>>>>>>> > >
>>>>>>>>>>>>>>> > > >>>>>>>>>>> > >
>>>>>>>>>>>>>>> > > >>>>>>>>>>> > > Thanks
>>>>>>>>>>>>>>> > > >>>>>>>>>>> > >
>>>>>>>>>>>>>>> > > >>>>>>>>>>> > > Aihua
>>>>>>>>>>>>>>> > > >>>>>>>>>>> > >
>>>>>>>>>>>>>>> > > >>>>>>>>>>> > > Meeting invite:
>>>>>>>>>>>>>>> > > >>>>>>>>>>> > >
>>>>>>>>>>>>>>> > > >>>>>>>>>>> > > Wednesday, July 17 · 9:00 – 10:00am
>>>>>>>>>>>>>>> > > >>>>>>>>>>> > > Time zone: America/Los_Angeles
>>>>>>>>>>>>>>> > > >>>>>>>>>>> > > Google Meet joining info
>>>>>>>>>>>>>>> > > >>>>>>>>>>> > > Video call link:
>>>>>>>>>>>>>>> https://meet.google.com/pbm-ovzn-aoq
>>>>>>>>>>>>>>> > > >>>>>>>>>>> > > Or dial: ‪(US) +1 650-449-9343‬ PIN:
>>>>>>>>>>>>>>> ‪170 576 525‬#
>>>>>>>>>>>>>>> > > >>>>>>>>>>> > > More phone numbers:
>>>>>>>>>>>>>>> > > >>>>>>>>>>>
>>>>>>>>>>>>>>> https://tel.meet/pbm-ovzn-aoq?pin=4079632691790
>>>>>>>>>>>>>>> > > >>>>>>>>>>> > >
>>>>>>>>>>>>>>> > > >>>>>>>>>>> > > On Tue, May 28, 2024 at 9:21 PM Aihua Xu
>>>>>>>>>>>>>>> <
>>>>>>>>>>>>>>> > > >>>>>>>>>>> aihua...@snowflake.com> wrote:
>>>>>>>>>>>>>>> > > >>>>>>>>>>> > >
>>>>>>>>>>>>>>> > > >>>>>>>>>>> > >> Hello,
>>>>>>>>>>>>>>> > > >>>>>>>>>>> > >>
>>>>>>>>>>>>>>> > > >>>>>>>>>>> > >> We have drafted the proposal
>>>>>>>>>>>>>>> > > >>>>>>>>>>> > >> <
>>>>>>>>>>>>>>> > > >>>>>>>>>>>
>>>>>>>>>>>>>>> https://docs.google.com/document/d/1QjhpG_SVNPZh3anFcpicMQx90ebwjL7rmzFYfUP89Iw/edit
>>>>>>>>>>>>>>> > > >>>>>>>>>>> >
>>>>>>>>>>>>>>> > > >>>>>>>>>>> > >> for Variant data type. Please help
>>>>>>>>>>>>>>> review and comment.
>>>>>>>>>>>>>>> > > >>>>>>>>>>> > >>
>>>>>>>>>>>>>>> > > >>>>>>>>>>> > >> Thanks,
>>>>>>>>>>>>>>> > > >>>>>>>>>>> > >> Aihua
>>>>>>>>>>>>>>> > > >>>>>>>>>>> > >>
>>>>>>>>>>>>>>> > > >>>>>>>>>>> > >> On Thu, May 16, 2024 at 12:45 PM Jack
>>>>>>>>>>>>>>> Ye <
>>>>>>>>>>>>>>> > > >>>>>>>>>>> yezhao...@gmail.com> wrote:
>>>>>>>>>>>>>>> > > >>>>>>>>>>> > >>
>>>>>>>>>>>>>>> > > >>>>>>>>>>> > >>> +10000 for a JSON/BSON type. We also
>>>>>>>>>>>>>>> had the same
>>>>>>>>>>>>>>> > > >>>>>>>>>>> discussion internally
>>>>>>>>>>>>>>> > > >>>>>>>>>>> > >>> and a JSON type would really play well
>>>>>>>>>>>>>>> with for example
>>>>>>>>>>>>>>> > > >>>>>>>>>>> the SUPER type in
>>>>>>>>>>>>>>> > > >>>>>>>>>>> > >>> Redshift:
>>>>>>>>>>>>>>> > > >>>>>>>>>>> > >>>
>>>>>>>>>>>>>>> > > >>>>>>>>>>>
>>>>>>>>>>>>>>> https://docs.aws.amazon.com/redshift/latest/dg/r_SUPER_type.html
>>>>>>>>>>>>>>> ,
>>>>>>>>>>>>>>> > > >>>>>>>>>>> and
>>>>>>>>>>>>>>> > > >>>>>>>>>>> > >>> can also provide better integration
>>>>>>>>>>>>>>> with the Trino JSON
>>>>>>>>>>>>>>> > > >>>>>>>>>>> type.
>>>>>>>>>>>>>>> > > >>>>>>>>>>> > >>>
>>>>>>>>>>>>>>> > > >>>>>>>>>>> > >>> Looking forward to the proposal!
>>>>>>>>>>>>>>> > > >>>>>>>>>>> > >>>
>>>>>>>>>>>>>>> > > >>>>>>>>>>> > >>> Best,
>>>>>>>>>>>>>>> > > >>>>>>>>>>> > >>> Jack Ye
>>>>>>>>>>>>>>> > > >>>>>>>>>>> > >>>
>>>>>>>>>>>>>>> > > >>>>>>>>>>> > >>>
>>>>>>>>>>>>>>> > > >>>>>>>>>>> > >>> On Wed, May 15, 2024 at 9:37 AM Tyler
>>>>>>>>>>>>>>> Akidau
>>>>>>>>>>>>>>> > > >>>>>>>>>>> > >>> <tyler.aki...@snowflake.com.invalid>
>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>> > > >>>>>>>>>>> > >>>
>>>>>>>>>>>>>>> > > >>>>>>>>>>> > >>>> On Tue, May 14, 2024 at 7:58 PM Gang
>>>>>>>>>>>>>>> Wu <ust...@gmail.com>
>>>>>>>>>>>>>>> > > >>>>>>>>>>> wrote:
>>>>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>
>>>>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>> > We may need some guidance on just
>>>>>>>>>>>>>>> how many we need to
>>>>>>>>>>>>>>> > > >>>>>>>>>>> look at;
>>>>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>> > we were planning on Spark and
>>>>>>>>>>>>>>> Trino, but weren't sure
>>>>>>>>>>>>>>> > > >>>>>>>>>>> how much
>>>>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>> > further down the rabbit hole we
>>>>>>>>>>>>>>> needed to go。
>>>>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>
>>>>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>> There are some engines living
>>>>>>>>>>>>>>> outside the Java world. It
>>>>>>>>>>>>>>> > > >>>>>>>>>>> would be
>>>>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>> good if the proposal could cover the
>>>>>>>>>>>>>>> effort it takes to
>>>>>>>>>>>>>>> > > >>>>>>>>>>> integrate
>>>>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>> variant type to them (e.g. velox,
>>>>>>>>>>>>>>> datafusion, etc.).
>>>>>>>>>>>>>>> > > >>>>>>>>>>> This is something
>>>>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>> that
>>>>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>> some proprietary iceberg vendors
>>>>>>>>>>>>>>> also care about.
>>>>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>
>>>>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>
>>>>>>>>>>>>>>> > > >>>>>>>>>>> > >>>> Ack, makes sense. We can make sure to
>>>>>>>>>>>>>>> share some
>>>>>>>>>>>>>>> > > >>>>>>>>>>> perspective on this.
>>>>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>
>>>>>>>>>>>>>>> > > >>>>>>>>>>> > >>>> > Not necessarily, no. As long as
>>>>>>>>>>>>>>> there's a binary type
>>>>>>>>>>>>>>> > > >>>>>>>>>>> and Iceberg and
>>>>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>> > the query engines are aware that
>>>>>>>>>>>>>>> the binary column
>>>>>>>>>>>>>>> > > >>>>>>>>>>> needs to be
>>>>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>> > interpreted as a variant, that
>>>>>>>>>>>>>>> should be sufficient.
>>>>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>
>>>>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>> From the perspective of
>>>>>>>>>>>>>>> interoperability, it would be
>>>>>>>>>>>>>>> > > >>>>>>>>>>> good to support
>>>>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>> native
>>>>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>> type from file specs. Life will be
>>>>>>>>>>>>>>> easier for projects
>>>>>>>>>>>>>>> > > >>>>>>>>>>> like Apache
>>>>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>> XTable.
>>>>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>> File format could also provide
>>>>>>>>>>>>>>> finer-grained statistics
>>>>>>>>>>>>>>> > > >>>>>>>>>>> for variant
>>>>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>> type which
>>>>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>> facilitates data skipping.
>>>>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>
>>>>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>
>>>>>>>>>>>>>>> > > >>>>>>>>>>> > >>>> Agreed, there can definitely be
>>>>>>>>>>>>>>> additional value in
>>>>>>>>>>>>>>> > > >>>>>>>>>>> native file format
>>>>>>>>>>>>>>> > > >>>>>>>>>>> > >>>> integration. Just wanted to highlight
>>>>>>>>>>>>>>> that it's not a
>>>>>>>>>>>>>>> > > >>>>>>>>>>> strict requirement.
>>>>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>
>>>>>>>>>>>>>>> > > >>>>>>>>>>> > >>>> -Tyler
>>>>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>
>>>>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>
>>>>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>
>>>>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>> Gang
>>>>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>
>>>>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>> On Wed, May 15, 2024 at 6:49 AM
>>>>>>>>>>>>>>> Tyler Akidau
>>>>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>> <tyler.aki...@snowflake.com.invalid>
>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>
>>>>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>> Good to see you again as well, JB!
>>>>>>>>>>>>>>> Thanks!
>>>>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>
>>>>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>> -Tyler
>>>>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>
>>>>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>
>>>>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>> On Tue, May 14, 2024 at 1:04 PM
>>>>>>>>>>>>>>> Jean-Baptiste Onofré <
>>>>>>>>>>>>>>> > > >>>>>>>>>>> j...@nanthrax.net>
>>>>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>> wrote:
>>>>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>
>>>>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> Hi Tyler,
>>>>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>>
>>>>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> Super happy to see you there :) It
>>>>>>>>>>>>>>> reminds me our
>>>>>>>>>>>>>>> > > >>>>>>>>>>> discussions back in
>>>>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> the start of Apache Beam :)
>>>>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>>
>>>>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> Anyway, the thread is pretty
>>>>>>>>>>>>>>> interesting. I remember
>>>>>>>>>>>>>>> > > >>>>>>>>>>> some discussions
>>>>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> about JSON datatype for spec v3.
>>>>>>>>>>>>>>> The binary data type
>>>>>>>>>>>>>>> > > >>>>>>>>>>> is already
>>>>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> supported in the spec v2.
>>>>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>>
>>>>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> I'm looking forward to the
>>>>>>>>>>>>>>> proposal and happy to help
>>>>>>>>>>>>>>> > > >>>>>>>>>>> on this !
>>>>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>>
>>>>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> Regards
>>>>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> JB
>>>>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>>
>>>>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> On Sat, May 11, 2024 at 7:06 AM
>>>>>>>>>>>>>>> Tyler Akidau
>>>>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> <tyler.aki...@snowflake.com.invalid>
>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> >
>>>>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> > Hello,
>>>>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> >
>>>>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> > We (Tyler, Nileema, Selcuk,
>>>>>>>>>>>>>>> Aihua) are working on a
>>>>>>>>>>>>>>> > > >>>>>>>>>>> proposal for
>>>>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> which we’d like to get early
>>>>>>>>>>>>>>> feedback from the
>>>>>>>>>>>>>>> > > >>>>>>>>>>> community. As you may know,
>>>>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> Snowflake has embraced Iceberg as
>>>>>>>>>>>>>>> its open Data Lake
>>>>>>>>>>>>>>> > > >>>>>>>>>>> format. Having made
>>>>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> good progress on our own adoption
>>>>>>>>>>>>>>> of the Iceberg
>>>>>>>>>>>>>>> > > >>>>>>>>>>> standard, we’re now in a
>>>>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> position where there are features
>>>>>>>>>>>>>>> not yet supported in
>>>>>>>>>>>>>>> > > >>>>>>>>>>> Iceberg which we
>>>>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> think would be valuable for our
>>>>>>>>>>>>>>> users, and that we
>>>>>>>>>>>>>>> > > >>>>>>>>>>> would like to discuss
>>>>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> with and help contribute to the
>>>>>>>>>>>>>>> Iceberg community.
>>>>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> >
>>>>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> > The first two such features we’d
>>>>>>>>>>>>>>> like to discuss are
>>>>>>>>>>>>>>> > > >>>>>>>>>>> in support of
>>>>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> efficient querying of dynamically
>>>>>>>>>>>>>>> typed,
>>>>>>>>>>>>>>> > > >>>>>>>>>>> semi-structured data: variant data
>>>>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> types, and subcolumnarization of
>>>>>>>>>>>>>>> variant columns. In
>>>>>>>>>>>>>>> > > >>>>>>>>>>> more detail, for
>>>>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> anyone who may not already be
>>>>>>>>>>>>>>> familiar:
>>>>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> >
>>>>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> > 1. Variant data types
>>>>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> > Variant types allow for the
>>>>>>>>>>>>>>> efficient binary
>>>>>>>>>>>>>>> > > >>>>>>>>>>> encoding of dynamic
>>>>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> semi-structured data such as JSON,
>>>>>>>>>>>>>>> Avro, etc. By
>>>>>>>>>>>>>>> > > >>>>>>>>>>> encoding semi-structured
>>>>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> data as a variant column, we
>>>>>>>>>>>>>>> retain the flexibility of
>>>>>>>>>>>>>>> > > >>>>>>>>>>> the source data,
>>>>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> while allowing query engines to
>>>>>>>>>>>>>>> more efficiently
>>>>>>>>>>>>>>> > > >>>>>>>>>>> operate on the data.
>>>>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> Snowflake has supported the
>>>>>>>>>>>>>>> variant data type on
>>>>>>>>>>>>>>> > > >>>>>>>>>>> Snowflake tables for many
>>>>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> years [1]. As more and more users
>>>>>>>>>>>>>>> utilize Iceberg
>>>>>>>>>>>>>>> > > >>>>>>>>>>> tables in Snowflake,
>>>>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> we’re hearing an increasing chorus
>>>>>>>>>>>>>>> of requests for
>>>>>>>>>>>>>>> > > >>>>>>>>>>> variant support.
>>>>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> Additionally, other query engines
>>>>>>>>>>>>>>> such as Apache Spark
>>>>>>>>>>>>>>> > > >>>>>>>>>>> have begun adding
>>>>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> variant support [2]. As such, we
>>>>>>>>>>>>>>> believe it would be
>>>>>>>>>>>>>>> > > >>>>>>>>>>> beneficial to the
>>>>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> Iceberg community as a whole to
>>>>>>>>>>>>>>> standardize on the
>>>>>>>>>>>>>>> > > >>>>>>>>>>> variant data type
>>>>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> encoding used across Iceberg
>>>>>>>>>>>>>>> tables.
>>>>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> >
>>>>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> > One specific point to make here
>>>>>>>>>>>>>>> is that, since an
>>>>>>>>>>>>>>> > > >>>>>>>>>>> Apache OSS
>>>>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> version of variant encoding
>>>>>>>>>>>>>>> already exists in Spark,
>>>>>>>>>>>>>>> > > >>>>>>>>>>> it likely makes sense
>>>>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> to simply adopt the Spark encoding
>>>>>>>>>>>>>>> as the Iceberg
>>>>>>>>>>>>>>> > > >>>>>>>>>>> standard as well. The
>>>>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> encoding we use internally today
>>>>>>>>>>>>>>> in Snowflake is
>>>>>>>>>>>>>>> > > >>>>>>>>>>> slightly different, but
>>>>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> essentially equivalent, and we see
>>>>>>>>>>>>>>> no particular value
>>>>>>>>>>>>>>> > > >>>>>>>>>>> in trying to clutter
>>>>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> the space with another
>>>>>>>>>>>>>>> equivalent-but-incompatible
>>>>>>>>>>>>>>> > > >>>>>>>>>>> encoding.
>>>>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> >
>>>>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> >
>>>>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> > 2. Subcolumnarization
>>>>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> > Subcolumnarization of variant
>>>>>>>>>>>>>>> columns allows query
>>>>>>>>>>>>>>> > > >>>>>>>>>>> engines to
>>>>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> efficiently prune datasets when
>>>>>>>>>>>>>>> subcolumns (i.e.,
>>>>>>>>>>>>>>> > > >>>>>>>>>>> nested fields) within a
>>>>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> variant column are queried, and
>>>>>>>>>>>>>>> also allows optionally
>>>>>>>>>>>>>>> > > >>>>>>>>>>> materializing some
>>>>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> of the nested fields as a column
>>>>>>>>>>>>>>> on their own,
>>>>>>>>>>>>>>> > > >>>>>>>>>>> affording queries on these
>>>>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> subcolumns the ability to read
>>>>>>>>>>>>>>> less data and spend
>>>>>>>>>>>>>>> > > >>>>>>>>>>> less CPU on extraction.
>>>>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> When subcolumnarizing, the system
>>>>>>>>>>>>>>> managing table
>>>>>>>>>>>>>>> > > >>>>>>>>>>> metadata and data tracks
>>>>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> individual pruning statistics
>>>>>>>>>>>>>>> (min, max, null, etc.)
>>>>>>>>>>>>>>> > > >>>>>>>>>>> for some subset of the
>>>>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> nested fields within a variant,
>>>>>>>>>>>>>>> and also manages any
>>>>>>>>>>>>>>> > > >>>>>>>>>>> optional
>>>>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> materialization. Without
>>>>>>>>>>>>>>> subcolumnarization, any query
>>>>>>>>>>>>>>> > > >>>>>>>>>>> which touches a
>>>>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> variant column must read, parse,
>>>>>>>>>>>>>>> extract, and filter
>>>>>>>>>>>>>>> > > >>>>>>>>>>> every row for which
>>>>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> that column is non-null. Thus, by
>>>>>>>>>>>>>>> providing a
>>>>>>>>>>>>>>> > > >>>>>>>>>>> standardized way of tracking
>>>>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> subcolum metadata and data for
>>>>>>>>>>>>>>> variant columns,
>>>>>>>>>>>>>>> > > >>>>>>>>>>> Iceberg can make
>>>>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> subcolumnar optimizations
>>>>>>>>>>>>>>> accessible across various
>>>>>>>>>>>>>>> > > >>>>>>>>>>> catalogs and query
>>>>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> engines.
>>>>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> >
>>>>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> > Subcolumnarization is a
>>>>>>>>>>>>>>> non-trivial topic, so we
>>>>>>>>>>>>>>> > > >>>>>>>>>>> expect any
>>>>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> concrete proposal to include not
>>>>>>>>>>>>>>> only the set of
>>>>>>>>>>>>>>> > > >>>>>>>>>>> changes to Iceberg
>>>>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> metadata that allow compatible
>>>>>>>>>>>>>>> query engines to
>>>>>>>>>>>>>>> > > >>>>>>>>>>> interopate on
>>>>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> subcolumnarization data for
>>>>>>>>>>>>>>> variant columns, but also
>>>>>>>>>>>>>>> > > >>>>>>>>>>> reference
>>>>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> documentation explaining
>>>>>>>>>>>>>>> subcolumnarization principles
>>>>>>>>>>>>>>> > > >>>>>>>>>>> and recommended best
>>>>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> practices.
>>>>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> >
>>>>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> >
>>>>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> > It sounds like the recent Geo
>>>>>>>>>>>>>>> proposal [3] may be a
>>>>>>>>>>>>>>> > > >>>>>>>>>>> good starting
>>>>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> point for how to approach this, so
>>>>>>>>>>>>>>> our plan is to
>>>>>>>>>>>>>>> > > >>>>>>>>>>> write something up in
>>>>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> that vein that covers the proposed
>>>>>>>>>>>>>>> spec changes,
>>>>>>>>>>>>>>> > > >>>>>>>>>>> backwards compatibility,
>>>>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> implementor burdens, etc. But we
>>>>>>>>>>>>>>> wanted to first reach
>>>>>>>>>>>>>>> > > >>>>>>>>>>> out to the community
>>>>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> to introduce ourselves and the
>>>>>>>>>>>>>>> idea, and see if
>>>>>>>>>>>>>>> > > >>>>>>>>>>> there’s any early feedback
>>>>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> we should incorporate before we
>>>>>>>>>>>>>>> spend too much time on
>>>>>>>>>>>>>>> > > >>>>>>>>>>> a concrete proposal.
>>>>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> >
>>>>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> > Thank you!
>>>>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> >
>>>>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> > [1]
>>>>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>>
>>>>>>>>>>>>>>> > > >>>>>>>>>>>
>>>>>>>>>>>>>>> https://docs.snowflake.com/en/sql-reference/data-types-semistructured
>>>>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> > [2]
>>>>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>>
>>>>>>>>>>>>>>> > > >>>>>>>>>>>
>>>>>>>>>>>>>>> https://github.com/apache/spark/blob/master/common/variant/README.md
>>>>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> > [3]
>>>>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>>
>>>>>>>>>>>>>>> > > >>>>>>>>>>>
>>>>>>>>>>>>>>> https://docs.google.com/document/d/1iVFbrRNEzZl8tDcZC81GFt01QJkLJsI9E2NBOt21IRI/edit
>>>>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> >
>>>>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> > -Tyler, Nileema, Selcuk, Aihua
>>>>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> >
>>>>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>>
>>>>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>
>>>>>>>>>>>>>>> > > >>>>>>>>>>> >
>>>>>>>>>>>>>>> > > >>>>>>>>>>> > --
>>>>>>>>>>>>>>> > > >>>>>>>>>>> > Ryan Blue
>>>>>>>>>>>>>>> > > >>>>>>>>>>> > Databricks
>>>>>>>>>>>>>>> > > >>>>>>>>>>> >
>>>>>>>>>>>>>>> > > >>>>>>>>>>>
>>>>>>>>>>>>>>> > > >>>>>>>>>>
>>>>>>>>>>>>>>> > > >>>>>>>>>
>>>>>>>>>>>>>>> > > >>>>>>>>> --
>>>>>>>>>>>>>>> > > >>>>>>>>> Ryan Blue
>>>>>>>>>>>>>>> > > >>>>>>>>> Databricks
>>>>>>>>>>>>>>> > > >>>>>>>>>
>>>>>>>>>>>>>>> > > >>>>>>>>
>>>>>>>>>>>>>>> > > >>>>>>>>
>>>>>>>>>>>>>>> > > >>>>>>>> --
>>>>>>>>>>>>>>> > > >>>>>>>> Ryan Blue
>>>>>>>>>>>>>>> > > >>>>>>>> Databricks
>>>>>>>>>>>>>>> > > >>>>>>>>
>>>>>>>>>>>>>>> > > >>>>>>>
>>>>>>>>>>>>>>> > > >>>>>>
>>>>>>>>>>>>>>> > > >>>>>> --
>>>>>>>>>>>>>>> > > >>>>>> Ryan Blue
>>>>>>>>>>>>>>> > > >>>>>> Databricks
>>>>>>>>>>>>>>> > > >>>>>>
>>>>>>>>>>>>>>> > > >>>>>
>>>>>>>>>>>>>>> > > >>>>
>>>>>>>>>>>>>>> > > >>>> --
>>>>>>>>>>>>>>> > > >>>> Ryan Blue
>>>>>>>>>>>>>>> > > >>>> Databricks
>>>>>>>>>>>>>>> > > >>>>
>>>>>>>>>>>>>>> > > >>>
>>>>>>>>>>>>>>> > > >>
>>>>>>>>>>>>>>> > > >> --
>>>>>>>>>>>>>>> > > >> Ryan Blue
>>>>>>>>>>>>>>> > > >> Databricks
>>>>>>>>>>>>>>> > > >>
>>>>>>>>>>>>>>> > > >
>>>>>>>>>>>>>>> > >
>>>>>>>>>>>>>>> >
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> --
>>>>>>>>>>>>> Ryan Blue
>>>>>>>>>>>>> Databricks
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>
>>>> --
>>>> Ryan Blue
>>>> Databricks
>>>>
>>>

-- 
Ryan Blue
Databricks

Re: [Early Feedback] Variant and Subcolumnarization Support

Reply via email to