Let me know if I understand correctly: basically the spec will not include any type promotion. E.g., if the chosen type for the subcolumn is int64, then only int64 value will be encoded in `typed_value`, other types of values including int32 type will be encoded in `untyped_value`; similarly if the chosen type is decimal(10, 5), the values of decimal(10, 2) will be encoded in `untyped_value`.
This looks clean to me. On Fri, Jul 26, 2024 at 3:11 PM Micah Kornfield <emkornfi...@gmail.com> wrote: > If we aren't optimizing for strict decimal behavior, then I think the >> cleanest option is to use `untyped_value` when decimal scales need to be >> preserved. I would also remove language from the shredding spec about >> storage modifying values so that this is an engine concern. The storage >> spec should state how you can encode values, without making recommendations >> about modifying those values. If an engine's semantics for variant allow it >> to modify value types, then that's up to the engine. > > > If I understand this correctly this also applies to int32/int64 > conversions? This sounds like a good solution to me as well (IIUC all we > are saying here is it is an engine decision on when it wants to widen types > for storage purposes and that might involve some loss in data fidelity?). > > Thanks, > Micah > > On Fri, Jul 26, 2024 at 9:55 AM Ryan Blue <b...@databricks.com.invalid> > wrote: > >> As a follow up, I talked with Russell quite a bit about losing types >> after the meeting and he convinced me that while there are valid use cases >> for strict decimal behavior, the majority of cases are either engines using >> decimal to keep track of the original number of digits in a number or >> people that simply want to limit the number of digits. In that case, I >> think the natural conclusion is that it should be _possible_ to have strict >> behavior but we should not increase complexity too much to optimize for it. >> >> If we aren't optimizing for strict decimal behavior, then I think the >> cleanest option is to use `untyped_value` when decimal scales need to be >> preserved. I would also remove language from the shredding spec about >> storage modifying values so that this is an engine concern. The storage >> spec should state how you can encode values, without making recommendations >> about modifying those values. If an engine's semantics for variant allow it >> to modify value types, then that's up to the engine. >> >> In the discussion, I wasn't the only person in favor of not modifying >> decimal scales, but I'm curious if this distinction satisfies everyone. If >> we remove the wording from the proposal that recommends modifying decimals >> and leave this to the engine, do we have agreement? >> >> On Thu, Jul 25, 2024 at 6:46 PM Aihua Xu <aihu...@gmail.com> wrote: >> >>> Hi community, >>> >>> Thanks for joining the meeting to discuss variant shredding. For those >>> who were unable to attend the meeting, please check out the recorded >>> meeting >>> <https://drive.google.com/file/d/1kiwv29nxxOqMCbxXn-NRoz-x2E9yIMlJ/view?usp=drive_link> >>> if >>> you are interested. Also to follow up on the meeting to converge on >>> lossiness discussion from shredding offline, I have converted the spark >>> shredding proposal by David into google doc >>> <https://docs.google.com/document/d/1JeBt4NIju08jQ2AbludiK-U0M9ISIgysP7fUDWtv7rg/edit> >>> and >>> please comment. >>> >>> Thanks, >>> Aihua >>> >>> >>> On Thu, Jul 25, 2024 at 10:14 AM Aihua Xu <aihu...@gmail.com> wrote: >>> >>>> Yes. This time I was able to record it and I will share it when it’s >>>> processed. >>>> >>>> >>>> On Jul 25, 2024, at 10:01 AM, Amogh Jahagirdar <2am...@gmail.com> >>>> wrote: >>>> >>>> >>>> Any chance this meeting was recorded? I couldn't make it but would be >>>> interested in catching up on the discussion. >>>> >>>> Thanks, >>>> >>>> Amogh Jahagirdar >>>> >>>> On Tue, Jul 23, 2024 at 11:30 AM Aihua Xu <aihu...@gmail.com> wrote: >>>> >>>>> Thanks folks for additional discussion. >>>>> >>>>> There are some questions related to subcolumniziation (spark shredding >>>>> - see the discussion <https://github.com/apache/spark/pull/46831>) >>>>> and we would like to host another meeting to mainly discuss that since we >>>>> plan to adopt it. We can also follow up the Spark variant topics (I can >>>>> see >>>>> that mostly we are aligned with the exception to find a place for the spec >>>>> and implementation). Look forward to meeting with you. BTW: should I >>>>> include dev@iceberg.apache.org in the email invite? >>>>> >>>>> Sync up on Variant subcolumnization (shredding) >>>>> Thursday, July 25 · 8:00 – 9:00am >>>>> Time zone: America/Los_Angeles >>>>> Google Meet joining info >>>>> Video call link: https://meet.google.com/mug-dvnv-hnq >>>>> Or dial: (US) +1 904-900-0730 PIN: 671 997 419# >>>>> More phone numbers: https://tel.meet/mug-dvnv-hnq?pin=1654043233422 >>>>> >>>>> Thanks, >>>>> Aihua >>>>> >>>>> On Tue, Jul 23, 2024 at 6:36 AM Amogh Jahagirdar <2am...@gmail.com> >>>>> wrote: >>>>> >>>>>> I'm late replying to this but I'm also in agreement with 1 (adopting >>>>>> the spark variant encoding), 3 (specifically only having a variant type), >>>>>> and 4 (ensuring we are thinking through subcolumnarization upfront since >>>>>> without it the variant type may not be that useful). >>>>>> >>>>>> I'd also support having the spec, and reference implementation in >>>>>> Iceberg; as others have said, it centralizes improvements in a single, >>>>>> agnostic dependency for engines, rather than engines having to take >>>>>> dependencies on other engine modules. >>>>>> >>>>>> Thanks, >>>>>> >>>>>> Amogh Jahagirdar >>>>>> >>>>>> On Tue, Jul 23, 2024 at 12:15 AM Péter Váry < >>>>>> peter.vary.apa...@gmail.com> wrote: >>>>>> >>>>>>> I have been looking around, how can we map Variant type in Flink. I >>>>>>> have not found any existing type which we could use, but Flink already >>>>>>> have >>>>>>> some JSON parsing capabilities [1] for string fields. >>>>>>> >>>>>>> So until we have native support in Flink for something similar to >>>>>>> Vartiant type, I expect that we need to map it to JSON strings in >>>>>>> RowData. >>>>>>> >>>>>>> Based on that, here are my preferences: >>>>>>> 1. I'm ok with adapting Spark Variant type, if we build our own >>>>>>> Iceberg serializer/deserializer module for it >>>>>>> 2. I prefer to move the spec to Iceberg, so we own it, and extend >>>>>>> it, if needed. This could be important in the first phase. Later when >>>>>>> it is >>>>>>> more stable we might donate it to some other project, like Parquet >>>>>>> 3. I would prefer to support only a single type, and Variant is more >>>>>>> expressive, but having a standard way to convert between JSON and >>>>>>> Variant >>>>>>> would be useful for Flink users. >>>>>>> 4. On subcolumnarization: I think Flink will only use this feature >>>>>>> as much as the Iceberg readers implement this, so I would like to see as >>>>>>> much as possible of it in the common Iceberg code >>>>>>> >>>>>>> Thanks, >>>>>>> Peter >>>>>>> >>>>>>> [1] - >>>>>>> https://nightlies.apache.org/flink/flink-docs-release-1.19/docs/dev/table/functions/systemfunctions/#json-functions >>>>>>> >>>>>>> >>>>>>> On Tue, Jul 23, 2024, 06:36 Micah Kornfield <emkornfi...@gmail.com> >>>>>>> wrote: >>>>>>> >>>>>>>> Sorry for the late reply. I agree with the sentiments on 1 and 3 >>>>>>>> that have already been posted (adopt the Spark encoding, and only have >>>>>>>> the >>>>>>>> Variant type). As mentioned on the doc for 3, I think it would be >>>>>>>> good to >>>>>>>> specify how to map scalar types to a JSON representation so there can >>>>>>>> be >>>>>>>> consistency between engines that don't support variant. >>>>>>>> >>>>>>>> >>>>>>>>> Regarding point 2, I also feel Iceberg is more natural to host >>>>>>>>> such a subproject for variant spec and implementation. But let me >>>>>>>>> reach out >>>>>>>>> to the Spark community to discuss. >>>>>>>> >>>>>>>> >>>>>>>> The only other place I can think of that might be a good home for >>>>>>>> Variant spec could be in Apache Arrow as a canonical extension type. >>>>>>>> There >>>>>>>> is an issue for this [1]. I think the main thing on where this is >>>>>>>> housed >>>>>>>> is which types are intended to be supported. I believe Arrow is >>>>>>>> currently >>>>>>>> a superset of the Iceberg type system (UUID is supported as a canonical >>>>>>>> extension type [2]). >>>>>>>> >>>>>>>> For point 4 subcolumnarization, I think ideally this belongs in >>>>>>>> Iceberg (and if Iceberg and Delta Lake can agree on how to do it that >>>>>>>> would >>>>>>>> be great) with potential consultation with Parquet/ORC communities to >>>>>>>> potentially add better native support. >>>>>>>> >>>>>>>> Thanks, >>>>>>>> Micah >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> [1] https://github.com/apache/arrow/issues/42069 >>>>>>>> [2] https://arrow.apache.org/docs/format/CanonicalExtensions.html >>>>>>>> >>>>>>>> On Sat, Jul 20, 2024 at 5:54 PM Aihua Xu <aihu...@gmail.com> wrote: >>>>>>>> >>>>>>>>> Thanks for the discussion and feedback. >>>>>>>>> >>>>>>>>> Do we have the consensus on point 1 and point 3 to move forward >>>>>>>>> with Spark variant encoding and support Variant type only? Or let me >>>>>>>>> know >>>>>>>>> how to proceed from here. >>>>>>>>> >>>>>>>>> Regarding point 2, I also feel Iceberg is more natural to host >>>>>>>>> such a subproject for variant spec and implementation. But let me >>>>>>>>> reach out >>>>>>>>> to the Spark community to discuss. >>>>>>>>> >>>>>>>>> Thanks, >>>>>>>>> Aihua >>>>>>>>> >>>>>>>>> >>>>>>>>> On Fri, Jul 19, 2024 at 9:35 AM Yufei Gu <flyrain...@gmail.com> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> Agreed with point 1. >>>>>>>>>> >>>>>>>>>> For point 2, I also prefer to hold the spec and reference >>>>>>>>>> implementation under Iceberg. Here are the reasons: >>>>>>>>>> 1. It is unconventional and impractical for one engine to depend >>>>>>>>>> on another for data types. For instance, it is not ideal for Trino >>>>>>>>>> to rely >>>>>>>>>> on data types defined by the Spark engine. >>>>>>>>>> 2. Iceberg serves as a bridge between engines and file formats. >>>>>>>>>> By centralizing the specification in Iceberg, any future >>>>>>>>>> optimizations or >>>>>>>>>> updates to file formats can be referred to within Iceberg, ensuring >>>>>>>>>> consistency and reducing dependencies. >>>>>>>>>> >>>>>>>>>> For point 3, I'd prefer to support the variant type only at this >>>>>>>>>> moment. >>>>>>>>>> >>>>>>>>>> Yufei >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Thu, Jul 18, 2024 at 12:55 PM Ryan Blue >>>>>>>>>> <b...@databricks.com.invalid> wrote: >>>>>>>>>> >>>>>>>>>>> Similarly, I'm aligned with point 1 and I'd choose to support >>>>>>>>>>> only variant for point 3. >>>>>>>>>>> >>>>>>>>>>> We'll need to work with the Spark community to find a good place >>>>>>>>>>> for the library and spec, since it touches many different projects. >>>>>>>>>>> I'd >>>>>>>>>>> also prefer Iceberg as the home. >>>>>>>>>>> >>>>>>>>>>> I also think it's a good idea to get subcolumnarization into our >>>>>>>>>>> spec when we update. Without that I think the feature will be fairly >>>>>>>>>>> limited. >>>>>>>>>>> >>>>>>>>>>> On Thu, Jul 18, 2024 at 10:56 AM Russell Spitzer < >>>>>>>>>>> russell.spit...@gmail.com> wrote: >>>>>>>>>>> >>>>>>>>>>>> I'm aligned with point 1. >>>>>>>>>>>> >>>>>>>>>>>> For point 2 I think we should choose quickly, I honestly do >>>>>>>>>>>> think this would be fine as part of the Iceberg Spec directly but >>>>>>>>>>>> understand it may be better for the more broad community if it was >>>>>>>>>>>> a sub >>>>>>>>>>>> project. As a sub-project I would still prefer it being an Iceberg >>>>>>>>>>>> Subproject since we are engine/file-format agnostic. >>>>>>>>>>>> >>>>>>>>>>>> 3. I support adding just Variant. >>>>>>>>>>>> >>>>>>>>>>>> On Thu, Jul 18, 2024 at 12:54 AM Aihua Xu <aihu...@apache.org> >>>>>>>>>>>> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> Hello community, >>>>>>>>>>>>> >>>>>>>>>>>>> It’s great to sync up with some of you on Variant and >>>>>>>>>>>>> SubColumarization support in Iceberg again. Apologize that I >>>>>>>>>>>>> didn’t record >>>>>>>>>>>>> the meeting but here are some key items that we want to follow up >>>>>>>>>>>>> with the >>>>>>>>>>>>> community. >>>>>>>>>>>>> >>>>>>>>>>>>> 1. Adopt Spark Variant encoding >>>>>>>>>>>>> Those present were in favor of adopting the Spark variant >>>>>>>>>>>>> encoding for Iceberg Variant with extensions to support other >>>>>>>>>>>>> Iceberg >>>>>>>>>>>>> types. We would like to know if anyone has an objection to this >>>>>>>>>>>>> to reuse an >>>>>>>>>>>>> open source encoding. >>>>>>>>>>>>> >>>>>>>>>>>>> 2. Movement of the Spark Variant Spec to another project >>>>>>>>>>>>> To avoid introducing Apache Spark as a dependency for the >>>>>>>>>>>>> engines and file formats, we discussed separating Spark Variant >>>>>>>>>>>>> encoding >>>>>>>>>>>>> spec and implementation from the Spark Project to a neutral >>>>>>>>>>>>> location. We >>>>>>>>>>>>> thought up several solutions but didn’t have consensus on any of >>>>>>>>>>>>> them. We >>>>>>>>>>>>> are looking for more feedback on this topic from the community >>>>>>>>>>>>> either in >>>>>>>>>>>>> terms of support for one of these options or another idea on how >>>>>>>>>>>>> to support >>>>>>>>>>>>> the spec. >>>>>>>>>>>>> >>>>>>>>>>>>> Options Proposed: >>>>>>>>>>>>> * Leave the Spec in Spark (Difficult for versioning and other >>>>>>>>>>>>> engines) >>>>>>>>>>>>> * Copying the Spec into Iceberg Project Directly (Difficult >>>>>>>>>>>>> for other Table Formats) >>>>>>>>>>>>> * Creating a Sub-Project of Apache Iceberg and moving the spec >>>>>>>>>>>>> and reference implementation there (Logistically complicated) >>>>>>>>>>>>> * Creating a Sub-Project of Apache Spark and moving the spec >>>>>>>>>>>>> and reference implementation there (Logistically complicated) >>>>>>>>>>>>> >>>>>>>>>>>>> 3. Add Variant type vs. Variant and JSON types >>>>>>>>>>>>> Those who were present were in favor of adding only the >>>>>>>>>>>>> Variant type to Iceberg. We are looking for anyone who has an >>>>>>>>>>>>> objection to >>>>>>>>>>>>> going forward with just the Variant Type and no Iceberg JSON >>>>>>>>>>>>> Type. We were >>>>>>>>>>>>> favoring adding Variant type only because: >>>>>>>>>>>>> * Introducing a JSON type would require engines that only >>>>>>>>>>>>> support VARIANT to do write time validation of their input to a >>>>>>>>>>>>> JSON >>>>>>>>>>>>> column. If they don’t have a JSON type an engine wouldn’t support >>>>>>>>>>>>> this. >>>>>>>>>>>>> * Engines which don’t support Variant will work most of the >>>>>>>>>>>>> time but can have fallback strings defined in the spec for reading >>>>>>>>>>>>> unsupported types. Writing a JSON into a Variant will always work. >>>>>>>>>>>>> >>>>>>>>>>>>> 4. Support for Subcolumnization spec (shredding in Spark) >>>>>>>>>>>>> We have no action items on this but would like to follow up on >>>>>>>>>>>>> discussions on Subcolumnization in the future. >>>>>>>>>>>>> * We had general agreement that this should be included in >>>>>>>>>>>>> Iceberg V3 or else adding variant may not be useful. >>>>>>>>>>>>> * We are interested in also adopting the shredding spec from >>>>>>>>>>>>> Spark and would like to move it to whatever place we decided the >>>>>>>>>>>>> Variant >>>>>>>>>>>>> spec is going to live. >>>>>>>>>>>>> >>>>>>>>>>>>> Let us know if missed anything and if you have any additional >>>>>>>>>>>>> thoughts or suggestions. >>>>>>>>>>>>> >>>>>>>>>>>>> Thanks >>>>>>>>>>>>> Aihua >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> On 2024/07/15 18:32:22 Aihua Xu wrote: >>>>>>>>>>>>> > Thanks for the discussion. >>>>>>>>>>>>> > >>>>>>>>>>>>> > I will move forward to work on spec PR. >>>>>>>>>>>>> > >>>>>>>>>>>>> > Regarding the implementation, we will have module for >>>>>>>>>>>>> Variant support in Iceberg so we will not have to bring in Spark >>>>>>>>>>>>> libraries. >>>>>>>>>>>>> > >>>>>>>>>>>>> > I'm reposting the meeting invite in case it's not clear in >>>>>>>>>>>>> my original email since I included in the end. Looks like we >>>>>>>>>>>>> don't have >>>>>>>>>>>>> major objections/diverges but let's sync up and have consensus. >>>>>>>>>>>>> > >>>>>>>>>>>>> > Meeting invite: >>>>>>>>>>>>> > >>>>>>>>>>>>> > Wednesday, July 17 · 9:00 – 10:00am >>>>>>>>>>>>> > Time zone: America/Los_Angeles >>>>>>>>>>>>> > Google Meet joining info >>>>>>>>>>>>> > Video call link: https://meet.google.com/pbm-ovzn-aoq >>>>>>>>>>>>> > Or dial: (US) +1 650-449-9343 PIN: 170 576 525# >>>>>>>>>>>>> > More phone numbers: >>>>>>>>>>>>> https://tel.meet/pbm-ovzn-aoq?pin=4079632691790 >>>>>>>>>>>>> > >>>>>>>>>>>>> > Thanks, >>>>>>>>>>>>> > Aihua >>>>>>>>>>>>> > >>>>>>>>>>>>> > On 2024/07/12 20:55:01 Micah Kornfield wrote: >>>>>>>>>>>>> > > I don't think this needs to hold up the PR but I think >>>>>>>>>>>>> coming to a >>>>>>>>>>>>> > > consensus on the exact set of types supported is >>>>>>>>>>>>> worthwhile (and if the >>>>>>>>>>>>> > > goal is to maintain the same set as specified by the Spark >>>>>>>>>>>>> Variant type or >>>>>>>>>>>>> > > if divergence is expected/allowed). From a fragmentation >>>>>>>>>>>>> perspective it >>>>>>>>>>>>> > > would be a shame if they diverge, so maybe a next step is >>>>>>>>>>>>> also suggesting >>>>>>>>>>>>> > > support to the Spark community on the missing existing >>>>>>>>>>>>> Iceberg types? >>>>>>>>>>>>> > > >>>>>>>>>>>>> > > Thanks, >>>>>>>>>>>>> > > Micah >>>>>>>>>>>>> > > >>>>>>>>>>>>> > > On Fri, Jul 12, 2024 at 1:44 PM Russell Spitzer < >>>>>>>>>>>>> russell.spit...@gmail.com> >>>>>>>>>>>>> > > wrote: >>>>>>>>>>>>> > > >>>>>>>>>>>>> > > > Just talked with Aihua and he's working on the Spec PR >>>>>>>>>>>>> now. We can get >>>>>>>>>>>>> > > > feedback there from everyone. >>>>>>>>>>>>> > > > >>>>>>>>>>>>> > > > On Fri, Jul 12, 2024 at 3:41 PM Ryan Blue >>>>>>>>>>>>> <b...@databricks.com.invalid> >>>>>>>>>>>>> > > > wrote: >>>>>>>>>>>>> > > > >>>>>>>>>>>>> > > >> Good idea, but I'm hoping that we can continue to get >>>>>>>>>>>>> their feedback in >>>>>>>>>>>>> > > >> parallel to getting the spec changes started. Piotr >>>>>>>>>>>>> didn't seem to object >>>>>>>>>>>>> > > >> to the encoding from what I read of his comments. >>>>>>>>>>>>> Hopefully he (and others) >>>>>>>>>>>>> > > >> chime in here. >>>>>>>>>>>>> > > >> >>>>>>>>>>>>> > > >> On Fri, Jul 12, 2024 at 1:32 PM Russell Spitzer < >>>>>>>>>>>>> > > >> russell.spit...@gmail.com> wrote: >>>>>>>>>>>>> > > >> >>>>>>>>>>>>> > > >>> I just want to make sure we get Piotr and Peter on >>>>>>>>>>>>> board as >>>>>>>>>>>>> > > >>> representatives of Flink and Trino engines. Also make >>>>>>>>>>>>> sure we have anyone >>>>>>>>>>>>> > > >>> else chime in who has experience with Ray if possible. >>>>>>>>>>>>> > > >>> >>>>>>>>>>>>> > > >>> Spec changes feel like the right next step. >>>>>>>>>>>>> > > >>> >>>>>>>>>>>>> > > >>> On Fri, Jul 12, 2024 at 3:14 PM Ryan Blue >>>>>>>>>>>>> <b...@databricks.com.invalid> >>>>>>>>>>>>> > > >>> wrote: >>>>>>>>>>>>> > > >>> >>>>>>>>>>>>> > > >>>> Okay, what are the next steps here? This proposal has >>>>>>>>>>>>> been out for >>>>>>>>>>>>> > > >>>> quite a while and I don't see any major objections to >>>>>>>>>>>>> using the Spark >>>>>>>>>>>>> > > >>>> encoding. It's quite well designed and fits the need >>>>>>>>>>>>> well. It can also be >>>>>>>>>>>>> > > >>>> extended to support additional types that are missing >>>>>>>>>>>>> if that's a priority. >>>>>>>>>>>>> > > >>>> >>>>>>>>>>>>> > > >>>> Should we move forward by starting a draft of the >>>>>>>>>>>>> changes to the table >>>>>>>>>>>>> > > >>>> spec? Then we can vote on committing those changes >>>>>>>>>>>>> and get moving on an >>>>>>>>>>>>> > > >>>> implementation (or possibly do the implementation in >>>>>>>>>>>>> parallel). >>>>>>>>>>>>> > > >>>> >>>>>>>>>>>>> > > >>>> On Fri, Jul 12, 2024 at 1:08 PM Russell Spitzer < >>>>>>>>>>>>> > > >>>> russell.spit...@gmail.com> wrote: >>>>>>>>>>>>> > > >>>> >>>>>>>>>>>>> > > >>>>> That's fair, I'm sold on an Iceberg Module. >>>>>>>>>>>>> > > >>>>> >>>>>>>>>>>>> > > >>>>> On Fri, Jul 12, 2024 at 2:53 PM Ryan Blue >>>>>>>>>>>>> <b...@databricks.com.invalid> >>>>>>>>>>>>> > > >>>>> wrote: >>>>>>>>>>>>> > > >>>>> >>>>>>>>>>>>> > > >>>>>> > Feels like eventually the encoding should land in >>>>>>>>>>>>> parquet proper >>>>>>>>>>>>> > > >>>>>> right? >>>>>>>>>>>>> > > >>>>>> >>>>>>>>>>>>> > > >>>>>> What about using it in ORC? I don't know where it >>>>>>>>>>>>> should end up. >>>>>>>>>>>>> > > >>>>>> Maybe Iceberg should make a standalone module from >>>>>>>>>>>>> it? >>>>>>>>>>>>> > > >>>>>> >>>>>>>>>>>>> > > >>>>>> On Fri, Jul 12, 2024 at 12:38 PM Russell Spitzer < >>>>>>>>>>>>> > > >>>>>> russell.spit...@gmail.com> wrote: >>>>>>>>>>>>> > > >>>>>> >>>>>>>>>>>>> > > >>>>>>> Feels like eventually the encoding should land in >>>>>>>>>>>>> parquet proper >>>>>>>>>>>>> > > >>>>>>> right? I'm fine with us just copying into Iceberg >>>>>>>>>>>>> though for the time >>>>>>>>>>>>> > > >>>>>>> being. >>>>>>>>>>>>> > > >>>>>>> >>>>>>>>>>>>> > > >>>>>>> On Fri, Jul 12, 2024 at 2:31 PM Ryan Blue >>>>>>>>>>>>> > > >>>>>>> <b...@databricks.com.invalid> wrote: >>>>>>>>>>>>> > > >>>>>>> >>>>>>>>>>>>> > > >>>>>>>> Oops, it looks like I missed where Aihua brought >>>>>>>>>>>>> this up in his >>>>>>>>>>>>> > > >>>>>>>> last email: >>>>>>>>>>>>> > > >>>>>>>> >>>>>>>>>>>>> > > >>>>>>>> > do we have an issue to directly use Spark >>>>>>>>>>>>> implementation in >>>>>>>>>>>>> > > >>>>>>>> Iceberg? >>>>>>>>>>>>> > > >>>>>>>> >>>>>>>>>>>>> > > >>>>>>>> Yes, I think that we do have an issue using the >>>>>>>>>>>>> Spark library. What >>>>>>>>>>>>> > > >>>>>>>> do you think about a Java implementation in >>>>>>>>>>>>> Iceberg? >>>>>>>>>>>>> > > >>>>>>>> >>>>>>>>>>>>> > > >>>>>>>> Ryan >>>>>>>>>>>>> > > >>>>>>>> >>>>>>>>>>>>> > > >>>>>>>> On Fri, Jul 12, 2024 at 12:28 PM Ryan Blue < >>>>>>>>>>>>> b...@databricks.com> >>>>>>>>>>>>> > > >>>>>>>> wrote: >>>>>>>>>>>>> > > >>>>>>>> >>>>>>>>>>>>> > > >>>>>>>>> I raised the same point from Peter's email in a >>>>>>>>>>>>> comment on the doc >>>>>>>>>>>>> > > >>>>>>>>> as well. There is a spark-variant_2.13 artifact >>>>>>>>>>>>> that would be a much >>>>>>>>>>>>> > > >>>>>>>>> smaller scope than relying on large portions of >>>>>>>>>>>>> Spark, but I even then I >>>>>>>>>>>>> > > >>>>>>>>> doubt that it is a good idea for Iceberg to >>>>>>>>>>>>> depend on that because it is a >>>>>>>>>>>>> > > >>>>>>>>> Scala artifact and we would need to bring in a >>>>>>>>>>>>> ton of Scala libs. I think >>>>>>>>>>>>> > > >>>>>>>>> what makes the most sense is to have an >>>>>>>>>>>>> independent implementation of the >>>>>>>>>>>>> > > >>>>>>>>> spec in Iceberg. >>>>>>>>>>>>> > > >>>>>>>>> >>>>>>>>>>>>> > > >>>>>>>>> On Fri, Jul 12, 2024 at 11:51 AM Péter Váry < >>>>>>>>>>>>> > > >>>>>>>>> peter.vary.apa...@gmail.com> wrote: >>>>>>>>>>>>> > > >>>>>>>>> >>>>>>>>>>>>> > > >>>>>>>>>> Hi Aihua, >>>>>>>>>>>>> > > >>>>>>>>>> Long time no see :) >>>>>>>>>>>>> > > >>>>>>>>>> Would this mean, that every engine which plans >>>>>>>>>>>>> to support Variant >>>>>>>>>>>>> > > >>>>>>>>>> data type needs to add Spark as a dependency? >>>>>>>>>>>>> Like Flink/Trino/Hive etc? >>>>>>>>>>>>> > > >>>>>>>>>> Thanks, Peter >>>>>>>>>>>>> > > >>>>>>>>>> >>>>>>>>>>>>> > > >>>>>>>>>> >>>>>>>>>>>>> > > >>>>>>>>>> On Fri, Jul 12, 2024, 19:10 Aihua Xu < >>>>>>>>>>>>> aihu...@apache.org> wrote: >>>>>>>>>>>>> > > >>>>>>>>>> >>>>>>>>>>>>> > > >>>>>>>>>>> Thanks Ryan. >>>>>>>>>>>>> > > >>>>>>>>>>> >>>>>>>>>>>>> > > >>>>>>>>>>> Yeah. That's another reason we want to pursue >>>>>>>>>>>>> Spark encoding to >>>>>>>>>>>>> > > >>>>>>>>>>> keep compatibility for the open source engines. >>>>>>>>>>>>> > > >>>>>>>>>>> >>>>>>>>>>>>> > > >>>>>>>>>>> One more question regarding the encoding >>>>>>>>>>>>> implementation: do we >>>>>>>>>>>>> > > >>>>>>>>>>> have an issue to directly use Spark >>>>>>>>>>>>> implementation in Iceberg? Russell >>>>>>>>>>>>> > > >>>>>>>>>>> pointed out that Trino doesn't have Spark >>>>>>>>>>>>> dependency and that could be a >>>>>>>>>>>>> > > >>>>>>>>>>> problem? >>>>>>>>>>>>> > > >>>>>>>>>>> >>>>>>>>>>>>> > > >>>>>>>>>>> Thanks, >>>>>>>>>>>>> > > >>>>>>>>>>> Aihua >>>>>>>>>>>>> > > >>>>>>>>>>> >>>>>>>>>>>>> > > >>>>>>>>>>> On 2024/07/12 15:02:06 Ryan Blue wrote: >>>>>>>>>>>>> > > >>>>>>>>>>> > Thanks, Aihua! >>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>>>>>>>> > > >>>>>>>>>>> > I think that the encoding choice in the >>>>>>>>>>>>> current doc is a good >>>>>>>>>>>>> > > >>>>>>>>>>> one. I went >>>>>>>>>>>>> > > >>>>>>>>>>> > through the Spark encoding in detail and it >>>>>>>>>>>>> looks like a >>>>>>>>>>>>> > > >>>>>>>>>>> better choice than >>>>>>>>>>>>> > > >>>>>>>>>>> > the other candidate encodings for quickly >>>>>>>>>>>>> accessing nested >>>>>>>>>>>>> > > >>>>>>>>>>> fields. >>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>>>>>>>> > > >>>>>>>>>>> > Another reason to use the Spark type is that >>>>>>>>>>>>> this is what >>>>>>>>>>>>> > > >>>>>>>>>>> Delta's variant >>>>>>>>>>>>> > > >>>>>>>>>>> > type is based on, so Parquet files in tables >>>>>>>>>>>>> written by Delta >>>>>>>>>>>>> > > >>>>>>>>>>> could be >>>>>>>>>>>>> > > >>>>>>>>>>> > converted or used in Iceberg tables without >>>>>>>>>>>>> needing to rewrite >>>>>>>>>>>>> > > >>>>>>>>>>> variant >>>>>>>>>>>>> > > >>>>>>>>>>> > data. (Also, note that I work at Databricks >>>>>>>>>>>>> and have an >>>>>>>>>>>>> > > >>>>>>>>>>> interest in >>>>>>>>>>>>> > > >>>>>>>>>>> > increasing format compatibility.) >>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>>>>>>>> > > >>>>>>>>>>> > Ryan >>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>>>>>>>> > > >>>>>>>>>>> > On Thu, Jul 11, 2024 at 11:21 AM Aihua Xu < >>>>>>>>>>>>> > > >>>>>>>>>>> aihua...@snowflake.com.invalid> >>>>>>>>>>>>> > > >>>>>>>>>>> > wrote: >>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>>>>>>>> > > >>>>>>>>>>> > > [Discuss] Consensus for Variant Encoding >>>>>>>>>>>>> > > >>>>>>>>>>> > > >>>>>>>>>>>>> > > >>>>>>>>>>> > > It’s great to be able to present the >>>>>>>>>>>>> Variant type proposal >>>>>>>>>>>>> > > >>>>>>>>>>> in the >>>>>>>>>>>>> > > >>>>>>>>>>> > > community sync yesterday and I’m looking >>>>>>>>>>>>> to host a meeting >>>>>>>>>>>>> > > >>>>>>>>>>> next week >>>>>>>>>>>>> > > >>>>>>>>>>> > > (targeting for 9am, July 17th) to go over >>>>>>>>>>>>> any further >>>>>>>>>>>>> > > >>>>>>>>>>> concerns about the >>>>>>>>>>>>> > > >>>>>>>>>>> > > encoding of the Variant type and any other >>>>>>>>>>>>> questions on the >>>>>>>>>>>>> > > >>>>>>>>>>> first phase of >>>>>>>>>>>>> > > >>>>>>>>>>> > > the proposal >>>>>>>>>>>>> > > >>>>>>>>>>> > > < >>>>>>>>>>>>> > > >>>>>>>>>>> >>>>>>>>>>>>> https://docs.google.com/document/d/1QjhpG_SVNPZh3anFcpicMQx90ebwjL7rmzFYfUP89Iw/edit >>>>>>>>>>>>> > > >>>>>>>>>>> >. >>>>>>>>>>>>> > > >>>>>>>>>>> > > We are hoping that anyone who is >>>>>>>>>>>>> interested in the proposal >>>>>>>>>>>>> > > >>>>>>>>>>> can either join >>>>>>>>>>>>> > > >>>>>>>>>>> > > or reply with their comments so we can >>>>>>>>>>>>> discuss them. Summary >>>>>>>>>>>>> > > >>>>>>>>>>> of the >>>>>>>>>>>>> > > >>>>>>>>>>> > > discussion and notes will be sent to the >>>>>>>>>>>>> mailing list for >>>>>>>>>>>>> > > >>>>>>>>>>> further comment >>>>>>>>>>>>> > > >>>>>>>>>>> > > there. >>>>>>>>>>>>> > > >>>>>>>>>>> > > >>>>>>>>>>>>> > > >>>>>>>>>>> > > >>>>>>>>>>>>> > > >>>>>>>>>>> > > - >>>>>>>>>>>>> > > >>>>>>>>>>> > > >>>>>>>>>>>>> > > >>>>>>>>>>> > > What should be the underlying binary >>>>>>>>>>>>> representation >>>>>>>>>>>>> > > >>>>>>>>>>> > > >>>>>>>>>>>>> > > >>>>>>>>>>> > > We have evaluated a few encodings in the >>>>>>>>>>>>> doc including ION, >>>>>>>>>>>>> > > >>>>>>>>>>> JSONB, and >>>>>>>>>>>>> > > >>>>>>>>>>> > > Spark encoding.Choosing the underlying >>>>>>>>>>>>> encoding is an >>>>>>>>>>>>> > > >>>>>>>>>>> important first step >>>>>>>>>>>>> > > >>>>>>>>>>> > > here and we believe we have general >>>>>>>>>>>>> support for Spark’s >>>>>>>>>>>>> > > >>>>>>>>>>> Variant encoding. >>>>>>>>>>>>> > > >>>>>>>>>>> > > We would like to hear if anyone else has >>>>>>>>>>>>> strong opinions in >>>>>>>>>>>>> > > >>>>>>>>>>> this space. >>>>>>>>>>>>> > > >>>>>>>>>>> > > >>>>>>>>>>>>> > > >>>>>>>>>>> > > >>>>>>>>>>>>> > > >>>>>>>>>>> > > - >>>>>>>>>>>>> > > >>>>>>>>>>> > > >>>>>>>>>>>>> > > >>>>>>>>>>> > > Should we support multiple logical >>>>>>>>>>>>> types or just Variant? >>>>>>>>>>>>> > > >>>>>>>>>>> Variant vs. >>>>>>>>>>>>> > > >>>>>>>>>>> > > Variant + JSON. >>>>>>>>>>>>> > > >>>>>>>>>>> > > >>>>>>>>>>>>> > > >>>>>>>>>>> > > This is to discuss what logical data >>>>>>>>>>>>> type(s) to be supported >>>>>>>>>>>>> > > >>>>>>>>>>> in Iceberg - >>>>>>>>>>>>> > > >>>>>>>>>>> > > Variant only vs. Variant + JSON. Both >>>>>>>>>>>>> types would share the >>>>>>>>>>>>> > > >>>>>>>>>>> same underlying >>>>>>>>>>>>> > > >>>>>>>>>>> > > encoding but would imply different >>>>>>>>>>>>> limitations on engines >>>>>>>>>>>>> > > >>>>>>>>>>> working with >>>>>>>>>>>>> > > >>>>>>>>>>> > > those types. >>>>>>>>>>>>> > > >>>>>>>>>>> > > >>>>>>>>>>>>> > > >>>>>>>>>>> > > From the sync up meeting, we are more >>>>>>>>>>>>> favoring toward >>>>>>>>>>>>> > > >>>>>>>>>>> supporting Variant >>>>>>>>>>>>> > > >>>>>>>>>>> > > only and we want to have a consensus on >>>>>>>>>>>>> the supported >>>>>>>>>>>>> > > >>>>>>>>>>> type(s). >>>>>>>>>>>>> > > >>>>>>>>>>> > > >>>>>>>>>>>>> > > >>>>>>>>>>> > > >>>>>>>>>>>>> > > >>>>>>>>>>> > > - >>>>>>>>>>>>> > > >>>>>>>>>>> > > >>>>>>>>>>>>> > > >>>>>>>>>>> > > How should we move forward with >>>>>>>>>>>>> Subcolumnization? >>>>>>>>>>>>> > > >>>>>>>>>>> > > >>>>>>>>>>>>> > > >>>>>>>>>>> > > Subcolumnization is an optimization for >>>>>>>>>>>>> Variant type by >>>>>>>>>>>>> > > >>>>>>>>>>> separating out >>>>>>>>>>>>> > > >>>>>>>>>>> > > subcolumns with their own metadata. This >>>>>>>>>>>>> is not critical for >>>>>>>>>>>>> > > >>>>>>>>>>> choosing the >>>>>>>>>>>>> > > >>>>>>>>>>> > > initial encoding of the Variant type so we >>>>>>>>>>>>> were hoping to >>>>>>>>>>>>> > > >>>>>>>>>>> gain consensus on >>>>>>>>>>>>> > > >>>>>>>>>>> > > leaving that for a follow up spec. >>>>>>>>>>>>> > > >>>>>>>>>>> > > >>>>>>>>>>>>> > > >>>>>>>>>>> > > >>>>>>>>>>>>> > > >>>>>>>>>>> > > Thanks >>>>>>>>>>>>> > > >>>>>>>>>>> > > >>>>>>>>>>>>> > > >>>>>>>>>>> > > Aihua >>>>>>>>>>>>> > > >>>>>>>>>>> > > >>>>>>>>>>>>> > > >>>>>>>>>>> > > Meeting invite: >>>>>>>>>>>>> > > >>>>>>>>>>> > > >>>>>>>>>>>>> > > >>>>>>>>>>> > > Wednesday, July 17 · 9:00 – 10:00am >>>>>>>>>>>>> > > >>>>>>>>>>> > > Time zone: America/Los_Angeles >>>>>>>>>>>>> > > >>>>>>>>>>> > > Google Meet joining info >>>>>>>>>>>>> > > >>>>>>>>>>> > > Video call link: >>>>>>>>>>>>> https://meet.google.com/pbm-ovzn-aoq >>>>>>>>>>>>> > > >>>>>>>>>>> > > Or dial: (US) +1 650-449-9343 PIN: 170 >>>>>>>>>>>>> 576 525# >>>>>>>>>>>>> > > >>>>>>>>>>> > > More phone numbers: >>>>>>>>>>>>> > > >>>>>>>>>>> >>>>>>>>>>>>> https://tel.meet/pbm-ovzn-aoq?pin=4079632691790 >>>>>>>>>>>>> > > >>>>>>>>>>> > > >>>>>>>>>>>>> > > >>>>>>>>>>> > > On Tue, May 28, 2024 at 9:21 PM Aihua Xu < >>>>>>>>>>>>> > > >>>>>>>>>>> aihua...@snowflake.com> wrote: >>>>>>>>>>>>> > > >>>>>>>>>>> > > >>>>>>>>>>>>> > > >>>>>>>>>>> > >> Hello, >>>>>>>>>>>>> > > >>>>>>>>>>> > >> >>>>>>>>>>>>> > > >>>>>>>>>>> > >> We have drafted the proposal >>>>>>>>>>>>> > > >>>>>>>>>>> > >> < >>>>>>>>>>>>> > > >>>>>>>>>>> >>>>>>>>>>>>> https://docs.google.com/document/d/1QjhpG_SVNPZh3anFcpicMQx90ebwjL7rmzFYfUP89Iw/edit >>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>>>>>>>> > > >>>>>>>>>>> > >> for Variant data type. Please help review >>>>>>>>>>>>> and comment. >>>>>>>>>>>>> > > >>>>>>>>>>> > >> >>>>>>>>>>>>> > > >>>>>>>>>>> > >> Thanks, >>>>>>>>>>>>> > > >>>>>>>>>>> > >> Aihua >>>>>>>>>>>>> > > >>>>>>>>>>> > >> >>>>>>>>>>>>> > > >>>>>>>>>>> > >> On Thu, May 16, 2024 at 12:45 PM Jack Ye < >>>>>>>>>>>>> > > >>>>>>>>>>> yezhao...@gmail.com> wrote: >>>>>>>>>>>>> > > >>>>>>>>>>> > >> >>>>>>>>>>>>> > > >>>>>>>>>>> > >>> +10000 for a JSON/BSON type. We also had >>>>>>>>>>>>> the same >>>>>>>>>>>>> > > >>>>>>>>>>> discussion internally >>>>>>>>>>>>> > > >>>>>>>>>>> > >>> and a JSON type would really play well >>>>>>>>>>>>> with for example >>>>>>>>>>>>> > > >>>>>>>>>>> the SUPER type in >>>>>>>>>>>>> > > >>>>>>>>>>> > >>> Redshift: >>>>>>>>>>>>> > > >>>>>>>>>>> > >>> >>>>>>>>>>>>> > > >>>>>>>>>>> >>>>>>>>>>>>> https://docs.aws.amazon.com/redshift/latest/dg/r_SUPER_type.html >>>>>>>>>>>>> , >>>>>>>>>>>>> > > >>>>>>>>>>> and >>>>>>>>>>>>> > > >>>>>>>>>>> > >>> can also provide better integration with >>>>>>>>>>>>> the Trino JSON >>>>>>>>>>>>> > > >>>>>>>>>>> type. >>>>>>>>>>>>> > > >>>>>>>>>>> > >>> >>>>>>>>>>>>> > > >>>>>>>>>>> > >>> Looking forward to the proposal! >>>>>>>>>>>>> > > >>>>>>>>>>> > >>> >>>>>>>>>>>>> > > >>>>>>>>>>> > >>> Best, >>>>>>>>>>>>> > > >>>>>>>>>>> > >>> Jack Ye >>>>>>>>>>>>> > > >>>>>>>>>>> > >>> >>>>>>>>>>>>> > > >>>>>>>>>>> > >>> >>>>>>>>>>>>> > > >>>>>>>>>>> > >>> On Wed, May 15, 2024 at 9:37 AM Tyler >>>>>>>>>>>>> Akidau >>>>>>>>>>>>> > > >>>>>>>>>>> > >>> <tyler.aki...@snowflake.com.invalid> >>>>>>>>>>>>> wrote: >>>>>>>>>>>>> > > >>>>>>>>>>> > >>> >>>>>>>>>>>>> > > >>>>>>>>>>> > >>>> On Tue, May 14, 2024 at 7:58 PM Gang Wu >>>>>>>>>>>>> <ust...@gmail.com> >>>>>>>>>>>>> > > >>>>>>>>>>> wrote: >>>>>>>>>>>>> > > >>>>>>>>>>> > >>>> >>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>> > We may need some guidance on just >>>>>>>>>>>>> how many we need to >>>>>>>>>>>>> > > >>>>>>>>>>> look at; >>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>> > we were planning on Spark and Trino, >>>>>>>>>>>>> but weren't sure >>>>>>>>>>>>> > > >>>>>>>>>>> how much >>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>> > further down the rabbit hole we >>>>>>>>>>>>> needed to go。 >>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>> >>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>> There are some engines living outside >>>>>>>>>>>>> the Java world. It >>>>>>>>>>>>> > > >>>>>>>>>>> would be >>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>> good if the proposal could cover the >>>>>>>>>>>>> effort it takes to >>>>>>>>>>>>> > > >>>>>>>>>>> integrate >>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>> variant type to them (e.g. velox, >>>>>>>>>>>>> datafusion, etc.). >>>>>>>>>>>>> > > >>>>>>>>>>> This is something >>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>> that >>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>> some proprietary iceberg vendors also >>>>>>>>>>>>> care about. >>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>> >>>>>>>>>>>>> > > >>>>>>>>>>> > >>>> >>>>>>>>>>>>> > > >>>>>>>>>>> > >>>> Ack, makes sense. We can make sure to >>>>>>>>>>>>> share some >>>>>>>>>>>>> > > >>>>>>>>>>> perspective on this. >>>>>>>>>>>>> > > >>>>>>>>>>> > >>>> >>>>>>>>>>>>> > > >>>>>>>>>>> > >>>> > Not necessarily, no. As long as >>>>>>>>>>>>> there's a binary type >>>>>>>>>>>>> > > >>>>>>>>>>> and Iceberg and >>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>> > the query engines are aware that the >>>>>>>>>>>>> binary column >>>>>>>>>>>>> > > >>>>>>>>>>> needs to be >>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>> > interpreted as a variant, that >>>>>>>>>>>>> should be sufficient. >>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>> >>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>> From the perspective of >>>>>>>>>>>>> interoperability, it would be >>>>>>>>>>>>> > > >>>>>>>>>>> good to support >>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>> native >>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>> type from file specs. Life will be >>>>>>>>>>>>> easier for projects >>>>>>>>>>>>> > > >>>>>>>>>>> like Apache >>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>> XTable. >>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>> File format could also provide >>>>>>>>>>>>> finer-grained statistics >>>>>>>>>>>>> > > >>>>>>>>>>> for variant >>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>> type which >>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>> facilitates data skipping. >>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>> >>>>>>>>>>>>> > > >>>>>>>>>>> > >>>> >>>>>>>>>>>>> > > >>>>>>>>>>> > >>>> Agreed, there can definitely be >>>>>>>>>>>>> additional value in >>>>>>>>>>>>> > > >>>>>>>>>>> native file format >>>>>>>>>>>>> > > >>>>>>>>>>> > >>>> integration. Just wanted to highlight >>>>>>>>>>>>> that it's not a >>>>>>>>>>>>> > > >>>>>>>>>>> strict requirement. >>>>>>>>>>>>> > > >>>>>>>>>>> > >>>> >>>>>>>>>>>>> > > >>>>>>>>>>> > >>>> -Tyler >>>>>>>>>>>>> > > >>>>>>>>>>> > >>>> >>>>>>>>>>>>> > > >>>>>>>>>>> > >>>> >>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>> >>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>> Gang >>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>> >>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>> On Wed, May 15, 2024 at 6:49 AM Tyler >>>>>>>>>>>>> Akidau >>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>> <tyler.aki...@snowflake.com.invalid> >>>>>>>>>>>>> wrote: >>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>> >>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>> Good to see you again as well, JB! >>>>>>>>>>>>> Thanks! >>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>> >>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>> -Tyler >>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>> >>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>> >>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>> On Tue, May 14, 2024 at 1:04 PM >>>>>>>>>>>>> Jean-Baptiste Onofré < >>>>>>>>>>>>> > > >>>>>>>>>>> j...@nanthrax.net> >>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>> wrote: >>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>> >>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> Hi Tyler, >>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> >>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> Super happy to see you there :) It >>>>>>>>>>>>> reminds me our >>>>>>>>>>>>> > > >>>>>>>>>>> discussions back in >>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> the start of Apache Beam :) >>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> >>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> Anyway, the thread is pretty >>>>>>>>>>>>> interesting. I remember >>>>>>>>>>>>> > > >>>>>>>>>>> some discussions >>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> about JSON datatype for spec v3. The >>>>>>>>>>>>> binary data type >>>>>>>>>>>>> > > >>>>>>>>>>> is already >>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> supported in the spec v2. >>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> >>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> I'm looking forward to the proposal >>>>>>>>>>>>> and happy to help >>>>>>>>>>>>> > > >>>>>>>>>>> on this ! >>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> >>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> Regards >>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> JB >>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> >>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> On Sat, May 11, 2024 at 7:06 AM >>>>>>>>>>>>> Tyler Akidau >>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> <tyler.aki...@snowflake.com.invalid> >>>>>>>>>>>>> wrote: >>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> > >>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> > Hello, >>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> > >>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> > We (Tyler, Nileema, Selcuk, Aihua) >>>>>>>>>>>>> are working on a >>>>>>>>>>>>> > > >>>>>>>>>>> proposal for >>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> which we’d like to get early >>>>>>>>>>>>> feedback from the >>>>>>>>>>>>> > > >>>>>>>>>>> community. As you may know, >>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> Snowflake has embraced Iceberg as >>>>>>>>>>>>> its open Data Lake >>>>>>>>>>>>> > > >>>>>>>>>>> format. Having made >>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> good progress on our own adoption of >>>>>>>>>>>>> the Iceberg >>>>>>>>>>>>> > > >>>>>>>>>>> standard, we’re now in a >>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> position where there are features >>>>>>>>>>>>> not yet supported in >>>>>>>>>>>>> > > >>>>>>>>>>> Iceberg which we >>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> think would be valuable for our >>>>>>>>>>>>> users, and that we >>>>>>>>>>>>> > > >>>>>>>>>>> would like to discuss >>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> with and help contribute to the >>>>>>>>>>>>> Iceberg community. >>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> > >>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> > The first two such features we’d >>>>>>>>>>>>> like to discuss are >>>>>>>>>>>>> > > >>>>>>>>>>> in support of >>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> efficient querying of dynamically >>>>>>>>>>>>> typed, >>>>>>>>>>>>> > > >>>>>>>>>>> semi-structured data: variant data >>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> types, and subcolumnarization of >>>>>>>>>>>>> variant columns. In >>>>>>>>>>>>> > > >>>>>>>>>>> more detail, for >>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> anyone who may not already be >>>>>>>>>>>>> familiar: >>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> > >>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> > 1. Variant data types >>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> > Variant types allow for the >>>>>>>>>>>>> efficient binary >>>>>>>>>>>>> > > >>>>>>>>>>> encoding of dynamic >>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> semi-structured data such as JSON, >>>>>>>>>>>>> Avro, etc. By >>>>>>>>>>>>> > > >>>>>>>>>>> encoding semi-structured >>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> data as a variant column, we retain >>>>>>>>>>>>> the flexibility of >>>>>>>>>>>>> > > >>>>>>>>>>> the source data, >>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> while allowing query engines to more >>>>>>>>>>>>> efficiently >>>>>>>>>>>>> > > >>>>>>>>>>> operate on the data. >>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> Snowflake has supported the variant >>>>>>>>>>>>> data type on >>>>>>>>>>>>> > > >>>>>>>>>>> Snowflake tables for many >>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> years [1]. As more and more users >>>>>>>>>>>>> utilize Iceberg >>>>>>>>>>>>> > > >>>>>>>>>>> tables in Snowflake, >>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> we’re hearing an increasing chorus >>>>>>>>>>>>> of requests for >>>>>>>>>>>>> > > >>>>>>>>>>> variant support. >>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> Additionally, other query engines >>>>>>>>>>>>> such as Apache Spark >>>>>>>>>>>>> > > >>>>>>>>>>> have begun adding >>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> variant support [2]. As such, we >>>>>>>>>>>>> believe it would be >>>>>>>>>>>>> > > >>>>>>>>>>> beneficial to the >>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> Iceberg community as a whole to >>>>>>>>>>>>> standardize on the >>>>>>>>>>>>> > > >>>>>>>>>>> variant data type >>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> encoding used across Iceberg tables. >>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> > >>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> > One specific point to make here is >>>>>>>>>>>>> that, since an >>>>>>>>>>>>> > > >>>>>>>>>>> Apache OSS >>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> version of variant encoding already >>>>>>>>>>>>> exists in Spark, >>>>>>>>>>>>> > > >>>>>>>>>>> it likely makes sense >>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> to simply adopt the Spark encoding >>>>>>>>>>>>> as the Iceberg >>>>>>>>>>>>> > > >>>>>>>>>>> standard as well. The >>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> encoding we use internally today in >>>>>>>>>>>>> Snowflake is >>>>>>>>>>>>> > > >>>>>>>>>>> slightly different, but >>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> essentially equivalent, and we see >>>>>>>>>>>>> no particular value >>>>>>>>>>>>> > > >>>>>>>>>>> in trying to clutter >>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> the space with another >>>>>>>>>>>>> equivalent-but-incompatible >>>>>>>>>>>>> > > >>>>>>>>>>> encoding. >>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> > >>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> > >>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> > 2. Subcolumnarization >>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> > Subcolumnarization of variant >>>>>>>>>>>>> columns allows query >>>>>>>>>>>>> > > >>>>>>>>>>> engines to >>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> efficiently prune datasets when >>>>>>>>>>>>> subcolumns (i.e., >>>>>>>>>>>>> > > >>>>>>>>>>> nested fields) within a >>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> variant column are queried, and also >>>>>>>>>>>>> allows optionally >>>>>>>>>>>>> > > >>>>>>>>>>> materializing some >>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> of the nested fields as a column on >>>>>>>>>>>>> their own, >>>>>>>>>>>>> > > >>>>>>>>>>> affording queries on these >>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> subcolumns the ability to read less >>>>>>>>>>>>> data and spend >>>>>>>>>>>>> > > >>>>>>>>>>> less CPU on extraction. >>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> When subcolumnarizing, the system >>>>>>>>>>>>> managing table >>>>>>>>>>>>> > > >>>>>>>>>>> metadata and data tracks >>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> individual pruning statistics (min, >>>>>>>>>>>>> max, null, etc.) >>>>>>>>>>>>> > > >>>>>>>>>>> for some subset of the >>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> nested fields within a variant, and >>>>>>>>>>>>> also manages any >>>>>>>>>>>>> > > >>>>>>>>>>> optional >>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> materialization. Without >>>>>>>>>>>>> subcolumnarization, any query >>>>>>>>>>>>> > > >>>>>>>>>>> which touches a >>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> variant column must read, parse, >>>>>>>>>>>>> extract, and filter >>>>>>>>>>>>> > > >>>>>>>>>>> every row for which >>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> that column is non-null. Thus, by >>>>>>>>>>>>> providing a >>>>>>>>>>>>> > > >>>>>>>>>>> standardized way of tracking >>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> subcolum metadata and data for >>>>>>>>>>>>> variant columns, >>>>>>>>>>>>> > > >>>>>>>>>>> Iceberg can make >>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> subcolumnar optimizations accessible >>>>>>>>>>>>> across various >>>>>>>>>>>>> > > >>>>>>>>>>> catalogs and query >>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> engines. >>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> > >>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> > Subcolumnarization is a >>>>>>>>>>>>> non-trivial topic, so we >>>>>>>>>>>>> > > >>>>>>>>>>> expect any >>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> concrete proposal to include not >>>>>>>>>>>>> only the set of >>>>>>>>>>>>> > > >>>>>>>>>>> changes to Iceberg >>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> metadata that allow compatible query >>>>>>>>>>>>> engines to >>>>>>>>>>>>> > > >>>>>>>>>>> interopate on >>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> subcolumnarization data for variant >>>>>>>>>>>>> columns, but also >>>>>>>>>>>>> > > >>>>>>>>>>> reference >>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> documentation explaining >>>>>>>>>>>>> subcolumnarization principles >>>>>>>>>>>>> > > >>>>>>>>>>> and recommended best >>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> practices. >>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> > >>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> > >>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> > It sounds like the recent Geo >>>>>>>>>>>>> proposal [3] may be a >>>>>>>>>>>>> > > >>>>>>>>>>> good starting >>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> point for how to approach this, so >>>>>>>>>>>>> our plan is to >>>>>>>>>>>>> > > >>>>>>>>>>> write something up in >>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> that vein that covers the proposed >>>>>>>>>>>>> spec changes, >>>>>>>>>>>>> > > >>>>>>>>>>> backwards compatibility, >>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> implementor burdens, etc. But we >>>>>>>>>>>>> wanted to first reach >>>>>>>>>>>>> > > >>>>>>>>>>> out to the community >>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> to introduce ourselves and the idea, >>>>>>>>>>>>> and see if >>>>>>>>>>>>> > > >>>>>>>>>>> there’s any early feedback >>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> we should incorporate before we >>>>>>>>>>>>> spend too much time on >>>>>>>>>>>>> > > >>>>>>>>>>> a concrete proposal. >>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> > >>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> > Thank you! >>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> > >>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> > [1] >>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> >>>>>>>>>>>>> > > >>>>>>>>>>> >>>>>>>>>>>>> https://docs.snowflake.com/en/sql-reference/data-types-semistructured >>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> > [2] >>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> >>>>>>>>>>>>> > > >>>>>>>>>>> >>>>>>>>>>>>> https://github.com/apache/spark/blob/master/common/variant/README.md >>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> > [3] >>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> >>>>>>>>>>>>> > > >>>>>>>>>>> >>>>>>>>>>>>> https://docs.google.com/document/d/1iVFbrRNEzZl8tDcZC81GFt01QJkLJsI9E2NBOt21IRI/edit >>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> > >>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> > -Tyler, Nileema, Selcuk, Aihua >>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> > >>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> >>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>> >>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>>>>>>>> > > >>>>>>>>>>> > -- >>>>>>>>>>>>> > > >>>>>>>>>>> > Ryan Blue >>>>>>>>>>>>> > > >>>>>>>>>>> > Databricks >>>>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>>>>>>>> > > >>>>>>>>>>> >>>>>>>>>>>>> > > >>>>>>>>>> >>>>>>>>>>>>> > > >>>>>>>>> >>>>>>>>>>>>> > > >>>>>>>>> -- >>>>>>>>>>>>> > > >>>>>>>>> Ryan Blue >>>>>>>>>>>>> > > >>>>>>>>> Databricks >>>>>>>>>>>>> > > >>>>>>>>> >>>>>>>>>>>>> > > >>>>>>>> >>>>>>>>>>>>> > > >>>>>>>> >>>>>>>>>>>>> > > >>>>>>>> -- >>>>>>>>>>>>> > > >>>>>>>> Ryan Blue >>>>>>>>>>>>> > > >>>>>>>> Databricks >>>>>>>>>>>>> > > >>>>>>>> >>>>>>>>>>>>> > > >>>>>>> >>>>>>>>>>>>> > > >>>>>> >>>>>>>>>>>>> > > >>>>>> -- >>>>>>>>>>>>> > > >>>>>> Ryan Blue >>>>>>>>>>>>> > > >>>>>> Databricks >>>>>>>>>>>>> > > >>>>>> >>>>>>>>>>>>> > > >>>>> >>>>>>>>>>>>> > > >>>> >>>>>>>>>>>>> > > >>>> -- >>>>>>>>>>>>> > > >>>> Ryan Blue >>>>>>>>>>>>> > > >>>> Databricks >>>>>>>>>>>>> > > >>>> >>>>>>>>>>>>> > > >>> >>>>>>>>>>>>> > > >> >>>>>>>>>>>>> > > >> -- >>>>>>>>>>>>> > > >> Ryan Blue >>>>>>>>>>>>> > > >> Databricks >>>>>>>>>>>>> > > >> >>>>>>>>>>>>> > > > >>>>>>>>>>>>> > > >>>>>>>>>>>>> > >>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> -- >>>>>>>>>>> Ryan Blue >>>>>>>>>>> Databricks >>>>>>>>>>> >>>>>>>>>> >> >> -- >> Ryan Blue >> Databricks >> >