Similarly, I'm aligned with point 1 and I'd choose to support only variant for point 3.
We'll need to work with the Spark community to find a good place for the library and spec, since it touches many different projects. I'd also prefer Iceberg as the home. I also think it's a good idea to get subcolumnarization into our spec when we update. Without that I think the feature will be fairly limited. On Thu, Jul 18, 2024 at 10:56 AM Russell Spitzer <russell.spit...@gmail.com> wrote: > I'm aligned with point 1. > > For point 2 I think we should choose quickly, I honestly do think this > would be fine as part of the Iceberg Spec directly but understand it may be > better for the more broad community if it was a sub project. As a > sub-project I would still prefer it being an Iceberg Subproject since we > are engine/file-format agnostic. > > 3. I support adding just Variant. > > On Thu, Jul 18, 2024 at 12:54 AM Aihua Xu <aihu...@apache.org> wrote: > >> Hello community, >> >> It’s great to sync up with some of you on Variant and SubColumarization >> support in Iceberg again. Apologize that I didn’t record the meeting but >> here are some key items that we want to follow up with the community. >> >> 1. Adopt Spark Variant encoding >> Those present were in favor of adopting the Spark variant encoding for >> Iceberg Variant with extensions to support other Iceberg types. We would >> like to know if anyone has an objection to this to reuse an open source >> encoding. >> >> 2. Movement of the Spark Variant Spec to another project >> To avoid introducing Apache Spark as a dependency for the engines and >> file formats, we discussed separating Spark Variant encoding spec and >> implementation from the Spark Project to a neutral location. We thought up >> several solutions but didn’t have consensus on any of them. We are looking >> for more feedback on this topic from the community either in terms of >> support for one of these options or another idea on how to support the spec. >> >> Options Proposed: >> * Leave the Spec in Spark (Difficult for versioning and other engines) >> * Copying the Spec into Iceberg Project Directly (Difficult for other >> Table Formats) >> * Creating a Sub-Project of Apache Iceberg and moving the spec and >> reference implementation there (Logistically complicated) >> * Creating a Sub-Project of Apache Spark and moving the spec and >> reference implementation there (Logistically complicated) >> >> 3. Add Variant type vs. Variant and JSON types >> Those who were present were in favor of adding only the Variant type to >> Iceberg. We are looking for anyone who has an objection to going forward >> with just the Variant Type and no Iceberg JSON Type. We were favoring >> adding Variant type only because: >> * Introducing a JSON type would require engines that only support VARIANT >> to do write time validation of their input to a JSON column. If they don’t >> have a JSON type an engine wouldn’t support this. >> * Engines which don’t support Variant will work most of the time but can >> have fallback strings defined in the spec for reading unsupported types. >> Writing a JSON into a Variant will always work. >> >> 4. Support for Subcolumnization spec (shredding in Spark) >> We have no action items on this but would like to follow up on >> discussions on Subcolumnization in the future. >> * We had general agreement that this should be included in Iceberg V3 or >> else adding variant may not be useful. >> * We are interested in also adopting the shredding spec from Spark and >> would like to move it to whatever place we decided the Variant spec is >> going to live. >> >> Let us know if missed anything and if you have any additional thoughts or >> suggestions. >> >> Thanks >> Aihua >> >> >> On 2024/07/15 18:32:22 Aihua Xu wrote: >> > Thanks for the discussion. >> > >> > I will move forward to work on spec PR. >> > >> > Regarding the implementation, we will have module for Variant support >> in Iceberg so we will not have to bring in Spark libraries. >> > >> > I'm reposting the meeting invite in case it's not clear in my original >> email since I included in the end. Looks like we don't have major >> objections/diverges but let's sync up and have consensus. >> > >> > Meeting invite: >> > >> > Wednesday, July 17 · 9:00 – 10:00am >> > Time zone: America/Los_Angeles >> > Google Meet joining info >> > Video call link: https://meet.google.com/pbm-ovzn-aoq >> > Or dial: (US) +1 650-449-9343 PIN: 170 576 525# >> > More phone numbers: https://tel.meet/pbm-ovzn-aoq?pin=4079632691790 >> > >> > Thanks, >> > Aihua >> > >> > On 2024/07/12 20:55:01 Micah Kornfield wrote: >> > > I don't think this needs to hold up the PR but I think coming to a >> > > consensus on the exact set of types supported is worthwhile (and if >> the >> > > goal is to maintain the same set as specified by the Spark Variant >> type or >> > > if divergence is expected/allowed). From a fragmentation perspective >> it >> > > would be a shame if they diverge, so maybe a next step is also >> suggesting >> > > support to the Spark community on the missing existing Iceberg types? >> > > >> > > Thanks, >> > > Micah >> > > >> > > On Fri, Jul 12, 2024 at 1:44 PM Russell Spitzer < >> russell.spit...@gmail.com> >> > > wrote: >> > > >> > > > Just talked with Aihua and he's working on the Spec PR now. We can >> get >> > > > feedback there from everyone. >> > > > >> > > > On Fri, Jul 12, 2024 at 3:41 PM Ryan Blue >> <b...@databricks.com.invalid> >> > > > wrote: >> > > > >> > > >> Good idea, but I'm hoping that we can continue to get their >> feedback in >> > > >> parallel to getting the spec changes started. Piotr didn't seem to >> object >> > > >> to the encoding from what I read of his comments. Hopefully he >> (and others) >> > > >> chime in here. >> > > >> >> > > >> On Fri, Jul 12, 2024 at 1:32 PM Russell Spitzer < >> > > >> russell.spit...@gmail.com> wrote: >> > > >> >> > > >>> I just want to make sure we get Piotr and Peter on board as >> > > >>> representatives of Flink and Trino engines. Also make sure we >> have anyone >> > > >>> else chime in who has experience with Ray if possible. >> > > >>> >> > > >>> Spec changes feel like the right next step. >> > > >>> >> > > >>> On Fri, Jul 12, 2024 at 3:14 PM Ryan Blue >> <b...@databricks.com.invalid> >> > > >>> wrote: >> > > >>> >> > > >>>> Okay, what are the next steps here? This proposal has been out >> for >> > > >>>> quite a while and I don't see any major objections to using the >> Spark >> > > >>>> encoding. It's quite well designed and fits the need well. It >> can also be >> > > >>>> extended to support additional types that are missing if that's >> a priority. >> > > >>>> >> > > >>>> Should we move forward by starting a draft of the changes to the >> table >> > > >>>> spec? Then we can vote on committing those changes and get >> moving on an >> > > >>>> implementation (or possibly do the implementation in parallel). >> > > >>>> >> > > >>>> On Fri, Jul 12, 2024 at 1:08 PM Russell Spitzer < >> > > >>>> russell.spit...@gmail.com> wrote: >> > > >>>> >> > > >>>>> That's fair, I'm sold on an Iceberg Module. >> > > >>>>> >> > > >>>>> On Fri, Jul 12, 2024 at 2:53 PM Ryan Blue >> <b...@databricks.com.invalid> >> > > >>>>> wrote: >> > > >>>>> >> > > >>>>>> > Feels like eventually the encoding should land in parquet >> proper >> > > >>>>>> right? >> > > >>>>>> >> > > >>>>>> What about using it in ORC? I don't know where it should end >> up. >> > > >>>>>> Maybe Iceberg should make a standalone module from it? >> > > >>>>>> >> > > >>>>>> On Fri, Jul 12, 2024 at 12:38 PM Russell Spitzer < >> > > >>>>>> russell.spit...@gmail.com> wrote: >> > > >>>>>> >> > > >>>>>>> Feels like eventually the encoding should land in parquet >> proper >> > > >>>>>>> right? I'm fine with us just copying into Iceberg though for >> the time >> > > >>>>>>> being. >> > > >>>>>>> >> > > >>>>>>> On Fri, Jul 12, 2024 at 2:31 PM Ryan Blue >> > > >>>>>>> <b...@databricks.com.invalid> wrote: >> > > >>>>>>> >> > > >>>>>>>> Oops, it looks like I missed where Aihua brought this up in >> his >> > > >>>>>>>> last email: >> > > >>>>>>>> >> > > >>>>>>>> > do we have an issue to directly use Spark implementation in >> > > >>>>>>>> Iceberg? >> > > >>>>>>>> >> > > >>>>>>>> Yes, I think that we do have an issue using the Spark >> library. What >> > > >>>>>>>> do you think about a Java implementation in Iceberg? >> > > >>>>>>>> >> > > >>>>>>>> Ryan >> > > >>>>>>>> >> > > >>>>>>>> On Fri, Jul 12, 2024 at 12:28 PM Ryan Blue < >> b...@databricks.com> >> > > >>>>>>>> wrote: >> > > >>>>>>>> >> > > >>>>>>>>> I raised the same point from Peter's email in a comment on >> the doc >> > > >>>>>>>>> as well. There is a spark-variant_2.13 artifact that would >> be a much >> > > >>>>>>>>> smaller scope than relying on large portions of Spark, but >> I even then I >> > > >>>>>>>>> doubt that it is a good idea for Iceberg to depend on that >> because it is a >> > > >>>>>>>>> Scala artifact and we would need to bring in a ton of Scala >> libs. I think >> > > >>>>>>>>> what makes the most sense is to have an independent >> implementation of the >> > > >>>>>>>>> spec in Iceberg. >> > > >>>>>>>>> >> > > >>>>>>>>> On Fri, Jul 12, 2024 at 11:51 AM Péter Váry < >> > > >>>>>>>>> peter.vary.apa...@gmail.com> wrote: >> > > >>>>>>>>> >> > > >>>>>>>>>> Hi Aihua, >> > > >>>>>>>>>> Long time no see :) >> > > >>>>>>>>>> Would this mean, that every engine which plans to support >> Variant >> > > >>>>>>>>>> data type needs to add Spark as a dependency? Like >> Flink/Trino/Hive etc? >> > > >>>>>>>>>> Thanks, Peter >> > > >>>>>>>>>> >> > > >>>>>>>>>> >> > > >>>>>>>>>> On Fri, Jul 12, 2024, 19:10 Aihua Xu <aihu...@apache.org> >> wrote: >> > > >>>>>>>>>> >> > > >>>>>>>>>>> Thanks Ryan. >> > > >>>>>>>>>>> >> > > >>>>>>>>>>> Yeah. That's another reason we want to pursue Spark >> encoding to >> > > >>>>>>>>>>> keep compatibility for the open source engines. >> > > >>>>>>>>>>> >> > > >>>>>>>>>>> One more question regarding the encoding implementation: >> do we >> > > >>>>>>>>>>> have an issue to directly use Spark implementation in >> Iceberg? Russell >> > > >>>>>>>>>>> pointed out that Trino doesn't have Spark dependency and >> that could be a >> > > >>>>>>>>>>> problem? >> > > >>>>>>>>>>> >> > > >>>>>>>>>>> Thanks, >> > > >>>>>>>>>>> Aihua >> > > >>>>>>>>>>> >> > > >>>>>>>>>>> On 2024/07/12 15:02:06 Ryan Blue wrote: >> > > >>>>>>>>>>> > Thanks, Aihua! >> > > >>>>>>>>>>> > >> > > >>>>>>>>>>> > I think that the encoding choice in the current doc is >> a good >> > > >>>>>>>>>>> one. I went >> > > >>>>>>>>>>> > through the Spark encoding in detail and it looks like a >> > > >>>>>>>>>>> better choice than >> > > >>>>>>>>>>> > the other candidate encodings for quickly accessing >> nested >> > > >>>>>>>>>>> fields. >> > > >>>>>>>>>>> > >> > > >>>>>>>>>>> > Another reason to use the Spark type is that this is >> what >> > > >>>>>>>>>>> Delta's variant >> > > >>>>>>>>>>> > type is based on, so Parquet files in tables written by >> Delta >> > > >>>>>>>>>>> could be >> > > >>>>>>>>>>> > converted or used in Iceberg tables without needing to >> rewrite >> > > >>>>>>>>>>> variant >> > > >>>>>>>>>>> > data. (Also, note that I work at Databricks and have an >> > > >>>>>>>>>>> interest in >> > > >>>>>>>>>>> > increasing format compatibility.) >> > > >>>>>>>>>>> > >> > > >>>>>>>>>>> > Ryan >> > > >>>>>>>>>>> > >> > > >>>>>>>>>>> > On Thu, Jul 11, 2024 at 11:21 AM Aihua Xu < >> > > >>>>>>>>>>> aihua...@snowflake.com.invalid> >> > > >>>>>>>>>>> > wrote: >> > > >>>>>>>>>>> > >> > > >>>>>>>>>>> > > [Discuss] Consensus for Variant Encoding >> > > >>>>>>>>>>> > > >> > > >>>>>>>>>>> > > It’s great to be able to present the Variant type >> proposal >> > > >>>>>>>>>>> in the >> > > >>>>>>>>>>> > > community sync yesterday and I’m looking to host a >> meeting >> > > >>>>>>>>>>> next week >> > > >>>>>>>>>>> > > (targeting for 9am, July 17th) to go over any further >> > > >>>>>>>>>>> concerns about the >> > > >>>>>>>>>>> > > encoding of the Variant type and any other questions >> on the >> > > >>>>>>>>>>> first phase of >> > > >>>>>>>>>>> > > the proposal >> > > >>>>>>>>>>> > > < >> > > >>>>>>>>>>> >> https://docs.google.com/document/d/1QjhpG_SVNPZh3anFcpicMQx90ebwjL7rmzFYfUP89Iw/edit >> > > >>>>>>>>>>> >. >> > > >>>>>>>>>>> > > We are hoping that anyone who is interested in the >> proposal >> > > >>>>>>>>>>> can either join >> > > >>>>>>>>>>> > > or reply with their comments so we can discuss them. >> Summary >> > > >>>>>>>>>>> of the >> > > >>>>>>>>>>> > > discussion and notes will be sent to the mailing list >> for >> > > >>>>>>>>>>> further comment >> > > >>>>>>>>>>> > > there. >> > > >>>>>>>>>>> > > >> > > >>>>>>>>>>> > > >> > > >>>>>>>>>>> > > - >> > > >>>>>>>>>>> > > >> > > >>>>>>>>>>> > > What should be the underlying binary representation >> > > >>>>>>>>>>> > > >> > > >>>>>>>>>>> > > We have evaluated a few encodings in the doc >> including ION, >> > > >>>>>>>>>>> JSONB, and >> > > >>>>>>>>>>> > > Spark encoding.Choosing the underlying encoding is an >> > > >>>>>>>>>>> important first step >> > > >>>>>>>>>>> > > here and we believe we have general support for >> Spark’s >> > > >>>>>>>>>>> Variant encoding. >> > > >>>>>>>>>>> > > We would like to hear if anyone else has strong >> opinions in >> > > >>>>>>>>>>> this space. >> > > >>>>>>>>>>> > > >> > > >>>>>>>>>>> > > >> > > >>>>>>>>>>> > > - >> > > >>>>>>>>>>> > > >> > > >>>>>>>>>>> > > Should we support multiple logical types or just >> Variant? >> > > >>>>>>>>>>> Variant vs. >> > > >>>>>>>>>>> > > Variant + JSON. >> > > >>>>>>>>>>> > > >> > > >>>>>>>>>>> > > This is to discuss what logical data type(s) to be >> supported >> > > >>>>>>>>>>> in Iceberg - >> > > >>>>>>>>>>> > > Variant only vs. Variant + JSON. Both types would >> share the >> > > >>>>>>>>>>> same underlying >> > > >>>>>>>>>>> > > encoding but would imply different limitations on >> engines >> > > >>>>>>>>>>> working with >> > > >>>>>>>>>>> > > those types. >> > > >>>>>>>>>>> > > >> > > >>>>>>>>>>> > > From the sync up meeting, we are more favoring toward >> > > >>>>>>>>>>> supporting Variant >> > > >>>>>>>>>>> > > only and we want to have a consensus on the supported >> > > >>>>>>>>>>> type(s). >> > > >>>>>>>>>>> > > >> > > >>>>>>>>>>> > > >> > > >>>>>>>>>>> > > - >> > > >>>>>>>>>>> > > >> > > >>>>>>>>>>> > > How should we move forward with Subcolumnization? >> > > >>>>>>>>>>> > > >> > > >>>>>>>>>>> > > Subcolumnization is an optimization for Variant type >> by >> > > >>>>>>>>>>> separating out >> > > >>>>>>>>>>> > > subcolumns with their own metadata. This is not >> critical for >> > > >>>>>>>>>>> choosing the >> > > >>>>>>>>>>> > > initial encoding of the Variant type so we were >> hoping to >> > > >>>>>>>>>>> gain consensus on >> > > >>>>>>>>>>> > > leaving that for a follow up spec. >> > > >>>>>>>>>>> > > >> > > >>>>>>>>>>> > > >> > > >>>>>>>>>>> > > Thanks >> > > >>>>>>>>>>> > > >> > > >>>>>>>>>>> > > Aihua >> > > >>>>>>>>>>> > > >> > > >>>>>>>>>>> > > Meeting invite: >> > > >>>>>>>>>>> > > >> > > >>>>>>>>>>> > > Wednesday, July 17 · 9:00 – 10:00am >> > > >>>>>>>>>>> > > Time zone: America/Los_Angeles >> > > >>>>>>>>>>> > > Google Meet joining info >> > > >>>>>>>>>>> > > Video call link: https://meet.google.com/pbm-ovzn-aoq >> > > >>>>>>>>>>> > > Or dial: (US) +1 650-449-9343 PIN: 170 576 525# >> > > >>>>>>>>>>> > > More phone numbers: >> > > >>>>>>>>>>> https://tel.meet/pbm-ovzn-aoq?pin=4079632691790 >> > > >>>>>>>>>>> > > >> > > >>>>>>>>>>> > > On Tue, May 28, 2024 at 9:21 PM Aihua Xu < >> > > >>>>>>>>>>> aihua...@snowflake.com> wrote: >> > > >>>>>>>>>>> > > >> > > >>>>>>>>>>> > >> Hello, >> > > >>>>>>>>>>> > >> >> > > >>>>>>>>>>> > >> We have drafted the proposal >> > > >>>>>>>>>>> > >> < >> > > >>>>>>>>>>> >> https://docs.google.com/document/d/1QjhpG_SVNPZh3anFcpicMQx90ebwjL7rmzFYfUP89Iw/edit >> > > >>>>>>>>>>> > >> > > >>>>>>>>>>> > >> for Variant data type. Please help review and >> comment. >> > > >>>>>>>>>>> > >> >> > > >>>>>>>>>>> > >> Thanks, >> > > >>>>>>>>>>> > >> Aihua >> > > >>>>>>>>>>> > >> >> > > >>>>>>>>>>> > >> On Thu, May 16, 2024 at 12:45 PM Jack Ye < >> > > >>>>>>>>>>> yezhao...@gmail.com> wrote: >> > > >>>>>>>>>>> > >> >> > > >>>>>>>>>>> > >>> +10000 for a JSON/BSON type. We also had the same >> > > >>>>>>>>>>> discussion internally >> > > >>>>>>>>>>> > >>> and a JSON type would really play well with for >> example >> > > >>>>>>>>>>> the SUPER type in >> > > >>>>>>>>>>> > >>> Redshift: >> > > >>>>>>>>>>> > >>> >> > > >>>>>>>>>>> >> https://docs.aws.amazon.com/redshift/latest/dg/r_SUPER_type.html, >> > > >>>>>>>>>>> and >> > > >>>>>>>>>>> > >>> can also provide better integration with the Trino >> JSON >> > > >>>>>>>>>>> type. >> > > >>>>>>>>>>> > >>> >> > > >>>>>>>>>>> > >>> Looking forward to the proposal! >> > > >>>>>>>>>>> > >>> >> > > >>>>>>>>>>> > >>> Best, >> > > >>>>>>>>>>> > >>> Jack Ye >> > > >>>>>>>>>>> > >>> >> > > >>>>>>>>>>> > >>> >> > > >>>>>>>>>>> > >>> On Wed, May 15, 2024 at 9:37 AM Tyler Akidau >> > > >>>>>>>>>>> > >>> <tyler.aki...@snowflake.com.invalid> wrote: >> > > >>>>>>>>>>> > >>> >> > > >>>>>>>>>>> > >>>> On Tue, May 14, 2024 at 7:58 PM Gang Wu < >> ust...@gmail.com> >> > > >>>>>>>>>>> wrote: >> > > >>>>>>>>>>> > >>>> >> > > >>>>>>>>>>> > >>>>> > We may need some guidance on just how many we >> need to >> > > >>>>>>>>>>> look at; >> > > >>>>>>>>>>> > >>>>> > we were planning on Spark and Trino, but >> weren't sure >> > > >>>>>>>>>>> how much >> > > >>>>>>>>>>> > >>>>> > further down the rabbit hole we needed to go。 >> > > >>>>>>>>>>> > >>>>> >> > > >>>>>>>>>>> > >>>>> There are some engines living outside the Java >> world. It >> > > >>>>>>>>>>> would be >> > > >>>>>>>>>>> > >>>>> good if the proposal could cover the effort it >> takes to >> > > >>>>>>>>>>> integrate >> > > >>>>>>>>>>> > >>>>> variant type to them (e.g. velox, datafusion, >> etc.). >> > > >>>>>>>>>>> This is something >> > > >>>>>>>>>>> > >>>>> that >> > > >>>>>>>>>>> > >>>>> some proprietary iceberg vendors also care about. >> > > >>>>>>>>>>> > >>>>> >> > > >>>>>>>>>>> > >>>> >> > > >>>>>>>>>>> > >>>> Ack, makes sense. We can make sure to share some >> > > >>>>>>>>>>> perspective on this. >> > > >>>>>>>>>>> > >>>> >> > > >>>>>>>>>>> > >>>> > Not necessarily, no. As long as there's a binary >> type >> > > >>>>>>>>>>> and Iceberg and >> > > >>>>>>>>>>> > >>>>> > the query engines are aware that the binary >> column >> > > >>>>>>>>>>> needs to be >> > > >>>>>>>>>>> > >>>>> > interpreted as a variant, that should be >> sufficient. >> > > >>>>>>>>>>> > >>>>> >> > > >>>>>>>>>>> > >>>>> From the perspective of interoperability, it >> would be >> > > >>>>>>>>>>> good to support >> > > >>>>>>>>>>> > >>>>> native >> > > >>>>>>>>>>> > >>>>> type from file specs. Life will be easier for >> projects >> > > >>>>>>>>>>> like Apache >> > > >>>>>>>>>>> > >>>>> XTable. >> > > >>>>>>>>>>> > >>>>> File format could also provide finer-grained >> statistics >> > > >>>>>>>>>>> for variant >> > > >>>>>>>>>>> > >>>>> type which >> > > >>>>>>>>>>> > >>>>> facilitates data skipping. >> > > >>>>>>>>>>> > >>>>> >> > > >>>>>>>>>>> > >>>> >> > > >>>>>>>>>>> > >>>> Agreed, there can definitely be additional value in >> > > >>>>>>>>>>> native file format >> > > >>>>>>>>>>> > >>>> integration. Just wanted to highlight that it's >> not a >> > > >>>>>>>>>>> strict requirement. >> > > >>>>>>>>>>> > >>>> >> > > >>>>>>>>>>> > >>>> -Tyler >> > > >>>>>>>>>>> > >>>> >> > > >>>>>>>>>>> > >>>> >> > > >>>>>>>>>>> > >>>>> >> > > >>>>>>>>>>> > >>>>> Gang >> > > >>>>>>>>>>> > >>>>> >> > > >>>>>>>>>>> > >>>>> On Wed, May 15, 2024 at 6:49 AM Tyler Akidau >> > > >>>>>>>>>>> > >>>>> <tyler.aki...@snowflake.com.invalid> wrote: >> > > >>>>>>>>>>> > >>>>> >> > > >>>>>>>>>>> > >>>>>> Good to see you again as well, JB! Thanks! >> > > >>>>>>>>>>> > >>>>>> >> > > >>>>>>>>>>> > >>>>>> -Tyler >> > > >>>>>>>>>>> > >>>>>> >> > > >>>>>>>>>>> > >>>>>> >> > > >>>>>>>>>>> > >>>>>> On Tue, May 14, 2024 at 1:04 PM Jean-Baptiste >> Onofré < >> > > >>>>>>>>>>> j...@nanthrax.net> >> > > >>>>>>>>>>> > >>>>>> wrote: >> > > >>>>>>>>>>> > >>>>>> >> > > >>>>>>>>>>> > >>>>>>> Hi Tyler, >> > > >>>>>>>>>>> > >>>>>>> >> > > >>>>>>>>>>> > >>>>>>> Super happy to see you there :) It reminds me >> our >> > > >>>>>>>>>>> discussions back in >> > > >>>>>>>>>>> > >>>>>>> the start of Apache Beam :) >> > > >>>>>>>>>>> > >>>>>>> >> > > >>>>>>>>>>> > >>>>>>> Anyway, the thread is pretty interesting. I >> remember >> > > >>>>>>>>>>> some discussions >> > > >>>>>>>>>>> > >>>>>>> about JSON datatype for spec v3. The binary >> data type >> > > >>>>>>>>>>> is already >> > > >>>>>>>>>>> > >>>>>>> supported in the spec v2. >> > > >>>>>>>>>>> > >>>>>>> >> > > >>>>>>>>>>> > >>>>>>> I'm looking forward to the proposal and happy >> to help >> > > >>>>>>>>>>> on this ! >> > > >>>>>>>>>>> > >>>>>>> >> > > >>>>>>>>>>> > >>>>>>> Regards >> > > >>>>>>>>>>> > >>>>>>> JB >> > > >>>>>>>>>>> > >>>>>>> >> > > >>>>>>>>>>> > >>>>>>> On Sat, May 11, 2024 at 7:06 AM Tyler Akidau >> > > >>>>>>>>>>> > >>>>>>> <tyler.aki...@snowflake.com.invalid> wrote: >> > > >>>>>>>>>>> > >>>>>>> > >> > > >>>>>>>>>>> > >>>>>>> > Hello, >> > > >>>>>>>>>>> > >>>>>>> > >> > > >>>>>>>>>>> > >>>>>>> > We (Tyler, Nileema, Selcuk, Aihua) are >> working on a >> > > >>>>>>>>>>> proposal for >> > > >>>>>>>>>>> > >>>>>>> which we’d like to get early feedback from the >> > > >>>>>>>>>>> community. As you may know, >> > > >>>>>>>>>>> > >>>>>>> Snowflake has embraced Iceberg as its open Data >> Lake >> > > >>>>>>>>>>> format. Having made >> > > >>>>>>>>>>> > >>>>>>> good progress on our own adoption of the Iceberg >> > > >>>>>>>>>>> standard, we’re now in a >> > > >>>>>>>>>>> > >>>>>>> position where there are features not yet >> supported in >> > > >>>>>>>>>>> Iceberg which we >> > > >>>>>>>>>>> > >>>>>>> think would be valuable for our users, and that >> we >> > > >>>>>>>>>>> would like to discuss >> > > >>>>>>>>>>> > >>>>>>> with and help contribute to the Iceberg >> community. >> > > >>>>>>>>>>> > >>>>>>> > >> > > >>>>>>>>>>> > >>>>>>> > The first two such features we’d like to >> discuss are >> > > >>>>>>>>>>> in support of >> > > >>>>>>>>>>> > >>>>>>> efficient querying of dynamically typed, >> > > >>>>>>>>>>> semi-structured data: variant data >> > > >>>>>>>>>>> > >>>>>>> types, and subcolumnarization of variant >> columns. In >> > > >>>>>>>>>>> more detail, for >> > > >>>>>>>>>>> > >>>>>>> anyone who may not already be familiar: >> > > >>>>>>>>>>> > >>>>>>> > >> > > >>>>>>>>>>> > >>>>>>> > 1. Variant data types >> > > >>>>>>>>>>> > >>>>>>> > Variant types allow for the efficient binary >> > > >>>>>>>>>>> encoding of dynamic >> > > >>>>>>>>>>> > >>>>>>> semi-structured data such as JSON, Avro, etc. By >> > > >>>>>>>>>>> encoding semi-structured >> > > >>>>>>>>>>> > >>>>>>> data as a variant column, we retain the >> flexibility of >> > > >>>>>>>>>>> the source data, >> > > >>>>>>>>>>> > >>>>>>> while allowing query engines to more efficiently >> > > >>>>>>>>>>> operate on the data. >> > > >>>>>>>>>>> > >>>>>>> Snowflake has supported the variant data type on >> > > >>>>>>>>>>> Snowflake tables for many >> > > >>>>>>>>>>> > >>>>>>> years [1]. As more and more users utilize >> Iceberg >> > > >>>>>>>>>>> tables in Snowflake, >> > > >>>>>>>>>>> > >>>>>>> we’re hearing an increasing chorus of requests >> for >> > > >>>>>>>>>>> variant support. >> > > >>>>>>>>>>> > >>>>>>> Additionally, other query engines such as >> Apache Spark >> > > >>>>>>>>>>> have begun adding >> > > >>>>>>>>>>> > >>>>>>> variant support [2]. As such, we believe it >> would be >> > > >>>>>>>>>>> beneficial to the >> > > >>>>>>>>>>> > >>>>>>> Iceberg community as a whole to standardize on >> the >> > > >>>>>>>>>>> variant data type >> > > >>>>>>>>>>> > >>>>>>> encoding used across Iceberg tables. >> > > >>>>>>>>>>> > >>>>>>> > >> > > >>>>>>>>>>> > >>>>>>> > One specific point to make here is that, >> since an >> > > >>>>>>>>>>> Apache OSS >> > > >>>>>>>>>>> > >>>>>>> version of variant encoding already exists in >> Spark, >> > > >>>>>>>>>>> it likely makes sense >> > > >>>>>>>>>>> > >>>>>>> to simply adopt the Spark encoding as the >> Iceberg >> > > >>>>>>>>>>> standard as well. The >> > > >>>>>>>>>>> > >>>>>>> encoding we use internally today in Snowflake is >> > > >>>>>>>>>>> slightly different, but >> > > >>>>>>>>>>> > >>>>>>> essentially equivalent, and we see no >> particular value >> > > >>>>>>>>>>> in trying to clutter >> > > >>>>>>>>>>> > >>>>>>> the space with another >> equivalent-but-incompatible >> > > >>>>>>>>>>> encoding. >> > > >>>>>>>>>>> > >>>>>>> > >> > > >>>>>>>>>>> > >>>>>>> > >> > > >>>>>>>>>>> > >>>>>>> > 2. Subcolumnarization >> > > >>>>>>>>>>> > >>>>>>> > Subcolumnarization of variant columns allows >> query >> > > >>>>>>>>>>> engines to >> > > >>>>>>>>>>> > >>>>>>> efficiently prune datasets when subcolumns >> (i.e., >> > > >>>>>>>>>>> nested fields) within a >> > > >>>>>>>>>>> > >>>>>>> variant column are queried, and also allows >> optionally >> > > >>>>>>>>>>> materializing some >> > > >>>>>>>>>>> > >>>>>>> of the nested fields as a column on their own, >> > > >>>>>>>>>>> affording queries on these >> > > >>>>>>>>>>> > >>>>>>> subcolumns the ability to read less data and >> spend >> > > >>>>>>>>>>> less CPU on extraction. >> > > >>>>>>>>>>> > >>>>>>> When subcolumnarizing, the system managing table >> > > >>>>>>>>>>> metadata and data tracks >> > > >>>>>>>>>>> > >>>>>>> individual pruning statistics (min, max, null, >> etc.) >> > > >>>>>>>>>>> for some subset of the >> > > >>>>>>>>>>> > >>>>>>> nested fields within a variant, and also >> manages any >> > > >>>>>>>>>>> optional >> > > >>>>>>>>>>> > >>>>>>> materialization. Without subcolumnarization, >> any query >> > > >>>>>>>>>>> which touches a >> > > >>>>>>>>>>> > >>>>>>> variant column must read, parse, extract, and >> filter >> > > >>>>>>>>>>> every row for which >> > > >>>>>>>>>>> > >>>>>>> that column is non-null. Thus, by providing a >> > > >>>>>>>>>>> standardized way of tracking >> > > >>>>>>>>>>> > >>>>>>> subcolum metadata and data for variant columns, >> > > >>>>>>>>>>> Iceberg can make >> > > >>>>>>>>>>> > >>>>>>> subcolumnar optimizations accessible across >> various >> > > >>>>>>>>>>> catalogs and query >> > > >>>>>>>>>>> > >>>>>>> engines. >> > > >>>>>>>>>>> > >>>>>>> > >> > > >>>>>>>>>>> > >>>>>>> > Subcolumnarization is a non-trivial topic, so >> we >> > > >>>>>>>>>>> expect any >> > > >>>>>>>>>>> > >>>>>>> concrete proposal to include not only the set of >> > > >>>>>>>>>>> changes to Iceberg >> > > >>>>>>>>>>> > >>>>>>> metadata that allow compatible query engines to >> > > >>>>>>>>>>> interopate on >> > > >>>>>>>>>>> > >>>>>>> subcolumnarization data for variant columns, >> but also >> > > >>>>>>>>>>> reference >> > > >>>>>>>>>>> > >>>>>>> documentation explaining subcolumnarization >> principles >> > > >>>>>>>>>>> and recommended best >> > > >>>>>>>>>>> > >>>>>>> practices. >> > > >>>>>>>>>>> > >>>>>>> > >> > > >>>>>>>>>>> > >>>>>>> > >> > > >>>>>>>>>>> > >>>>>>> > It sounds like the recent Geo proposal [3] >> may be a >> > > >>>>>>>>>>> good starting >> > > >>>>>>>>>>> > >>>>>>> point for how to approach this, so our plan is >> to >> > > >>>>>>>>>>> write something up in >> > > >>>>>>>>>>> > >>>>>>> that vein that covers the proposed spec changes, >> > > >>>>>>>>>>> backwards compatibility, >> > > >>>>>>>>>>> > >>>>>>> implementor burdens, etc. But we wanted to >> first reach >> > > >>>>>>>>>>> out to the community >> > > >>>>>>>>>>> > >>>>>>> to introduce ourselves and the idea, and see if >> > > >>>>>>>>>>> there’s any early feedback >> > > >>>>>>>>>>> > >>>>>>> we should incorporate before we spend too much >> time on >> > > >>>>>>>>>>> a concrete proposal. >> > > >>>>>>>>>>> > >>>>>>> > >> > > >>>>>>>>>>> > >>>>>>> > Thank you! >> > > >>>>>>>>>>> > >>>>>>> > >> > > >>>>>>>>>>> > >>>>>>> > [1] >> > > >>>>>>>>>>> > >>>>>>> >> > > >>>>>>>>>>> >> https://docs.snowflake.com/en/sql-reference/data-types-semistructured >> > > >>>>>>>>>>> > >>>>>>> > [2] >> > > >>>>>>>>>>> > >>>>>>> >> > > >>>>>>>>>>> >> https://github.com/apache/spark/blob/master/common/variant/README.md >> > > >>>>>>>>>>> > >>>>>>> > [3] >> > > >>>>>>>>>>> > >>>>>>> >> > > >>>>>>>>>>> >> https://docs.google.com/document/d/1iVFbrRNEzZl8tDcZC81GFt01QJkLJsI9E2NBOt21IRI/edit >> > > >>>>>>>>>>> > >>>>>>> > >> > > >>>>>>>>>>> > >>>>>>> > -Tyler, Nileema, Selcuk, Aihua >> > > >>>>>>>>>>> > >>>>>>> > >> > > >>>>>>>>>>> > >>>>>>> >> > > >>>>>>>>>>> > >>>>>> >> > > >>>>>>>>>>> > >> > > >>>>>>>>>>> > -- >> > > >>>>>>>>>>> > Ryan Blue >> > > >>>>>>>>>>> > Databricks >> > > >>>>>>>>>>> > >> > > >>>>>>>>>>> >> > > >>>>>>>>>> >> > > >>>>>>>>> >> > > >>>>>>>>> -- >> > > >>>>>>>>> Ryan Blue >> > > >>>>>>>>> Databricks >> > > >>>>>>>>> >> > > >>>>>>>> >> > > >>>>>>>> >> > > >>>>>>>> -- >> > > >>>>>>>> Ryan Blue >> > > >>>>>>>> Databricks >> > > >>>>>>>> >> > > >>>>>>> >> > > >>>>>> >> > > >>>>>> -- >> > > >>>>>> Ryan Blue >> > > >>>>>> Databricks >> > > >>>>>> >> > > >>>>> >> > > >>>> >> > > >>>> -- >> > > >>>> Ryan Blue >> > > >>>> Databricks >> > > >>>> >> > > >>> >> > > >> >> > > >> -- >> > > >> Ryan Blue >> > > >> Databricks >> > > >> >> > > > >> > > >> > >> > -- Ryan Blue Databricks