Feels like eventually the encoding should land in parquet proper right? I'm fine with us just copying into Iceberg though for the time being.
On Fri, Jul 12, 2024 at 2:31 PM Ryan Blue <b...@databricks.com.invalid> wrote: > Oops, it looks like I missed where Aihua brought this up in his last email: > > > do we have an issue to directly use Spark implementation in Iceberg? > > Yes, I think that we do have an issue using the Spark library. What do you > think about a Java implementation in Iceberg? > > Ryan > > On Fri, Jul 12, 2024 at 12:28 PM Ryan Blue <b...@databricks.com> wrote: > >> I raised the same point from Peter's email in a comment on the doc as >> well. There is a spark-variant_2.13 artifact that would be a much smaller >> scope than relying on large portions of Spark, but I even then I doubt that >> it is a good idea for Iceberg to depend on that because it is a Scala >> artifact and we would need to bring in a ton of Scala libs. I think what >> makes the most sense is to have an independent implementation of the spec >> in Iceberg. >> >> On Fri, Jul 12, 2024 at 11:51 AM Péter Váry <peter.vary.apa...@gmail.com> >> wrote: >> >>> Hi Aihua, >>> Long time no see :) >>> Would this mean, that every engine which plans to support Variant data >>> type needs to add Spark as a dependency? Like Flink/Trino/Hive etc? >>> Thanks, Peter >>> >>> >>> On Fri, Jul 12, 2024, 19:10 Aihua Xu <aihu...@apache.org> wrote: >>> >>>> Thanks Ryan. >>>> >>>> Yeah. That's another reason we want to pursue Spark encoding to keep >>>> compatibility for the open source engines. >>>> >>>> One more question regarding the encoding implementation: do we have an >>>> issue to directly use Spark implementation in Iceberg? Russell pointed out >>>> that Trino doesn't have Spark dependency and that could be a problem? >>>> >>>> Thanks, >>>> Aihua >>>> >>>> On 2024/07/12 15:02:06 Ryan Blue wrote: >>>> > Thanks, Aihua! >>>> > >>>> > I think that the encoding choice in the current doc is a good one. I >>>> went >>>> > through the Spark encoding in detail and it looks like a better >>>> choice than >>>> > the other candidate encodings for quickly accessing nested fields. >>>> > >>>> > Another reason to use the Spark type is that this is what Delta's >>>> variant >>>> > type is based on, so Parquet files in tables written by Delta could be >>>> > converted or used in Iceberg tables without needing to rewrite variant >>>> > data. (Also, note that I work at Databricks and have an interest in >>>> > increasing format compatibility.) >>>> > >>>> > Ryan >>>> > >>>> > On Thu, Jul 11, 2024 at 11:21 AM Aihua Xu <aihua...@snowflake.com >>>> .invalid> >>>> > wrote: >>>> > >>>> > > [Discuss] Consensus for Variant Encoding >>>> > > >>>> > > It’s great to be able to present the Variant type proposal in the >>>> > > community sync yesterday and I’m looking to host a meeting next week >>>> > > (targeting for 9am, July 17th) to go over any further concerns >>>> about the >>>> > > encoding of the Variant type and any other questions on the first >>>> phase of >>>> > > the proposal >>>> > > < >>>> https://docs.google.com/document/d/1QjhpG_SVNPZh3anFcpicMQx90ebwjL7rmzFYfUP89Iw/edit >>>> >. >>>> > > We are hoping that anyone who is interested in the proposal can >>>> either join >>>> > > or reply with their comments so we can discuss them. Summary of the >>>> > > discussion and notes will be sent to the mailing list for further >>>> comment >>>> > > there. >>>> > > >>>> > > >>>> > > - >>>> > > >>>> > > What should be the underlying binary representation >>>> > > >>>> > > We have evaluated a few encodings in the doc including ION, JSONB, >>>> and >>>> > > Spark encoding.Choosing the underlying encoding is an important >>>> first step >>>> > > here and we believe we have general support for Spark’s Variant >>>> encoding. >>>> > > We would like to hear if anyone else has strong opinions in this >>>> space. >>>> > > >>>> > > >>>> > > - >>>> > > >>>> > > Should we support multiple logical types or just Variant? >>>> Variant vs. >>>> > > Variant + JSON. >>>> > > >>>> > > This is to discuss what logical data type(s) to be supported in >>>> Iceberg - >>>> > > Variant only vs. Variant + JSON. Both types would share the same >>>> underlying >>>> > > encoding but would imply different limitations on engines working >>>> with >>>> > > those types. >>>> > > >>>> > > From the sync up meeting, we are more favoring toward supporting >>>> Variant >>>> > > only and we want to have a consensus on the supported type(s). >>>> > > >>>> > > >>>> > > - >>>> > > >>>> > > How should we move forward with Subcolumnization? >>>> > > >>>> > > Subcolumnization is an optimization for Variant type by separating >>>> out >>>> > > subcolumns with their own metadata. This is not critical for >>>> choosing the >>>> > > initial encoding of the Variant type so we were hoping to gain >>>> consensus on >>>> > > leaving that for a follow up spec. >>>> > > >>>> > > >>>> > > Thanks >>>> > > >>>> > > Aihua >>>> > > >>>> > > Meeting invite: >>>> > > >>>> > > Wednesday, July 17 · 9:00 – 10:00am >>>> > > Time zone: America/Los_Angeles >>>> > > Google Meet joining info >>>> > > Video call link: https://meet.google.com/pbm-ovzn-aoq >>>> > > Or dial: (US) +1 650-449-9343 PIN: 170 576 525# >>>> > > More phone numbers: https://tel.meet/pbm-ovzn-aoq?pin=4079632691790 >>>> > > >>>> > > On Tue, May 28, 2024 at 9:21 PM Aihua Xu <aihua...@snowflake.com> >>>> wrote: >>>> > > >>>> > >> Hello, >>>> > >> >>>> > >> We have drafted the proposal >>>> > >> < >>>> https://docs.google.com/document/d/1QjhpG_SVNPZh3anFcpicMQx90ebwjL7rmzFYfUP89Iw/edit >>>> > >>>> > >> for Variant data type. Please help review and comment. >>>> > >> >>>> > >> Thanks, >>>> > >> Aihua >>>> > >> >>>> > >> On Thu, May 16, 2024 at 12:45 PM Jack Ye <yezhao...@gmail.com> >>>> wrote: >>>> > >> >>>> > >>> +10000 for a JSON/BSON type. We also had the same discussion >>>> internally >>>> > >>> and a JSON type would really play well with for example the SUPER >>>> type in >>>> > >>> Redshift: >>>> > >>> https://docs.aws.amazon.com/redshift/latest/dg/r_SUPER_type.html, >>>> and >>>> > >>> can also provide better integration with the Trino JSON type. >>>> > >>> >>>> > >>> Looking forward to the proposal! >>>> > >>> >>>> > >>> Best, >>>> > >>> Jack Ye >>>> > >>> >>>> > >>> >>>> > >>> On Wed, May 15, 2024 at 9:37 AM Tyler Akidau >>>> > >>> <tyler.aki...@snowflake.com.invalid> wrote: >>>> > >>> >>>> > >>>> On Tue, May 14, 2024 at 7:58 PM Gang Wu <ust...@gmail.com> >>>> wrote: >>>> > >>>> >>>> > >>>>> > We may need some guidance on just how many we need to look at; >>>> > >>>>> > we were planning on Spark and Trino, but weren't sure how much >>>> > >>>>> > further down the rabbit hole we needed to go。 >>>> > >>>>> >>>> > >>>>> There are some engines living outside the Java world. It would >>>> be >>>> > >>>>> good if the proposal could cover the effort it takes to >>>> integrate >>>> > >>>>> variant type to them (e.g. velox, datafusion, etc.). This is >>>> something >>>> > >>>>> that >>>> > >>>>> some proprietary iceberg vendors also care about. >>>> > >>>>> >>>> > >>>> >>>> > >>>> Ack, makes sense. We can make sure to share some perspective on >>>> this. >>>> > >>>> >>>> > >>>> > Not necessarily, no. As long as there's a binary type and >>>> Iceberg and >>>> > >>>>> > the query engines are aware that the binary column needs to be >>>> > >>>>> > interpreted as a variant, that should be sufficient. >>>> > >>>>> >>>> > >>>>> From the perspective of interoperability, it would be good to >>>> support >>>> > >>>>> native >>>> > >>>>> type from file specs. Life will be easier for projects like >>>> Apache >>>> > >>>>> XTable. >>>> > >>>>> File format could also provide finer-grained statistics for >>>> variant >>>> > >>>>> type which >>>> > >>>>> facilitates data skipping. >>>> > >>>>> >>>> > >>>> >>>> > >>>> Agreed, there can definitely be additional value in native file >>>> format >>>> > >>>> integration. Just wanted to highlight that it's not a strict >>>> requirement. >>>> > >>>> >>>> > >>>> -Tyler >>>> > >>>> >>>> > >>>> >>>> > >>>>> >>>> > >>>>> Gang >>>> > >>>>> >>>> > >>>>> On Wed, May 15, 2024 at 6:49 AM Tyler Akidau >>>> > >>>>> <tyler.aki...@snowflake.com.invalid> wrote: >>>> > >>>>> >>>> > >>>>>> Good to see you again as well, JB! Thanks! >>>> > >>>>>> >>>> > >>>>>> -Tyler >>>> > >>>>>> >>>> > >>>>>> >>>> > >>>>>> On Tue, May 14, 2024 at 1:04 PM Jean-Baptiste Onofré < >>>> j...@nanthrax.net> >>>> > >>>>>> wrote: >>>> > >>>>>> >>>> > >>>>>>> Hi Tyler, >>>> > >>>>>>> >>>> > >>>>>>> Super happy to see you there :) It reminds me our discussions >>>> back in >>>> > >>>>>>> the start of Apache Beam :) >>>> > >>>>>>> >>>> > >>>>>>> Anyway, the thread is pretty interesting. I remember some >>>> discussions >>>> > >>>>>>> about JSON datatype for spec v3. The binary data type is >>>> already >>>> > >>>>>>> supported in the spec v2. >>>> > >>>>>>> >>>> > >>>>>>> I'm looking forward to the proposal and happy to help on this >>>> ! >>>> > >>>>>>> >>>> > >>>>>>> Regards >>>> > >>>>>>> JB >>>> > >>>>>>> >>>> > >>>>>>> On Sat, May 11, 2024 at 7:06 AM Tyler Akidau >>>> > >>>>>>> <tyler.aki...@snowflake.com.invalid> wrote: >>>> > >>>>>>> > >>>> > >>>>>>> > Hello, >>>> > >>>>>>> > >>>> > >>>>>>> > We (Tyler, Nileema, Selcuk, Aihua) are working on a >>>> proposal for >>>> > >>>>>>> which we’d like to get early feedback from the community. As >>>> you may know, >>>> > >>>>>>> Snowflake has embraced Iceberg as its open Data Lake format. >>>> Having made >>>> > >>>>>>> good progress on our own adoption of the Iceberg standard, >>>> we’re now in a >>>> > >>>>>>> position where there are features not yet supported in >>>> Iceberg which we >>>> > >>>>>>> think would be valuable for our users, and that we would like >>>> to discuss >>>> > >>>>>>> with and help contribute to the Iceberg community. >>>> > >>>>>>> > >>>> > >>>>>>> > The first two such features we’d like to discuss are in >>>> support of >>>> > >>>>>>> efficient querying of dynamically typed, semi-structured >>>> data: variant data >>>> > >>>>>>> types, and subcolumnarization of variant columns. In more >>>> detail, for >>>> > >>>>>>> anyone who may not already be familiar: >>>> > >>>>>>> > >>>> > >>>>>>> > 1. Variant data types >>>> > >>>>>>> > Variant types allow for the efficient binary encoding of >>>> dynamic >>>> > >>>>>>> semi-structured data such as JSON, Avro, etc. By encoding >>>> semi-structured >>>> > >>>>>>> data as a variant column, we retain the flexibility of the >>>> source data, >>>> > >>>>>>> while allowing query engines to more efficiently operate on >>>> the data. >>>> > >>>>>>> Snowflake has supported the variant data type on Snowflake >>>> tables for many >>>> > >>>>>>> years [1]. As more and more users utilize Iceberg tables in >>>> Snowflake, >>>> > >>>>>>> we’re hearing an increasing chorus of requests for variant >>>> support. >>>> > >>>>>>> Additionally, other query engines such as Apache Spark have >>>> begun adding >>>> > >>>>>>> variant support [2]. As such, we believe it would be >>>> beneficial to the >>>> > >>>>>>> Iceberg community as a whole to standardize on the variant >>>> data type >>>> > >>>>>>> encoding used across Iceberg tables. >>>> > >>>>>>> > >>>> > >>>>>>> > One specific point to make here is that, since an Apache OSS >>>> > >>>>>>> version of variant encoding already exists in Spark, it >>>> likely makes sense >>>> > >>>>>>> to simply adopt the Spark encoding as the Iceberg standard as >>>> well. The >>>> > >>>>>>> encoding we use internally today in Snowflake is slightly >>>> different, but >>>> > >>>>>>> essentially equivalent, and we see no particular value in >>>> trying to clutter >>>> > >>>>>>> the space with another equivalent-but-incompatible encoding. >>>> > >>>>>>> > >>>> > >>>>>>> > >>>> > >>>>>>> > 2. Subcolumnarization >>>> > >>>>>>> > Subcolumnarization of variant columns allows query engines >>>> to >>>> > >>>>>>> efficiently prune datasets when subcolumns (i.e., nested >>>> fields) within a >>>> > >>>>>>> variant column are queried, and also allows optionally >>>> materializing some >>>> > >>>>>>> of the nested fields as a column on their own, affording >>>> queries on these >>>> > >>>>>>> subcolumns the ability to read less data and spend less CPU >>>> on extraction. >>>> > >>>>>>> When subcolumnarizing, the system managing table metadata and >>>> data tracks >>>> > >>>>>>> individual pruning statistics (min, max, null, etc.) for some >>>> subset of the >>>> > >>>>>>> nested fields within a variant, and also manages any optional >>>> > >>>>>>> materialization. Without subcolumnarization, any query which >>>> touches a >>>> > >>>>>>> variant column must read, parse, extract, and filter every >>>> row for which >>>> > >>>>>>> that column is non-null. Thus, by providing a standardized >>>> way of tracking >>>> > >>>>>>> subcolum metadata and data for variant columns, Iceberg can >>>> make >>>> > >>>>>>> subcolumnar optimizations accessible across various catalogs >>>> and query >>>> > >>>>>>> engines. >>>> > >>>>>>> > >>>> > >>>>>>> > Subcolumnarization is a non-trivial topic, so we expect any >>>> > >>>>>>> concrete proposal to include not only the set of changes to >>>> Iceberg >>>> > >>>>>>> metadata that allow compatible query engines to interopate on >>>> > >>>>>>> subcolumnarization data for variant columns, but also >>>> reference >>>> > >>>>>>> documentation explaining subcolumnarization principles and >>>> recommended best >>>> > >>>>>>> practices. >>>> > >>>>>>> > >>>> > >>>>>>> > >>>> > >>>>>>> > It sounds like the recent Geo proposal [3] may be a good >>>> starting >>>> > >>>>>>> point for how to approach this, so our plan is to write >>>> something up in >>>> > >>>>>>> that vein that covers the proposed spec changes, backwards >>>> compatibility, >>>> > >>>>>>> implementor burdens, etc. But we wanted to first reach out to >>>> the community >>>> > >>>>>>> to introduce ourselves and the idea, and see if there’s any >>>> early feedback >>>> > >>>>>>> we should incorporate before we spend too much time on a >>>> concrete proposal. >>>> > >>>>>>> > >>>> > >>>>>>> > Thank you! >>>> > >>>>>>> > >>>> > >>>>>>> > [1] >>>> > >>>>>>> >>>> https://docs.snowflake.com/en/sql-reference/data-types-semistructured >>>> > >>>>>>> > [2] >>>> > >>>>>>> >>>> https://github.com/apache/spark/blob/master/common/variant/README.md >>>> > >>>>>>> > [3] >>>> > >>>>>>> >>>> https://docs.google.com/document/d/1iVFbrRNEzZl8tDcZC81GFt01QJkLJsI9E2NBOt21IRI/edit >>>> > >>>>>>> > >>>> > >>>>>>> > -Tyler, Nileema, Selcuk, Aihua >>>> > >>>>>>> > >>>> > >>>>>>> >>>> > >>>>>> >>>> > >>>> > -- >>>> > Ryan Blue >>>> > Databricks >>>> > >>>> >>> >> >> -- >> Ryan Blue >> Databricks >> > > > -- > Ryan Blue > Databricks >