> Feels like eventually the encoding should land in parquet proper right?
What about using it in ORC? I don't know where it should end up. Maybe Iceberg should make a standalone module from it? On Fri, Jul 12, 2024 at 12:38 PM Russell Spitzer <russell.spit...@gmail.com> wrote: > Feels like eventually the encoding should land in parquet proper right? > I'm fine with us just copying into Iceberg though for the time being. > > On Fri, Jul 12, 2024 at 2:31 PM Ryan Blue <b...@databricks.com.invalid> > wrote: > >> Oops, it looks like I missed where Aihua brought this up in his last >> email: >> >> > do we have an issue to directly use Spark implementation in Iceberg? >> >> Yes, I think that we do have an issue using the Spark library. What do >> you think about a Java implementation in Iceberg? >> >> Ryan >> >> On Fri, Jul 12, 2024 at 12:28 PM Ryan Blue <b...@databricks.com> wrote: >> >>> I raised the same point from Peter's email in a comment on the doc as >>> well. There is a spark-variant_2.13 artifact that would be a much smaller >>> scope than relying on large portions of Spark, but I even then I doubt that >>> it is a good idea for Iceberg to depend on that because it is a Scala >>> artifact and we would need to bring in a ton of Scala libs. I think what >>> makes the most sense is to have an independent implementation of the spec >>> in Iceberg. >>> >>> On Fri, Jul 12, 2024 at 11:51 AM Péter Váry <peter.vary.apa...@gmail.com> >>> wrote: >>> >>>> Hi Aihua, >>>> Long time no see :) >>>> Would this mean, that every engine which plans to support Variant data >>>> type needs to add Spark as a dependency? Like Flink/Trino/Hive etc? >>>> Thanks, Peter >>>> >>>> >>>> On Fri, Jul 12, 2024, 19:10 Aihua Xu <aihu...@apache.org> wrote: >>>> >>>>> Thanks Ryan. >>>>> >>>>> Yeah. That's another reason we want to pursue Spark encoding to keep >>>>> compatibility for the open source engines. >>>>> >>>>> One more question regarding the encoding implementation: do we have an >>>>> issue to directly use Spark implementation in Iceberg? Russell pointed out >>>>> that Trino doesn't have Spark dependency and that could be a problem? >>>>> >>>>> Thanks, >>>>> Aihua >>>>> >>>>> On 2024/07/12 15:02:06 Ryan Blue wrote: >>>>> > Thanks, Aihua! >>>>> > >>>>> > I think that the encoding choice in the current doc is a good one. I >>>>> went >>>>> > through the Spark encoding in detail and it looks like a better >>>>> choice than >>>>> > the other candidate encodings for quickly accessing nested fields. >>>>> > >>>>> > Another reason to use the Spark type is that this is what Delta's >>>>> variant >>>>> > type is based on, so Parquet files in tables written by Delta could >>>>> be >>>>> > converted or used in Iceberg tables without needing to rewrite >>>>> variant >>>>> > data. (Also, note that I work at Databricks and have an interest in >>>>> > increasing format compatibility.) >>>>> > >>>>> > Ryan >>>>> > >>>>> > On Thu, Jul 11, 2024 at 11:21 AM Aihua Xu <aihua...@snowflake.com >>>>> .invalid> >>>>> > wrote: >>>>> > >>>>> > > [Discuss] Consensus for Variant Encoding >>>>> > > >>>>> > > It’s great to be able to present the Variant type proposal in the >>>>> > > community sync yesterday and I’m looking to host a meeting next >>>>> week >>>>> > > (targeting for 9am, July 17th) to go over any further concerns >>>>> about the >>>>> > > encoding of the Variant type and any other questions on the first >>>>> phase of >>>>> > > the proposal >>>>> > > < >>>>> https://docs.google.com/document/d/1QjhpG_SVNPZh3anFcpicMQx90ebwjL7rmzFYfUP89Iw/edit >>>>> >. >>>>> > > We are hoping that anyone who is interested in the proposal can >>>>> either join >>>>> > > or reply with their comments so we can discuss them. Summary of the >>>>> > > discussion and notes will be sent to the mailing list for further >>>>> comment >>>>> > > there. >>>>> > > >>>>> > > >>>>> > > - >>>>> > > >>>>> > > What should be the underlying binary representation >>>>> > > >>>>> > > We have evaluated a few encodings in the doc including ION, JSONB, >>>>> and >>>>> > > Spark encoding.Choosing the underlying encoding is an important >>>>> first step >>>>> > > here and we believe we have general support for Spark’s Variant >>>>> encoding. >>>>> > > We would like to hear if anyone else has strong opinions in this >>>>> space. >>>>> > > >>>>> > > >>>>> > > - >>>>> > > >>>>> > > Should we support multiple logical types or just Variant? >>>>> Variant vs. >>>>> > > Variant + JSON. >>>>> > > >>>>> > > This is to discuss what logical data type(s) to be supported in >>>>> Iceberg - >>>>> > > Variant only vs. Variant + JSON. Both types would share the same >>>>> underlying >>>>> > > encoding but would imply different limitations on engines working >>>>> with >>>>> > > those types. >>>>> > > >>>>> > > From the sync up meeting, we are more favoring toward supporting >>>>> Variant >>>>> > > only and we want to have a consensus on the supported type(s). >>>>> > > >>>>> > > >>>>> > > - >>>>> > > >>>>> > > How should we move forward with Subcolumnization? >>>>> > > >>>>> > > Subcolumnization is an optimization for Variant type by separating >>>>> out >>>>> > > subcolumns with their own metadata. This is not critical for >>>>> choosing the >>>>> > > initial encoding of the Variant type so we were hoping to gain >>>>> consensus on >>>>> > > leaving that for a follow up spec. >>>>> > > >>>>> > > >>>>> > > Thanks >>>>> > > >>>>> > > Aihua >>>>> > > >>>>> > > Meeting invite: >>>>> > > >>>>> > > Wednesday, July 17 · 9:00 – 10:00am >>>>> > > Time zone: America/Los_Angeles >>>>> > > Google Meet joining info >>>>> > > Video call link: https://meet.google.com/pbm-ovzn-aoq >>>>> > > Or dial: (US) +1 650-449-9343 PIN: 170 576 525# >>>>> > > More phone numbers: >>>>> https://tel.meet/pbm-ovzn-aoq?pin=4079632691790 >>>>> > > >>>>> > > On Tue, May 28, 2024 at 9:21 PM Aihua Xu <aihua...@snowflake.com> >>>>> wrote: >>>>> > > >>>>> > >> Hello, >>>>> > >> >>>>> > >> We have drafted the proposal >>>>> > >> < >>>>> https://docs.google.com/document/d/1QjhpG_SVNPZh3anFcpicMQx90ebwjL7rmzFYfUP89Iw/edit >>>>> > >>>>> > >> for Variant data type. Please help review and comment. >>>>> > >> >>>>> > >> Thanks, >>>>> > >> Aihua >>>>> > >> >>>>> > >> On Thu, May 16, 2024 at 12:45 PM Jack Ye <yezhao...@gmail.com> >>>>> wrote: >>>>> > >> >>>>> > >>> +10000 for a JSON/BSON type. We also had the same discussion >>>>> internally >>>>> > >>> and a JSON type would really play well with for example the >>>>> SUPER type in >>>>> > >>> Redshift: >>>>> > >>> https://docs.aws.amazon.com/redshift/latest/dg/r_SUPER_type.html, >>>>> and >>>>> > >>> can also provide better integration with the Trino JSON type. >>>>> > >>> >>>>> > >>> Looking forward to the proposal! >>>>> > >>> >>>>> > >>> Best, >>>>> > >>> Jack Ye >>>>> > >>> >>>>> > >>> >>>>> > >>> On Wed, May 15, 2024 at 9:37 AM Tyler Akidau >>>>> > >>> <tyler.aki...@snowflake.com.invalid> wrote: >>>>> > >>> >>>>> > >>>> On Tue, May 14, 2024 at 7:58 PM Gang Wu <ust...@gmail.com> >>>>> wrote: >>>>> > >>>> >>>>> > >>>>> > We may need some guidance on just how many we need to look >>>>> at; >>>>> > >>>>> > we were planning on Spark and Trino, but weren't sure how >>>>> much >>>>> > >>>>> > further down the rabbit hole we needed to go。 >>>>> > >>>>> >>>>> > >>>>> There are some engines living outside the Java world. It would >>>>> be >>>>> > >>>>> good if the proposal could cover the effort it takes to >>>>> integrate >>>>> > >>>>> variant type to them (e.g. velox, datafusion, etc.). This is >>>>> something >>>>> > >>>>> that >>>>> > >>>>> some proprietary iceberg vendors also care about. >>>>> > >>>>> >>>>> > >>>> >>>>> > >>>> Ack, makes sense. We can make sure to share some perspective on >>>>> this. >>>>> > >>>> >>>>> > >>>> > Not necessarily, no. As long as there's a binary type and >>>>> Iceberg and >>>>> > >>>>> > the query engines are aware that the binary column needs to >>>>> be >>>>> > >>>>> > interpreted as a variant, that should be sufficient. >>>>> > >>>>> >>>>> > >>>>> From the perspective of interoperability, it would be good to >>>>> support >>>>> > >>>>> native >>>>> > >>>>> type from file specs. Life will be easier for projects like >>>>> Apache >>>>> > >>>>> XTable. >>>>> > >>>>> File format could also provide finer-grained statistics for >>>>> variant >>>>> > >>>>> type which >>>>> > >>>>> facilitates data skipping. >>>>> > >>>>> >>>>> > >>>> >>>>> > >>>> Agreed, there can definitely be additional value in native file >>>>> format >>>>> > >>>> integration. Just wanted to highlight that it's not a strict >>>>> requirement. >>>>> > >>>> >>>>> > >>>> -Tyler >>>>> > >>>> >>>>> > >>>> >>>>> > >>>>> >>>>> > >>>>> Gang >>>>> > >>>>> >>>>> > >>>>> On Wed, May 15, 2024 at 6:49 AM Tyler Akidau >>>>> > >>>>> <tyler.aki...@snowflake.com.invalid> wrote: >>>>> > >>>>> >>>>> > >>>>>> Good to see you again as well, JB! Thanks! >>>>> > >>>>>> >>>>> > >>>>>> -Tyler >>>>> > >>>>>> >>>>> > >>>>>> >>>>> > >>>>>> On Tue, May 14, 2024 at 1:04 PM Jean-Baptiste Onofré < >>>>> j...@nanthrax.net> >>>>> > >>>>>> wrote: >>>>> > >>>>>> >>>>> > >>>>>>> Hi Tyler, >>>>> > >>>>>>> >>>>> > >>>>>>> Super happy to see you there :) It reminds me our >>>>> discussions back in >>>>> > >>>>>>> the start of Apache Beam :) >>>>> > >>>>>>> >>>>> > >>>>>>> Anyway, the thread is pretty interesting. I remember some >>>>> discussions >>>>> > >>>>>>> about JSON datatype for spec v3. The binary data type is >>>>> already >>>>> > >>>>>>> supported in the spec v2. >>>>> > >>>>>>> >>>>> > >>>>>>> I'm looking forward to the proposal and happy to help on >>>>> this ! >>>>> > >>>>>>> >>>>> > >>>>>>> Regards >>>>> > >>>>>>> JB >>>>> > >>>>>>> >>>>> > >>>>>>> On Sat, May 11, 2024 at 7:06 AM Tyler Akidau >>>>> > >>>>>>> <tyler.aki...@snowflake.com.invalid> wrote: >>>>> > >>>>>>> > >>>>> > >>>>>>> > Hello, >>>>> > >>>>>>> > >>>>> > >>>>>>> > We (Tyler, Nileema, Selcuk, Aihua) are working on a >>>>> proposal for >>>>> > >>>>>>> which we’d like to get early feedback from the community. As >>>>> you may know, >>>>> > >>>>>>> Snowflake has embraced Iceberg as its open Data Lake format. >>>>> Having made >>>>> > >>>>>>> good progress on our own adoption of the Iceberg standard, >>>>> we’re now in a >>>>> > >>>>>>> position where there are features not yet supported in >>>>> Iceberg which we >>>>> > >>>>>>> think would be valuable for our users, and that we would >>>>> like to discuss >>>>> > >>>>>>> with and help contribute to the Iceberg community. >>>>> > >>>>>>> > >>>>> > >>>>>>> > The first two such features we’d like to discuss are in >>>>> support of >>>>> > >>>>>>> efficient querying of dynamically typed, semi-structured >>>>> data: variant data >>>>> > >>>>>>> types, and subcolumnarization of variant columns. In more >>>>> detail, for >>>>> > >>>>>>> anyone who may not already be familiar: >>>>> > >>>>>>> > >>>>> > >>>>>>> > 1. Variant data types >>>>> > >>>>>>> > Variant types allow for the efficient binary encoding of >>>>> dynamic >>>>> > >>>>>>> semi-structured data such as JSON, Avro, etc. By encoding >>>>> semi-structured >>>>> > >>>>>>> data as a variant column, we retain the flexibility of the >>>>> source data, >>>>> > >>>>>>> while allowing query engines to more efficiently operate on >>>>> the data. >>>>> > >>>>>>> Snowflake has supported the variant data type on Snowflake >>>>> tables for many >>>>> > >>>>>>> years [1]. As more and more users utilize Iceberg tables in >>>>> Snowflake, >>>>> > >>>>>>> we’re hearing an increasing chorus of requests for variant >>>>> support. >>>>> > >>>>>>> Additionally, other query engines such as Apache Spark have >>>>> begun adding >>>>> > >>>>>>> variant support [2]. As such, we believe it would be >>>>> beneficial to the >>>>> > >>>>>>> Iceberg community as a whole to standardize on the variant >>>>> data type >>>>> > >>>>>>> encoding used across Iceberg tables. >>>>> > >>>>>>> > >>>>> > >>>>>>> > One specific point to make here is that, since an Apache >>>>> OSS >>>>> > >>>>>>> version of variant encoding already exists in Spark, it >>>>> likely makes sense >>>>> > >>>>>>> to simply adopt the Spark encoding as the Iceberg standard >>>>> as well. The >>>>> > >>>>>>> encoding we use internally today in Snowflake is slightly >>>>> different, but >>>>> > >>>>>>> essentially equivalent, and we see no particular value in >>>>> trying to clutter >>>>> > >>>>>>> the space with another equivalent-but-incompatible encoding. >>>>> > >>>>>>> > >>>>> > >>>>>>> > >>>>> > >>>>>>> > 2. Subcolumnarization >>>>> > >>>>>>> > Subcolumnarization of variant columns allows query engines >>>>> to >>>>> > >>>>>>> efficiently prune datasets when subcolumns (i.e., nested >>>>> fields) within a >>>>> > >>>>>>> variant column are queried, and also allows optionally >>>>> materializing some >>>>> > >>>>>>> of the nested fields as a column on their own, affording >>>>> queries on these >>>>> > >>>>>>> subcolumns the ability to read less data and spend less CPU >>>>> on extraction. >>>>> > >>>>>>> When subcolumnarizing, the system managing table metadata >>>>> and data tracks >>>>> > >>>>>>> individual pruning statistics (min, max, null, etc.) for >>>>> some subset of the >>>>> > >>>>>>> nested fields within a variant, and also manages any optional >>>>> > >>>>>>> materialization. Without subcolumnarization, any query which >>>>> touches a >>>>> > >>>>>>> variant column must read, parse, extract, and filter every >>>>> row for which >>>>> > >>>>>>> that column is non-null. Thus, by providing a standardized >>>>> way of tracking >>>>> > >>>>>>> subcolum metadata and data for variant columns, Iceberg can >>>>> make >>>>> > >>>>>>> subcolumnar optimizations accessible across various catalogs >>>>> and query >>>>> > >>>>>>> engines. >>>>> > >>>>>>> > >>>>> > >>>>>>> > Subcolumnarization is a non-trivial topic, so we expect any >>>>> > >>>>>>> concrete proposal to include not only the set of changes to >>>>> Iceberg >>>>> > >>>>>>> metadata that allow compatible query engines to interopate on >>>>> > >>>>>>> subcolumnarization data for variant columns, but also >>>>> reference >>>>> > >>>>>>> documentation explaining subcolumnarization principles and >>>>> recommended best >>>>> > >>>>>>> practices. >>>>> > >>>>>>> > >>>>> > >>>>>>> > >>>>> > >>>>>>> > It sounds like the recent Geo proposal [3] may be a good >>>>> starting >>>>> > >>>>>>> point for how to approach this, so our plan is to write >>>>> something up in >>>>> > >>>>>>> that vein that covers the proposed spec changes, backwards >>>>> compatibility, >>>>> > >>>>>>> implementor burdens, etc. But we wanted to first reach out >>>>> to the community >>>>> > >>>>>>> to introduce ourselves and the idea, and see if there’s any >>>>> early feedback >>>>> > >>>>>>> we should incorporate before we spend too much time on a >>>>> concrete proposal. >>>>> > >>>>>>> > >>>>> > >>>>>>> > Thank you! >>>>> > >>>>>>> > >>>>> > >>>>>>> > [1] >>>>> > >>>>>>> >>>>> https://docs.snowflake.com/en/sql-reference/data-types-semistructured >>>>> > >>>>>>> > [2] >>>>> > >>>>>>> >>>>> https://github.com/apache/spark/blob/master/common/variant/README.md >>>>> > >>>>>>> > [3] >>>>> > >>>>>>> >>>>> https://docs.google.com/document/d/1iVFbrRNEzZl8tDcZC81GFt01QJkLJsI9E2NBOt21IRI/edit >>>>> > >>>>>>> > >>>>> > >>>>>>> > -Tyler, Nileema, Selcuk, Aihua >>>>> > >>>>>>> > >>>>> > >>>>>>> >>>>> > >>>>>> >>>>> > >>>>> > -- >>>>> > Ryan Blue >>>>> > Databricks >>>>> > >>>>> >>>> >>> >>> -- >>> Ryan Blue >>> Databricks >>> >> >> >> -- >> Ryan Blue >> Databricks >> > -- Ryan Blue Databricks