Feels like eventually the encoding should land in parquet proper right? I'm
fine with us just copying into Iceberg though for the time being.

On Fri, Jul 12, 2024 at 2:31 PM Ryan Blue <b...@databricks.com.invalid>
wrote:

> Oops, it looks like I missed where Aihua brought this up in his last email:
>
> > do we have an issue to directly use Spark implementation in Iceberg?
>
> Yes, I think that we do have an issue using the Spark library. What do you
> think about a Java implementation in Iceberg?
>
> Ryan
>
> On Fri, Jul 12, 2024 at 12:28 PM Ryan Blue <b...@databricks.com> wrote:
>
>> I raised the same point from Peter's email in a comment on the doc as
>> well. There is a spark-variant_2.13 artifact that would be a much smaller
>> scope than relying on large portions of Spark, but I even then I doubt that
>> it is a good idea for Iceberg to depend on that because it is a Scala
>> artifact and we would need to bring in a ton of Scala libs. I think what
>> makes the most sense is to have an independent implementation of the spec
>> in Iceberg.
>>
>> On Fri, Jul 12, 2024 at 11:51 AM Péter Váry <peter.vary.apa...@gmail.com>
>> wrote:
>>
>>> Hi Aihua,
>>> Long time no see :)
>>> Would this mean, that every engine which plans to support Variant data
>>> type needs to add Spark as a dependency? Like Flink/Trino/Hive etc?
>>> Thanks, Peter
>>>
>>>
>>> On Fri, Jul 12, 2024, 19:10 Aihua Xu <aihu...@apache.org> wrote:
>>>
>>>> Thanks Ryan.
>>>>
>>>> Yeah. That's another reason we want to pursue Spark encoding to keep
>>>> compatibility for the open source engines.
>>>>
>>>> One more question regarding the encoding implementation: do we have an
>>>> issue to directly use Spark implementation in Iceberg? Russell pointed out
>>>> that Trino doesn't have Spark dependency and that could be a problem?
>>>>
>>>> Thanks,
>>>> Aihua
>>>>
>>>> On 2024/07/12 15:02:06 Ryan Blue wrote:
>>>> > Thanks, Aihua!
>>>> >
>>>> > I think that the encoding choice in the current doc is a good one. I
>>>> went
>>>> > through the Spark encoding in detail and it looks like a better
>>>> choice than
>>>> > the other candidate encodings for quickly accessing nested fields.
>>>> >
>>>> > Another reason to use the Spark type is that this is what Delta's
>>>> variant
>>>> > type is based on, so Parquet files in tables written by Delta could be
>>>> > converted or used in Iceberg tables without needing to rewrite variant
>>>> > data. (Also, note that I work at Databricks and have an interest in
>>>> > increasing format compatibility.)
>>>> >
>>>> > Ryan
>>>> >
>>>> > On Thu, Jul 11, 2024 at 11:21 AM Aihua Xu <aihua...@snowflake.com
>>>> .invalid>
>>>> > wrote:
>>>> >
>>>> > > [Discuss] Consensus for Variant Encoding
>>>> > >
>>>> > > It’s great to be able to present the Variant type proposal in the
>>>> > > community sync yesterday and I’m looking to host a meeting next week
>>>> > > (targeting for 9am, July 17th) to go over any further concerns
>>>> about the
>>>> > > encoding of the Variant type and any other questions on the first
>>>> phase of
>>>> > > the proposal
>>>> > > <
>>>> https://docs.google.com/document/d/1QjhpG_SVNPZh3anFcpicMQx90ebwjL7rmzFYfUP89Iw/edit
>>>> >.
>>>> > > We are hoping that anyone who is interested in the proposal can
>>>> either join
>>>> > > or reply with their comments so we can discuss them. Summary of the
>>>> > > discussion and notes will be sent to the mailing list for further
>>>> comment
>>>> > > there.
>>>> > >
>>>> > >
>>>> > >    -
>>>> > >
>>>> > >    What should be the underlying binary representation
>>>> > >
>>>> > > We have evaluated a few encodings in the doc including ION, JSONB,
>>>> and
>>>> > > Spark encoding.Choosing the underlying encoding is an important
>>>> first step
>>>> > > here and we believe we have general support for Spark’s Variant
>>>> encoding.
>>>> > > We would like to hear if anyone else has strong opinions in this
>>>> space.
>>>> > >
>>>> > >
>>>> > >    -
>>>> > >
>>>> > >    Should we support multiple logical types or just Variant?
>>>> Variant vs.
>>>> > >    Variant + JSON.
>>>> > >
>>>> > > This is to discuss what logical data type(s) to be supported in
>>>> Iceberg -
>>>> > > Variant only vs. Variant + JSON. Both types would share the same
>>>> underlying
>>>> > > encoding but would imply different limitations on engines working
>>>> with
>>>> > > those types.
>>>> > >
>>>> > > From the sync up meeting, we are more favoring toward supporting
>>>> Variant
>>>> > > only and we want to have a consensus on the supported type(s).
>>>> > >
>>>> > >
>>>> > >    -
>>>> > >
>>>> > >    How should we move forward with Subcolumnization?
>>>> > >
>>>> > > Subcolumnization is an optimization for Variant type by separating
>>>> out
>>>> > > subcolumns with their own metadata. This is not critical for
>>>> choosing the
>>>> > > initial encoding of the Variant type so we were hoping to gain
>>>> consensus on
>>>> > > leaving that for a follow up spec.
>>>> > >
>>>> > >
>>>> > > Thanks
>>>> > >
>>>> > > Aihua
>>>> > >
>>>> > > Meeting invite:
>>>> > >
>>>> > > Wednesday, July 17 · 9:00 – 10:00am
>>>> > > Time zone: America/Los_Angeles
>>>> > > Google Meet joining info
>>>> > > Video call link: https://meet.google.com/pbm-ovzn-aoq
>>>> > > Or dial: ‪(US) +1 650-449-9343‬ PIN: ‪170 576 525‬#
>>>> > > More phone numbers: https://tel.meet/pbm-ovzn-aoq?pin=4079632691790
>>>> > >
>>>> > > On Tue, May 28, 2024 at 9:21 PM Aihua Xu <aihua...@snowflake.com>
>>>> wrote:
>>>> > >
>>>> > >> Hello,
>>>> > >>
>>>> > >> We have drafted the proposal
>>>> > >> <
>>>> https://docs.google.com/document/d/1QjhpG_SVNPZh3anFcpicMQx90ebwjL7rmzFYfUP89Iw/edit
>>>> >
>>>> > >> for Variant data type. Please help review and comment.
>>>> > >>
>>>> > >> Thanks,
>>>> > >> Aihua
>>>> > >>
>>>> > >> On Thu, May 16, 2024 at 12:45 PM Jack Ye <yezhao...@gmail.com>
>>>> wrote:
>>>> > >>
>>>> > >>> +10000 for a JSON/BSON type. We also had the same discussion
>>>> internally
>>>> > >>> and a JSON type would really play well with for example the SUPER
>>>> type in
>>>> > >>> Redshift:
>>>> > >>> https://docs.aws.amazon.com/redshift/latest/dg/r_SUPER_type.html,
>>>> and
>>>> > >>> can also provide better integration with the Trino JSON type.
>>>> > >>>
>>>> > >>> Looking forward to the proposal!
>>>> > >>>
>>>> > >>> Best,
>>>> > >>> Jack Ye
>>>> > >>>
>>>> > >>>
>>>> > >>> On Wed, May 15, 2024 at 9:37 AM Tyler Akidau
>>>> > >>> <tyler.aki...@snowflake.com.invalid> wrote:
>>>> > >>>
>>>> > >>>> On Tue, May 14, 2024 at 7:58 PM Gang Wu <ust...@gmail.com>
>>>> wrote:
>>>> > >>>>
>>>> > >>>>> > We may need some guidance on just how many we need to look at;
>>>> > >>>>> > we were planning on Spark and Trino, but weren't sure how much
>>>> > >>>>> > further down the rabbit hole we needed to go。
>>>> > >>>>>
>>>> > >>>>> There are some engines living outside the Java world. It would
>>>> be
>>>> > >>>>> good if the proposal could cover the effort it takes to
>>>> integrate
>>>> > >>>>> variant type to them (e.g. velox, datafusion, etc.). This is
>>>> something
>>>> > >>>>> that
>>>> > >>>>> some proprietary iceberg vendors also care about.
>>>> > >>>>>
>>>> > >>>>
>>>> > >>>> Ack, makes sense. We can make sure to share some perspective on
>>>> this.
>>>> > >>>>
>>>> > >>>> > Not necessarily, no. As long as there's a binary type and
>>>> Iceberg and
>>>> > >>>>> > the query engines are aware that the binary column needs to be
>>>> > >>>>> > interpreted as a variant, that should be sufficient.
>>>> > >>>>>
>>>> > >>>>> From the perspective of interoperability, it would be good to
>>>> support
>>>> > >>>>> native
>>>> > >>>>> type from file specs. Life will be easier for projects like
>>>> Apache
>>>> > >>>>> XTable.
>>>> > >>>>> File format could also provide finer-grained statistics for
>>>> variant
>>>> > >>>>> type which
>>>> > >>>>> facilitates data skipping.
>>>> > >>>>>
>>>> > >>>>
>>>> > >>>> Agreed, there can definitely be additional value in native file
>>>> format
>>>> > >>>> integration. Just wanted to highlight that it's not a strict
>>>> requirement.
>>>> > >>>>
>>>> > >>>> -Tyler
>>>> > >>>>
>>>> > >>>>
>>>> > >>>>>
>>>> > >>>>> Gang
>>>> > >>>>>
>>>> > >>>>> On Wed, May 15, 2024 at 6:49 AM Tyler Akidau
>>>> > >>>>> <tyler.aki...@snowflake.com.invalid> wrote:
>>>> > >>>>>
>>>> > >>>>>> Good to see you again as well, JB! Thanks!
>>>> > >>>>>>
>>>> > >>>>>> -Tyler
>>>> > >>>>>>
>>>> > >>>>>>
>>>> > >>>>>> On Tue, May 14, 2024 at 1:04 PM Jean-Baptiste Onofré <
>>>> j...@nanthrax.net>
>>>> > >>>>>> wrote:
>>>> > >>>>>>
>>>> > >>>>>>> Hi Tyler,
>>>> > >>>>>>>
>>>> > >>>>>>> Super happy to see you there :) It reminds me our discussions
>>>> back in
>>>> > >>>>>>> the start of Apache Beam :)
>>>> > >>>>>>>
>>>> > >>>>>>> Anyway, the thread is pretty interesting. I remember some
>>>> discussions
>>>> > >>>>>>> about JSON datatype for spec v3. The binary data type is
>>>> already
>>>> > >>>>>>> supported in the spec v2.
>>>> > >>>>>>>
>>>> > >>>>>>> I'm looking forward to the proposal and happy to help on this
>>>> !
>>>> > >>>>>>>
>>>> > >>>>>>> Regards
>>>> > >>>>>>> JB
>>>> > >>>>>>>
>>>> > >>>>>>> On Sat, May 11, 2024 at 7:06 AM Tyler Akidau
>>>> > >>>>>>> <tyler.aki...@snowflake.com.invalid> wrote:
>>>> > >>>>>>> >
>>>> > >>>>>>> > Hello,
>>>> > >>>>>>> >
>>>> > >>>>>>> > We (Tyler, Nileema, Selcuk, Aihua) are working on a
>>>> proposal for
>>>> > >>>>>>> which we’d like to get early feedback from the community. As
>>>> you may know,
>>>> > >>>>>>> Snowflake has embraced Iceberg as its open Data Lake format.
>>>> Having made
>>>> > >>>>>>> good progress on our own adoption of the Iceberg standard,
>>>> we’re now in a
>>>> > >>>>>>> position where there are features not yet supported in
>>>> Iceberg which we
>>>> > >>>>>>> think would be valuable for our users, and that we would like
>>>> to discuss
>>>> > >>>>>>> with and help contribute to the Iceberg community.
>>>> > >>>>>>> >
>>>> > >>>>>>> > The first two such features we’d like to discuss are in
>>>> support of
>>>> > >>>>>>> efficient querying of dynamically typed, semi-structured
>>>> data: variant data
>>>> > >>>>>>> types, and subcolumnarization of variant columns. In more
>>>> detail, for
>>>> > >>>>>>> anyone who may not already be familiar:
>>>> > >>>>>>> >
>>>> > >>>>>>> > 1. Variant data types
>>>> > >>>>>>> > Variant types allow for the efficient binary encoding of
>>>> dynamic
>>>> > >>>>>>> semi-structured data such as JSON, Avro, etc. By encoding
>>>> semi-structured
>>>> > >>>>>>> data as a variant column, we retain the flexibility of the
>>>> source data,
>>>> > >>>>>>> while allowing query engines to more efficiently operate on
>>>> the data.
>>>> > >>>>>>> Snowflake has supported the variant data type on Snowflake
>>>> tables for many
>>>> > >>>>>>> years [1]. As more and more users utilize Iceberg tables in
>>>> Snowflake,
>>>> > >>>>>>> we’re hearing an increasing chorus of requests for variant
>>>> support.
>>>> > >>>>>>> Additionally, other query engines such as Apache Spark have
>>>> begun adding
>>>> > >>>>>>> variant support [2]. As such, we believe it would be
>>>> beneficial to the
>>>> > >>>>>>> Iceberg community as a whole to standardize on the variant
>>>> data type
>>>> > >>>>>>> encoding used across Iceberg tables.
>>>> > >>>>>>> >
>>>> > >>>>>>> > One specific point to make here is that, since an Apache OSS
>>>> > >>>>>>> version of variant encoding already exists in Spark, it
>>>> likely makes sense
>>>> > >>>>>>> to simply adopt the Spark encoding as the Iceberg standard as
>>>> well. The
>>>> > >>>>>>> encoding we use internally today in Snowflake is slightly
>>>> different, but
>>>> > >>>>>>> essentially equivalent, and we see no particular value in
>>>> trying to clutter
>>>> > >>>>>>> the space with another equivalent-but-incompatible encoding.
>>>> > >>>>>>> >
>>>> > >>>>>>> >
>>>> > >>>>>>> > 2. Subcolumnarization
>>>> > >>>>>>> > Subcolumnarization of variant columns allows query engines
>>>> to
>>>> > >>>>>>> efficiently prune datasets when subcolumns (i.e., nested
>>>> fields) within a
>>>> > >>>>>>> variant column are queried, and also allows optionally
>>>> materializing some
>>>> > >>>>>>> of the nested fields as a column on their own, affording
>>>> queries on these
>>>> > >>>>>>> subcolumns the ability to read less data and spend less CPU
>>>> on extraction.
>>>> > >>>>>>> When subcolumnarizing, the system managing table metadata and
>>>> data tracks
>>>> > >>>>>>> individual pruning statistics (min, max, null, etc.) for some
>>>> subset of the
>>>> > >>>>>>> nested fields within a variant, and also manages any optional
>>>> > >>>>>>> materialization. Without subcolumnarization, any query which
>>>> touches a
>>>> > >>>>>>> variant column must read, parse, extract, and filter every
>>>> row for which
>>>> > >>>>>>> that column is non-null. Thus, by providing a standardized
>>>> way of tracking
>>>> > >>>>>>> subcolum metadata and data for variant columns, Iceberg can
>>>> make
>>>> > >>>>>>> subcolumnar optimizations accessible across various catalogs
>>>> and query
>>>> > >>>>>>> engines.
>>>> > >>>>>>> >
>>>> > >>>>>>> > Subcolumnarization is a non-trivial topic, so we expect any
>>>> > >>>>>>> concrete proposal to include not only the set of changes to
>>>> Iceberg
>>>> > >>>>>>> metadata that allow compatible query engines to interopate on
>>>> > >>>>>>> subcolumnarization data for variant columns, but also
>>>> reference
>>>> > >>>>>>> documentation explaining subcolumnarization principles and
>>>> recommended best
>>>> > >>>>>>> practices.
>>>> > >>>>>>> >
>>>> > >>>>>>> >
>>>> > >>>>>>> > It sounds like the recent Geo proposal [3] may be a good
>>>> starting
>>>> > >>>>>>> point for how to approach this, so our plan is to write
>>>> something up in
>>>> > >>>>>>> that vein that covers the proposed spec changes, backwards
>>>> compatibility,
>>>> > >>>>>>> implementor burdens, etc. But we wanted to first reach out to
>>>> the community
>>>> > >>>>>>> to introduce ourselves and the idea, and see if there’s any
>>>> early feedback
>>>> > >>>>>>> we should incorporate before we spend too much time on a
>>>> concrete proposal.
>>>> > >>>>>>> >
>>>> > >>>>>>> > Thank you!
>>>> > >>>>>>> >
>>>> > >>>>>>> > [1]
>>>> > >>>>>>>
>>>> https://docs.snowflake.com/en/sql-reference/data-types-semistructured
>>>> > >>>>>>> > [2]
>>>> > >>>>>>>
>>>> https://github.com/apache/spark/blob/master/common/variant/README.md
>>>> > >>>>>>> > [3]
>>>> > >>>>>>>
>>>> https://docs.google.com/document/d/1iVFbrRNEzZl8tDcZC81GFt01QJkLJsI9E2NBOt21IRI/edit
>>>> > >>>>>>> >
>>>> > >>>>>>> > -Tyler, Nileema, Selcuk, Aihua
>>>> > >>>>>>> >
>>>> > >>>>>>>
>>>> > >>>>>>
>>>> >
>>>> > --
>>>> > Ryan Blue
>>>> > Databricks
>>>> >
>>>>
>>>
>>
>> --
>> Ryan Blue
>> Databricks
>>
>
>
> --
> Ryan Blue
> Databricks
>

Reply via email to