> Feels like eventually the encoding should land in parquet proper right?

What about using it in ORC? I don't know where it should end up. Maybe
Iceberg should make a standalone module from it?

On Fri, Jul 12, 2024 at 12:38 PM Russell Spitzer <russell.spit...@gmail.com>
wrote:

> Feels like eventually the encoding should land in parquet proper right?
> I'm fine with us just copying into Iceberg though for the time being.
>
> On Fri, Jul 12, 2024 at 2:31 PM Ryan Blue <b...@databricks.com.invalid>
> wrote:
>
>> Oops, it looks like I missed where Aihua brought this up in his last
>> email:
>>
>> > do we have an issue to directly use Spark implementation in Iceberg?
>>
>> Yes, I think that we do have an issue using the Spark library. What do
>> you think about a Java implementation in Iceberg?
>>
>> Ryan
>>
>> On Fri, Jul 12, 2024 at 12:28 PM Ryan Blue <b...@databricks.com> wrote:
>>
>>> I raised the same point from Peter's email in a comment on the doc as
>>> well. There is a spark-variant_2.13 artifact that would be a much smaller
>>> scope than relying on large portions of Spark, but I even then I doubt that
>>> it is a good idea for Iceberg to depend on that because it is a Scala
>>> artifact and we would need to bring in a ton of Scala libs. I think what
>>> makes the most sense is to have an independent implementation of the spec
>>> in Iceberg.
>>>
>>> On Fri, Jul 12, 2024 at 11:51 AM Péter Váry <peter.vary.apa...@gmail.com>
>>> wrote:
>>>
>>>> Hi Aihua,
>>>> Long time no see :)
>>>> Would this mean, that every engine which plans to support Variant data
>>>> type needs to add Spark as a dependency? Like Flink/Trino/Hive etc?
>>>> Thanks, Peter
>>>>
>>>>
>>>> On Fri, Jul 12, 2024, 19:10 Aihua Xu <aihu...@apache.org> wrote:
>>>>
>>>>> Thanks Ryan.
>>>>>
>>>>> Yeah. That's another reason we want to pursue Spark encoding to keep
>>>>> compatibility for the open source engines.
>>>>>
>>>>> One more question regarding the encoding implementation: do we have an
>>>>> issue to directly use Spark implementation in Iceberg? Russell pointed out
>>>>> that Trino doesn't have Spark dependency and that could be a problem?
>>>>>
>>>>> Thanks,
>>>>> Aihua
>>>>>
>>>>> On 2024/07/12 15:02:06 Ryan Blue wrote:
>>>>> > Thanks, Aihua!
>>>>> >
>>>>> > I think that the encoding choice in the current doc is a good one. I
>>>>> went
>>>>> > through the Spark encoding in detail and it looks like a better
>>>>> choice than
>>>>> > the other candidate encodings for quickly accessing nested fields.
>>>>> >
>>>>> > Another reason to use the Spark type is that this is what Delta's
>>>>> variant
>>>>> > type is based on, so Parquet files in tables written by Delta could
>>>>> be
>>>>> > converted or used in Iceberg tables without needing to rewrite
>>>>> variant
>>>>> > data. (Also, note that I work at Databricks and have an interest in
>>>>> > increasing format compatibility.)
>>>>> >
>>>>> > Ryan
>>>>> >
>>>>> > On Thu, Jul 11, 2024 at 11:21 AM Aihua Xu <aihua...@snowflake.com
>>>>> .invalid>
>>>>> > wrote:
>>>>> >
>>>>> > > [Discuss] Consensus for Variant Encoding
>>>>> > >
>>>>> > > It’s great to be able to present the Variant type proposal in the
>>>>> > > community sync yesterday and I’m looking to host a meeting next
>>>>> week
>>>>> > > (targeting for 9am, July 17th) to go over any further concerns
>>>>> about the
>>>>> > > encoding of the Variant type and any other questions on the first
>>>>> phase of
>>>>> > > the proposal
>>>>> > > <
>>>>> https://docs.google.com/document/d/1QjhpG_SVNPZh3anFcpicMQx90ebwjL7rmzFYfUP89Iw/edit
>>>>> >.
>>>>> > > We are hoping that anyone who is interested in the proposal can
>>>>> either join
>>>>> > > or reply with their comments so we can discuss them. Summary of the
>>>>> > > discussion and notes will be sent to the mailing list for further
>>>>> comment
>>>>> > > there.
>>>>> > >
>>>>> > >
>>>>> > >    -
>>>>> > >
>>>>> > >    What should be the underlying binary representation
>>>>> > >
>>>>> > > We have evaluated a few encodings in the doc including ION, JSONB,
>>>>> and
>>>>> > > Spark encoding.Choosing the underlying encoding is an important
>>>>> first step
>>>>> > > here and we believe we have general support for Spark’s Variant
>>>>> encoding.
>>>>> > > We would like to hear if anyone else has strong opinions in this
>>>>> space.
>>>>> > >
>>>>> > >
>>>>> > >    -
>>>>> > >
>>>>> > >    Should we support multiple logical types or just Variant?
>>>>> Variant vs.
>>>>> > >    Variant + JSON.
>>>>> > >
>>>>> > > This is to discuss what logical data type(s) to be supported in
>>>>> Iceberg -
>>>>> > > Variant only vs. Variant + JSON. Both types would share the same
>>>>> underlying
>>>>> > > encoding but would imply different limitations on engines working
>>>>> with
>>>>> > > those types.
>>>>> > >
>>>>> > > From the sync up meeting, we are more favoring toward supporting
>>>>> Variant
>>>>> > > only and we want to have a consensus on the supported type(s).
>>>>> > >
>>>>> > >
>>>>> > >    -
>>>>> > >
>>>>> > >    How should we move forward with Subcolumnization?
>>>>> > >
>>>>> > > Subcolumnization is an optimization for Variant type by separating
>>>>> out
>>>>> > > subcolumns with their own metadata. This is not critical for
>>>>> choosing the
>>>>> > > initial encoding of the Variant type so we were hoping to gain
>>>>> consensus on
>>>>> > > leaving that for a follow up spec.
>>>>> > >
>>>>> > >
>>>>> > > Thanks
>>>>> > >
>>>>> > > Aihua
>>>>> > >
>>>>> > > Meeting invite:
>>>>> > >
>>>>> > > Wednesday, July 17 · 9:00 – 10:00am
>>>>> > > Time zone: America/Los_Angeles
>>>>> > > Google Meet joining info
>>>>> > > Video call link: https://meet.google.com/pbm-ovzn-aoq
>>>>> > > Or dial: ‪(US) +1 650-449-9343‬ PIN: ‪170 576 525‬#
>>>>> > > More phone numbers:
>>>>> https://tel.meet/pbm-ovzn-aoq?pin=4079632691790
>>>>> > >
>>>>> > > On Tue, May 28, 2024 at 9:21 PM Aihua Xu <aihua...@snowflake.com>
>>>>> wrote:
>>>>> > >
>>>>> > >> Hello,
>>>>> > >>
>>>>> > >> We have drafted the proposal
>>>>> > >> <
>>>>> https://docs.google.com/document/d/1QjhpG_SVNPZh3anFcpicMQx90ebwjL7rmzFYfUP89Iw/edit
>>>>> >
>>>>> > >> for Variant data type. Please help review and comment.
>>>>> > >>
>>>>> > >> Thanks,
>>>>> > >> Aihua
>>>>> > >>
>>>>> > >> On Thu, May 16, 2024 at 12:45 PM Jack Ye <yezhao...@gmail.com>
>>>>> wrote:
>>>>> > >>
>>>>> > >>> +10000 for a JSON/BSON type. We also had the same discussion
>>>>> internally
>>>>> > >>> and a JSON type would really play well with for example the
>>>>> SUPER type in
>>>>> > >>> Redshift:
>>>>> > >>> https://docs.aws.amazon.com/redshift/latest/dg/r_SUPER_type.html,
>>>>> and
>>>>> > >>> can also provide better integration with the Trino JSON type.
>>>>> > >>>
>>>>> > >>> Looking forward to the proposal!
>>>>> > >>>
>>>>> > >>> Best,
>>>>> > >>> Jack Ye
>>>>> > >>>
>>>>> > >>>
>>>>> > >>> On Wed, May 15, 2024 at 9:37 AM Tyler Akidau
>>>>> > >>> <tyler.aki...@snowflake.com.invalid> wrote:
>>>>> > >>>
>>>>> > >>>> On Tue, May 14, 2024 at 7:58 PM Gang Wu <ust...@gmail.com>
>>>>> wrote:
>>>>> > >>>>
>>>>> > >>>>> > We may need some guidance on just how many we need to look
>>>>> at;
>>>>> > >>>>> > we were planning on Spark and Trino, but weren't sure how
>>>>> much
>>>>> > >>>>> > further down the rabbit hole we needed to go。
>>>>> > >>>>>
>>>>> > >>>>> There are some engines living outside the Java world. It would
>>>>> be
>>>>> > >>>>> good if the proposal could cover the effort it takes to
>>>>> integrate
>>>>> > >>>>> variant type to them (e.g. velox, datafusion, etc.). This is
>>>>> something
>>>>> > >>>>> that
>>>>> > >>>>> some proprietary iceberg vendors also care about.
>>>>> > >>>>>
>>>>> > >>>>
>>>>> > >>>> Ack, makes sense. We can make sure to share some perspective on
>>>>> this.
>>>>> > >>>>
>>>>> > >>>> > Not necessarily, no. As long as there's a binary type and
>>>>> Iceberg and
>>>>> > >>>>> > the query engines are aware that the binary column needs to
>>>>> be
>>>>> > >>>>> > interpreted as a variant, that should be sufficient.
>>>>> > >>>>>
>>>>> > >>>>> From the perspective of interoperability, it would be good to
>>>>> support
>>>>> > >>>>> native
>>>>> > >>>>> type from file specs. Life will be easier for projects like
>>>>> Apache
>>>>> > >>>>> XTable.
>>>>> > >>>>> File format could also provide finer-grained statistics for
>>>>> variant
>>>>> > >>>>> type which
>>>>> > >>>>> facilitates data skipping.
>>>>> > >>>>>
>>>>> > >>>>
>>>>> > >>>> Agreed, there can definitely be additional value in native file
>>>>> format
>>>>> > >>>> integration. Just wanted to highlight that it's not a strict
>>>>> requirement.
>>>>> > >>>>
>>>>> > >>>> -Tyler
>>>>> > >>>>
>>>>> > >>>>
>>>>> > >>>>>
>>>>> > >>>>> Gang
>>>>> > >>>>>
>>>>> > >>>>> On Wed, May 15, 2024 at 6:49 AM Tyler Akidau
>>>>> > >>>>> <tyler.aki...@snowflake.com.invalid> wrote:
>>>>> > >>>>>
>>>>> > >>>>>> Good to see you again as well, JB! Thanks!
>>>>> > >>>>>>
>>>>> > >>>>>> -Tyler
>>>>> > >>>>>>
>>>>> > >>>>>>
>>>>> > >>>>>> On Tue, May 14, 2024 at 1:04 PM Jean-Baptiste Onofré <
>>>>> j...@nanthrax.net>
>>>>> > >>>>>> wrote:
>>>>> > >>>>>>
>>>>> > >>>>>>> Hi Tyler,
>>>>> > >>>>>>>
>>>>> > >>>>>>> Super happy to see you there :) It reminds me our
>>>>> discussions back in
>>>>> > >>>>>>> the start of Apache Beam :)
>>>>> > >>>>>>>
>>>>> > >>>>>>> Anyway, the thread is pretty interesting. I remember some
>>>>> discussions
>>>>> > >>>>>>> about JSON datatype for spec v3. The binary data type is
>>>>> already
>>>>> > >>>>>>> supported in the spec v2.
>>>>> > >>>>>>>
>>>>> > >>>>>>> I'm looking forward to the proposal and happy to help on
>>>>> this !
>>>>> > >>>>>>>
>>>>> > >>>>>>> Regards
>>>>> > >>>>>>> JB
>>>>> > >>>>>>>
>>>>> > >>>>>>> On Sat, May 11, 2024 at 7:06 AM Tyler Akidau
>>>>> > >>>>>>> <tyler.aki...@snowflake.com.invalid> wrote:
>>>>> > >>>>>>> >
>>>>> > >>>>>>> > Hello,
>>>>> > >>>>>>> >
>>>>> > >>>>>>> > We (Tyler, Nileema, Selcuk, Aihua) are working on a
>>>>> proposal for
>>>>> > >>>>>>> which we’d like to get early feedback from the community. As
>>>>> you may know,
>>>>> > >>>>>>> Snowflake has embraced Iceberg as its open Data Lake format.
>>>>> Having made
>>>>> > >>>>>>> good progress on our own adoption of the Iceberg standard,
>>>>> we’re now in a
>>>>> > >>>>>>> position where there are features not yet supported in
>>>>> Iceberg which we
>>>>> > >>>>>>> think would be valuable for our users, and that we would
>>>>> like to discuss
>>>>> > >>>>>>> with and help contribute to the Iceberg community.
>>>>> > >>>>>>> >
>>>>> > >>>>>>> > The first two such features we’d like to discuss are in
>>>>> support of
>>>>> > >>>>>>> efficient querying of dynamically typed, semi-structured
>>>>> data: variant data
>>>>> > >>>>>>> types, and subcolumnarization of variant columns. In more
>>>>> detail, for
>>>>> > >>>>>>> anyone who may not already be familiar:
>>>>> > >>>>>>> >
>>>>> > >>>>>>> > 1. Variant data types
>>>>> > >>>>>>> > Variant types allow for the efficient binary encoding of
>>>>> dynamic
>>>>> > >>>>>>> semi-structured data such as JSON, Avro, etc. By encoding
>>>>> semi-structured
>>>>> > >>>>>>> data as a variant column, we retain the flexibility of the
>>>>> source data,
>>>>> > >>>>>>> while allowing query engines to more efficiently operate on
>>>>> the data.
>>>>> > >>>>>>> Snowflake has supported the variant data type on Snowflake
>>>>> tables for many
>>>>> > >>>>>>> years [1]. As more and more users utilize Iceberg tables in
>>>>> Snowflake,
>>>>> > >>>>>>> we’re hearing an increasing chorus of requests for variant
>>>>> support.
>>>>> > >>>>>>> Additionally, other query engines such as Apache Spark have
>>>>> begun adding
>>>>> > >>>>>>> variant support [2]. As such, we believe it would be
>>>>> beneficial to the
>>>>> > >>>>>>> Iceberg community as a whole to standardize on the variant
>>>>> data type
>>>>> > >>>>>>> encoding used across Iceberg tables.
>>>>> > >>>>>>> >
>>>>> > >>>>>>> > One specific point to make here is that, since an Apache
>>>>> OSS
>>>>> > >>>>>>> version of variant encoding already exists in Spark, it
>>>>> likely makes sense
>>>>> > >>>>>>> to simply adopt the Spark encoding as the Iceberg standard
>>>>> as well. The
>>>>> > >>>>>>> encoding we use internally today in Snowflake is slightly
>>>>> different, but
>>>>> > >>>>>>> essentially equivalent, and we see no particular value in
>>>>> trying to clutter
>>>>> > >>>>>>> the space with another equivalent-but-incompatible encoding.
>>>>> > >>>>>>> >
>>>>> > >>>>>>> >
>>>>> > >>>>>>> > 2. Subcolumnarization
>>>>> > >>>>>>> > Subcolumnarization of variant columns allows query engines
>>>>> to
>>>>> > >>>>>>> efficiently prune datasets when subcolumns (i.e., nested
>>>>> fields) within a
>>>>> > >>>>>>> variant column are queried, and also allows optionally
>>>>> materializing some
>>>>> > >>>>>>> of the nested fields as a column on their own, affording
>>>>> queries on these
>>>>> > >>>>>>> subcolumns the ability to read less data and spend less CPU
>>>>> on extraction.
>>>>> > >>>>>>> When subcolumnarizing, the system managing table metadata
>>>>> and data tracks
>>>>> > >>>>>>> individual pruning statistics (min, max, null, etc.) for
>>>>> some subset of the
>>>>> > >>>>>>> nested fields within a variant, and also manages any optional
>>>>> > >>>>>>> materialization. Without subcolumnarization, any query which
>>>>> touches a
>>>>> > >>>>>>> variant column must read, parse, extract, and filter every
>>>>> row for which
>>>>> > >>>>>>> that column is non-null. Thus, by providing a standardized
>>>>> way of tracking
>>>>> > >>>>>>> subcolum metadata and data for variant columns, Iceberg can
>>>>> make
>>>>> > >>>>>>> subcolumnar optimizations accessible across various catalogs
>>>>> and query
>>>>> > >>>>>>> engines.
>>>>> > >>>>>>> >
>>>>> > >>>>>>> > Subcolumnarization is a non-trivial topic, so we expect any
>>>>> > >>>>>>> concrete proposal to include not only the set of changes to
>>>>> Iceberg
>>>>> > >>>>>>> metadata that allow compatible query engines to interopate on
>>>>> > >>>>>>> subcolumnarization data for variant columns, but also
>>>>> reference
>>>>> > >>>>>>> documentation explaining subcolumnarization principles and
>>>>> recommended best
>>>>> > >>>>>>> practices.
>>>>> > >>>>>>> >
>>>>> > >>>>>>> >
>>>>> > >>>>>>> > It sounds like the recent Geo proposal [3] may be a good
>>>>> starting
>>>>> > >>>>>>> point for how to approach this, so our plan is to write
>>>>> something up in
>>>>> > >>>>>>> that vein that covers the proposed spec changes, backwards
>>>>> compatibility,
>>>>> > >>>>>>> implementor burdens, etc. But we wanted to first reach out
>>>>> to the community
>>>>> > >>>>>>> to introduce ourselves and the idea, and see if there’s any
>>>>> early feedback
>>>>> > >>>>>>> we should incorporate before we spend too much time on a
>>>>> concrete proposal.
>>>>> > >>>>>>> >
>>>>> > >>>>>>> > Thank you!
>>>>> > >>>>>>> >
>>>>> > >>>>>>> > [1]
>>>>> > >>>>>>>
>>>>> https://docs.snowflake.com/en/sql-reference/data-types-semistructured
>>>>> > >>>>>>> > [2]
>>>>> > >>>>>>>
>>>>> https://github.com/apache/spark/blob/master/common/variant/README.md
>>>>> > >>>>>>> > [3]
>>>>> > >>>>>>>
>>>>> https://docs.google.com/document/d/1iVFbrRNEzZl8tDcZC81GFt01QJkLJsI9E2NBOt21IRI/edit
>>>>> > >>>>>>> >
>>>>> > >>>>>>> > -Tyler, Nileema, Selcuk, Aihua
>>>>> > >>>>>>> >
>>>>> > >>>>>>>
>>>>> > >>>>>>
>>>>> >
>>>>> > --
>>>>> > Ryan Blue
>>>>> > Databricks
>>>>> >
>>>>>
>>>>
>>>
>>> --
>>> Ryan Blue
>>> Databricks
>>>
>>
>>
>> --
>> Ryan Blue
>> Databricks
>>
>

-- 
Ryan Blue
Databricks

Reply via email to