Re: [DISCUSS] Variant Spec Location

Micah Kornfield Thu, 15 Aug 2024 15:14:28 -0700

>
> I think the Parquet community is the most neutral option available. Would
> anyone else support asking the Spark and Parquet communities to maintain
> the variant spec in Parquet?



This makes sense to me.  I'll reiterate that Arrow might be a better
potential home for this for a few different reasons:
1.  At the moment, I think Arrow type system has slightly wider coverage
(it has interval types which I think more closely align with Spark).  It
also has a well defined extension mechanism for types that aren't present
(this is being worked on in parquet).
2.  It currently has better infrastructure for cross language testing (this
is something I'm hoping to improve in Parquet).
3.  The shredding standard can be defined in file format neutral way in
Arrow's memory model which can then be persisted as necessary to columnar
file formats.
4.  If we can mostly converge on Arrow as a type system for table formats I
think it potentially makes most of the data ecosystem better.

In the end, I don't feel too strongly and happy to see this hosted in
Parquet.  I'm not exactly sure what the sticking points were in the private
discussions, so I'm not sure how the Spark community would feel about
donating the spec to a third-party.  I think on a technical level there
still remains the issue of how formats/engines will handle scalar types not
currently supported (I think in most cases this will be conversion to human
readables string?)

Thanks,
Micah

On Thu, Aug 15, 2024 at 2:04 PM Daniel Weeks <dwe...@apache.org> wrote:

> I would agree that Parquet seems like a reasonable option in terms of fit
> and neutrality.
>
> I'd love to get any feedback from others, but assuming there's
> general consensus, I feel like we need to engage with those communities and
> have an open conversation about the discussions we've had and why we feel
> this is important to address any governance/neutrality concerns.
>
> Others already mentioned this, but I agree there's added value that other
> projects could benefit from variant, so standardizing at the Parquet level
> makes this less opaque to the rest of the ecosystem.
>
> -Dan
>
> On Thu, Aug 15, 2024 at 11:31 AM Russell Spitzer <
> russell.spit...@gmail.com> wrote:
>
>> I support that whole-heartedly. Parquet would be a great neutral location
>> for the spec.
>>
>> On Thu, Aug 15, 2024 at 1:17 PM Ryan Blue <b...@databricks.com.invalid>
>> wrote:
>>
>>> I think it's a good idea to reach out to the Spark community and make
>>> sure we are in agreement. Up until now I think we've been thinking more
>>> abstractly about what makes sense but before we make any decision we should
>>> definitely collaborate with the other communities.
>>>
>>> I'd also like to suggest an alternative for where this spec should be
>>> maintained that would hopefully allow us to avoid copying and maintaining
>>> multiple places. As we've already discussed, this is not an easy spec to
>>> find a home for because there are alternative projects that are all
>>> interested. Since this is a cross-engine type, Spark may not be ideal. At
>>> the same time, Delta already supports the variant spec so there's a similar
>>> problem maintaining this in Iceberg.
>>>
>>> I think that a reasonable and neutral option is to see if the Parquet
>>> community would be willing to host the spec and library. That fits with the
>>> spec because subcolumnarization is written assuming Parquet is the storage.
>>> It would also be the best place for broad compatibility because anyone
>>> using Parquet would have a strong motivation to standardize on the same
>>> encoding.
>>>
>>> Initially, I pushed for Iceberg instead of Parquet because we may want
>>> to have the same variant encoding in ORC, but what made me change my mind
>>> is that every layer (file format, table format, engine) has that problem
>>> and I've heard the concern about neutrality raised multiple times while
>>> discussing this question internally.
>>>
>>> I think the Parquet community is the most neutral option available.
>>> Would anyone else support asking the Spark and Parquet communities to
>>> maintain the variant spec in Parquet?
>>>
>>> Ryan
>>>
>>> On Thu, Aug 15, 2024 at 8:34 AM Xuanwo <xua...@apache.org> wrote:
>>>
>>>> From the iceberg-rust perspective, it could be extremely challenging to
>>>> keep track of both the Spark and Iceberg specifications. Having a single
>>>> source of truth would be much better. I believe this change will also
>>>> benefit Delta Lake if they implement the same approach. Perhaps we can try
>>>> contacting them to initiate such a project?
>>>>
>>>> On Thu, Aug 15, 2024, at 23:17, Gang Wu wrote:
>>>>
>>>> +1 on posting this discussion to dev@spark ML
>>>>
>>>> > I don't think there is anything that would stop us from moving to a
>>>> joint project in the future
>>>>
>>>> My concern is that if we don't do this from day 1, we will never
>>>> ever do this.
>>>>
>>>> Best,
>>>> Gang
>>>>
>>>> On Thu, Aug 15, 2024 at 11:08 PM Russell Spitzer <
>>>> russell.spit...@gmail.com> wrote:
>>>>
>>>> Thats fair @Micah, so far all the discussions have been direct and off
>>>> the dev list. Would you like to make the request on the public Spark Dev
>>>> list? I would be glad to co-sign, I can also draft up a quick email if you
>>>> don't have time.
>>>>
>>>> On Thu, Aug 15, 2024 at 10:04 AM Micah Kornfield <emkornfi...@gmail.com>
>>>> wrote:
>>>>
>>>> I agree that it would be beneficial to make a sub-project, the main
>>>> problem is political and not logistic. I've been asking for movement from
>>>> other relative projects for a month and we simply haven't gotten anywhere.
>>>>
>>>>
>>>> I just wanted to double check that these issues were brought directly
>>>> to the spark community (i.e. a discussion thread on the Spark developer
>>>> mailing list) and not via backchannels.
>>>>
>>>> I'm not sure the outcome would be different and I don't think this
>>>> should block forking the spec, but we should make sure that the decision is
>>>> publicly documented within both communities.
>>>>
>>>> Thanks,
>>>> Micah
>>>>
>>>> On Thu, Aug 15, 2024 at 7:47 AM Russell Spitzer <
>>>> russell.spit...@gmail.com> wrote:
>>>>
>>>> @Gang Wu
>>>> I agree that it would be beneficial to make a sub-project, the main
>>>> problem is political and not logistic. I've been asking for movement from
>>>> other relative projects for a month and we simply haven't gotten anywhere.
>>>> I don't think there is anything that would stop us from moving to a joint
>>>> project in the future and if you know of some way of encouraging that
>>>> movement from other relevant parties I would be glad to collaborate in
>>>> doing that. One thing that I don't want to do is have the Iceberg project
>>>> stay in a holding pattern without any clear roadmap as to how to proceed.
>>>>
>>>> On Wed, Aug 14, 2024 at 11:12 PM Yufei Gu <flyrain...@gmail.com> wrote:
>>>>
>>>> I’m on board with copying the spec into our repository. However, as
>>>> we’ve talked about, it’s not just a straightforward copy—there are already
>>>> some divergences. Some of them are under discussion. Iceberg is definitely
>>>> the best place for these specs. Engines like Trino and Flink can then rely
>>>> on the Iceberg specs as a solid foundation.
>>>>
>>>> Yufei
>>>>
>>>> On Wed, Aug 14, 2024 at 7:51 PM Gang Wu <ust...@gmail.com> wrote:
>>>>
>>>> Sorry for chiming in late.
>>>>
>>>> From the discussion in
>>>> https://lists.apache.org/thread/xcyytoypgplfr74klg1z2rgjo6k5b0sq, I
>>>> don't quite understand why it is logistically complicated to create a
>>>> sub-project to hold the variant spec and impl.
>>>>
>>>> IMHO, coping the variant type spec into Apache Iceberg has some
>>>> deficiencies:
>>>> - It is a burden to update two repos if there is a variant type spec
>>>> change and will likely result in deviation if some changes do not reach
>>>> agreement from both parties.
>>>> - Implementers are required to keep an eye on both specs (considering
>>>> proprietary engines where both Iceberg and Delta are supported).
>>>> - Putting the spec and impl of variant type in Iceberg repo does lose
>>>> the opportunity for better native support from file formats like Parquet
>>>> and ORC.
>>>>
>>>> I'm not sure if it is possible to create a separate project (e.g.
>>>> apache/variant-type) to make it a single point of truth. We can learn from
>>>> the experience of Apache Arrow. In this fashion, different engines, table
>>>> formats and file formats can follow the same spec and are free to depend on
>>>> the reference implementations from apache/variant-type or implement their
>>>> own.
>>>>
>>>> Best,
>>>> Gang
>>>>
>>>>
>>>>
>>>>
>>>> On Thu, Aug 15, 2024 at 10:07 AM Jack Ye <yezhao...@gmail.com> wrote:
>>>>
>>>> +1 for copying the spec into our repository, I think we need to own it
>>>> fully as a part of the table spec, and we can build compatibility through
>>>> tests.
>>>>
>>>> -Jack
>>>>
>>>> On Wed, Aug 14, 2024 at 12:52 PM Russell Spitzer <
>>>> russell.spit...@gmail.com> wrote:
>>>>
>>>> I'm not really in favor of linking and annotating as that just makes
>>>> things more complicated and still is essentially forking just with more
>>>> steps. If we just track our annotations / modifications  to a single
>>>> commit/version then we have the same issue again but now you have to go to
>>>> multiple sources to get the actual Spec. *In addition, our very copy
>>>> of the Spec is going to require new types which don't exist in the Spark
>>>> Spec which necessarily means diverging. *We will need to take up new
>>>> primitive id's (as noted in my first email)
>>>>
>>>> The other issue I have is I don't think the Spark Spec is really going
>>>> through a thorough review process from all members of the Spark community,
>>>> I believe it probably should have gone through the SPIP but instead seems
>>>> to have been merged without broad community involvement.
>>>>
>>>> The only way to truly avoid diverging is to only have a single copy of
>>>> the spec, in our previous discussions the vast majority of Apache Iceberg
>>>> community want it to exist here.
>>>>
>>>> On Wed, Aug 14, 2024 at 2:19 PM Daniel Weeks <dwe...@apache.org> wrote:
>>>>
>>>> I'm really excited about the introduction of variant type to Iceberg,
>>>> but I want to raise concerns about forking the spec.
>>>>
>>>> I feel like preemptively forking would create the situation where we
>>>> end up diverging because there's little reason to work with both
>>>> communities to evolve in a way that benefits everyone.
>>>>
>>>> I would much rather point to a specific version of the spec and
>>>> annotate any variance in Iceberg's handling.  This would allow us to
>>>> continue without dividing the communities.
>>>>
>>>> If at any point there are irreconcilable differences, I would support
>>>> forking, but I don't feel like that should be the initial step.
>>>>
>>>> No one is excited about the possibility that the physical
>>>> representations end up diverging, but it feels like we're setting
>>>> ourselves up for that exact scenario.
>>>>
>>>> -Dan
>>>>
>>>>
>>>> On Wed, Aug 14, 2024 at 6:54 AM Fokko Driesprong <fo...@apache.org>
>>>> wrote:
>>>>
>>>> +1 to what's already being said here. It is good to copy the spec to
>>>> Iceberg and add context that's specific to Iceberg, but at the same time,
>>>> we should maintain compatibility.
>>>>
>>>> Kind regards,
>>>> Fokko
>>>>
>>>> Op wo 14 aug 2024 om 15:30 schreef Manu Zhang <owenzhang1...@gmail.com
>>>> >:
>>>>
>>>> +1 to copy the spec into our repository. I think the best way to keep
>>>> compatibility is building integration tests.
>>>>
>>>> Thanks,
>>>> Manu
>>>>
>>>> On Wed, Aug 14, 2024 at 8:27 PM Péter Váry <peter.vary.apa...@gmail.com>
>>>> wrote:
>>>>
>>>> Thanks Russell and Aihua for pushing Variant support!
>>>>
>>>> Given the differences between the supported types and the lack of
>>>> interest from the other project, I think it is reasonable to duplicate the
>>>> specification to our repository.
>>>> I would give very strong emphasis on sticking to the Spark spec as much
>>>> as possible, to keep compatibility as much as possible. Maybe even revert
>>>> to a shared specification if the situation changes.
>>>>
>>>> Thanks,
>>>> Peter
>>>>
>>>> Aihua Xu <aihu...@gmail.com> ezt írta (időpont: 2024. aug. 13., K,
>>>> 19:52):
>>>>
>>>> Thanks Russell for bringing this up.
>>>>
>>>> This is the main blocker to move forward with the Variant support in
>>>> Iceberg and hopefully we can have a consensus. To me, I also feel it makes
>>>> more sense to move the spec into Iceberg rather than Spark engine owns it
>>>> and we try to keep it compatible with Spark spec.
>>>>
>>>> Thanks,
>>>> Aihua
>>>>
>>>> On Mon, Aug 12, 2024 at 6:50 PM Russell Spitzer <
>>>> russell.spit...@gmail.com> wrote:
>>>>
>>>> Hi Y’all,
>>>>
>>>> We’ve hit a bit of a roadblock with the Variant Proposal, while we were
>>>> hoping to move the Variant and Shredding specifications from Spark into
>>>> Iceberg there doesn’t seem to be a lot of interest in that. Unfortunately,
>>>> I think we have a number of issues with just linking to the Spark project
>>>> directly from within Iceberg and *I believe we need to copy the
>>>> specifications into our repository*.
>>>>
>>>> There are a few reasons why i think this is necessary
>>>>
>>>> First, we have a divergence of types already. The Spark Specification
>>>> already includes types which Iceberg has no definition for (19, 20
>>>> <https://github.com/apache/spark/blob/master/common/variant/README.md#encoding-types>
>>>> - Interval Types) and Iceberg already has a type which is not included
>>>> within the Spark Specification (Time) and will soon have more with
>>>> TimestampNS, and Geo.
>>>>
>>>> Second, We would like to make sure that Spark is not a hard dependency
>>>> for other engines. We are working with several implementers of the Iceberg
>>>> spec and it has previously been agreed that it would be best if the source
>>>> of truth for Variant existed in an engine and file format neutral location.
>>>> The Iceberg project has a good open model of governance and, as we have
>>>> seen so far discussing Variant
>>>> <https://lists.apache.org/thread/xcyytoypgplfr74klg1z2rgjo6k5b0sq>,
>>>> open and active collaboration. This would also help as we can strictly
>>>> version our changes in-line with the rest of the Iceberg spec.
>>>>
>>>> Third, The Shredding spec is not quite finished and requires some group
>>>> analysis and discussion before we commit it. I think again the Iceberg
>>>> community is probably the right place for this to happen as we have already
>>>> started discussions here on these topics.
>>>>
>>>> For these reasons I think we should go with a direct copy of the
>>>> existing specification from the Spark Project and move ahead with our
>>>> discussions and modifications within Iceberg. That said, *I do not
>>>> want to diverge if possible from the Spark proposal*. For example,
>>>> although we do not use the Interval types above, I think we should
>>>> *not* reuse those type ids within our spec. Iceberg's Variant Spec
>>>> types 19 and 20 would remain unused along with any other types we think are
>>>> not applicable. We should strive whenever possible to allow for
>>>> compatibility.
>>>>
>>>> In the interest of moving forward with this proposal I am hoping to see
>>>> if anyone in the community objects to this plan going forward or has a
>>>> better alternative.
>>>>
>>>> As always I am thankful for your time and am eager to hear back from
>>>> everyone,
>>>> Russ
>>>>
>>>> Xuanwo
>>>>
>>>> https://xuanwo.io/
>>>>
>>>>
>>>
>>> --
>>> Ryan Blue
>>> Databricks
>>>
>>

Re: [DISCUSS] Variant Spec Location

Reply via email to