Re: [DISCUSS] Variant Spec Location

Daniel Weeks Thu, 15 Aug 2024 14:04:05 -0700

I would agree that Parquet seems like a reasonable option in terms of fit
and neutrality.


I'd love to get any feedback from others, but assuming there's
general consensus, I feel like we need to engage with those communities and
have an open conversation about the discussions we've had and why we feel
this is important to address any governance/neutrality concerns.

Others already mentioned this, but I agree there's added value that other
projects could benefit from variant, so standardizing at the Parquet level
makes this less opaque to the rest of the ecosystem.

-Dan

On Thu, Aug 15, 2024 at 11:31 AM Russell Spitzer <[email protected]>
wrote:

> I support that whole-heartedly. Parquet would be a great neutral location
> for the spec.
>
> On Thu, Aug 15, 2024 at 1:17 PM Ryan Blue <[email protected]>
> wrote:
>
>> I think it's a good idea to reach out to the Spark community and make
>> sure we are in agreement. Up until now I think we've been thinking more
>> abstractly about what makes sense but before we make any decision we should
>> definitely collaborate with the other communities.
>>
>> I'd also like to suggest an alternative for where this spec should be
>> maintained that would hopefully allow us to avoid copying and maintaining
>> multiple places. As we've already discussed, this is not an easy spec to
>> find a home for because there are alternative projects that are all
>> interested. Since this is a cross-engine type, Spark may not be ideal. At
>> the same time, Delta already supports the variant spec so there's a similar
>> problem maintaining this in Iceberg.
>>
>> I think that a reasonable and neutral option is to see if the Parquet
>> community would be willing to host the spec and library. That fits with the
>> spec because subcolumnarization is written assuming Parquet is the storage.
>> It would also be the best place for broad compatibility because anyone
>> using Parquet would have a strong motivation to standardize on the same
>> encoding.
>>
>> Initially, I pushed for Iceberg instead of Parquet because we may want to
>> have the same variant encoding in ORC, but what made me change my mind is
>> that every layer (file format, table format, engine) has that problem and
>> I've heard the concern about neutrality raised multiple times while
>> discussing this question internally.
>>
>> I think the Parquet community is the most neutral option available. Would
>> anyone else support asking the Spark and Parquet communities to maintain
>> the variant spec in Parquet?
>>
>> Ryan
>>
>> On Thu, Aug 15, 2024 at 8:34 AM Xuanwo <[email protected]> wrote:
>>
>>> From the iceberg-rust perspective, it could be extremely challenging to
>>> keep track of both the Spark and Iceberg specifications. Having a single
>>> source of truth would be much better. I believe this change will also
>>> benefit Delta Lake if they implement the same approach. Perhaps we can try
>>> contacting them to initiate such a project?
>>>
>>> On Thu, Aug 15, 2024, at 23:17, Gang Wu wrote:
>>>
>>> +1 on posting this discussion to dev@spark ML
>>>
>>> > I don't think there is anything that would stop us from moving to a
>>> joint project in the future
>>>
>>> My concern is that if we don't do this from day 1, we will never ever do
>>> this.
>>>
>>> Best,
>>> Gang
>>>
>>> On Thu, Aug 15, 2024 at 11:08 PM Russell Spitzer <
>>> [email protected]> wrote:
>>>
>>> Thats fair @Micah, so far all the discussions have been direct and off
>>> the dev list. Would you like to make the request on the public Spark Dev
>>> list? I would be glad to co-sign, I can also draft up a quick email if you
>>> don't have time.
>>>
>>> On Thu, Aug 15, 2024 at 10:04 AM Micah Kornfield <[email protected]>
>>> wrote:
>>>
>>> I agree that it would be beneficial to make a sub-project, the main
>>> problem is political and not logistic. I've been asking for movement from
>>> other relative projects for a month and we simply haven't gotten anywhere.
>>>
>>>
>>> I just wanted to double check that these issues were brought directly to
>>> the spark community (i.e. a discussion thread on the Spark developer
>>> mailing list) and not via backchannels.
>>>
>>> I'm not sure the outcome would be different and I don't think this
>>> should block forking the spec, but we should make sure that the decision is
>>> publicly documented within both communities.
>>>
>>> Thanks,
>>> Micah
>>>
>>> On Thu, Aug 15, 2024 at 7:47 AM Russell Spitzer <
>>> [email protected]> wrote:
>>>
>>> @Gang Wu
>>> I agree that it would be beneficial to make a sub-project, the main
>>> problem is political and not logistic. I've been asking for movement from
>>> other relative projects for a month and we simply haven't gotten anywhere.
>>> I don't think there is anything that would stop us from moving to a joint
>>> project in the future and if you know of some way of encouraging that
>>> movement from other relevant parties I would be glad to collaborate in
>>> doing that. One thing that I don't want to do is have the Iceberg project
>>> stay in a holding pattern without any clear roadmap as to how to proceed.
>>>
>>> On Wed, Aug 14, 2024 at 11:12 PM Yufei Gu <[email protected]> wrote:
>>>
>>> I’m on board with copying the spec into our repository. However, as
>>> we’ve talked about, it’s not just a straightforward copy—there are already
>>> some divergences. Some of them are under discussion. Iceberg is definitely
>>> the best place for these specs. Engines like Trino and Flink can then rely
>>> on the Iceberg specs as a solid foundation.
>>>
>>> Yufei
>>>
>>> On Wed, Aug 14, 2024 at 7:51 PM Gang Wu <[email protected]> wrote:
>>>
>>> Sorry for chiming in late.
>>>
>>> From the discussion in
>>> https://lists.apache.org/thread/xcyytoypgplfr74klg1z2rgjo6k5b0sq, I
>>> don't quite understand why it is logistically complicated to create a
>>> sub-project to hold the variant spec and impl.
>>>
>>> IMHO, coping the variant type spec into Apache Iceberg has some
>>> deficiencies:
>>> - It is a burden to update two repos if there is a variant type spec
>>> change and will likely result in deviation if some changes do not reach
>>> agreement from both parties.
>>> - Implementers are required to keep an eye on both specs (considering
>>> proprietary engines where both Iceberg and Delta are supported).
>>> - Putting the spec and impl of variant type in Iceberg repo does lose
>>> the opportunity for better native support from file formats like Parquet
>>> and ORC.
>>>
>>> I'm not sure if it is possible to create a separate project (e.g.
>>> apache/variant-type) to make it a single point of truth. We can learn from
>>> the experience of Apache Arrow. In this fashion, different engines, table
>>> formats and file formats can follow the same spec and are free to depend on
>>> the reference implementations from apache/variant-type or implement their
>>> own.
>>>
>>> Best,
>>> Gang
>>>
>>>
>>>
>>>
>>> On Thu, Aug 15, 2024 at 10:07 AM Jack Ye <[email protected]> wrote:
>>>
>>> +1 for copying the spec into our repository, I think we need to own it
>>> fully as a part of the table spec, and we can build compatibility through
>>> tests.
>>>
>>> -Jack
>>>
>>> On Wed, Aug 14, 2024 at 12:52 PM Russell Spitzer <
>>> [email protected]> wrote:
>>>
>>> I'm not really in favor of linking and annotating as that just makes
>>> things more complicated and still is essentially forking just with more
>>> steps. If we just track our annotations / modifications  to a single
>>> commit/version then we have the same issue again but now you have to go to
>>> multiple sources to get the actual Spec. *In addition, our very copy of
>>> the Spec is going to require new types which don't exist in the Spark Spec
>>> which necessarily means diverging. *We will need to take up new
>>> primitive id's (as noted in my first email)
>>>
>>> The other issue I have is I don't think the Spark Spec is really going
>>> through a thorough review process from all members of the Spark community,
>>> I believe it probably should have gone through the SPIP but instead seems
>>> to have been merged without broad community involvement.
>>>
>>> The only way to truly avoid diverging is to only have a single copy of
>>> the spec, in our previous discussions the vast majority of Apache Iceberg
>>> community want it to exist here.
>>>
>>> On Wed, Aug 14, 2024 at 2:19 PM Daniel Weeks <[email protected]> wrote:
>>>
>>> I'm really excited about the introduction of variant type to Iceberg,
>>> but I want to raise concerns about forking the spec.
>>>
>>> I feel like preemptively forking would create the situation where we end
>>> up diverging because there's little reason to work with both communities to
>>> evolve in a way that benefits everyone.
>>>
>>> I would much rather point to a specific version of the spec and annotate
>>> any variance in Iceberg's handling.  This would allow us to continue
>>> without dividing the communities.
>>>
>>> If at any point there are irreconcilable differences, I would support
>>> forking, but I don't feel like that should be the initial step.
>>>
>>> No one is excited about the possibility that the physical
>>> representations end up diverging, but it feels like we're setting
>>> ourselves up for that exact scenario.
>>>
>>> -Dan
>>>
>>>
>>> On Wed, Aug 14, 2024 at 6:54 AM Fokko Driesprong <[email protected]>
>>> wrote:
>>>
>>> +1 to what's already being said here. It is good to copy the spec to
>>> Iceberg and add context that's specific to Iceberg, but at the same time,
>>> we should maintain compatibility.
>>>
>>> Kind regards,
>>> Fokko
>>>
>>> Op wo 14 aug 2024 om 15:30 schreef Manu Zhang <[email protected]>:
>>>
>>> +1 to copy the spec into our repository. I think the best way to keep
>>> compatibility is building integration tests.
>>>
>>> Thanks,
>>> Manu
>>>
>>> On Wed, Aug 14, 2024 at 8:27 PM Péter Váry <[email protected]>
>>> wrote:
>>>
>>> Thanks Russell and Aihua for pushing Variant support!
>>>
>>> Given the differences between the supported types and the lack of
>>> interest from the other project, I think it is reasonable to duplicate the
>>> specification to our repository.
>>> I would give very strong emphasis on sticking to the Spark spec as much
>>> as possible, to keep compatibility as much as possible. Maybe even revert
>>> to a shared specification if the situation changes.
>>>
>>> Thanks,
>>> Peter
>>>
>>> Aihua Xu <[email protected]> ezt írta (időpont: 2024. aug. 13., K,
>>> 19:52):
>>>
>>> Thanks Russell for bringing this up.
>>>
>>> This is the main blocker to move forward with the Variant support in
>>> Iceberg and hopefully we can have a consensus. To me, I also feel it makes
>>> more sense to move the spec into Iceberg rather than Spark engine owns it
>>> and we try to keep it compatible with Spark spec.
>>>
>>> Thanks,
>>> Aihua
>>>
>>> On Mon, Aug 12, 2024 at 6:50 PM Russell Spitzer <
>>> [email protected]> wrote:
>>>
>>> Hi Y’all,
>>>
>>> We’ve hit a bit of a roadblock with the Variant Proposal, while we were
>>> hoping to move the Variant and Shredding specifications from Spark into
>>> Iceberg there doesn’t seem to be a lot of interest in that. Unfortunately,
>>> I think we have a number of issues with just linking to the Spark project
>>> directly from within Iceberg and *I believe we need to copy the
>>> specifications into our repository*.
>>>
>>> There are a few reasons why i think this is necessary
>>>
>>> First, we have a divergence of types already. The Spark Specification
>>> already includes types which Iceberg has no definition for (19, 20
>>> <https://github.com/apache/spark/blob/master/common/variant/README.md#encoding-types>
>>> - Interval Types) and Iceberg already has a type which is not included
>>> within the Spark Specification (Time) and will soon have more with
>>> TimestampNS, and Geo.
>>>
>>> Second, We would like to make sure that Spark is not a hard dependency
>>> for other engines. We are working with several implementers of the Iceberg
>>> spec and it has previously been agreed that it would be best if the source
>>> of truth for Variant existed in an engine and file format neutral location.
>>> The Iceberg project has a good open model of governance and, as we have
>>> seen so far discussing Variant
>>> <https://lists.apache.org/thread/xcyytoypgplfr74klg1z2rgjo6k5b0sq>,
>>> open and active collaboration. This would also help as we can strictly
>>> version our changes in-line with the rest of the Iceberg spec.
>>>
>>> Third, The Shredding spec is not quite finished and requires some group
>>> analysis and discussion before we commit it. I think again the Iceberg
>>> community is probably the right place for this to happen as we have already
>>> started discussions here on these topics.
>>>
>>> For these reasons I think we should go with a direct copy of the
>>> existing specification from the Spark Project and move ahead with our
>>> discussions and modifications within Iceberg. That said, *I do not want
>>> to diverge if possible from the Spark proposal*. For example, although
>>> we do not use the Interval types above, I think we should *not* reuse
>>> those type ids within our spec. Iceberg's Variant Spec types 19 and 20
>>> would remain unused along with any other types we think are not applicable.
>>> We should strive whenever possible to allow for compatibility.
>>>
>>> In the interest of moving forward with this proposal I am hoping to see
>>> if anyone in the community objects to this plan going forward or has a
>>> better alternative.
>>>
>>> As always I am thankful for your time and am eager to hear back from
>>> everyone,
>>> Russ
>>>
>>> Xuanwo
>>>
>>> https://xuanwo.io/
>>>
>>>
>>
>> --
>> Ryan Blue
>> Databricks
>>
>

Re: [DISCUSS] Variant Spec Location

Reply via email to