>From the iceberg-rust perspective, it could be extremely challenging to keep 
>track of both the Spark and Iceberg specifications. Having a single source of 
>truth would be much better. I believe this change will also benefit Delta Lake 
>if they implement the same approach. Perhaps we can try contacting them to 
>initiate such a project?

On Thu, Aug 15, 2024, at 23:17, Gang Wu wrote:
> +1 on posting this discussion to dev@spark ML
> 
> > I don't think there is anything that would stop us from moving to a joint 
> > project in the future
> 
> My concern is that if we don't do this from day 1, we will never ever do this.
> 
> Best,
> Gang
> 
> On Thu, Aug 15, 2024 at 11:08 PM Russell Spitzer <russell.spit...@gmail.com> 
> wrote:
>> Thats fair @Micah, so far all the discussions have been direct and off the 
>> dev list. Would you like to make the request on the public Spark Dev list? I 
>> would be glad to co-sign, I can also draft up a quick email if you don't 
>> have time. 
>> 
>> On Thu, Aug 15, 2024 at 10:04 AM Micah Kornfield <emkornfi...@gmail.com> 
>> wrote:
>>>> I agree that it would be beneficial to make a sub-project, the main 
>>>> problem is political and not logistic. I've been asking for movement from 
>>>> other relative projects for a month and we simply haven't gotten anywhere.
>>> 
>>> I just wanted to double check that these issues were brought directly to 
>>> the spark community (i.e. a discussion thread on the Spark developer 
>>> mailing list) and not via backchannels. 
>>> 
>>> I'm not sure the outcome would be different and I don't think this should 
>>> block forking the spec, but we should make sure that the decision is 
>>> publicly documented within both communities.
>>> 
>>> Thanks,
>>> Micah
>>> 
>>> On Thu, Aug 15, 2024 at 7:47 AM Russell Spitzer <russell.spit...@gmail.com> 
>>> wrote:
>>>> @Gang Wu
>>>> I agree that it would be beneficial to make a sub-project, the main 
>>>> problem is political and not logistic. I've been asking for movement from 
>>>> other relative projects for a month and we simply haven't gotten anywhere. 
>>>> I don't think there is anything that would stop us from moving to a joint 
>>>> project in the future and if you know of some way of encouraging that 
>>>> movement from other relevant parties I would be glad to collaborate in 
>>>> doing that. One thing that I don't want to do is have the Iceberg project 
>>>> stay in a holding pattern without any clear roadmap as to how to proceed.
>>>> 
>>>> On Wed, Aug 14, 2024 at 11:12 PM Yufei Gu <flyrain...@gmail.com> wrote:
>>>>> I’m on board with copying the spec into our repository. However, as we’ve 
>>>>> talked about, it’s not just a straightforward copy—there are already some 
>>>>> divergences. Some of them are under discussion. Iceberg is definitely the 
>>>>> best place for these specs. Engines like Trino and Flink can then rely on 
>>>>> the Iceberg specs as a solid foundation.
>>>>> 
>>>>> Yufei
>>>>> 
>>>>> On Wed, Aug 14, 2024 at 7:51 PM Gang Wu <ust...@gmail.com> wrote:
>>>>>> Sorry for chiming in late. 
>>>>>> 
>>>>>> From the discussion in 
>>>>>> https://lists.apache.org/thread/xcyytoypgplfr74klg1z2rgjo6k5b0sq, I 
>>>>>> don't quite understand why it is logistically complicated to create a 
>>>>>> sub-project to hold the variant spec and impl.
>>>>>> 
>>>>>> IMHO, coping the variant type spec into Apache Iceberg has some 
>>>>>> deficiencies:
>>>>>> - It is a burden to update two repos if there is a variant type spec 
>>>>>> change and will likely result in deviation if some changes do not reach 
>>>>>> agreement from both parties.
>>>>>> - Implementers are required to keep an eye on both specs (considering 
>>>>>> proprietary engines where both Iceberg and Delta are supported). 
>>>>>> - Putting the spec and impl of variant type in Iceberg repo does lose 
>>>>>> the opportunity for better native support from file formats like Parquet 
>>>>>> and ORC.
>>>>>> 
>>>>>> I'm not sure if it is possible to create a separate project (e.g. 
>>>>>> apache/variant-type) to make it a single point of truth. We can learn 
>>>>>> from the experience of Apache Arrow. In this fashion, different engines, 
>>>>>> table formats and file formats can follow the same spec and are free to 
>>>>>> depend on the reference implementations from apache/variant-type or 
>>>>>> implement their own.
>>>>>> 
>>>>>> Best,
>>>>>> Gang
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> On Thu, Aug 15, 2024 at 10:07 AM Jack Ye <yezhao...@gmail.com> wrote:
>>>>>>> +1 for copying the spec into our repository, I think we need to own it 
>>>>>>> fully as a part of the table spec, and we can build compatibility 
>>>>>>> through tests.
>>>>>>> 
>>>>>>> -Jack
>>>>>>> 
>>>>>>> On Wed, Aug 14, 2024 at 12:52 PM Russell Spitzer 
>>>>>>> <russell.spit...@gmail.com> wrote:
>>>>>>>> I'm not really in favor of linking and annotating as that just makes 
>>>>>>>> things more complicated and still is essentially forking just with 
>>>>>>>> more steps. If we just track our annotations / modifications  to a 
>>>>>>>> single commit/version then we have the same issue again but now you 
>>>>>>>> have to go to multiple sources to get the actual Spec. *In addition, 
>>>>>>>> our very copy of the Spec is going to require new types which don't 
>>>>>>>> exist in the Spark Spec which necessarily means diverging. *We will 
>>>>>>>> need to take up new primitive id's (as noted in my first email)
>>>>>>>> 
>>>>>>>> The other issue I have is I don't think the Spark Spec is really going 
>>>>>>>> through a thorough review process from all members of the Spark 
>>>>>>>> community, I believe it probably should have gone through the SPIP but 
>>>>>>>> instead seems to have been merged without broad community involvement.
>>>>>>>> 
>>>>>>>> The only way to truly avoid diverging is to only have a single copy of 
>>>>>>>> the spec, in our previous discussions the vast majority of Apache 
>>>>>>>> Iceberg community want it to exist here.
>>>>>>>> 
>>>>>>>> On Wed, Aug 14, 2024 at 2:19 PM Daniel Weeks <dwe...@apache.org> wrote:
>>>>>>>>> I'm really excited about the introduction of variant type to Iceberg, 
>>>>>>>>> but I want to raise concerns about forking the spec.
>>>>>>>>> 
>>>>>>>>> I feel like preemptively forking would create the situation where we 
>>>>>>>>> end up diverging because there's little reason to work with both 
>>>>>>>>> communities to evolve in a way that benefits everyone.
>>>>>>>>> 
>>>>>>>>> I would much rather point to a specific version of the spec and 
>>>>>>>>> annotate any variance in Iceberg's handling.  This would allow us to 
>>>>>>>>> continue without dividing the communities.
>>>>>>>>> 
>>>>>>>>> If at any point there are irreconcilable differences, I would support 
>>>>>>>>> forking, but I don't feel like that should be the initial step.
>>>>>>>>> 
>>>>>>>>> No one is excited about the possibility that the physical 
>>>>>>>>> representations end up diverging, but it feels like we're setting 
>>>>>>>>> ourselves up for that exact scenario.
>>>>>>>>> 
>>>>>>>>> -Dan
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> On Wed, Aug 14, 2024 at 6:54 AM Fokko Driesprong <fo...@apache.org> 
>>>>>>>>> wrote:
>>>>>>>>>> +1 to what's already being said here. It is good to copy the spec to 
>>>>>>>>>> Iceberg and add context that's specific to Iceberg, but at the same 
>>>>>>>>>> time, we should maintain compatibility.
>>>>>>>>>> 
>>>>>>>>>> Kind regards,
>>>>>>>>>> Fokko
>>>>>>>>>> 
>>>>>>>>>> Op wo 14 aug 2024 om 15:30 schreef Manu Zhang 
>>>>>>>>>> <owenzhang1...@gmail.com>:
>>>>>>>>>>> +1 to copy the spec into our repository. I think the best way to 
>>>>>>>>>>> keep compatibility is building integration tests.
>>>>>>>>>>> 
>>>>>>>>>>> Thanks,
>>>>>>>>>>> Manu
>>>>>>>>>>> 
>>>>>>>>>>> On Wed, Aug 14, 2024 at 8:27 PM Péter Váry 
>>>>>>>>>>> <peter.vary.apa...@gmail.com> wrote:
>>>>>>>>>>>> Thanks Russell and Aihua for pushing Variant support!
>>>>>>>>>>>> 
>>>>>>>>>>>> Given the differences between the supported types and the lack of 
>>>>>>>>>>>> interest from the other project, I think it is reasonable to 
>>>>>>>>>>>> duplicate the specification to our repository.
>>>>>>>>>>>> I would give very strong emphasis on sticking to the Spark spec as 
>>>>>>>>>>>> much as possible, to keep compatibility as much as possible. Maybe 
>>>>>>>>>>>> even revert to a shared specification if the situation changes.
>>>>>>>>>>>> 
>>>>>>>>>>>> Thanks,
>>>>>>>>>>>> Peter
>>>>>>>>>>>> 
>>>>>>>>>>>> Aihua Xu <aihu...@gmail.com> ezt írta (időpont: 2024. aug. 13., K, 
>>>>>>>>>>>> 19:52):
>>>>>>>>>>>>> Thanks Russell for bringing this up. 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> This is the main blocker to move forward with the Variant support 
>>>>>>>>>>>>> in Iceberg and hopefully we can have a consensus. To me, I also 
>>>>>>>>>>>>> feel it makes more sense to move the spec into Iceberg rather 
>>>>>>>>>>>>> than Spark engine owns it and we try to keep it compatible with 
>>>>>>>>>>>>> Spark spec.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>> Aihua
>>>>>>>>>>>>> 
>>>>>>>>>>>>> On Mon, Aug 12, 2024 at 6:50 PM Russell Spitzer 
>>>>>>>>>>>>> <russell.spit...@gmail.com> wrote:
>>>>>>>>>>>>>> Hi Y’all,
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> We’ve hit a bit of a roadblock with the Variant Proposal, while 
>>>>>>>>>>>>>> we were hoping to move the Variant and Shredding specifications 
>>>>>>>>>>>>>> from Spark into Iceberg there doesn’t seem to be a lot of 
>>>>>>>>>>>>>> interest in that. Unfortunately, I think we have a number of 
>>>>>>>>>>>>>> issues with just linking to the Spark project directly from 
>>>>>>>>>>>>>> within Iceberg and *I believe we need to copy the specifications 
>>>>>>>>>>>>>> into our repository*.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> There are a few reasons why i think this is necessary
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> First, we have a divergence of types already. The Spark 
>>>>>>>>>>>>>> Specification already includes types which Iceberg has no 
>>>>>>>>>>>>>> definition for (19, 20 
>>>>>>>>>>>>>> <https://github.com/apache/spark/blob/master/common/variant/README.md#encoding-types>
>>>>>>>>>>>>>>  - Interval Types) and Iceberg already has a type which is not 
>>>>>>>>>>>>>> included within the Spark Specification (Time) and will soon 
>>>>>>>>>>>>>> have more with TimestampNS, and Geo.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Second, We would like to make sure that Spark is not a hard 
>>>>>>>>>>>>>> dependency for other engines. We are working with several 
>>>>>>>>>>>>>> implementers of the Iceberg spec and it has previously been 
>>>>>>>>>>>>>> agreed that it would be best if the source of truth for Variant 
>>>>>>>>>>>>>> existed in an engine and file format neutral location. The 
>>>>>>>>>>>>>> Iceberg project has a good open model of governance and, as we 
>>>>>>>>>>>>>> have seen so far discussing Variant 
>>>>>>>>>>>>>> <https://lists.apache.org/thread/xcyytoypgplfr74klg1z2rgjo6k5b0sq>,
>>>>>>>>>>>>>>  open and active collaboration. This would also help as we can 
>>>>>>>>>>>>>> strictly version our changes in-line with the rest of the 
>>>>>>>>>>>>>> Iceberg spec.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Third, The Shredding spec is not quite finished and requires 
>>>>>>>>>>>>>> some group analysis and discussion before we commit it. I think 
>>>>>>>>>>>>>> again the Iceberg community is probably the right place for this 
>>>>>>>>>>>>>> to happen as we have already started discussions here on these 
>>>>>>>>>>>>>> topics.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> For these reasons I think we should go with a direct copy of the 
>>>>>>>>>>>>>> existing specification from the Spark Project and move ahead 
>>>>>>>>>>>>>> with our discussions and modifications within Iceberg. That 
>>>>>>>>>>>>>> said, *I do not want to diverge if possible from the Spark 
>>>>>>>>>>>>>> proposal*. For example, although we do not use the Interval 
>>>>>>>>>>>>>> types above, I think we should *not* reuse those type ids within 
>>>>>>>>>>>>>> our spec. Iceberg's Variant Spec types 19 and 20 would remain 
>>>>>>>>>>>>>> unused along with any other types we think are not applicable. 
>>>>>>>>>>>>>> We should strive whenever possible to allow for compatibility.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> In the interest of moving forward with this proposal I am hoping 
>>>>>>>>>>>>>> to see if anyone in the community objects to this plan going 
>>>>>>>>>>>>>> forward or has a better alternative. 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> As always I am thankful for your time and am eager to hear back 
>>>>>>>>>>>>>> from everyone,
>>>>>>>>>>>>>> Russ
>>>>>>>>>>>>>> 
Xuanwo

https://xuanwo.io/

Reply via email to