>From the iceberg-rust perspective, it could be extremely challenging to keep >track of both the Spark and Iceberg specifications. Having a single source of >truth would be much better. I believe this change will also benefit Delta Lake >if they implement the same approach. Perhaps we can try contacting them to >initiate such a project?
On Thu, Aug 15, 2024, at 23:17, Gang Wu wrote: > +1 on posting this discussion to dev@spark ML > > > I don't think there is anything that would stop us from moving to a joint > > project in the future > > My concern is that if we don't do this from day 1, we will never ever do this. > > Best, > Gang > > On Thu, Aug 15, 2024 at 11:08 PM Russell Spitzer <russell.spit...@gmail.com> > wrote: >> Thats fair @Micah, so far all the discussions have been direct and off the >> dev list. Would you like to make the request on the public Spark Dev list? I >> would be glad to co-sign, I can also draft up a quick email if you don't >> have time. >> >> On Thu, Aug 15, 2024 at 10:04 AM Micah Kornfield <emkornfi...@gmail.com> >> wrote: >>>> I agree that it would be beneficial to make a sub-project, the main >>>> problem is political and not logistic. I've been asking for movement from >>>> other relative projects for a month and we simply haven't gotten anywhere. >>> >>> I just wanted to double check that these issues were brought directly to >>> the spark community (i.e. a discussion thread on the Spark developer >>> mailing list) and not via backchannels. >>> >>> I'm not sure the outcome would be different and I don't think this should >>> block forking the spec, but we should make sure that the decision is >>> publicly documented within both communities. >>> >>> Thanks, >>> Micah >>> >>> On Thu, Aug 15, 2024 at 7:47 AM Russell Spitzer <russell.spit...@gmail.com> >>> wrote: >>>> @Gang Wu >>>> I agree that it would be beneficial to make a sub-project, the main >>>> problem is political and not logistic. I've been asking for movement from >>>> other relative projects for a month and we simply haven't gotten anywhere. >>>> I don't think there is anything that would stop us from moving to a joint >>>> project in the future and if you know of some way of encouraging that >>>> movement from other relevant parties I would be glad to collaborate in >>>> doing that. One thing that I don't want to do is have the Iceberg project >>>> stay in a holding pattern without any clear roadmap as to how to proceed. >>>> >>>> On Wed, Aug 14, 2024 at 11:12 PM Yufei Gu <flyrain...@gmail.com> wrote: >>>>> I’m on board with copying the spec into our repository. However, as we’ve >>>>> talked about, it’s not just a straightforward copy—there are already some >>>>> divergences. Some of them are under discussion. Iceberg is definitely the >>>>> best place for these specs. Engines like Trino and Flink can then rely on >>>>> the Iceberg specs as a solid foundation. >>>>> >>>>> Yufei >>>>> >>>>> On Wed, Aug 14, 2024 at 7:51 PM Gang Wu <ust...@gmail.com> wrote: >>>>>> Sorry for chiming in late. >>>>>> >>>>>> From the discussion in >>>>>> https://lists.apache.org/thread/xcyytoypgplfr74klg1z2rgjo6k5b0sq, I >>>>>> don't quite understand why it is logistically complicated to create a >>>>>> sub-project to hold the variant spec and impl. >>>>>> >>>>>> IMHO, coping the variant type spec into Apache Iceberg has some >>>>>> deficiencies: >>>>>> - It is a burden to update two repos if there is a variant type spec >>>>>> change and will likely result in deviation if some changes do not reach >>>>>> agreement from both parties. >>>>>> - Implementers are required to keep an eye on both specs (considering >>>>>> proprietary engines where both Iceberg and Delta are supported). >>>>>> - Putting the spec and impl of variant type in Iceberg repo does lose >>>>>> the opportunity for better native support from file formats like Parquet >>>>>> and ORC. >>>>>> >>>>>> I'm not sure if it is possible to create a separate project (e.g. >>>>>> apache/variant-type) to make it a single point of truth. We can learn >>>>>> from the experience of Apache Arrow. In this fashion, different engines, >>>>>> table formats and file formats can follow the same spec and are free to >>>>>> depend on the reference implementations from apache/variant-type or >>>>>> implement their own. >>>>>> >>>>>> Best, >>>>>> Gang >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> On Thu, Aug 15, 2024 at 10:07 AM Jack Ye <yezhao...@gmail.com> wrote: >>>>>>> +1 for copying the spec into our repository, I think we need to own it >>>>>>> fully as a part of the table spec, and we can build compatibility >>>>>>> through tests. >>>>>>> >>>>>>> -Jack >>>>>>> >>>>>>> On Wed, Aug 14, 2024 at 12:52 PM Russell Spitzer >>>>>>> <russell.spit...@gmail.com> wrote: >>>>>>>> I'm not really in favor of linking and annotating as that just makes >>>>>>>> things more complicated and still is essentially forking just with >>>>>>>> more steps. If we just track our annotations / modifications to a >>>>>>>> single commit/version then we have the same issue again but now you >>>>>>>> have to go to multiple sources to get the actual Spec. *In addition, >>>>>>>> our very copy of the Spec is going to require new types which don't >>>>>>>> exist in the Spark Spec which necessarily means diverging. *We will >>>>>>>> need to take up new primitive id's (as noted in my first email) >>>>>>>> >>>>>>>> The other issue I have is I don't think the Spark Spec is really going >>>>>>>> through a thorough review process from all members of the Spark >>>>>>>> community, I believe it probably should have gone through the SPIP but >>>>>>>> instead seems to have been merged without broad community involvement. >>>>>>>> >>>>>>>> The only way to truly avoid diverging is to only have a single copy of >>>>>>>> the spec, in our previous discussions the vast majority of Apache >>>>>>>> Iceberg community want it to exist here. >>>>>>>> >>>>>>>> On Wed, Aug 14, 2024 at 2:19 PM Daniel Weeks <dwe...@apache.org> wrote: >>>>>>>>> I'm really excited about the introduction of variant type to Iceberg, >>>>>>>>> but I want to raise concerns about forking the spec. >>>>>>>>> >>>>>>>>> I feel like preemptively forking would create the situation where we >>>>>>>>> end up diverging because there's little reason to work with both >>>>>>>>> communities to evolve in a way that benefits everyone. >>>>>>>>> >>>>>>>>> I would much rather point to a specific version of the spec and >>>>>>>>> annotate any variance in Iceberg's handling. This would allow us to >>>>>>>>> continue without dividing the communities. >>>>>>>>> >>>>>>>>> If at any point there are irreconcilable differences, I would support >>>>>>>>> forking, but I don't feel like that should be the initial step. >>>>>>>>> >>>>>>>>> No one is excited about the possibility that the physical >>>>>>>>> representations end up diverging, but it feels like we're setting >>>>>>>>> ourselves up for that exact scenario. >>>>>>>>> >>>>>>>>> -Dan >>>>>>>>> >>>>>>>>> >>>>>>>>> On Wed, Aug 14, 2024 at 6:54 AM Fokko Driesprong <fo...@apache.org> >>>>>>>>> wrote: >>>>>>>>>> +1 to what's already being said here. It is good to copy the spec to >>>>>>>>>> Iceberg and add context that's specific to Iceberg, but at the same >>>>>>>>>> time, we should maintain compatibility. >>>>>>>>>> >>>>>>>>>> Kind regards, >>>>>>>>>> Fokko >>>>>>>>>> >>>>>>>>>> Op wo 14 aug 2024 om 15:30 schreef Manu Zhang >>>>>>>>>> <owenzhang1...@gmail.com>: >>>>>>>>>>> +1 to copy the spec into our repository. I think the best way to >>>>>>>>>>> keep compatibility is building integration tests. >>>>>>>>>>> >>>>>>>>>>> Thanks, >>>>>>>>>>> Manu >>>>>>>>>>> >>>>>>>>>>> On Wed, Aug 14, 2024 at 8:27 PM Péter Váry >>>>>>>>>>> <peter.vary.apa...@gmail.com> wrote: >>>>>>>>>>>> Thanks Russell and Aihua for pushing Variant support! >>>>>>>>>>>> >>>>>>>>>>>> Given the differences between the supported types and the lack of >>>>>>>>>>>> interest from the other project, I think it is reasonable to >>>>>>>>>>>> duplicate the specification to our repository. >>>>>>>>>>>> I would give very strong emphasis on sticking to the Spark spec as >>>>>>>>>>>> much as possible, to keep compatibility as much as possible. Maybe >>>>>>>>>>>> even revert to a shared specification if the situation changes. >>>>>>>>>>>> >>>>>>>>>>>> Thanks, >>>>>>>>>>>> Peter >>>>>>>>>>>> >>>>>>>>>>>> Aihua Xu <aihu...@gmail.com> ezt írta (időpont: 2024. aug. 13., K, >>>>>>>>>>>> 19:52): >>>>>>>>>>>>> Thanks Russell for bringing this up. >>>>>>>>>>>>> >>>>>>>>>>>>> This is the main blocker to move forward with the Variant support >>>>>>>>>>>>> in Iceberg and hopefully we can have a consensus. To me, I also >>>>>>>>>>>>> feel it makes more sense to move the spec into Iceberg rather >>>>>>>>>>>>> than Spark engine owns it and we try to keep it compatible with >>>>>>>>>>>>> Spark spec. >>>>>>>>>>>>> >>>>>>>>>>>>> Thanks, >>>>>>>>>>>>> Aihua >>>>>>>>>>>>> >>>>>>>>>>>>> On Mon, Aug 12, 2024 at 6:50 PM Russell Spitzer >>>>>>>>>>>>> <russell.spit...@gmail.com> wrote: >>>>>>>>>>>>>> Hi Y’all, >>>>>>>>>>>>>> >>>>>>>>>>>>>> We’ve hit a bit of a roadblock with the Variant Proposal, while >>>>>>>>>>>>>> we were hoping to move the Variant and Shredding specifications >>>>>>>>>>>>>> from Spark into Iceberg there doesn’t seem to be a lot of >>>>>>>>>>>>>> interest in that. Unfortunately, I think we have a number of >>>>>>>>>>>>>> issues with just linking to the Spark project directly from >>>>>>>>>>>>>> within Iceberg and *I believe we need to copy the specifications >>>>>>>>>>>>>> into our repository*. >>>>>>>>>>>>>> >>>>>>>>>>>>>> There are a few reasons why i think this is necessary >>>>>>>>>>>>>> >>>>>>>>>>>>>> First, we have a divergence of types already. The Spark >>>>>>>>>>>>>> Specification already includes types which Iceberg has no >>>>>>>>>>>>>> definition for (19, 20 >>>>>>>>>>>>>> <https://github.com/apache/spark/blob/master/common/variant/README.md#encoding-types> >>>>>>>>>>>>>> - Interval Types) and Iceberg already has a type which is not >>>>>>>>>>>>>> included within the Spark Specification (Time) and will soon >>>>>>>>>>>>>> have more with TimestampNS, and Geo. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Second, We would like to make sure that Spark is not a hard >>>>>>>>>>>>>> dependency for other engines. We are working with several >>>>>>>>>>>>>> implementers of the Iceberg spec and it has previously been >>>>>>>>>>>>>> agreed that it would be best if the source of truth for Variant >>>>>>>>>>>>>> existed in an engine and file format neutral location. The >>>>>>>>>>>>>> Iceberg project has a good open model of governance and, as we >>>>>>>>>>>>>> have seen so far discussing Variant >>>>>>>>>>>>>> <https://lists.apache.org/thread/xcyytoypgplfr74klg1z2rgjo6k5b0sq>, >>>>>>>>>>>>>> open and active collaboration. This would also help as we can >>>>>>>>>>>>>> strictly version our changes in-line with the rest of the >>>>>>>>>>>>>> Iceberg spec. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Third, The Shredding spec is not quite finished and requires >>>>>>>>>>>>>> some group analysis and discussion before we commit it. I think >>>>>>>>>>>>>> again the Iceberg community is probably the right place for this >>>>>>>>>>>>>> to happen as we have already started discussions here on these >>>>>>>>>>>>>> topics. >>>>>>>>>>>>>> >>>>>>>>>>>>>> For these reasons I think we should go with a direct copy of the >>>>>>>>>>>>>> existing specification from the Spark Project and move ahead >>>>>>>>>>>>>> with our discussions and modifications within Iceberg. That >>>>>>>>>>>>>> said, *I do not want to diverge if possible from the Spark >>>>>>>>>>>>>> proposal*. For example, although we do not use the Interval >>>>>>>>>>>>>> types above, I think we should *not* reuse those type ids within >>>>>>>>>>>>>> our spec. Iceberg's Variant Spec types 19 and 20 would remain >>>>>>>>>>>>>> unused along with any other types we think are not applicable. >>>>>>>>>>>>>> We should strive whenever possible to allow for compatibility. >>>>>>>>>>>>>> >>>>>>>>>>>>>> In the interest of moving forward with this proposal I am hoping >>>>>>>>>>>>>> to see if anyone in the community objects to this plan going >>>>>>>>>>>>>> forward or has a better alternative. >>>>>>>>>>>>>> >>>>>>>>>>>>>> As always I am thankful for your time and am eager to hear back >>>>>>>>>>>>>> from everyone, >>>>>>>>>>>>>> Russ >>>>>>>>>>>>>> Xuanwo https://xuanwo.io/