Re: [DISCUSS] Variant Spec Location

Jack Ye Wed, 14 Aug 2024 19:07:28 -0700

+1 for copying the spec into our repository, I think we need to own it
fully as a part of the table spec, and we can build compatibility through
tests.


-Jack

On Wed, Aug 14, 2024 at 12:52 PM Russell Spitzer <russell.spit...@gmail.com>
wrote:

> I'm not really in favor of linking and annotating as that just makes
> things more complicated and still is essentially forking just with more
> steps. If we just track our annotations / modifications  to a single
> commit/version then we have the same issue again but now you have to go to
> multiple sources to get the actual Spec. *In addition, our very copy of
> the Spec is going to require new types which don't exist in the Spark Spec
> which necessarily means diverging. *We will need to take up new primitive
> id's (as noted in my first email)
>
> The other issue I have is I don't think the Spark Spec is really going
> through a thorough review process from all members of the Spark community,
> I believe it probably should have gone through the SPIP but instead seems
> to have been merged without broad community involvement.
>
> The only way to truly avoid diverging is to only have a single copy of the
> spec, in our previous discussions the vast majority of Apache Iceberg
> community want it to exist here.
>
> On Wed, Aug 14, 2024 at 2:19 PM Daniel Weeks <dwe...@apache.org> wrote:
>
>> I'm really excited about the introduction of variant type to Iceberg, but
>> I want to raise concerns about forking the spec.
>>
>> I feel like preemptively forking would create the situation where we end
>> up diverging because there's little reason to work with both communities to
>> evolve in a way that benefits everyone.
>>
>> I would much rather point to a specific version of the spec and annotate
>> any variance in Iceberg's handling.  This would allow us to continue
>> without dividing the communities.
>>
>> If at any point there are irreconcilable differences, I would support
>> forking, but I don't feel like that should be the initial step.
>>
>> No one is excited about the possibility that the physical representations
>> end up diverging, but it feels like we're setting ourselves up for that
>> exact scenario.
>>
>> -Dan
>>
>>
>> On Wed, Aug 14, 2024 at 6:54 AM Fokko Driesprong <fo...@apache.org>
>> wrote:
>>
>>> +1 to what's already being said here. It is good to copy the spec to
>>> Iceberg and add context that's specific to Iceberg, but at the same time,
>>> we should maintain compatibility.
>>>
>>> Kind regards,
>>> Fokko
>>>
>>> Op wo 14 aug 2024 om 15:30 schreef Manu Zhang <owenzhang1...@gmail.com>:
>>>
>>>> +1 to copy the spec into our repository. I think the best way to keep
>>>> compatibility is building integration tests.
>>>>
>>>> Thanks,
>>>> Manu
>>>>
>>>> On Wed, Aug 14, 2024 at 8:27 PM Péter Váry <peter.vary.apa...@gmail.com>
>>>> wrote:
>>>>
>>>>> Thanks Russell and Aihua for pushing Variant support!
>>>>>
>>>>> Given the differences between the supported types and the lack of
>>>>> interest from the other project, I think it is reasonable to duplicate the
>>>>> specification to our repository.
>>>>> I would give very strong emphasis on sticking to the Spark spec as
>>>>> much as possible, to keep compatibility as much as possible. Maybe even
>>>>> revert to a shared specification if the situation changes.
>>>>>
>>>>> Thanks,
>>>>> Peter
>>>>>
>>>>> Aihua Xu <aihu...@gmail.com> ezt írta (időpont: 2024. aug. 13., K,
>>>>> 19:52):
>>>>>
>>>>>> Thanks Russell for bringing this up.
>>>>>>
>>>>>> This is the main blocker to move forward with the Variant support in
>>>>>> Iceberg and hopefully we can have a consensus. To me, I also feel it 
>>>>>> makes
>>>>>> more sense to move the spec into Iceberg rather than Spark engine owns it
>>>>>> and we try to keep it compatible with Spark spec.
>>>>>>
>>>>>> Thanks,
>>>>>> Aihua
>>>>>>
>>>>>> On Mon, Aug 12, 2024 at 6:50 PM Russell Spitzer <
>>>>>> russell.spit...@gmail.com> wrote:
>>>>>>
>>>>>>> Hi Y’all,
>>>>>>>
>>>>>>> We’ve hit a bit of a roadblock with the Variant Proposal, while we
>>>>>>> were hoping to move the Variant and Shredding specifications from Spark
>>>>>>> into Iceberg there doesn’t seem to be a lot of interest in that.
>>>>>>> Unfortunately, I think we have a number of issues with just linking to 
>>>>>>> the
>>>>>>> Spark project directly from within Iceberg and *I believe we need
>>>>>>> to copy the specifications into our repository*.
>>>>>>>
>>>>>>> There are a few reasons why i think this is necessary
>>>>>>>
>>>>>>> First, we have a divergence of types already. The Spark
>>>>>>> Specification already includes types which Iceberg has no definition 
>>>>>>> for (19,
>>>>>>> 20
>>>>>>> <https://github.com/apache/spark/blob/master/common/variant/README.md#encoding-types>
>>>>>>> - Interval Types) and Iceberg already has a type which is not included
>>>>>>> within the Spark Specification (Time) and will soon have more with
>>>>>>> TimestampNS, and Geo.
>>>>>>>
>>>>>>> Second, We would like to make sure that Spark is not a hard
>>>>>>> dependency for other engines. We are working with several implementers 
>>>>>>> of
>>>>>>> the Iceberg spec and it has previously been agreed that it would be 
>>>>>>> best if
>>>>>>> the source of truth for Variant existed in an engine and file format
>>>>>>> neutral location. The Iceberg project has a good open model of 
>>>>>>> governance
>>>>>>> and, as we have seen so far discussing Variant
>>>>>>> <https://lists.apache.org/thread/xcyytoypgplfr74klg1z2rgjo6k5b0sq>,
>>>>>>> open and active collaboration. This would also help as we can strictly
>>>>>>> version our changes in-line with the rest of the Iceberg spec.
>>>>>>>
>>>>>>> Third, The Shredding spec is not quite finished and requires some
>>>>>>> group analysis and discussion before we commit it. I think again the
>>>>>>> Iceberg community is probably the right place for this to happen as we 
>>>>>>> have
>>>>>>> already started discussions here on these topics.
>>>>>>>
>>>>>>> For these reasons I think we should go with a direct copy of the
>>>>>>> existing specification from the Spark Project and move ahead with our
>>>>>>> discussions and modifications within Iceberg. That said, *I do not
>>>>>>> want to diverge if possible from the Spark proposal*. For example,
>>>>>>> although we do not use the Interval types above, I think we should
>>>>>>> not reuse those type ids within our spec. Iceberg's Variant Spec
>>>>>>> types 19 and 20 would remain unused along with any other types we think 
>>>>>>> are
>>>>>>> not applicable. We should strive whenever possible to allow for
>>>>>>> compatibility.
>>>>>>>
>>>>>>> In the interest of moving forward with this proposal I am hoping to
>>>>>>> see if anyone in the community objects to this plan going forward or 
>>>>>>> has a
>>>>>>> better alternative.
>>>>>>>
>>>>>>> As always I am thankful for your time and am eager to hear back from
>>>>>>> everyone,
>>>>>>> Russ
>>>>>>>
>>>>>>>

Re: [DISCUSS] Variant Spec Location

Reply via email to