Re: [DISCUSS] Variant Spec Location

Gene Pang Fri, 16 Aug 2024 12:00:53 -0700

I think Parquet might be a better home over Arrow. Ryan already brought up
interesting points, especially with all of the storage related details and
discussions, like shredding.


Another aspect to this is that while working on Variant, we had ideas of
adding a Variant logical type to Parquet. We thought Variant could make a
lot of sense as a Parquet type, to make it more convenient to implement
Variant in other projects. We never got that started since we were busy
building out the features and library, but maybe if there are more people
from the various communities, that could become a reality sooner.

Thanks,
Gene

On Fri, Aug 16, 2024 at 11:18 AM Ryan Blue <b...@databricks.com.invalid>
wrote:

> I think Parquet is a better place for the variant spec than Arrow. Parquet
> is upstream of nearly every project (other than ORC) so it is a good place
> to standardize and facilitate discussions across communities. There are
> also existing relationships and connections to the Parquet community
> because of its widespread use. Arrow is not used as Spark's in-memory
> model, nor Trino and others so those existing relationships aren't there. I
> also worry that differences in approaches would make it difficult later on.
>
> This is a fairly specific and orthogonal spec for the in-memory portion so
> I don't see much value in maintaining it in the Arrow community. The main
> place where it is evolving right now (besides adding type IDs) is in
> shredding and storage.
>
>
> On Fri, Aug 16, 2024 at 9:59 AM Weston Pace <weston.p...@gmail.com> wrote:
>
>> +1 to using Arrow to house the spec.  In the interest of expediency I
>> wonder if we could even store it there "on the side" while we figure out
>> how to integrate the variant data type with Arrow.
>>
>> I have a question for those more familiar with the variant spec.  Do we
>> think it could be introduced as a canonical extension type?  My thinking
>> right now is "no" because there is no storage type that has the concept of
>> a "metadata buffer" unless you count dictionary arrays but the metadata
>> buffer in dictionary arrays already has a specific meaning and so
>> attempting to smuggle variant metadata into it would be odd.  You could
>> also use a "variable length binary" with one extra row but that doesn't
>> work because then you end up with more rows.  You could very awkwardly
>> create a variable length binary array where the first item in the array is
>> the metadata buffer + the first variant value but that seems like
>> stretching things.
>>
>> If it's not an extension array then it will need to be a new first-class
>> data type / layout and that will take some work & time to achieve support.
>> However, I think that may be a prerequisite in some ways anyway.  I believe
>> (at least some) Iceberg implementations are using Arrow internally (I could
>> be mistaken)?  If that is so then it seems like implementing the variant
>> type into Arrow is an inevitability.
>>
>> On Fri, Aug 16, 2024 at 9:48 AM Gene Pang <gene.p...@gmail.com> wrote:
>>
>>> Hi all,
>>>
>>> I am one of the main developers for Variant in Apache Spark. David
>>> Cashman (another one of the main Variant developers) and I have been
>>> working on Variant in Spark for a while, and we are excited by the interest
>>> from the Iceberg community!
>>>
>>> We have attended some of the Iceberg dev Variant meetings, with Russell
>>> and several others here, and it has been a great experience in sharing
>>> ideas and getting feedback, especially for the shredding scheme. Maybe
>>> there was some sort of miscommunication, but in those meetings we mentioned
>>> that we wanted to move the specification and library out to some other
>>> location, where it would be easier for other projects to use and
>>> collaborate on. I don't know how it ended up that we are not interested in
>>> moving them, but I want to correct that misconception, and ensure that we
>>> continue to have the goal to host the spec and implementation elsewhere. I
>>> haven't heard anything since those meetings, but we have also been trying
>>> to figure out some candidate homes.
>>>
>>> There have been good suggestions and ideas in this thread, and they will
>>> be considered as we also coordinate with the Spark and Delta communities,
>>> since those projects are already depending on the existing library. I
>>> appreciate all the ideas for a potential new home for the project!
>>>
>>> Thanks,
>>> Gene
>>>
>>> On Thu, Aug 15, 2024 at 4:18 PM Micah Kornfield <emkornfi...@gmail.com>
>>> wrote:
>>>
>>>> Thats fair @Micah, so far all the discussions have been direct and off
>>>>> the dev list. Would you like to make the request on the public Spark Dev
>>>>> list? I would be glad to co-sign, I can also draft up a quick email if you
>>>>> don't have time.
>>>>
>>>>
>>>> I think once we come to consensus, if you have bandwidth, I think the
>>>> message might be better coming from you, as you have more context on some
>>>> of the non-public conversations, the requirements from an Iceberg
>>>> perspective on governance and the blockers that were encountered.  If
>>>> details on the conversations can't be shared, (i.e. we are starting from
>>>> scratch) it seems like suggesting a new project via SPIP might be the way
>>>> forward.  I'm happy to help with that if it is useful but I would guess
>>>> Aihua or Tyler might be in a better place to start as it seems they have
>>>> done more serious thinking here.
>>>>
>>>> If we decide to try to standardize on Parquet or Arrow I'm happy to
>>>> help support the effort in those communities.
>>>>
>>>> Thanks,
>>>> Micah
>>>>
>>>> On Thu, Aug 15, 2024 at 8:09 AM Russell Spitzer <
>>>> russell.spit...@gmail.com> wrote:
>>>>
>>>>> Thats fair @Micah, so far all the discussions have been direct and off
>>>>> the dev list. Would you like to make the request on the public Spark Dev
>>>>> list? I would be glad to co-sign, I can also draft up a quick email if you
>>>>> don't have time.
>>>>>
>>>>> On Thu, Aug 15, 2024 at 10:04 AM Micah Kornfield <
>>>>> emkornfi...@gmail.com> wrote:
>>>>>
>>>>>> I agree that it would be beneficial to make a sub-project, the main
>>>>>>> problem is political and not logistic. I've been asking for movement 
>>>>>>> from
>>>>>>> other relative projects for a month and we simply haven't gotten 
>>>>>>> anywhere.
>>>>>>
>>>>>>
>>>>>> I just wanted to double check that these issues were brought directly
>>>>>> to the spark community (i.e. a discussion thread on the Spark developer
>>>>>> mailing list) and not via backchannels.
>>>>>>
>>>>>> I'm not sure the outcome would be different and I don't think this
>>>>>> should block forking the spec, but we should make sure that the decision 
>>>>>> is
>>>>>> publicly documented within both communities.
>>>>>>
>>>>>> Thanks,
>>>>>> Micah
>>>>>>
>>>>>> On Thu, Aug 15, 2024 at 7:47 AM Russell Spitzer <
>>>>>> russell.spit...@gmail.com> wrote:
>>>>>>
>>>>>>> @Gang Wu
>>>>>>>
>>>>>>> I agree that it would be beneficial to make a sub-project, the main
>>>>>>> problem is political and not logistic. I've been asking for movement 
>>>>>>> from
>>>>>>> other relative projects for a month and we simply haven't gotten 
>>>>>>> anywhere.
>>>>>>> I don't think there is anything that would stop us from moving to a 
>>>>>>> joint
>>>>>>> project in the future and if you know of some way of encouraging that
>>>>>>> movement from other relevant parties I would be glad to collaborate in
>>>>>>> doing that. One thing that I don't want to do is have the Iceberg 
>>>>>>> project
>>>>>>> stay in a holding pattern without any clear roadmap as to how to 
>>>>>>> proceed.
>>>>>>>
>>>>>>> On Wed, Aug 14, 2024 at 11:12 PM Yufei Gu <flyrain...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> I’m on board with copying the spec into our repository. However, as
>>>>>>>> we’ve talked about, it’s not just a straightforward copy—there are 
>>>>>>>> already
>>>>>>>> some divergences. Some of them are under discussion. Iceberg is 
>>>>>>>> definitely
>>>>>>>> the best place for these specs. Engines like Trino and Flink can then 
>>>>>>>> rely
>>>>>>>> on the Iceberg specs as a solid foundation.
>>>>>>>>
>>>>>>>> Yufei
>>>>>>>>
>>>>>>>> On Wed, Aug 14, 2024 at 7:51 PM Gang Wu <ust...@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> Sorry for chiming in late.
>>>>>>>>>
>>>>>>>>> From the discussion in
>>>>>>>>> https://lists.apache.org/thread/xcyytoypgplfr74klg1z2rgjo6k5b0sq,
>>>>>>>>> I don't quite understand why it is logistically complicated to create 
>>>>>>>>> a
>>>>>>>>> sub-project to hold the variant spec and impl.
>>>>>>>>>
>>>>>>>>> IMHO, coping the variant type spec into Apache Iceberg has some
>>>>>>>>> deficiencies:
>>>>>>>>> - It is a burden to update two repos if there is a variant type
>>>>>>>>> spec change and will likely result in deviation if some changes do not
>>>>>>>>> reach agreement from both parties.
>>>>>>>>> - Implementers are required to keep an eye on both specs
>>>>>>>>> (considering proprietary engines where both Iceberg and Delta are
>>>>>>>>> supported).
>>>>>>>>> - Putting the spec and impl of variant type in Iceberg repo does
>>>>>>>>> lose the opportunity for better native support from file formats like
>>>>>>>>> Parquet and ORC.
>>>>>>>>>
>>>>>>>>> I'm not sure if it is possible to create a separate project (e.g.
>>>>>>>>> apache/variant-type) to make it a single point of truth. We can learn 
>>>>>>>>> from
>>>>>>>>> the experience of Apache Arrow. In this fashion, different engines, 
>>>>>>>>> table
>>>>>>>>> formats and file formats can follow the same spec and are free to 
>>>>>>>>> depend on
>>>>>>>>> the reference implementations from apache/variant-type or implement 
>>>>>>>>> their
>>>>>>>>> own.
>>>>>>>>>
>>>>>>>>> Best,
>>>>>>>>> Gang
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Thu, Aug 15, 2024 at 10:07 AM Jack Ye <yezhao...@gmail.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> +1 for copying the spec into our repository, I think we need to
>>>>>>>>>> own it fully as a part of the table spec, and we can build 
>>>>>>>>>> compatibility
>>>>>>>>>> through tests.
>>>>>>>>>>
>>>>>>>>>> -Jack
>>>>>>>>>>
>>>>>>>>>> On Wed, Aug 14, 2024 at 12:52 PM Russell Spitzer <
>>>>>>>>>> russell.spit...@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> I'm not really in favor of linking and annotating as that just
>>>>>>>>>>> makes things more complicated and still is essentially forking just 
>>>>>>>>>>> with
>>>>>>>>>>> more steps. If we just track our annotations / modifications  to a 
>>>>>>>>>>> single
>>>>>>>>>>> commit/version then we have the same issue again but now you have 
>>>>>>>>>>> to go to
>>>>>>>>>>> multiple sources to get the actual Spec. *In addition, our very
>>>>>>>>>>> copy of the Spec is going to require new types which don't exist in 
>>>>>>>>>>> the
>>>>>>>>>>> Spark Spec which necessarily means diverging. *We will need to
>>>>>>>>>>> take up new primitive id's (as noted in my first email)
>>>>>>>>>>>
>>>>>>>>>>> The other issue I have is I don't think the Spark Spec is really
>>>>>>>>>>> going through a thorough review process from all members of the 
>>>>>>>>>>> Spark
>>>>>>>>>>> community, I believe it probably should have gone through the SPIP 
>>>>>>>>>>> but
>>>>>>>>>>> instead seems to have been merged without broad community 
>>>>>>>>>>> involvement.
>>>>>>>>>>>
>>>>>>>>>>> The only way to truly avoid diverging is to only have a single
>>>>>>>>>>> copy of the spec, in our previous discussions the vast majority of 
>>>>>>>>>>> Apache
>>>>>>>>>>> Iceberg community want it to exist here.
>>>>>>>>>>>
>>>>>>>>>>> On Wed, Aug 14, 2024 at 2:19 PM Daniel Weeks <dwe...@apache.org>
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> I'm really excited about the introduction of variant type to
>>>>>>>>>>>> Iceberg, but I want to raise concerns about forking the spec.
>>>>>>>>>>>>
>>>>>>>>>>>> I feel like preemptively forking would create the situation
>>>>>>>>>>>> where we end up diverging because there's little reason to work 
>>>>>>>>>>>> with both
>>>>>>>>>>>> communities to evolve in a way that benefits everyone.
>>>>>>>>>>>>
>>>>>>>>>>>> I would much rather point to a specific version of the spec and
>>>>>>>>>>>> annotate any variance in Iceberg's handling.  This would allow us 
>>>>>>>>>>>> to
>>>>>>>>>>>> continue without dividing the communities.
>>>>>>>>>>>>
>>>>>>>>>>>> If at any point there are irreconcilable differences, I would
>>>>>>>>>>>> support forking, but I don't feel like that should be the initial 
>>>>>>>>>>>> step.
>>>>>>>>>>>>
>>>>>>>>>>>> No one is excited about the possibility that the physical
>>>>>>>>>>>> representations end up diverging, but it feels like we're setting
>>>>>>>>>>>> ourselves up for that exact scenario.
>>>>>>>>>>>>
>>>>>>>>>>>> -Dan
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Wed, Aug 14, 2024 at 6:54 AM Fokko Driesprong <
>>>>>>>>>>>> fo...@apache.org> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> +1 to what's already being said here. It is good to copy the
>>>>>>>>>>>>> spec to Iceberg and add context that's specific to Iceberg, but 
>>>>>>>>>>>>> at the same
>>>>>>>>>>>>> time, we should maintain compatibility.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Kind regards,
>>>>>>>>>>>>> Fokko
>>>>>>>>>>>>>
>>>>>>>>>>>>> Op wo 14 aug 2024 om 15:30 schreef Manu Zhang <
>>>>>>>>>>>>> owenzhang1...@gmail.com>:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> +1 to copy the spec into our repository. I think the best way
>>>>>>>>>>>>>> to keep compatibility is building integration tests.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>> Manu
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Wed, Aug 14, 2024 at 8:27 PM Péter Váry <
>>>>>>>>>>>>>> peter.vary.apa...@gmail.com> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Thanks Russell and Aihua for pushing Variant support!
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Given the differences between the supported types and the
>>>>>>>>>>>>>>> lack of interest from the other project, I think it is 
>>>>>>>>>>>>>>> reasonable to
>>>>>>>>>>>>>>> duplicate the specification to our repository.
>>>>>>>>>>>>>>> I would give very strong emphasis on sticking to the Spark
>>>>>>>>>>>>>>> spec as much as possible, to keep compatibility as much as 
>>>>>>>>>>>>>>> possible. Maybe
>>>>>>>>>>>>>>> even revert to a shared specification if the situation changes.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>> Peter
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Aihua Xu <aihu...@gmail.com> ezt írta (időpont: 2024. aug.
>>>>>>>>>>>>>>> 13., K, 19:52):
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Thanks Russell for bringing this up.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> This is the main blocker to move forward with the Variant
>>>>>>>>>>>>>>>> support in Iceberg and hopefully we can have a consensus. To 
>>>>>>>>>>>>>>>> me, I also
>>>>>>>>>>>>>>>> feel it makes more sense to move the spec into Iceberg rather 
>>>>>>>>>>>>>>>> than Spark
>>>>>>>>>>>>>>>> engine owns it and we try to keep it compatible with Spark 
>>>>>>>>>>>>>>>> spec.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>> Aihua
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Mon, Aug 12, 2024 at 6:50 PM Russell Spitzer <
>>>>>>>>>>>>>>>> russell.spit...@gmail.com> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Hi Y’all,
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> We’ve hit a bit of a roadblock with the Variant Proposal,
>>>>>>>>>>>>>>>>> while we were hoping to move the Variant and Shredding 
>>>>>>>>>>>>>>>>> specifications from
>>>>>>>>>>>>>>>>> Spark into Iceberg there doesn’t seem to be a lot of interest 
>>>>>>>>>>>>>>>>> in that.
>>>>>>>>>>>>>>>>> Unfortunately, I think we have a number of issues with just 
>>>>>>>>>>>>>>>>> linking to the
>>>>>>>>>>>>>>>>> Spark project directly from within Iceberg and *I believe
>>>>>>>>>>>>>>>>> we need to copy the specifications into our repository*.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> There are a few reasons why i think this is necessary
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> First, we have a divergence of types already. The Spark
>>>>>>>>>>>>>>>>> Specification already includes types which Iceberg has no 
>>>>>>>>>>>>>>>>> definition for (19,
>>>>>>>>>>>>>>>>> 20
>>>>>>>>>>>>>>>>> <https://github.com/apache/spark/blob/master/common/variant/README.md#encoding-types>
>>>>>>>>>>>>>>>>> - Interval Types) and Iceberg already has a type which is not 
>>>>>>>>>>>>>>>>> included
>>>>>>>>>>>>>>>>> within the Spark Specification (Time) and will soon have more 
>>>>>>>>>>>>>>>>> with
>>>>>>>>>>>>>>>>> TimestampNS, and Geo.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Second, We would like to make sure that Spark is not a
>>>>>>>>>>>>>>>>> hard dependency for other engines. We are working with 
>>>>>>>>>>>>>>>>> several implementers
>>>>>>>>>>>>>>>>> of the Iceberg spec and it has previously been agreed that it 
>>>>>>>>>>>>>>>>> would be best
>>>>>>>>>>>>>>>>> if the source of truth for Variant existed in an engine and 
>>>>>>>>>>>>>>>>> file format
>>>>>>>>>>>>>>>>> neutral location. The Iceberg project has a good open model 
>>>>>>>>>>>>>>>>> of governance
>>>>>>>>>>>>>>>>> and, as we have seen so far discussing Variant
>>>>>>>>>>>>>>>>> <https://lists.apache.org/thread/xcyytoypgplfr74klg1z2rgjo6k5b0sq>,
>>>>>>>>>>>>>>>>> open and active collaboration. This would also help as we can 
>>>>>>>>>>>>>>>>> strictly
>>>>>>>>>>>>>>>>> version our changes in-line with the rest of the Iceberg spec.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Third, The Shredding spec is not quite finished and
>>>>>>>>>>>>>>>>> requires some group analysis and discussion before we commit 
>>>>>>>>>>>>>>>>> it. I think
>>>>>>>>>>>>>>>>> again the Iceberg community is probably the right place for 
>>>>>>>>>>>>>>>>> this to happen
>>>>>>>>>>>>>>>>> as we have already started discussions here on these topics.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> For these reasons I think we should go with a direct copy
>>>>>>>>>>>>>>>>> of the existing specification from the Spark Project and move 
>>>>>>>>>>>>>>>>> ahead with
>>>>>>>>>>>>>>>>> our discussions and modifications within Iceberg. That said, 
>>>>>>>>>>>>>>>>> *I
>>>>>>>>>>>>>>>>> do not want to diverge if possible from the Spark proposal*.
>>>>>>>>>>>>>>>>> For example, although we do not use the Interval types above, 
>>>>>>>>>>>>>>>>> I think we
>>>>>>>>>>>>>>>>> should not reuse those type ids within our spec.
>>>>>>>>>>>>>>>>> Iceberg's Variant Spec types 19 and 20 would remain unused 
>>>>>>>>>>>>>>>>> along with any
>>>>>>>>>>>>>>>>> other types we think are not applicable. We should strive 
>>>>>>>>>>>>>>>>> whenever possible
>>>>>>>>>>>>>>>>> to allow for compatibility.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> In the interest of moving forward with this proposal I am
>>>>>>>>>>>>>>>>> hoping to see if anyone in the community objects to this plan 
>>>>>>>>>>>>>>>>> going forward
>>>>>>>>>>>>>>>>> or has a better alternative.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> As always I am thankful for your time and am eager to hear
>>>>>>>>>>>>>>>>> back from everyone,
>>>>>>>>>>>>>>>>> Russ
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>
> --
> Ryan Blue
> Databricks
>

Re: [DISCUSS] Variant Spec Location

Reply via email to