Re: [DISCUSS] Variant Spec Location

Gang Wu Thu, 22 Aug 2024 07:12:41 -0700

Thanks Fokko for providing the discussion from dev@spark!

Happy to see consensus from the creators and looking forward to the next
step!


Best,
Gang

On Thu, Aug 22, 2024 at 4:12 PM Fokko Driesprong <[email protected]> wrote:

> Removing the Arrow dev-list from the CC since that's not helpful at this
> point.
>
> This thread focuses on: Should we fork the spec into Iceberg, or are we
> okay with having this inside a different project? Spark is not preferred,
> so Parquet and Arrow are suggested as alternatives. Reading the thread, my
> take is that it is okay if we need to fork the spec due to
> incompatibilities, but the preference is to not fork it to avoid divergence
> (why fork otherwise).
>
> I don't think it is up to the Iceberg community to decide where the spec
> should be, but that's up to the original creators. That thread
> <https://lists.apache.org/thread/0k5oj3mn0049fcxoxm3gx3d7r28gw4rj> is
> happening on the Spark devlist since that's where the variant spec comes
> from.
>
> My take on it is that it doesn't make much sense to me to land this into
> Arrow, but much rather host this at Parquet. One of the improvements as
> part of the proposal
> <https://docs.google.com/document/d/1QjhpG_SVNPZh3anFcpicMQx90ebwjL7rmzFYfUP89Iw/edit>
> is to even bring it closer to Parquet.
>
> Kind regards,
> Fokko
>
>
> Op do 22 aug 2024 om 09:51 schreef Antoine Pitrou <[email protected]>:
>
>>
>> Hi Gang,
>>
>> Sorry, but can you give a pointer to the start of this discussion thread
>> in a readable format (for example a mailing-list archive)? It appears
>> that dev@arrow wasn't cc'ed from the start and that can make it
>> difficult to understand what this is about.
>>
>> Regards
>>
>> Antoine.
>>
>>
>> Le 22/08/2024 à 08:32, Gang Wu a écrit :
>> > It seems that we have reached a consensus to some extent that there
>> > should be a new home for the variant spec. The pending question
>> > is whether Parquet or Arrow is a better choice. As a committer from
>> Arrow,
>> > Parquet and ORC communities, I am neutral to choose any and happy to
>> > help with the movement once a decision has been made.
>> >
>> > Should we start a vote to move forward?
>> >
>> > Best,
>> > Gang
>> >
>> > On Sat, Aug 17, 2024 at 8:34 AM Micah Kornfield <[email protected]>
>> > wrote:
>> >
>> >>>
>> >>> That being said, I think the most important consideration for now is
>> >> where
>> >>> are the current maintainers / contributors to the variant type. If
>> most
>> >> of
>> >>> them are already PMC members / committers on a project, it becomes a
>> bit
>> >>> easier. Otherwise if there isn't much overlap with a project's
>> existing
>> >>> governance, I worry there could be a bit of friction. How many active
>> >>> contributors are there from Iceberg? And how about from Arrow?
>> >>
>> >>
>> >> I think this is the key question. What are the requirements around
>> >> governance?  I've seen some tangential messaging here but I'm not
>> clear on
>> >> what everyone expects.
>> >>
>> >> I think for a lot of the other concerns my view is that the exact
>> project
>> >> does not really matter (and choosing a project with mature cross
>> language
>> >> testing infrastructure or committing to building it is critical). IIUC
>> we
>> >> are talking about following artifacts:
>> >>
>> >> 1.  A stand alone specification document (this can be hosted anyplace)
>> >> 2.  A set of language bindings with minimal dependencies can be
>> consumed
>> >> downstream (again, as long as dependencies are managed carefully any
>> >> project can host these)
>> >> 3.  Potential integration where appropriate into file format libraries
>> to
>> >> support shredding (but as of now this is being bypassed by using
>> >> conventions anyways).  My impression is that at least for Parquet
>> there has
>> >> been a proliferation of vectorized readers across different projects,
>> so
>> >> I'm not clear how much standardization in parquet-java could help here.
>> >>
>> >> To respond to some other questions:
>> >>
>> >> Arrow is not used as Spark's in-memory model, nor Trino and others so
>> those
>> >>> existing relationships aren't there. I also worry that differences in
>> >>> approaches would make it difficult later on.
>> >>
>> >>
>> >> While Arrow is not in the core memory model, for Spark I believe it is
>> >> still used for IPC for things like Java<->Python. Trino also consumes
>> Arrow
>> >> libraries today to support things like Snowflake/Bigquery federation.
>> But I
>> >> think this is minor because as mentioned above I think the functional
>> >> libraries would be relatively stand-alone.
>> >>
>> >> Do we think it could be introduced as a canonical extension arrow type?
>> >>
>> >>
>> >>   I believe it can be, I think there are probably different layouts
>> that can
>> >> be supported:
>> >>
>> >> 1.  A struct with two variable width bytes columns (metadata and value
>> data
>> >> are stored separately and each entry has a 1:1 relationship).
>> >> 2.  Shredded (shredded according to the same convention as parquet), I
>> >> would need to double check but I don't think Arrow would have problems
>> here
>> >> but REE would likely be required to make this efficient (i.e. sparse
>> value
>> >> support is important).
>> >>
>> >> In both cases the main complexity is providing the necessary functions
>> for
>> >> manipulation.
>> >>
>> >> Thanks,
>> >> Micah
>> >>
>> >>
>> >>
>> >>
>> >>
>> >>
>> >>
>> >> On Fri, Aug 16, 2024 at 3:58 PM Will Jones <[email protected]>
>> >> wrote:
>> >>
>> >>> In being more engine and format agnostic, I agree the Arrow project
>> might
>> >>> be a good host for such a specification. It seems like we want to move
>> >> away
>> >>> from hosting in Spark to make it engine agnostic. But moving into
>> Iceberg
>> >>> might make it less format agnostic, as I understand multiple formats
>> >> might
>> >>> want to implement this. I'm not intimately familiar with the state of
>> >> this,
>> >>> but I believe Delta Lake would like to be aligned with the same
>> format as
>> >>> Iceberg. In addition, the Lance format (which I work on), will
>> eventually
>> >>> be interesting as well. It seems equally bad to me to attach this
>> >>> specification to a particular table format as it does a particular
>> query
>> >>> engine.
>> >>>
>> >>> That being said, I think the most important consideration for now is
>> >> where
>> >>> are the current maintainers / contributors to the variant type. If
>> most
>> >> of
>> >>> them are already PMC members / committers on a project, it becomes a
>> bit
>> >>> easier. Otherwise if there isn't much overlap with a project's
>> existing
>> >>> governance, I worry there could be a bit of friction. How many active
>> >>> contributors are there from Iceberg? And how about from Arrow?
>> >>>
>> >>> BTW, I'd add I'm interested in helping develop an Arrow extension type
>> >> for
>> >>> the binary variant type. I've been experimenting with a DataFusion
>> >>> extension that operates on this [1], and already have some ideas on
>> how
>> >>> such an extension type might be defined. I'm not yet caught up on the
>> >>> shredded specification, but I think having just the binary format
>> would
>> >> be
>> >>> beneficial for in-memory analytics, which are most relevant to Arrow.
>> >> I'll
>> >>> be creating a seperate thread on the Arrow ML about this soon.
>> >>>
>> >>> Best,
>> >>>
>> >>> Will Jones
>> >>>
>> >>> [1]
>> >>>
>> >>
>> https://github.com/datafusion-contrib/datafusion-functions-variant/issues
>> >>>
>> >>>
>> >>> On Thu, Aug 15, 2024 at 7:39 PM Gang Wu <[email protected]> wrote:
>> >>>
>> >>>> + dev@arrow
>> >>>>
>> >>>> Thanks for all the valuable suggestions! I am inclined to Micah's
>> idea
>> >>> that
>> >>>> Arrow might be a better host compared to Parquet.
>> >>>>
>> >>>> To give more context, I am taking the initiative to add the geometry
>> >> type
>> >>>> to both Parquet and ORC. I'd like to do the same thing for variant
>> type
>> >>> in
>> >>>> that variant type is engine and file format agnostic. This does mean
>> >> that
>> >>>> Parquet might not be the neutral place to hold the variant spec.
>> >>>>
>> >>>> Best,
>> >>>> Gang
>> >>>>
>> >>>> On Fri, Aug 16, 2024 at 10:00 AM Jingsong Li <[email protected]
>> >
>> >>>> wrote:
>> >>>>
>> >>>>> Thanks all for your discussion.
>> >>>>>
>> >>>>> The Apache Paimon community is also considering support for this
>> >>>>> Variant type, without a doubt, we hope to maintain consistency with
>> >>>>> Iceberg.
>> >>>>>
>> >>>>> Not only the Paimon community, but also various computing engines
>> >> need
>> >>>>> to adapt to this type, such as Flink and StarRocks. We also hope to
>> >>>>> promote them to adapt to this type.
>> >>>>>
>> >>>>> It is worth noting that we also need to standardize many functions
>> >>>>> related to it.
>> >>>>>
>> >>>>> A neutral place to maintain it is a great choice.
>> >>>>>
>> >>>>> - As Gang Wu said, a standalone project is good, just like
>> >>> RoaringBitmap
>> >>>>> [1].
>> >>>>> - As Ryan said, Parquet community is a neutral option too.
>> >>>>> - As Micah said, Arrow is also an option too.
>> >>>>>
>> >>>>> [1] https://github.com/RoaringBitmap
>> >>>>>
>> >>>>> Best,
>> >>>>> Jingsong
>> >>>>>
>> >>>>> On Fri, Aug 16, 2024 at 7:18 AM Micah Kornfield <
>> >> [email protected]
>> >>>>
>> >>>>> wrote:
>> >>>>>>>
>> >>>>>>> Thats fair @Micah, so far all the discussions have been direct and
>> >>> off
>> >>>>> the dev list. Would you like to make the request on the public Spark
>> >>> Dev
>> >>>>> list? I would be glad to co-sign, I can also draft up a quick email
>> >> if
>> >>>> you
>> >>>>> don't have time.
>> >>>>>>
>> >>>>>>
>> >>>>>> I think once we come to consensus, if you have bandwidth, I think
>> >> the
>> >>>>> message might be better coming from you, as you have more context on
>> >>> some
>> >>>>> of the non-public conversations, the requirements from an Iceberg
>> >>>>> perspective on governance and the blockers that were encountered.
>> If
>> >>>>> details on the conversations can't be shared, (i.e. we are starting
>> >>> from
>> >>>>> scratch) it seems like suggesting a new project via SPIP might be
>> the
>> >>> way
>> >>>>> forward.  I'm happy to help with that if it is useful but I would
>> >> guess
>> >>>>> Aihua or Tyler might be in a better place to start as it seems they
>> >>> have
>> >>>>> done more serious thinking here.
>> >>>>>>
>> >>>>>> If we decide to try to standardize on Parquet or Arrow I'm happy to
>> >>>> help
>> >>>>> support the effort in those communities.
>> >>>>>>
>> >>>>>> Thanks,
>> >>>>>> Micah
>> >>>>>>
>> >>>>>> On Thu, Aug 15, 2024 at 8:09 AM Russell Spitzer <
>> >>>>> [email protected]> wrote:
>> >>>>>>>
>> >>>>>>> Thats fair @Micah, so far all the discussions have been direct and
>> >>> off
>> >>>>> the dev list. Would you like to make the request on the public Spark
>> >>> Dev
>> >>>>> list? I would be glad to co-sign, I can also draft up a quick email
>> >> if
>> >>>> you
>> >>>>> don't have time.
>> >>>>>>>
>> >>>>>>> On Thu, Aug 15, 2024 at 10:04 AM Micah Kornfield <
>> >>>> [email protected]>
>> >>>>> wrote:
>> >>>>>>>>>
>> >>>>>>>>> I agree that it would be beneficial to make a sub-project, the
>> >>> main
>> >>>>> problem is political and not logistic. I've been asking for movement
>> >>> from
>> >>>>> other relative projects for a month and we simply haven't gotten
>> >>>> anywhere.
>> >>>>>>>>
>> >>>>>>>>
>> >>>>>>>> I just wanted to double check that these issues were brought
>> >>> directly
>> >>>>> to the spark community (i.e. a discussion thread on the Spark
>> >> developer
>> >>>>> mailing list) and not via backchannels.
>> >>>>>>>>
>> >>>>>>>> I'm not sure the outcome would be different and I don't think
>> >> this
>> >>>>> should block forking the spec, but we should make sure that the
>> >>> decision
>> >>>> is
>> >>>>> publicly documented within both communities.
>> >>>>>>>>
>> >>>>>>>> Thanks,
>> >>>>>>>> Micah
>> >>>>>>>>
>> >>>>>>>> On Thu, Aug 15, 2024 at 7:47 AM Russell Spitzer <
>> >>>>> [email protected]> wrote:
>> >>>>>>>>>
>> >>>>>>>>> @Gang Wu
>> >>>>>>>>>
>> >>>>>>>>> I agree that it would be beneficial to make a sub-project, the
>> >>> main
>> >>>>> problem is political and not logistic. I've been asking for movement
>> >>> from
>> >>>>> other relative projects for a month and we simply haven't gotten
>> >>>> anywhere.
>> >>>>> I don't think there is anything that would stop us from moving to a
>> >>> joint
>> >>>>> project in the future and if you know of some way of encouraging
>> that
>> >>>>> movement from other relevant parties I would be glad to collaborate
>> >> in
>> >>>>> doing that. One thing that I don't want to do is have the Iceberg
>> >>> project
>> >>>>> stay in a holding pattern without any clear roadmap as to how to
>> >>> proceed.
>> >>>>>>>>>
>> >>>>>>>>> On Wed, Aug 14, 2024 at 11:12 PM Yufei Gu <[email protected]
>> >>>
>> >>>>> wrote:
>> >>>>>>>>>>
>> >>>>>>>>>> I’m on board with copying the spec into our repository.
>> >> However,
>> >>> as
>> >>>>> we’ve talked about, it’s not just a straightforward copy—there are
>> >>>> already
>> >>>>> some divergences. Some of them are under discussion. Iceberg is
>> >>>> definitely
>> >>>>> the best place for these specs. Engines like Trino and Flink can
>> then
>> >>>> rely
>> >>>>> on the Iceberg specs as a solid foundation.
>> >>>>>>>>>>
>> >>>>>>>>>> Yufei
>> >>>>>>>>>>
>> >>>>>>>>>> On Wed, Aug 14, 2024 at 7:51 PM Gang Wu <[email protected]>
>> >>> wrote:
>> >>>>>>>>>>>
>> >>>>>>>>>>> Sorry for chiming in late.
>> >>>>>>>>>>>
>> >>>>>>>>>>>  From the discussion in
>> >>>>> https://lists.apache.org/thread/xcyytoypgplfr74klg1z2rgjo6k5b0sq, I
>> >>>> don't
>> >>>>> quite understand why it is logistically complicated to create a
>> >>>> sub-project
>> >>>>> to hold the variant spec and impl.
>> >>>>>>>>>>>
>> >>>>>>>>>>> IMHO, coping the variant type spec into Apache Iceberg has
>> >> some
>> >>>>> deficiencies:
>> >>>>>>>>>>> - It is a burden to update two repos if there is a variant
>> >> type
>> >>>>> spec change and will likely result in deviation if some changes do
>> >> not
>> >>>>> reach agreement from both parties.
>> >>>>>>>>>>> - Implementers are required to keep an eye on both specs
>> >>>>> (considering proprietary engines where both Iceberg and Delta are
>> >>>>> supported).
>> >>>>>>>>>>> - Putting the spec and impl of variant type in Iceberg repo
>> >> does
>> >>>>> lose the opportunity for better native support from file formats
>> like
>> >>>>> Parquet and ORC.
>> >>>>>>>>>>>
>> >>>>>>>>>>> I'm not sure if it is possible to create a separate project
>> >>> (e.g.
>> >>>>> apache/variant-type) to make it a single point of truth. We can
>> learn
>> >>>> from
>> >>>>> the experience of Apache Arrow. In this fashion, different engines,
>> >>> table
>> >>>>> formats and file formats can follow the same spec and are free to
>> >>> depend
>> >>>> on
>> >>>>> the reference implementations from apache/variant-type or implement
>> >>> their
>> >>>>> own.
>> >>>>>>>>>>>
>> >>>>>>>>>>> Best,
>> >>>>>>>>>>> Gang
>> >>>>>>>>>>>
>> >>>>>>>>>>>
>> >>>>>>>>>>>
>> >>>>>>>>>>>
>> >>>>>>>>>>> On Thu, Aug 15, 2024 at 10:07 AM Jack Ye <[email protected]
>> >>>
>> >>>>> wrote:
>> >>>>>>>>>>>>
>> >>>>>>>>>>>> +1 for copying the spec into our repository, I think we need
>> >> to
>> >>>>> own it fully as a part of the table spec, and we can build
>> >>> compatibility
>> >>>>> through tests.
>> >>>>>>>>>>>>
>> >>>>>>>>>>>> -Jack
>> >>>>>>>>>>>>
>> >>>>>>>>>>>> On Wed, Aug 14, 2024 at 12:52 PM Russell Spitzer <
>> >>>>> [email protected]> wrote:
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>>> I'm not really in favor of linking and annotating as that
>> >> just
>> >>>>> makes things more complicated and still is essentially forking just
>> >>> with
>> >>>>> more steps. If we just track our annotations / modifications  to a
>> >>> single
>> >>>>> commit/version then we have the same issue again but now you have to
>> >> go
>> >>>> to
>> >>>>> multiple sources to get the actual Spec. In addition, our very copy
>> >> of
>> >>>> the
>> >>>>> Spec is going to require new types which don't exist in the Spark
>> >> Spec
>> >>>>> which necessarily means diverging. We will need to take up new
>> >>> primitive
>> >>>>> id's (as noted in my first email)
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>>> The other issue I have is I don't think the Spark Spec is
>> >>> really
>> >>>>> going through a thorough review process from all members of the
>> Spark
>> >>>>> community, I believe it probably should have gone through the SPIP
>> >> but
>> >>>>> instead seems to have been merged without broad community
>> >> involvement.
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>>> The only way to truly avoid diverging is to only have a
>> >> single
>> >>>>> copy of the spec, in our previous discussions the vast majority of
>> >>> Apache
>> >>>>> Iceberg community want it to exist here.
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>>> On Wed, Aug 14, 2024 at 2:19 PM Daniel Weeks <
>> >>> [email protected]
>> >>>>>
>> >>>>> wrote:
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>>>> I'm really excited about the introduction of variant type
>> >> to
>> >>>>> Iceberg, but I want to raise concerns about forking the spec.
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>>>> I feel like preemptively forking would create the situation
>> >>>>> where we end up diverging because there's little reason to work with
>> >>> both
>> >>>>> communities to evolve in a way that benefits everyone.
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>>>> I would much rather point to a specific version of the spec
>> >>> and
>> >>>>> annotate any variance in Iceberg's handling.  This would allow us to
>> >>>>> continue without dividing the communities.
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>>>> If at any point there are irreconcilable differences, I
>> >> would
>> >>>>> support forking, but I don't feel like that should be the initial
>> >> step.
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>>>> No one is excited about the possibility that the physical
>> >>>>> representations end up diverging, but it feels like we're setting
>> >>>> ourselves
>> >>>>> up for that exact scenario.
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>>>> -Dan
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>>>> On Wed, Aug 14, 2024 at 6:54 AM Fokko Driesprong <
>> >>>>> [email protected]> wrote:
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>> +1 to what's already being said here. It is good to copy
>> >> the
>> >>>>> spec to Iceberg and add context that's specific to Iceberg, but at
>> >> the
>> >>>> same
>> >>>>> time, we should maintain compatibility.
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>> Kind regards,
>> >>>>>>>>>>>>>>> Fokko
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>> Op wo 14 aug 2024 om 15:30 schreef Manu Zhang <
>> >>>>> [email protected]>:
>> >>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>> +1 to copy the spec into our repository. I think the best
>> >>> way
>> >>>>> to keep compatibility is building integration tests.
>> >>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>> Thanks,
>> >>>>>>>>>>>>>>>> Manu
>> >>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>> On Wed, Aug 14, 2024 at 8:27 PM Péter Váry <
>> >>>>> [email protected]> wrote:
>> >>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>> Thanks Russell and Aihua for pushing Variant support!
>> >>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>> Given the differences between the supported types and
>> >> the
>> >>>>> lack of interest from the other project, I think it is reasonable to
>> >>>>> duplicate the specification to our repository.
>> >>>>>>>>>>>>>>>>> I would give very strong emphasis on sticking to the
>> >> Spark
>> >>>>> spec as much as possible, to keep compatibility as much as possible.
>> >>>> Maybe
>> >>>>> even revert to a shared specification if the situation changes.
>> >>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>> Thanks,
>> >>>>>>>>>>>>>>>>> Peter
>> >>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>> Aihua Xu <[email protected]> ezt írta (időpont: 2024.
>> >>> aug.
>> >>>>> 13., K, 19:52):
>> >>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>> Thanks Russell for bringing this up.
>> >>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>> This is the main blocker to move forward with the
>> >> Variant
>> >>>>> support in Iceberg and hopefully we can have a consensus. To me, I
>> >> also
>> >>>>> feel it makes more sense to move the spec into Iceberg rather than
>> >>> Spark
>> >>>>> engine owns it and we try to keep it compatible with Spark spec.
>> >>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>> Thanks,
>> >>>>>>>>>>>>>>>>>> Aihua
>> >>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>> On Mon, Aug 12, 2024 at 6:50 PM Russell Spitzer <
>> >>>>> [email protected]> wrote:
>> >>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>> Hi Y’all,
>> >>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>> We’ve hit a bit of a roadblock with the Variant
>> >>> Proposal,
>> >>>>> while we were hoping to move the Variant and Shredding
>> specifications
>> >>>> from
>> >>>>> Spark into Iceberg there doesn’t seem to be a lot of interest in
>> >> that.
>> >>>>> Unfortunately, I think we have a number of issues with just linking
>> >> to
>> >>>> the
>> >>>>> Spark project directly from within Iceberg and I believe we need to
>> >>> copy
>> >>>>> the specifications into our repository.
>> >>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>> There are a few reasons why i think this is necessary
>> >>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>> First, we have a divergence of types already. The
>> >> Spark
>> >>>>> Specification already includes types which Iceberg has no definition
>> >>> for
>> >>>>> (19, 20 - Interval Types) and Iceberg already has a type which is
>> not
>> >>>>> included within the Spark Specification (Time) and will soon have
>> >> more
>> >>>> with
>> >>>>> TimestampNS, and Geo.
>> >>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>> Second, We would like to make sure that Spark is not a
>> >>>> hard
>> >>>>> dependency for other engines. We are working with several
>> >> implementers
>> >>> of
>> >>>>> the Iceberg spec and it has previously been agreed that it would be
>> >>> best
>> >>>> if
>> >>>>> the source of truth for Variant existed in an engine and file format
>> >>>>> neutral location. The Iceberg project has a good open model of
>> >>> governance
>> >>>>> and, as we have seen so far discussing Variant, open and active
>> >>>>> collaboration. This would also help as we can strictly version our
>> >>>> changes
>> >>>>> in-line with the rest of the Iceberg spec.
>> >>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>> Third, The Shredding spec is not quite finished and
>> >>>>> requires some group analysis and discussion before we commit it. I
>> >>> think
>> >>>>> again the Iceberg community is probably the right place for this to
>> >>>> happen
>> >>>>> as we have already started discussions here on these topics.
>> >>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>> For these reasons I think we should go with a direct
>> >>> copy
>> >>>>> of the existing specification from the Spark Project and move ahead
>> >>> with
>> >>>>> our discussions and modifications within Iceberg. That said, I do
>> not
>> >>>> want
>> >>>>> to diverge if possible from the Spark proposal. For example,
>> although
>> >>> we
>> >>>> do
>> >>>>> not use the Interval types above, I think we should not reuse those
>> >>> type
>> >>>>> ids within our spec. Iceberg's Variant Spec types 19 and 20 would
>> >>> remain
>> >>>>> unused along with any other types we think are not applicable. We
>> >>> should
>> >>>>> strive whenever possible to allow for compatibility.
>> >>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>> In the interest of moving forward with this proposal I
>> >>> am
>> >>>>> hoping to see if anyone in the community objects to this plan going
>> >>>> forward
>> >>>>> or has a better alternative.
>> >>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>> As always I am thankful for your time and am eager to
>> >>> hear
>> >>>>> back from everyone,
>> >>>>>>>>>>>>>>>>>>> Russ
>> >>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>
>> >>>>>
>> >>>>
>> >>>
>> >>
>> >
>>
>

Re: [DISCUSS] Variant Spec Location

Reply via email to