Re: [DISCUSS] Variant Spec Location

Gang Wu Fri, 23 Aug 2024 02:22:49 -0700

Thanks Aihua!

I've started the discussion in dev@parquet:
https://lists.apache.org/thread/6h58hj39lhqtcyd2hlsyvqm4lzdh4b9z


Best,
Gang

On Fri, Aug 23, 2024 at 12:53 PM Aihua Xu <aihua...@snowflake.com> wrote:

> From this thread
> https://lists.apache.org/thread/0k5oj3mn0049fcxoxm3gx3d7r28gw4rj,  seems
> Spark community is leaning toward moving to Parquet.
>
> Gang, can you help start a discussion in the parquet community on adopting
> and maintaining such Variant spec?
>
> On Thu, Aug 22, 2024 at 8:08 AM Curt Hagenlocher <c...@hagenlocher.org>
> wrote:
>
>> This seems to straddle that line, in that you can also view this as a way
>> to represent semi-structured data in a manner that allows for more
>> efficient querying and computation by breaking out some of its components
>> into a more structured form.
>>
>> (I also happen to want a canonical Arrow representation for variant data,
>> as this type occurs in many databases but doesn't have a great
>> representation today in ADBC results. That's why I filed [Format]
>> Consider adding an official variant type to Arrow · Issue #42069 ·
>> apache/arrow (github.com) <https://github.com/apache/arrow/issues/42069>.
>> Of course, there's no specific reason why a canonical Arrow
>> representation for variants must align with Spark and/or Iceberg.)
>>
>> -Curt
>>
>> On Thu, Aug 22, 2024 at 2:01 AM Antoine Pitrou <anto...@python.org>
>> wrote:
>>
>>>
>>> Ah, thanks. I've tried to find a rationale and ended up on
>>> https://lists.apache.org/thread/xnyo1k66dxh0ffpg7j9f04xgos0kwc34 . Is
>>> it
>>> a good description of what you're after?
>>>
>>> If so, then I don't think Arrow is a good match. This seems mostly to be
>>> a marshalling format for semi-structured data (like Avro?). Arrow data
>>> types are meant to be in a representation ideal for querying and
>>> computation, rather than transport and storage.
>>>
>>> This could be developed separately and then be represented in Arrow
>>> using an extension type (perhaps a canonical one as in
>>> https://arrow.apache.org/docs/dev/format/CanonicalExtensions.html).
>>>
>>> What do other Arrow developers think?
>>>
>>> Regards
>>>
>>> Antoine.
>>>
>>>
>>> Le 22/08/2024 à 10:45, Gang Wu a écrit :
>>> > Sorry for the inconvenience.
>>> >
>>> > This is the permalink for the discussion:
>>> > https://lists.apache.org/thread/hopkr2f0ftoywwt9zo3jxb7n0ob5s5bw
>>> >
>>> > On Thu, Aug 22, 2024 at 3:51 PM Antoine Pitrou <anto...@python.org>
>>> wrote:
>>> >
>>> >>
>>> >> Hi Gang,
>>> >>
>>> >> Sorry, but can you give a pointer to the start of this discussion
>>> thread
>>> >> in a readable format (for example a mailing-list archive)? It appears
>>> >> that dev@arrow wasn't cc'ed from the start and that can make it
>>> >> difficult to understand what this is about.
>>> >>
>>> >> Regards
>>> >>
>>> >> Antoine.
>>> >>
>>> >>
>>> >> Le 22/08/2024 à 08:32, Gang Wu a écrit :
>>> >>> It seems that we have reached a consensus to some extent that there
>>> >>> should be a new home for the variant spec. The pending question
>>> >>> is whether Parquet or Arrow is a better choice. As a committer from
>>> >> Arrow,
>>> >>> Parquet and ORC communities, I am neutral to choose any and happy to
>>> >>> help with the movement once a decision has been made.
>>> >>>
>>> >>> Should we start a vote to move forward?
>>> >>>
>>> >>> Best,
>>> >>> Gang
>>> >>>
>>> >>> On Sat, Aug 17, 2024 at 8:34 AM Micah Kornfield <
>>> emkornfi...@gmail.com>
>>> >>> wrote:
>>> >>>
>>> >>>>>
>>> >>>>> That being said, I think the most important consideration for now
>>> is
>>> >>>> where
>>> >>>>> are the current maintainers / contributors to the variant type. If
>>> most
>>> >>>> of
>>> >>>>> them are already PMC members / committers on a project, it becomes
>>> a
>>> >> bit
>>> >>>>> easier. Otherwise if there isn't much overlap with a project's
>>> existing
>>> >>>>> governance, I worry there could be a bit of friction. How many
>>> active
>>> >>>>> contributors are there from Iceberg? And how about from Arrow?
>>> >>>>
>>> >>>>
>>> >>>> I think this is the key question. What are the requirements around
>>> >>>> governance?  I've seen some tangential messaging here but I'm not
>>> clear
>>> >> on
>>> >>>> what everyone expects.
>>> >>>>
>>> >>>> I think for a lot of the other concerns my view is that the exact
>>> >> project
>>> >>>> does not really matter (and choosing a project with mature cross
>>> >> language
>>> >>>> testing infrastructure or committing to building it is critical).
>>> IIUC
>>> >> we
>>> >>>> are talking about following artifacts:
>>> >>>>
>>> >>>> 1.  A stand alone specification document (this can be hosted
>>> anyplace)
>>> >>>> 2.  A set of language bindings with minimal dependencies can be
>>> consumed
>>> >>>> downstream (again, as long as dependencies are managed carefully any
>>> >>>> project can host these)
>>> >>>> 3.  Potential integration where appropriate into file format
>>> libraries
>>> >> to
>>> >>>> support shredding (but as of now this is being bypassed by using
>>> >>>> conventions anyways).  My impression is that at least for Parquet
>>> there
>>> >> has
>>> >>>> been a proliferation of vectorized readers across different
>>> projects, so
>>> >>>> I'm not clear how much standardization in parquet-java could help
>>> here.
>>> >>>>
>>> >>>> To respond to some other questions:
>>> >>>>
>>> >>>> Arrow is not used as Spark's in-memory model, nor Trino and others
>>> so
>>> >> those
>>> >>>>> existing relationships aren't there. I also worry that differences
>>> in
>>> >>>>> approaches would make it difficult later on.
>>> >>>>
>>> >>>>
>>> >>>> While Arrow is not in the core memory model, for Spark I believe it
>>> is
>>> >>>> still used for IPC for things like Java<->Python. Trino also
>>> consumes
>>> >> Arrow
>>> >>>> libraries today to support things like Snowflake/Bigquery
>>> federation.
>>> >> But I
>>> >>>> think this is minor because as mentioned above I think the
>>> functional
>>> >>>> libraries would be relatively stand-alone.
>>> >>>>
>>> >>>> Do we think it could be introduced as a canonical extension arrow
>>> type?
>>> >>>>
>>> >>>>
>>> >>>>    I believe it can be, I think there are probably different layouts
>>> >> that can
>>> >>>> be supported:
>>> >>>>
>>> >>>> 1.  A struct with two variable width bytes columns (metadata and
>>> value
>>> >> data
>>> >>>> are stored separately and each entry has a 1:1 relationship).
>>> >>>> 2.  Shredded (shredded according to the same convention as
>>> parquet), I
>>> >>>> would need to double check but I don't think Arrow would have
>>> problems
>>> >> here
>>> >>>> but REE would likely be required to make this efficient (i.e. sparse
>>> >> value
>>> >>>> support is important).
>>> >>>>
>>> >>>> In both cases the main complexity is providing the necessary
>>> functions
>>> >> for
>>> >>>> manipulation.
>>> >>>>
>>> >>>> Thanks,
>>> >>>> Micah
>>> >>>>
>>> >>>>
>>> >>>>
>>> >>>>
>>> >>>>
>>> >>>>
>>> >>>>
>>> >>>> On Fri, Aug 16, 2024 at 3:58 PM Will Jones <will.jones...@gmail.com
>>> >
>>> >>>> wrote:
>>> >>>>
>>> >>>>> In being more engine and format agnostic, I agree the Arrow project
>>> >> might
>>> >>>>> be a good host for such a specification. It seems like we want to
>>> move
>>> >>>> away
>>> >>>>> from hosting in Spark to make it engine agnostic. But moving into
>>> >> Iceberg
>>> >>>>> might make it less format agnostic, as I understand multiple
>>> formats
>>> >>>> might
>>> >>>>> want to implement this. I'm not intimately familiar with the state
>>> of
>>> >>>> this,
>>> >>>>> but I believe Delta Lake would like to be aligned with the same
>>> format
>>> >> as
>>> >>>>> Iceberg. In addition, the Lance format (which I work on), will
>>> >> eventually
>>> >>>>> be interesting as well. It seems equally bad to me to attach this
>>> >>>>> specification to a particular table format as it does a particular
>>> >> query
>>> >>>>> engine.
>>> >>>>>
>>> >>>>> That being said, I think the most important consideration for now
>>> is
>>> >>>> where
>>> >>>>> are the current maintainers / contributors to the variant type. If
>>> most
>>> >>>> of
>>> >>>>> them are already PMC members / committers on a project, it becomes
>>> a
>>> >> bit
>>> >>>>> easier. Otherwise if there isn't much overlap with a project's
>>> existing
>>> >>>>> governance, I worry there could be a bit of friction. How many
>>> active
>>> >>>>> contributors are there from Iceberg? And how about from Arrow?
>>> >>>>>
>>> >>>>> BTW, I'd add I'm interested in helping develop an Arrow extension
>>> type
>>> >>>> for
>>> >>>>> the binary variant type. I've been experimenting with a DataFusion
>>> >>>>> extension that operates on this [1], and already have some ideas
>>> on how
>>> >>>>> such an extension type might be defined. I'm not yet caught up on
>>> the
>>> >>>>> shredded specification, but I think having just the binary format
>>> would
>>> >>>> be
>>> >>>>> beneficial for in-memory analytics, which are most relevant to
>>> Arrow.
>>> >>>> I'll
>>> >>>>> be creating a seperate thread on the Arrow ML about this soon.
>>> >>>>>
>>> >>>>> Best,
>>> >>>>>
>>> >>>>> Will Jones
>>> >>>>>
>>> >>>>> [1]
>>> >>>>>
>>> >>>>
>>> >>
>>> https://github.com/datafusion-contrib/datafusion-functions-variant/issues
>>> >>>>>
>>> >>>>>
>>> >>>>> On Thu, Aug 15, 2024 at 7:39 PM Gang Wu <ust...@gmail.com> wrote:
>>> >>>>>
>>> >>>>>> + dev@arrow
>>> >>>>>>
>>> >>>>>> Thanks for all the valuable suggestions! I am inclined to Micah's
>>> idea
>>> >>>>> that
>>> >>>>>> Arrow might be a better host compared to Parquet.
>>> >>>>>>
>>> >>>>>> To give more context, I am taking the initiative to add the
>>> geometry
>>> >>>> type
>>> >>>>>> to both Parquet and ORC. I'd like to do the same thing for variant
>>> >> type
>>> >>>>> in
>>> >>>>>> that variant type is engine and file format agnostic. This does
>>> mean
>>> >>>> that
>>> >>>>>> Parquet might not be the neutral place to hold the variant spec.
>>> >>>>>>
>>> >>>>>> Best,
>>> >>>>>> Gang
>>> >>>>>>
>>> >>>>>> On Fri, Aug 16, 2024 at 10:00 AM Jingsong Li <
>>> jingsongl...@gmail.com>
>>> >>>>>> wrote:
>>> >>>>>>
>>> >>>>>>> Thanks all for your discussion.
>>> >>>>>>>
>>> >>>>>>> The Apache Paimon community is also considering support for this
>>> >>>>>>> Variant type, without a doubt, we hope to maintain consistency
>>> with
>>> >>>>>>> Iceberg.
>>> >>>>>>>
>>> >>>>>>> Not only the Paimon community, but also various computing engines
>>> >>>> need
>>> >>>>>>> to adapt to this type, such as Flink and StarRocks. We also hope
>>> to
>>> >>>>>>> promote them to adapt to this type.
>>> >>>>>>>
>>> >>>>>>> It is worth noting that we also need to standardize many
>>> functions
>>> >>>>>>> related to it.
>>> >>>>>>>
>>> >>>>>>> A neutral place to maintain it is a great choice.
>>> >>>>>>>
>>> >>>>>>> - As Gang Wu said, a standalone project is good, just like
>>> >>>>> RoaringBitmap
>>> >>>>>>> [1].
>>> >>>>>>> - As Ryan said, Parquet community is a neutral option too.
>>> >>>>>>> - As Micah said, Arrow is also an option too.
>>> >>>>>>>
>>> >>>>>>> [1] https://github.com/RoaringBitmap
>>> >>>>>>>
>>> >>>>>>> Best,
>>> >>>>>>> Jingsong
>>> >>>>>>>
>>> >>>>>>> On Fri, Aug 16, 2024 at 7:18 AM Micah Kornfield <
>>> >>>> emkornfi...@gmail.com
>>> >>>>>>
>>> >>>>>>> wrote:
>>> >>>>>>>>>
>>> >>>>>>>>> Thats fair @Micah, so far all the discussions have been direct
>>> and
>>> >>>>> off
>>> >>>>>>> the dev list. Would you like to make the request on the public
>>> Spark
>>> >>>>> Dev
>>> >>>>>>> list? I would be glad to co-sign, I can also draft up a quick
>>> email
>>> >>>> if
>>> >>>>>> you
>>> >>>>>>> don't have time.
>>> >>>>>>>>
>>> >>>>>>>>
>>> >>>>>>>> I think once we come to consensus, if you have bandwidth, I
>>> think
>>> >>>> the
>>> >>>>>>> message might be better coming from you, as you have more
>>> context on
>>> >>>>> some
>>> >>>>>>> of the non-public conversations, the requirements from an Iceberg
>>> >>>>>>> perspective on governance and the blockers that were
>>> encountered.  If
>>> >>>>>>> details on the conversations can't be shared, (i.e. we are
>>> starting
>>> >>>>> from
>>> >>>>>>> scratch) it seems like suggesting a new project via SPIP might
>>> be the
>>> >>>>> way
>>> >>>>>>> forward.  I'm happy to help with that if it is useful but I would
>>> >>>> guess
>>> >>>>>>> Aihua or Tyler might be in a better place to start as it seems
>>> they
>>> >>>>> have
>>> >>>>>>> done more serious thinking here.
>>> >>>>>>>>
>>> >>>>>>>> If we decide to try to standardize on Parquet or Arrow I'm
>>> happy to
>>> >>>>>> help
>>> >>>>>>> support the effort in those communities.
>>> >>>>>>>>
>>> >>>>>>>> Thanks,
>>> >>>>>>>> Micah
>>> >>>>>>>>
>>> >>>>>>>> On Thu, Aug 15, 2024 at 8:09 AM Russell Spitzer <
>>> >>>>>>> russell.spit...@gmail.com> wrote:
>>> >>>>>>>>>
>>> >>>>>>>>> Thats fair @Micah, so far all the discussions have been direct
>>> and
>>> >>>>> off
>>> >>>>>>> the dev list. Would you like to make the request on the public
>>> Spark
>>> >>>>> Dev
>>> >>>>>>> list? I would be glad to co-sign, I can also draft up a quick
>>> email
>>> >>>> if
>>> >>>>>> you
>>> >>>>>>> don't have time.
>>> >>>>>>>>>
>>> >>>>>>>>> On Thu, Aug 15, 2024 at 10:04 AM Micah Kornfield <
>>> >>>>>> emkornfi...@gmail.com>
>>> >>>>>>> wrote:
>>> >>>>>>>>>>>
>>> >>>>>>>>>>> I agree that it would be beneficial to make a sub-project,
>>> the
>>> >>>>> main
>>> >>>>>>> problem is political and not logistic. I've been asking for
>>> movement
>>> >>>>> from
>>> >>>>>>> other relative projects for a month and we simply haven't gotten
>>> >>>>>> anywhere.
>>> >>>>>>>>>>
>>> >>>>>>>>>>
>>> >>>>>>>>>> I just wanted to double check that these issues were brought
>>> >>>>> directly
>>> >>>>>>> to the spark community (i.e. a discussion thread on the Spark
>>> >>>> developer
>>> >>>>>>> mailing list) and not via backchannels.
>>> >>>>>>>>>>
>>> >>>>>>>>>> I'm not sure the outcome would be different and I don't think
>>> >>>> this
>>> >>>>>>> should block forking the spec, but we should make sure that the
>>> >>>>> decision
>>> >>>>>> is
>>> >>>>>>> publicly documented within both communities.
>>> >>>>>>>>>>
>>> >>>>>>>>>> Thanks,
>>> >>>>>>>>>> Micah
>>> >>>>>>>>>>
>>> >>>>>>>>>> On Thu, Aug 15, 2024 at 7:47 AM Russell Spitzer <
>>> >>>>>>> russell.spit...@gmail.com> wrote:
>>> >>>>>>>>>>>
>>> >>>>>>>>>>> @Gang Wu
>>> >>>>>>>>>>>
>>> >>>>>>>>>>> I agree that it would be beneficial to make a sub-project,
>>> the
>>> >>>>> main
>>> >>>>>>> problem is political and not logistic. I've been asking for
>>> movement
>>> >>>>> from
>>> >>>>>>> other relative projects for a month and we simply haven't gotten
>>> >>>>>> anywhere.
>>> >>>>>>> I don't think there is anything that would stop us from moving
>>> to a
>>> >>>>> joint
>>> >>>>>>> project in the future and if you know of some way of encouraging
>>> that
>>> >>>>>>> movement from other relevant parties I would be glad to
>>> collaborate
>>> >>>> in
>>> >>>>>>> doing that. One thing that I don't want to do is have the Iceberg
>>> >>>>> project
>>> >>>>>>> stay in a holding pattern without any clear roadmap as to how to
>>> >>>>> proceed.
>>> >>>>>>>>>>>
>>> >>>>>>>>>>> On Wed, Aug 14, 2024 at 11:12 PM Yufei Gu <
>>> flyrain...@gmail.com
>>> >>>>>
>>> >>>>>>> wrote:
>>> >>>>>>>>>>>>
>>> >>>>>>>>>>>> I’m on board with copying the spec into our repository.
>>> >>>> However,
>>> >>>>> as
>>> >>>>>>> we’ve talked about, it’s not just a straightforward copy—there
>>> are
>>> >>>>>> already
>>> >>>>>>> some divergences. Some of them are under discussion. Iceberg is
>>> >>>>>> definitely
>>> >>>>>>> the best place for these specs. Engines like Trino and Flink can
>>> then
>>> >>>>>> rely
>>> >>>>>>> on the Iceberg specs as a solid foundation.
>>> >>>>>>>>>>>>
>>> >>>>>>>>>>>> Yufei
>>> >>>>>>>>>>>>
>>> >>>>>>>>>>>> On Wed, Aug 14, 2024 at 7:51 PM Gang Wu <ust...@gmail.com>
>>> >>>>> wrote:
>>> >>>>>>>>>>>>>
>>> >>>>>>>>>>>>> Sorry for chiming in late.
>>> >>>>>>>>>>>>>
>>> >>>>>>>>>>>>>   From the discussion in
>>> >>>>>>> https://lists.apache.org/thread/xcyytoypgplfr74klg1z2rgjo6k5b0sq,
>>> I
>>> >>>>>> don't
>>> >>>>>>> quite understand why it is logistically complicated to create a
>>> >>>>>> sub-project
>>> >>>>>>> to hold the variant spec and impl.
>>> >>>>>>>>>>>>>
>>> >>>>>>>>>>>>> IMHO, coping the variant type spec into Apache Iceberg has
>>> >>>> some
>>> >>>>>>> deficiencies:
>>> >>>>>>>>>>>>> - It is a burden to update two repos if there is a variant
>>> >>>> type
>>> >>>>>>> spec change and will likely result in deviation if some changes
>>> do
>>> >>>> not
>>> >>>>>>> reach agreement from both parties.
>>> >>>>>>>>>>>>> - Implementers are required to keep an eye on both specs
>>> >>>>>>> (considering proprietary engines where both Iceberg and Delta are
>>> >>>>>>> supported).
>>> >>>>>>>>>>>>> - Putting the spec and impl of variant type in Iceberg repo
>>> >>>> does
>>> >>>>>>> lose the opportunity for better native support from file formats
>>> like
>>> >>>>>>> Parquet and ORC.
>>> >>>>>>>>>>>>>
>>> >>>>>>>>>>>>> I'm not sure if it is possible to create a separate project
>>> >>>>> (e.g.
>>> >>>>>>> apache/variant-type) to make it a single point of truth. We can
>>> learn
>>> >>>>>> from
>>> >>>>>>> the experience of Apache Arrow. In this fashion, different
>>> engines,
>>> >>>>> table
>>> >>>>>>> formats and file formats can follow the same spec and are free to
>>> >>>>> depend
>>> >>>>>> on
>>> >>>>>>> the reference implementations from apache/variant-type or
>>> implement
>>> >>>>> their
>>> >>>>>>> own.
>>> >>>>>>>>>>>>>
>>> >>>>>>>>>>>>> Best,
>>> >>>>>>>>>>>>> Gang
>>> >>>>>>>>>>>>>
>>> >>>>>>>>>>>>>
>>> >>>>>>>>>>>>>
>>> >>>>>>>>>>>>>
>>> >>>>>>>>>>>>> On Thu, Aug 15, 2024 at 10:07 AM Jack Ye <
>>> yezhao...@gmail.com
>>> >>>>>
>>> >>>>>>> wrote:
>>> >>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>> +1 for copying the spec into our repository, I think we
>>> need
>>> >>>> to
>>> >>>>>>> own it fully as a part of the table spec, and we can build
>>> >>>>> compatibility
>>> >>>>>>> through tests.
>>> >>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>> -Jack
>>> >>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>> On Wed, Aug 14, 2024 at 12:52 PM Russell Spitzer <
>>> >>>>>>> russell.spit...@gmail.com> wrote:
>>> >>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>> I'm not really in favor of linking and annotating as that
>>> >>>> just
>>> >>>>>>> makes things more complicated and still is essentially forking
>>> just
>>> >>>>> with
>>> >>>>>>> more steps. If we just track our annotations / modifications  to
>>> a
>>> >>>>> single
>>> >>>>>>> commit/version then we have the same issue again but now you
>>> have to
>>> >>>> go
>>> >>>>>> to
>>> >>>>>>> multiple sources to get the actual Spec. In addition, our very
>>> copy
>>> >>>> of
>>> >>>>>> the
>>> >>>>>>> Spec is going to require new types which don't exist in the Spark
>>> >>>> Spec
>>> >>>>>>> which necessarily means diverging. We will need to take up new
>>> >>>>> primitive
>>> >>>>>>> id's (as noted in my first email)
>>> >>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>> The other issue I have is I don't think the Spark Spec is
>>> >>>>> really
>>> >>>>>>> going through a thorough review process from all members of the
>>> Spark
>>> >>>>>>> community, I believe it probably should have gone through the
>>> SPIP
>>> >>>> but
>>> >>>>>>> instead seems to have been merged without broad community
>>> >>>> involvement.
>>> >>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>> The only way to truly avoid diverging is to only have a
>>> >>>> single
>>> >>>>>>> copy of the spec, in our previous discussions the vast majority
>>> of
>>> >>>>> Apache
>>> >>>>>>> Iceberg community want it to exist here.
>>> >>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>> On Wed, Aug 14, 2024 at 2:19 PM Daniel Weeks <
>>> >>>>> dwe...@apache.org
>>> >>>>>>>
>>> >>>>>>> wrote:
>>> >>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>> I'm really excited about the introduction of variant
>>> type
>>> >>>> to
>>> >>>>>>> Iceberg, but I want to raise concerns about forking the spec.
>>> >>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>> I feel like preemptively forking would create the
>>> situation
>>> >>>>>>> where we end up diverging because there's little reason to work
>>> with
>>> >>>>> both
>>> >>>>>>> communities to evolve in a way that benefits everyone.
>>> >>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>> I would much rather point to a specific version of the
>>> spec
>>> >>>>> and
>>> >>>>>>> annotate any variance in Iceberg's handling.  This would allow
>>> us to
>>> >>>>>>> continue without dividing the communities.
>>> >>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>> If at any point there are irreconcilable differences, I
>>> >>>> would
>>> >>>>>>> support forking, but I don't feel like that should be the initial
>>> >>>> step.
>>> >>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>> No one is excited about the possibility that the
>>> physical
>>> >>>>>>> representations end up diverging, but it feels like we're setting
>>> >>>>>> ourselves
>>> >>>>>>> up for that exact scenario.
>>> >>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>> -Dan
>>> >>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>> On Wed, Aug 14, 2024 at 6:54 AM Fokko Driesprong <
>>> >>>>>>> fo...@apache.org> wrote:
>>> >>>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>>> +1 to what's already being said here. It is good to
>>> copy
>>> >>>> the
>>> >>>>>>> spec to Iceberg and add context that's specific to Iceberg, but
>>> at
>>> >>>> the
>>> >>>>>> same
>>> >>>>>>> time, we should maintain compatibility.
>>> >>>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>>> Kind regards,
>>> >>>>>>>>>>>>>>>>> Fokko
>>> >>>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>>> Op wo 14 aug 2024 om 15:30 schreef Manu Zhang <
>>> >>>>>>> owenzhang1...@gmail.com>:
>>> >>>>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>>>> +1 to copy the spec into our repository. I think the
>>> best
>>> >>>>> way
>>> >>>>>>> to keep compatibility is building integration tests.
>>> >>>>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>>>> Thanks,
>>> >>>>>>>>>>>>>>>>>> Manu
>>> >>>>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>>>> On Wed, Aug 14, 2024 at 8:27 PM Péter Váry <
>>> >>>>>>> peter.vary.apa...@gmail.com> wrote:
>>> >>>>>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>>>>> Thanks Russell and Aihua for pushing Variant support!
>>> >>>>>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>>>>> Given the differences between the supported types and
>>> >>>> the
>>> >>>>>>> lack of interest from the other project, I think it is
>>> reasonable to
>>> >>>>>>> duplicate the specification to our repository.
>>> >>>>>>>>>>>>>>>>>>> I would give very strong emphasis on sticking to the
>>> >>>> Spark
>>> >>>>>>> spec as much as possible, to keep compatibility as much as
>>> possible.
>>> >>>>>> Maybe
>>> >>>>>>> even revert to a shared specification if the situation changes.
>>> >>>>>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>>>>> Thanks,
>>> >>>>>>>>>>>>>>>>>>> Peter
>>> >>>>>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>>>>> Aihua Xu <aihu...@gmail.com> ezt írta (időpont:
>>> 2024.
>>> >>>>> aug.
>>> >>>>>>> 13., K, 19:52):
>>> >>>>>>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>>>>>> Thanks Russell for bringing this up.
>>> >>>>>>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>>>>>> This is the main blocker to move forward with the
>>> >>>> Variant
>>> >>>>>>> support in Iceberg and hopefully we can have a consensus. To me,
>>> I
>>> >>>> also
>>> >>>>>>> feel it makes more sense to move the spec into Iceberg rather
>>> than
>>> >>>>> Spark
>>> >>>>>>> engine owns it and we try to keep it compatible with Spark spec.
>>> >>>>>>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>>>>>> Thanks,
>>> >>>>>>>>>>>>>>>>>>>> Aihua
>>> >>>>>>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>>>>>> On Mon, Aug 12, 2024 at 6:50 PM Russell Spitzer <
>>> >>>>>>> russell.spit...@gmail.com> wrote:
>>> >>>>>>>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>>>>>>> Hi Y’all,
>>> >>>>>>>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>>>>>>> We’ve hit a bit of a roadblock with the Variant
>>> >>>>> Proposal,
>>> >>>>>>> while we were hoping to move the Variant and Shredding
>>> specifications
>>> >>>>>> from
>>> >>>>>>> Spark into Iceberg there doesn’t seem to be a lot of interest in
>>> >>>> that.
>>> >>>>>>> Unfortunately, I think we have a number of issues with just
>>> linking
>>> >>>> to
>>> >>>>>> the
>>> >>>>>>> Spark project directly from within Iceberg and I believe we need
>>> to
>>> >>>>> copy
>>> >>>>>>> the specifications into our repository.
>>> >>>>>>>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>>>>>>> There are a few reasons why i think this is
>>> necessary
>>> >>>>>>>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>>>>>>> First, we have a divergence of types already. The
>>> >>>> Spark
>>> >>>>>>> Specification already includes types which Iceberg has no
>>> definition
>>> >>>>> for
>>> >>>>>>> (19, 20 - Interval Types) and Iceberg already has a type which
>>> is not
>>> >>>>>>> included within the Spark Specification (Time) and will soon have
>>> >>>> more
>>> >>>>>> with
>>> >>>>>>> TimestampNS, and Geo.
>>> >>>>>>>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>>>>>>> Second, We would like to make sure that Spark is
>>> not a
>>> >>>>>> hard
>>> >>>>>>> dependency for other engines. We are working with several
>>> >>>> implementers
>>> >>>>> of
>>> >>>>>>> the Iceberg spec and it has previously been agreed that it would
>>> be
>>> >>>>> best
>>> >>>>>> if
>>> >>>>>>> the source of truth for Variant existed in an engine and file
>>> format
>>> >>>>>>> neutral location. The Iceberg project has a good open model of
>>> >>>>> governance
>>> >>>>>>> and, as we have seen so far discussing Variant, open and active
>>> >>>>>>> collaboration. This would also help as we can strictly version
>>> our
>>> >>>>>> changes
>>> >>>>>>> in-line with the rest of the Iceberg spec.
>>> >>>>>>>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>>>>>>> Third, The Shredding spec is not quite finished and
>>> >>>>>>> requires some group analysis and discussion before we commit it.
>>> I
>>> >>>>> think
>>> >>>>>>> again the Iceberg community is probably the right place for this
>>> to
>>> >>>>>> happen
>>> >>>>>>> as we have already started discussions here on these topics.
>>> >>>>>>>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>>>>>>> For these reasons I think we should go with a
>>> direct
>>> >>>>> copy
>>> >>>>>>> of the existing specification from the Spark Project and move
>>> ahead
>>> >>>>> with
>>> >>>>>>> our discussions and modifications within Iceberg. That said, I
>>> do not
>>> >>>>>> want
>>> >>>>>>> to diverge if possible from the Spark proposal. For example,
>>> although
>>> >>>>> we
>>> >>>>>> do
>>> >>>>>>> not use the Interval types above, I think we should not reuse
>>> those
>>> >>>>> type
>>> >>>>>>> ids within our spec. Iceberg's Variant Spec types 19 and 20 would
>>> >>>>> remain
>>> >>>>>>> unused along with any other types we think are not applicable. We
>>> >>>>> should
>>> >>>>>>> strive whenever possible to allow for compatibility.
>>> >>>>>>>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>>>>>>> In the interest of moving forward with this
>>> proposal I
>>> >>>>> am
>>> >>>>>>> hoping to see if anyone in the community objects to this plan
>>> going
>>> >>>>>> forward
>>> >>>>>>> or has a better alternative.
>>> >>>>>>>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>>>>>>> As always I am thankful for your time and am eager
>>> to
>>> >>>>> hear
>>> >>>>>>> back from everyone,
>>> >>>>>>>>>>>>>>>>>>>>> Russ
>>> >>>>>>>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>>>>>>>
>>> >>>>>>>
>>> >>>>>>
>>> >>>>>
>>> >>>>
>>> >>>
>>> >>
>>> >
>>>
>>

Re: [DISCUSS] Variant Spec Location

Reply via email to