Thanks Aihua! I've started the discussion in dev@parquet: https://lists.apache.org/thread/6h58hj39lhqtcyd2hlsyvqm4lzdh4b9z
Best, Gang On Fri, Aug 23, 2024 at 12:53 PM Aihua Xu <aihua...@snowflake.com> wrote: > From this thread > https://lists.apache.org/thread/0k5oj3mn0049fcxoxm3gx3d7r28gw4rj, seems > Spark community is leaning toward moving to Parquet. > > Gang, can you help start a discussion in the parquet community on adopting > and maintaining such Variant spec? > > On Thu, Aug 22, 2024 at 8:08 AM Curt Hagenlocher <c...@hagenlocher.org> > wrote: > >> This seems to straddle that line, in that you can also view this as a way >> to represent semi-structured data in a manner that allows for more >> efficient querying and computation by breaking out some of its components >> into a more structured form. >> >> (I also happen to want a canonical Arrow representation for variant data, >> as this type occurs in many databases but doesn't have a great >> representation today in ADBC results. That's why I filed [Format] >> Consider adding an official variant type to Arrow · Issue #42069 · >> apache/arrow (github.com) <https://github.com/apache/arrow/issues/42069>. >> Of course, there's no specific reason why a canonical Arrow >> representation for variants must align with Spark and/or Iceberg.) >> >> -Curt >> >> On Thu, Aug 22, 2024 at 2:01 AM Antoine Pitrou <anto...@python.org> >> wrote: >> >>> >>> Ah, thanks. I've tried to find a rationale and ended up on >>> https://lists.apache.org/thread/xnyo1k66dxh0ffpg7j9f04xgos0kwc34 . Is >>> it >>> a good description of what you're after? >>> >>> If so, then I don't think Arrow is a good match. This seems mostly to be >>> a marshalling format for semi-structured data (like Avro?). Arrow data >>> types are meant to be in a representation ideal for querying and >>> computation, rather than transport and storage. >>> >>> This could be developed separately and then be represented in Arrow >>> using an extension type (perhaps a canonical one as in >>> https://arrow.apache.org/docs/dev/format/CanonicalExtensions.html). >>> >>> What do other Arrow developers think? >>> >>> Regards >>> >>> Antoine. >>> >>> >>> Le 22/08/2024 à 10:45, Gang Wu a écrit : >>> > Sorry for the inconvenience. >>> > >>> > This is the permalink for the discussion: >>> > https://lists.apache.org/thread/hopkr2f0ftoywwt9zo3jxb7n0ob5s5bw >>> > >>> > On Thu, Aug 22, 2024 at 3:51 PM Antoine Pitrou <anto...@python.org> >>> wrote: >>> > >>> >> >>> >> Hi Gang, >>> >> >>> >> Sorry, but can you give a pointer to the start of this discussion >>> thread >>> >> in a readable format (for example a mailing-list archive)? It appears >>> >> that dev@arrow wasn't cc'ed from the start and that can make it >>> >> difficult to understand what this is about. >>> >> >>> >> Regards >>> >> >>> >> Antoine. >>> >> >>> >> >>> >> Le 22/08/2024 à 08:32, Gang Wu a écrit : >>> >>> It seems that we have reached a consensus to some extent that there >>> >>> should be a new home for the variant spec. The pending question >>> >>> is whether Parquet or Arrow is a better choice. As a committer from >>> >> Arrow, >>> >>> Parquet and ORC communities, I am neutral to choose any and happy to >>> >>> help with the movement once a decision has been made. >>> >>> >>> >>> Should we start a vote to move forward? >>> >>> >>> >>> Best, >>> >>> Gang >>> >>> >>> >>> On Sat, Aug 17, 2024 at 8:34 AM Micah Kornfield < >>> emkornfi...@gmail.com> >>> >>> wrote: >>> >>> >>> >>>>> >>> >>>>> That being said, I think the most important consideration for now >>> is >>> >>>> where >>> >>>>> are the current maintainers / contributors to the variant type. If >>> most >>> >>>> of >>> >>>>> them are already PMC members / committers on a project, it becomes >>> a >>> >> bit >>> >>>>> easier. Otherwise if there isn't much overlap with a project's >>> existing >>> >>>>> governance, I worry there could be a bit of friction. How many >>> active >>> >>>>> contributors are there from Iceberg? And how about from Arrow? >>> >>>> >>> >>>> >>> >>>> I think this is the key question. What are the requirements around >>> >>>> governance? I've seen some tangential messaging here but I'm not >>> clear >>> >> on >>> >>>> what everyone expects. >>> >>>> >>> >>>> I think for a lot of the other concerns my view is that the exact >>> >> project >>> >>>> does not really matter (and choosing a project with mature cross >>> >> language >>> >>>> testing infrastructure or committing to building it is critical). >>> IIUC >>> >> we >>> >>>> are talking about following artifacts: >>> >>>> >>> >>>> 1. A stand alone specification document (this can be hosted >>> anyplace) >>> >>>> 2. A set of language bindings with minimal dependencies can be >>> consumed >>> >>>> downstream (again, as long as dependencies are managed carefully any >>> >>>> project can host these) >>> >>>> 3. Potential integration where appropriate into file format >>> libraries >>> >> to >>> >>>> support shredding (but as of now this is being bypassed by using >>> >>>> conventions anyways). My impression is that at least for Parquet >>> there >>> >> has >>> >>>> been a proliferation of vectorized readers across different >>> projects, so >>> >>>> I'm not clear how much standardization in parquet-java could help >>> here. >>> >>>> >>> >>>> To respond to some other questions: >>> >>>> >>> >>>> Arrow is not used as Spark's in-memory model, nor Trino and others >>> so >>> >> those >>> >>>>> existing relationships aren't there. I also worry that differences >>> in >>> >>>>> approaches would make it difficult later on. >>> >>>> >>> >>>> >>> >>>> While Arrow is not in the core memory model, for Spark I believe it >>> is >>> >>>> still used for IPC for things like Java<->Python. Trino also >>> consumes >>> >> Arrow >>> >>>> libraries today to support things like Snowflake/Bigquery >>> federation. >>> >> But I >>> >>>> think this is minor because as mentioned above I think the >>> functional >>> >>>> libraries would be relatively stand-alone. >>> >>>> >>> >>>> Do we think it could be introduced as a canonical extension arrow >>> type? >>> >>>> >>> >>>> >>> >>>> I believe it can be, I think there are probably different layouts >>> >> that can >>> >>>> be supported: >>> >>>> >>> >>>> 1. A struct with two variable width bytes columns (metadata and >>> value >>> >> data >>> >>>> are stored separately and each entry has a 1:1 relationship). >>> >>>> 2. Shredded (shredded according to the same convention as >>> parquet), I >>> >>>> would need to double check but I don't think Arrow would have >>> problems >>> >> here >>> >>>> but REE would likely be required to make this efficient (i.e. sparse >>> >> value >>> >>>> support is important). >>> >>>> >>> >>>> In both cases the main complexity is providing the necessary >>> functions >>> >> for >>> >>>> manipulation. >>> >>>> >>> >>>> Thanks, >>> >>>> Micah >>> >>>> >>> >>>> >>> >>>> >>> >>>> >>> >>>> >>> >>>> >>> >>>> >>> >>>> On Fri, Aug 16, 2024 at 3:58 PM Will Jones <will.jones...@gmail.com >>> > >>> >>>> wrote: >>> >>>> >>> >>>>> In being more engine and format agnostic, I agree the Arrow project >>> >> might >>> >>>>> be a good host for such a specification. It seems like we want to >>> move >>> >>>> away >>> >>>>> from hosting in Spark to make it engine agnostic. But moving into >>> >> Iceberg >>> >>>>> might make it less format agnostic, as I understand multiple >>> formats >>> >>>> might >>> >>>>> want to implement this. I'm not intimately familiar with the state >>> of >>> >>>> this, >>> >>>>> but I believe Delta Lake would like to be aligned with the same >>> format >>> >> as >>> >>>>> Iceberg. In addition, the Lance format (which I work on), will >>> >> eventually >>> >>>>> be interesting as well. It seems equally bad to me to attach this >>> >>>>> specification to a particular table format as it does a particular >>> >> query >>> >>>>> engine. >>> >>>>> >>> >>>>> That being said, I think the most important consideration for now >>> is >>> >>>> where >>> >>>>> are the current maintainers / contributors to the variant type. If >>> most >>> >>>> of >>> >>>>> them are already PMC members / committers on a project, it becomes >>> a >>> >> bit >>> >>>>> easier. Otherwise if there isn't much overlap with a project's >>> existing >>> >>>>> governance, I worry there could be a bit of friction. How many >>> active >>> >>>>> contributors are there from Iceberg? And how about from Arrow? >>> >>>>> >>> >>>>> BTW, I'd add I'm interested in helping develop an Arrow extension >>> type >>> >>>> for >>> >>>>> the binary variant type. I've been experimenting with a DataFusion >>> >>>>> extension that operates on this [1], and already have some ideas >>> on how >>> >>>>> such an extension type might be defined. I'm not yet caught up on >>> the >>> >>>>> shredded specification, but I think having just the binary format >>> would >>> >>>> be >>> >>>>> beneficial for in-memory analytics, which are most relevant to >>> Arrow. >>> >>>> I'll >>> >>>>> be creating a seperate thread on the Arrow ML about this soon. >>> >>>>> >>> >>>>> Best, >>> >>>>> >>> >>>>> Will Jones >>> >>>>> >>> >>>>> [1] >>> >>>>> >>> >>>> >>> >> >>> https://github.com/datafusion-contrib/datafusion-functions-variant/issues >>> >>>>> >>> >>>>> >>> >>>>> On Thu, Aug 15, 2024 at 7:39 PM Gang Wu <ust...@gmail.com> wrote: >>> >>>>> >>> >>>>>> + dev@arrow >>> >>>>>> >>> >>>>>> Thanks for all the valuable suggestions! I am inclined to Micah's >>> idea >>> >>>>> that >>> >>>>>> Arrow might be a better host compared to Parquet. >>> >>>>>> >>> >>>>>> To give more context, I am taking the initiative to add the >>> geometry >>> >>>> type >>> >>>>>> to both Parquet and ORC. I'd like to do the same thing for variant >>> >> type >>> >>>>> in >>> >>>>>> that variant type is engine and file format agnostic. This does >>> mean >>> >>>> that >>> >>>>>> Parquet might not be the neutral place to hold the variant spec. >>> >>>>>> >>> >>>>>> Best, >>> >>>>>> Gang >>> >>>>>> >>> >>>>>> On Fri, Aug 16, 2024 at 10:00 AM Jingsong Li < >>> jingsongl...@gmail.com> >>> >>>>>> wrote: >>> >>>>>> >>> >>>>>>> Thanks all for your discussion. >>> >>>>>>> >>> >>>>>>> The Apache Paimon community is also considering support for this >>> >>>>>>> Variant type, without a doubt, we hope to maintain consistency >>> with >>> >>>>>>> Iceberg. >>> >>>>>>> >>> >>>>>>> Not only the Paimon community, but also various computing engines >>> >>>> need >>> >>>>>>> to adapt to this type, such as Flink and StarRocks. We also hope >>> to >>> >>>>>>> promote them to adapt to this type. >>> >>>>>>> >>> >>>>>>> It is worth noting that we also need to standardize many >>> functions >>> >>>>>>> related to it. >>> >>>>>>> >>> >>>>>>> A neutral place to maintain it is a great choice. >>> >>>>>>> >>> >>>>>>> - As Gang Wu said, a standalone project is good, just like >>> >>>>> RoaringBitmap >>> >>>>>>> [1]. >>> >>>>>>> - As Ryan said, Parquet community is a neutral option too. >>> >>>>>>> - As Micah said, Arrow is also an option too. >>> >>>>>>> >>> >>>>>>> [1] https://github.com/RoaringBitmap >>> >>>>>>> >>> >>>>>>> Best, >>> >>>>>>> Jingsong >>> >>>>>>> >>> >>>>>>> On Fri, Aug 16, 2024 at 7:18 AM Micah Kornfield < >>> >>>> emkornfi...@gmail.com >>> >>>>>> >>> >>>>>>> wrote: >>> >>>>>>>>> >>> >>>>>>>>> Thats fair @Micah, so far all the discussions have been direct >>> and >>> >>>>> off >>> >>>>>>> the dev list. Would you like to make the request on the public >>> Spark >>> >>>>> Dev >>> >>>>>>> list? I would be glad to co-sign, I can also draft up a quick >>> email >>> >>>> if >>> >>>>>> you >>> >>>>>>> don't have time. >>> >>>>>>>> >>> >>>>>>>> >>> >>>>>>>> I think once we come to consensus, if you have bandwidth, I >>> think >>> >>>> the >>> >>>>>>> message might be better coming from you, as you have more >>> context on >>> >>>>> some >>> >>>>>>> of the non-public conversations, the requirements from an Iceberg >>> >>>>>>> perspective on governance and the blockers that were >>> encountered. If >>> >>>>>>> details on the conversations can't be shared, (i.e. we are >>> starting >>> >>>>> from >>> >>>>>>> scratch) it seems like suggesting a new project via SPIP might >>> be the >>> >>>>> way >>> >>>>>>> forward. I'm happy to help with that if it is useful but I would >>> >>>> guess >>> >>>>>>> Aihua or Tyler might be in a better place to start as it seems >>> they >>> >>>>> have >>> >>>>>>> done more serious thinking here. >>> >>>>>>>> >>> >>>>>>>> If we decide to try to standardize on Parquet or Arrow I'm >>> happy to >>> >>>>>> help >>> >>>>>>> support the effort in those communities. >>> >>>>>>>> >>> >>>>>>>> Thanks, >>> >>>>>>>> Micah >>> >>>>>>>> >>> >>>>>>>> On Thu, Aug 15, 2024 at 8:09 AM Russell Spitzer < >>> >>>>>>> russell.spit...@gmail.com> wrote: >>> >>>>>>>>> >>> >>>>>>>>> Thats fair @Micah, so far all the discussions have been direct >>> and >>> >>>>> off >>> >>>>>>> the dev list. Would you like to make the request on the public >>> Spark >>> >>>>> Dev >>> >>>>>>> list? I would be glad to co-sign, I can also draft up a quick >>> email >>> >>>> if >>> >>>>>> you >>> >>>>>>> don't have time. >>> >>>>>>>>> >>> >>>>>>>>> On Thu, Aug 15, 2024 at 10:04 AM Micah Kornfield < >>> >>>>>> emkornfi...@gmail.com> >>> >>>>>>> wrote: >>> >>>>>>>>>>> >>> >>>>>>>>>>> I agree that it would be beneficial to make a sub-project, >>> the >>> >>>>> main >>> >>>>>>> problem is political and not logistic. I've been asking for >>> movement >>> >>>>> from >>> >>>>>>> other relative projects for a month and we simply haven't gotten >>> >>>>>> anywhere. >>> >>>>>>>>>> >>> >>>>>>>>>> >>> >>>>>>>>>> I just wanted to double check that these issues were brought >>> >>>>> directly >>> >>>>>>> to the spark community (i.e. a discussion thread on the Spark >>> >>>> developer >>> >>>>>>> mailing list) and not via backchannels. >>> >>>>>>>>>> >>> >>>>>>>>>> I'm not sure the outcome would be different and I don't think >>> >>>> this >>> >>>>>>> should block forking the spec, but we should make sure that the >>> >>>>> decision >>> >>>>>> is >>> >>>>>>> publicly documented within both communities. >>> >>>>>>>>>> >>> >>>>>>>>>> Thanks, >>> >>>>>>>>>> Micah >>> >>>>>>>>>> >>> >>>>>>>>>> On Thu, Aug 15, 2024 at 7:47 AM Russell Spitzer < >>> >>>>>>> russell.spit...@gmail.com> wrote: >>> >>>>>>>>>>> >>> >>>>>>>>>>> @Gang Wu >>> >>>>>>>>>>> >>> >>>>>>>>>>> I agree that it would be beneficial to make a sub-project, >>> the >>> >>>>> main >>> >>>>>>> problem is political and not logistic. I've been asking for >>> movement >>> >>>>> from >>> >>>>>>> other relative projects for a month and we simply haven't gotten >>> >>>>>> anywhere. >>> >>>>>>> I don't think there is anything that would stop us from moving >>> to a >>> >>>>> joint >>> >>>>>>> project in the future and if you know of some way of encouraging >>> that >>> >>>>>>> movement from other relevant parties I would be glad to >>> collaborate >>> >>>> in >>> >>>>>>> doing that. One thing that I don't want to do is have the Iceberg >>> >>>>> project >>> >>>>>>> stay in a holding pattern without any clear roadmap as to how to >>> >>>>> proceed. >>> >>>>>>>>>>> >>> >>>>>>>>>>> On Wed, Aug 14, 2024 at 11:12 PM Yufei Gu < >>> flyrain...@gmail.com >>> >>>>> >>> >>>>>>> wrote: >>> >>>>>>>>>>>> >>> >>>>>>>>>>>> I’m on board with copying the spec into our repository. >>> >>>> However, >>> >>>>> as >>> >>>>>>> we’ve talked about, it’s not just a straightforward copy—there >>> are >>> >>>>>> already >>> >>>>>>> some divergences. Some of them are under discussion. Iceberg is >>> >>>>>> definitely >>> >>>>>>> the best place for these specs. Engines like Trino and Flink can >>> then >>> >>>>>> rely >>> >>>>>>> on the Iceberg specs as a solid foundation. >>> >>>>>>>>>>>> >>> >>>>>>>>>>>> Yufei >>> >>>>>>>>>>>> >>> >>>>>>>>>>>> On Wed, Aug 14, 2024 at 7:51 PM Gang Wu <ust...@gmail.com> >>> >>>>> wrote: >>> >>>>>>>>>>>>> >>> >>>>>>>>>>>>> Sorry for chiming in late. >>> >>>>>>>>>>>>> >>> >>>>>>>>>>>>> From the discussion in >>> >>>>>>> https://lists.apache.org/thread/xcyytoypgplfr74klg1z2rgjo6k5b0sq, >>> I >>> >>>>>> don't >>> >>>>>>> quite understand why it is logistically complicated to create a >>> >>>>>> sub-project >>> >>>>>>> to hold the variant spec and impl. >>> >>>>>>>>>>>>> >>> >>>>>>>>>>>>> IMHO, coping the variant type spec into Apache Iceberg has >>> >>>> some >>> >>>>>>> deficiencies: >>> >>>>>>>>>>>>> - It is a burden to update two repos if there is a variant >>> >>>> type >>> >>>>>>> spec change and will likely result in deviation if some changes >>> do >>> >>>> not >>> >>>>>>> reach agreement from both parties. >>> >>>>>>>>>>>>> - Implementers are required to keep an eye on both specs >>> >>>>>>> (considering proprietary engines where both Iceberg and Delta are >>> >>>>>>> supported). >>> >>>>>>>>>>>>> - Putting the spec and impl of variant type in Iceberg repo >>> >>>> does >>> >>>>>>> lose the opportunity for better native support from file formats >>> like >>> >>>>>>> Parquet and ORC. >>> >>>>>>>>>>>>> >>> >>>>>>>>>>>>> I'm not sure if it is possible to create a separate project >>> >>>>> (e.g. >>> >>>>>>> apache/variant-type) to make it a single point of truth. We can >>> learn >>> >>>>>> from >>> >>>>>>> the experience of Apache Arrow. In this fashion, different >>> engines, >>> >>>>> table >>> >>>>>>> formats and file formats can follow the same spec and are free to >>> >>>>> depend >>> >>>>>> on >>> >>>>>>> the reference implementations from apache/variant-type or >>> implement >>> >>>>> their >>> >>>>>>> own. >>> >>>>>>>>>>>>> >>> >>>>>>>>>>>>> Best, >>> >>>>>>>>>>>>> Gang >>> >>>>>>>>>>>>> >>> >>>>>>>>>>>>> >>> >>>>>>>>>>>>> >>> >>>>>>>>>>>>> >>> >>>>>>>>>>>>> On Thu, Aug 15, 2024 at 10:07 AM Jack Ye < >>> yezhao...@gmail.com >>> >>>>> >>> >>>>>>> wrote: >>> >>>>>>>>>>>>>> >>> >>>>>>>>>>>>>> +1 for copying the spec into our repository, I think we >>> need >>> >>>> to >>> >>>>>>> own it fully as a part of the table spec, and we can build >>> >>>>> compatibility >>> >>>>>>> through tests. >>> >>>>>>>>>>>>>> >>> >>>>>>>>>>>>>> -Jack >>> >>>>>>>>>>>>>> >>> >>>>>>>>>>>>>> On Wed, Aug 14, 2024 at 12:52 PM Russell Spitzer < >>> >>>>>>> russell.spit...@gmail.com> wrote: >>> >>>>>>>>>>>>>>> >>> >>>>>>>>>>>>>>> I'm not really in favor of linking and annotating as that >>> >>>> just >>> >>>>>>> makes things more complicated and still is essentially forking >>> just >>> >>>>> with >>> >>>>>>> more steps. If we just track our annotations / modifications to >>> a >>> >>>>> single >>> >>>>>>> commit/version then we have the same issue again but now you >>> have to >>> >>>> go >>> >>>>>> to >>> >>>>>>> multiple sources to get the actual Spec. In addition, our very >>> copy >>> >>>> of >>> >>>>>> the >>> >>>>>>> Spec is going to require new types which don't exist in the Spark >>> >>>> Spec >>> >>>>>>> which necessarily means diverging. We will need to take up new >>> >>>>> primitive >>> >>>>>>> id's (as noted in my first email) >>> >>>>>>>>>>>>>>> >>> >>>>>>>>>>>>>>> The other issue I have is I don't think the Spark Spec is >>> >>>>> really >>> >>>>>>> going through a thorough review process from all members of the >>> Spark >>> >>>>>>> community, I believe it probably should have gone through the >>> SPIP >>> >>>> but >>> >>>>>>> instead seems to have been merged without broad community >>> >>>> involvement. >>> >>>>>>>>>>>>>>> >>> >>>>>>>>>>>>>>> The only way to truly avoid diverging is to only have a >>> >>>> single >>> >>>>>>> copy of the spec, in our previous discussions the vast majority >>> of >>> >>>>> Apache >>> >>>>>>> Iceberg community want it to exist here. >>> >>>>>>>>>>>>>>> >>> >>>>>>>>>>>>>>> On Wed, Aug 14, 2024 at 2:19 PM Daniel Weeks < >>> >>>>> dwe...@apache.org >>> >>>>>>> >>> >>>>>>> wrote: >>> >>>>>>>>>>>>>>>> >>> >>>>>>>>>>>>>>>> I'm really excited about the introduction of variant >>> type >>> >>>> to >>> >>>>>>> Iceberg, but I want to raise concerns about forking the spec. >>> >>>>>>>>>>>>>>>> >>> >>>>>>>>>>>>>>>> I feel like preemptively forking would create the >>> situation >>> >>>>>>> where we end up diverging because there's little reason to work >>> with >>> >>>>> both >>> >>>>>>> communities to evolve in a way that benefits everyone. >>> >>>>>>>>>>>>>>>> >>> >>>>>>>>>>>>>>>> I would much rather point to a specific version of the >>> spec >>> >>>>> and >>> >>>>>>> annotate any variance in Iceberg's handling. This would allow >>> us to >>> >>>>>>> continue without dividing the communities. >>> >>>>>>>>>>>>>>>> >>> >>>>>>>>>>>>>>>> If at any point there are irreconcilable differences, I >>> >>>> would >>> >>>>>>> support forking, but I don't feel like that should be the initial >>> >>>> step. >>> >>>>>>>>>>>>>>>> >>> >>>>>>>>>>>>>>>> No one is excited about the possibility that the >>> physical >>> >>>>>>> representations end up diverging, but it feels like we're setting >>> >>>>>> ourselves >>> >>>>>>> up for that exact scenario. >>> >>>>>>>>>>>>>>>> >>> >>>>>>>>>>>>>>>> -Dan >>> >>>>>>>>>>>>>>>> >>> >>>>>>>>>>>>>>>> >>> >>>>>>>>>>>>>>>> On Wed, Aug 14, 2024 at 6:54 AM Fokko Driesprong < >>> >>>>>>> fo...@apache.org> wrote: >>> >>>>>>>>>>>>>>>>> >>> >>>>>>>>>>>>>>>>> +1 to what's already being said here. It is good to >>> copy >>> >>>> the >>> >>>>>>> spec to Iceberg and add context that's specific to Iceberg, but >>> at >>> >>>> the >>> >>>>>> same >>> >>>>>>> time, we should maintain compatibility. >>> >>>>>>>>>>>>>>>>> >>> >>>>>>>>>>>>>>>>> Kind regards, >>> >>>>>>>>>>>>>>>>> Fokko >>> >>>>>>>>>>>>>>>>> >>> >>>>>>>>>>>>>>>>> Op wo 14 aug 2024 om 15:30 schreef Manu Zhang < >>> >>>>>>> owenzhang1...@gmail.com>: >>> >>>>>>>>>>>>>>>>>> >>> >>>>>>>>>>>>>>>>>> +1 to copy the spec into our repository. I think the >>> best >>> >>>>> way >>> >>>>>>> to keep compatibility is building integration tests. >>> >>>>>>>>>>>>>>>>>> >>> >>>>>>>>>>>>>>>>>> Thanks, >>> >>>>>>>>>>>>>>>>>> Manu >>> >>>>>>>>>>>>>>>>>> >>> >>>>>>>>>>>>>>>>>> On Wed, Aug 14, 2024 at 8:27 PM Péter Váry < >>> >>>>>>> peter.vary.apa...@gmail.com> wrote: >>> >>>>>>>>>>>>>>>>>>> >>> >>>>>>>>>>>>>>>>>>> Thanks Russell and Aihua for pushing Variant support! >>> >>>>>>>>>>>>>>>>>>> >>> >>>>>>>>>>>>>>>>>>> Given the differences between the supported types and >>> >>>> the >>> >>>>>>> lack of interest from the other project, I think it is >>> reasonable to >>> >>>>>>> duplicate the specification to our repository. >>> >>>>>>>>>>>>>>>>>>> I would give very strong emphasis on sticking to the >>> >>>> Spark >>> >>>>>>> spec as much as possible, to keep compatibility as much as >>> possible. >>> >>>>>> Maybe >>> >>>>>>> even revert to a shared specification if the situation changes. >>> >>>>>>>>>>>>>>>>>>> >>> >>>>>>>>>>>>>>>>>>> Thanks, >>> >>>>>>>>>>>>>>>>>>> Peter >>> >>>>>>>>>>>>>>>>>>> >>> >>>>>>>>>>>>>>>>>>> Aihua Xu <aihu...@gmail.com> ezt írta (időpont: >>> 2024. >>> >>>>> aug. >>> >>>>>>> 13., K, 19:52): >>> >>>>>>>>>>>>>>>>>>>> >>> >>>>>>>>>>>>>>>>>>>> Thanks Russell for bringing this up. >>> >>>>>>>>>>>>>>>>>>>> >>> >>>>>>>>>>>>>>>>>>>> This is the main blocker to move forward with the >>> >>>> Variant >>> >>>>>>> support in Iceberg and hopefully we can have a consensus. To me, >>> I >>> >>>> also >>> >>>>>>> feel it makes more sense to move the spec into Iceberg rather >>> than >>> >>>>> Spark >>> >>>>>>> engine owns it and we try to keep it compatible with Spark spec. >>> >>>>>>>>>>>>>>>>>>>> >>> >>>>>>>>>>>>>>>>>>>> Thanks, >>> >>>>>>>>>>>>>>>>>>>> Aihua >>> >>>>>>>>>>>>>>>>>>>> >>> >>>>>>>>>>>>>>>>>>>> On Mon, Aug 12, 2024 at 6:50 PM Russell Spitzer < >>> >>>>>>> russell.spit...@gmail.com> wrote: >>> >>>>>>>>>>>>>>>>>>>>> >>> >>>>>>>>>>>>>>>>>>>>> Hi Y’all, >>> >>>>>>>>>>>>>>>>>>>>> >>> >>>>>>>>>>>>>>>>>>>>> We’ve hit a bit of a roadblock with the Variant >>> >>>>> Proposal, >>> >>>>>>> while we were hoping to move the Variant and Shredding >>> specifications >>> >>>>>> from >>> >>>>>>> Spark into Iceberg there doesn’t seem to be a lot of interest in >>> >>>> that. >>> >>>>>>> Unfortunately, I think we have a number of issues with just >>> linking >>> >>>> to >>> >>>>>> the >>> >>>>>>> Spark project directly from within Iceberg and I believe we need >>> to >>> >>>>> copy >>> >>>>>>> the specifications into our repository. >>> >>>>>>>>>>>>>>>>>>>>> >>> >>>>>>>>>>>>>>>>>>>>> There are a few reasons why i think this is >>> necessary >>> >>>>>>>>>>>>>>>>>>>>> >>> >>>>>>>>>>>>>>>>>>>>> First, we have a divergence of types already. The >>> >>>> Spark >>> >>>>>>> Specification already includes types which Iceberg has no >>> definition >>> >>>>> for >>> >>>>>>> (19, 20 - Interval Types) and Iceberg already has a type which >>> is not >>> >>>>>>> included within the Spark Specification (Time) and will soon have >>> >>>> more >>> >>>>>> with >>> >>>>>>> TimestampNS, and Geo. >>> >>>>>>>>>>>>>>>>>>>>> >>> >>>>>>>>>>>>>>>>>>>>> Second, We would like to make sure that Spark is >>> not a >>> >>>>>> hard >>> >>>>>>> dependency for other engines. We are working with several >>> >>>> implementers >>> >>>>> of >>> >>>>>>> the Iceberg spec and it has previously been agreed that it would >>> be >>> >>>>> best >>> >>>>>> if >>> >>>>>>> the source of truth for Variant existed in an engine and file >>> format >>> >>>>>>> neutral location. The Iceberg project has a good open model of >>> >>>>> governance >>> >>>>>>> and, as we have seen so far discussing Variant, open and active >>> >>>>>>> collaboration. This would also help as we can strictly version >>> our >>> >>>>>> changes >>> >>>>>>> in-line with the rest of the Iceberg spec. >>> >>>>>>>>>>>>>>>>>>>>> >>> >>>>>>>>>>>>>>>>>>>>> Third, The Shredding spec is not quite finished and >>> >>>>>>> requires some group analysis and discussion before we commit it. >>> I >>> >>>>> think >>> >>>>>>> again the Iceberg community is probably the right place for this >>> to >>> >>>>>> happen >>> >>>>>>> as we have already started discussions here on these topics. >>> >>>>>>>>>>>>>>>>>>>>> >>> >>>>>>>>>>>>>>>>>>>>> For these reasons I think we should go with a >>> direct >>> >>>>> copy >>> >>>>>>> of the existing specification from the Spark Project and move >>> ahead >>> >>>>> with >>> >>>>>>> our discussions and modifications within Iceberg. That said, I >>> do not >>> >>>>>> want >>> >>>>>>> to diverge if possible from the Spark proposal. For example, >>> although >>> >>>>> we >>> >>>>>> do >>> >>>>>>> not use the Interval types above, I think we should not reuse >>> those >>> >>>>> type >>> >>>>>>> ids within our spec. Iceberg's Variant Spec types 19 and 20 would >>> >>>>> remain >>> >>>>>>> unused along with any other types we think are not applicable. We >>> >>>>> should >>> >>>>>>> strive whenever possible to allow for compatibility. >>> >>>>>>>>>>>>>>>>>>>>> >>> >>>>>>>>>>>>>>>>>>>>> In the interest of moving forward with this >>> proposal I >>> >>>>> am >>> >>>>>>> hoping to see if anyone in the community objects to this plan >>> going >>> >>>>>> forward >>> >>>>>>> or has a better alternative. >>> >>>>>>>>>>>>>>>>>>>>> >>> >>>>>>>>>>>>>>>>>>>>> As always I am thankful for your time and am eager >>> to >>> >>>>> hear >>> >>>>>>> back from everyone, >>> >>>>>>>>>>>>>>>>>>>>> Russ >>> >>>>>>>>>>>>>>>>>>>>> >>> >>>>>>>>>>>>>>>>>>>>> >>> >>>>>>> >>> >>>>>> >>> >>>>> >>> >>>> >>> >>> >>> >> >>> > >>> >>