Since you mentioned Hive, I chime in with what we do there. You might find it useful: - metastore module - only small differences - DynConstructor solves for us - mr module - some bigger differences, but still manageable for Hive 2-3. Need some new classes, but most of the code is reused - extra module for Hive 3. For Hive 4 we use a different repo as we moved to the Hive codebase.
My thoughts based on the above experience: - Keeping Hive 4 and Hive 2-3 code in sync is a pain. We constantly have problems with backporting changes between repos and we are slacking behind which hurts both projects - Hive 2-3 model is working better by forcing us to keep the things in sync, but with serious differences in the Hive project it still doesn't seem like a viable option. So I think the question is: How stable is the Spark code we are integrating to. If I is fairly stable then we are better off with a "one repo multiple modules" approach and we should consider the multirepo only if the differences become prohibitive. Thanks, Peter On Fri, 17 Sep 2021, 02:21 Anton Okolnychyi, <aokolnyc...@apple.com.invalid> wrote: > Okay, looks like there is consensus around supporting multiple Spark > versions at the same time. There are folks who mentioned this on this > thread and there were folks who brought this up during the sync. > > Let’s think through Option 2 and 3 in more detail then. > > Option 2 > > In Option 2, there will be a separate repo. I believe the master branch > will soon point to Spark 3.2 (the most recent supported version). The main > development will happen there and the artifact version will be 0.1.0. I > also suppose there will be 0.1.x-spark-2 and 0.1.x-spark-3.1 branches where > we will cherry-pick applicable changes. Once we are ready to release 0.1.0 > Spark integration, we will create 0.1.x-spark-3.2 and cut 3 releases: Spark > 2.4, Spark 3.1, Spark 3.2. After that, we will bump the version in master > to 0.2.0 and create new 0.2.x-spark-2 and 0.2.x-spark-3.1 branches for > cherry-picks. > > I guess we will continue to shade everything in the new repo and will have > to release every time the core is released. We will do a maintenance > release for each supported Spark version whenever we cut a new maintenance > Iceberg > release or need to fix any bugs in the Spark integration. > Under this model, we will probably need nightly snapshots (or on each > commit) for the core format and the Spark integration will depend on > snapshots until we are ready to release. > > Overall, I think this option gives us very simple builds and provides best > separation. It will keep the main repo clean. The main downside is that we > will have to split a Spark feature into two PRs: one against the core and > one against the Spark integration. Certain changes in core can also break > the Spark integration too and will require adaptations. > > Ryan, I am not sure I fully understood the testing part. How will we be > able to test the Spark integration in the main repo if certain changes in > core may break the Spark integration and require changes there? Will we try > to prohibit such changes? > > Option 3 (modified) > > If I get correctly, the modified Option 3 sounds very close to > the initially suggested approach by Imran but with code duplication instead > of extra refactoring and introducing new common modules. > > Jack, are you suggesting we test only a single Spark version at a time? Or > do we expect to test all versions? Will there be any difference compared to > just having a module per version? I did not fully understand. > > My worry with this approach is that our build will be very complicated and > we will still have a lot of Spark-related modules in the main repo. Once > people start using Flink and Hive more, will we have to do the same? > > - Anton > > > > On 16 Sep 2021, at 08:11, Ryan Blue <b...@tabular.io> wrote: > > I'd support the option that Jack suggests if we can set a few expectations > for keeping it clean. > > First, I'd like to avoid refactoring code to share it across Spark > versions -- that introduces risk because we're relying on compiling against > one version and running in another and both Spark and Scala change rapidly. > A big benefit of options 1 and 2 is that we mostly focus on only one Spark > version. I think we should duplicate code rather than spend time > refactoring to rely on binary compatibility. I propose we start each new > Spark version by copying the last one and updating it. And we should build > just the latest supported version by default. > > The drawback to having everything in a single repo is that we wouldn't be > able to cherry-pick changes across Spark versions/branches, but I think > Jack is right that having a single build is better. > > Second, we should make CI faster by running the Spark builds in parallel. > It sounds like this is what would happen anyway, with a property that > selects the Spark version that you want to build against. > > Overall, this new suggestion sounds like a promising way forward. > > Ryan > > On Wed, Sep 15, 2021 at 11:46 PM Jack Ye <yezhao...@gmail.com> wrote: > >> I think in Ryan's proposal we will create a ton of modules anyway, as >> Wing listed we are just using git branch as an additional dimension, but my >> understanding is that you will still have 1 core, 1 extension, 1 runtime >> artifact published for each Spark version in either approach. >> >> In that case, this is just brainstorming, I wonder if we can explore a >> modified option 3 that flattens all the versions in each Spark branch in >> option 2 into master. The repository structure would look something like: >> >> iceberg/api/... >> /bundled-guava/... >> /core/... >> ... >> /spark/2.4/core/... >> /extension/... >> /runtime/... >> /3.1/core/... >> /extension/... >> /runtime/... >> >> The gradle build script in the root is configured to build against the >> latest version of Spark by default, unless otherwise specified by the user. >> >> Intellij can also be configured to only index files of specific versions >> based on the same config used in build. >> >> In this way, I imagine the CI setup to be much easier to do things like >> testing version compatibility for a feature or running only a >> specific subset of Spark version builds based on the Spark version >> directories touched. >> >> And the biggest benefit is that we don't have the same difficulty as >> option 2 of developing a feature when it's both in core and Spark. >> >> We can then develop a mechanism to vote to stop support of certain >> versions, and archive the corresponding directory to avoid accumulating too >> many versions in the long term. >> >> -Jack Ye >> >> >> On Wed, Sep 15, 2021 at 4:17 PM Ryan Blue <b...@tabular.io> wrote: >> >>> Sorry, I was thinking about CI integration between Iceberg Java and >>> Iceberg Spark, I just didn't mention it and I see how that's a big thing to >>> leave out! >>> >>> I would definitely want to test the projects together. One thing we >>> could do is have a nightly build like Russell suggests. I'm also wondering >>> if we could have some tighter integration where the Iceberg Spark build can >>> be included in the Iceberg Java build using properties. Maybe the github >>> action could checkout Iceberg, then checkout the Spark integration's latest >>> branch, and then run the gradle build with a property that makes Spark a >>> subproject in the build. That way we can continue to have Spark CI run >>> regularly. >>> >>> On Wed, Sep 15, 2021 at 3:08 PM Russell Spitzer < >>> russell.spit...@gmail.com> wrote: >>> >>>> I agree that Option 2 is considerably more difficult for development >>>> when core API changes need to be picked up by the external Spark module. I >>>> also think a monthly release would probably still be prohibitive to >>>> actually implementing new features that appear in the API, I would hope we >>>> have a much faster process or maybe just have snapshot artifacts published >>>> nightly? >>>> >>>> On Sep 15, 2021, at 4:46 PM, Wing Yew Poon <wyp...@cloudera.com.INVALID> >>>> wrote: >>>> >>>> IIUC, Option 2 is to move the Spark support for Iceberg into a separate >>>> repo (subproject of Iceberg). Would we have branches such as 0.13-2.4, >>>> 0.13-3.0, 0.13-3.1, and 0.13-3.2? For features that can be supported in all >>>> versions or all Spark 3 versions, then we would need to commit the changes >>>> to all applicable branches. Basically we are trading more work to commit to >>>> multiple branches for simplified build and CI time per branch, which might >>>> be an acceptable trade-off. However, the biggest downside is that changes >>>> may need to be made in core Iceberg as well as in the engine (in this case >>>> Spark) support, and we need to wait for a release of core Iceberg to >>>> consume the changes in the subproject. In this case, maybe we should have a >>>> monthly release of core Iceberg (no matter how many changes go in, as long >>>> as it is non-zero) so that the subproject can consume changes fairly >>>> quickly? >>>> >>>> >>>> On Wed, Sep 15, 2021 at 2:09 PM Ryan Blue <b...@tabular.io> wrote: >>>> >>>>> Thanks for bringing this up, Anton. I’m glad that we have the set of >>>>> potential solutions well defined. >>>>> >>>>> Looks like the next step is to decide whether we want to require >>>>> people to update Spark versions to pick up newer versions of Iceberg. If >>>>> we >>>>> choose to make people upgrade, then option 1 is clearly the best choice. >>>>> >>>>> I don’t think that we should make updating Spark a requirement. Many >>>>> of the things that we’re working on are orthogonal to Spark versions, like >>>>> table maintenance actions, secondary indexes, the 1.0 API, views, ORC >>>>> delete files, new storage implementations, etc. Upgrading Spark is time >>>>> consuming and untrusted in my experience, so I think we would be setting >>>>> up >>>>> an unnecessary trade-off between spending lots of time to upgrade Spark >>>>> and >>>>> picking up new Iceberg features. >>>>> >>>>> Another way of thinking about this is that if we went with option 1, >>>>> then we could port bug fixes into 0.12.x. But there are many things that >>>>> wouldn’t fit this model, like adding a FileIO implementation for ADLS. So >>>>> some people in the community would have to maintain branches of newer >>>>> Iceberg versions with older versions of Spark outside of the main Iceberg >>>>> project — that defeats the purpose of simplifying things with option 1 >>>>> because we would then have more people maintaining the same 0.13.x with >>>>> Spark 3.1 branch. (This reminds me of the Spark community, where we wanted >>>>> to release a 2.5 line with DSv2 backported, but the community decided not >>>>> to so we built similar 2.4+DSv2 branches at Netflix, Tencent, Apple, etc.) >>>>> >>>>> If the community is going to do the work anyway — and I think some of >>>>> us would — we should make it possible to share that work. That’s why I >>>>> don’t think that we should go with option 1. >>>>> >>>>> If we don’t go with option 1, then the choice is how to maintain >>>>> multiple Spark versions. I think that the way we’re doing it right now is >>>>> not something we want to continue. >>>>> >>>>> Using multiple modules (option 3) is concerning to me because of the >>>>> changes in Spark. We currently structure the library to share as much code >>>>> as possible. But that means compiling against different Spark versions and >>>>> relying on binary compatibility and reflection in some cases. To me, this >>>>> seems unmaintainable in the long run because it requires refactoring >>>>> common >>>>> classes and spending a lot of time deduplicating code. It also creates a >>>>> ton of modules, at least one common module, then a module per version, >>>>> then >>>>> an extensions module per version, and finally a runtime module per >>>>> version. >>>>> That’s 3 modules per Spark version, plus any new common modules. And each >>>>> module needs to be tested, which is making our CI take a really long time. >>>>> We also don’t support multiple Scala versions, which is another gap that >>>>> will require even more modules and tests. >>>>> >>>>> I like option 2 because it would allow us to compile against a single >>>>> version of Spark (which will be much more reliable). It would give us an >>>>> opportunity to support different Scala versions. It avoids the need to >>>>> refactor to share code and allows people to focus on a single version of >>>>> Spark, while also creating a way for people to maintain and update the >>>>> older versions with newer Iceberg releases. I don’t think that this would >>>>> slow down development. I think it would actually speed it up because we’d >>>>> be spending less time trying to make multiple versions work in the same >>>>> build. And anyone in favor of option 1 would basically get option 1: you >>>>> don’t have to care about branches for older Spark versions. >>>>> >>>>> Jack makes a good point about wanting to keep code in a single >>>>> repository, but I think that the need to manage more version combinations >>>>> overrides this concern. It’s easier to make this decision in python >>>>> because >>>>> we’re not trying to depend on two projects that change relatively quickly. >>>>> We’re just trying to build a library. >>>>> >>>>> Ryan >>>>> >>>>> On Wed, Sep 15, 2021 at 2:58 AM OpenInx <open...@gmail.com> wrote: >>>>> >>>>>> Thanks for bringing this up, Anton. >>>>>> >>>>>> Everyone has great pros/cons to support their preferences. Before >>>>>> giving my preference, let me raise one question: what's the top >>>>>> priority >>>>>> thing for apache iceberg project at this point in time ? This question >>>>>> will help us to answer the following question: Should we support more >>>>>> engine versions more robustly or be a bit more aggressive and concentrate >>>>>> on getting the new features that users need most in order to keep the >>>>>> project more competitive ? >>>>>> >>>>>> If people watch the apache iceberg project and check the issues & >>>>>> PR frequently, I guess more than 90% people will answer the priority >>>>>> question: There is no doubt for making the whole v2 story to be >>>>>> production-ready. The current roadmap discussion also proofs the thing >>>>>> : >>>>>> https://lists.apache.org/x/thread.html/r84e80216c259c81f824c6971504c321cd8c785774c489d52d4fc123f@%3Cdev.iceberg.apache.org%3E >>>>>> . >>>>>> >>>>>> In order to ensure the highest priority at this point in time, I will >>>>>> prefer option-1 to reduce the cost of engine maintenance, so as to free >>>>>> up >>>>>> resources to make v2 production-ready. >>>>>> >>>>>> On Wed, Sep 15, 2021 at 3:00 PM Saisai Shao <sai.sai.s...@gmail.com> >>>>>> wrote: >>>>>> >>>>>>> From Dev's point, it has less burden to always support the latest >>>>>>> version of Spark (for example). But from user's point, especially for us >>>>>>> who maintain Spark internally, it is not easy to upgrade the Spark >>>>>>> version >>>>>>> for the first time (since we have many customizations internally), and >>>>>>> we're still promoting to upgrade to 3.1.2. If the community ditches the >>>>>>> support of old version of Spark3, users have to maintain it themselves >>>>>>> unavoidably. >>>>>>> >>>>>>> So I'm inclined to make this support in community, not by users >>>>>>> themselves, as for Option 2 or 3, I'm fine with either. And to relieve >>>>>>> the >>>>>>> burden, we could support limited versions of Spark (for example 2 >>>>>>> versions). >>>>>>> >>>>>>> Just my two cents. >>>>>>> >>>>>>> -Saisai >>>>>>> >>>>>>> >>>>>>> Jack Ye <yezhao...@gmail.com> 于2021年9月15日周三 下午1:35写道: >>>>>>> >>>>>>>> Hi Wing Yew, >>>>>>>> >>>>>>>> I think 2.4 is a different story, we will continue to support Spark >>>>>>>> 2.4, but as you can see it will continue to have very limited >>>>>>>> functionalities comparing to Spark 3. I believe we discussed about >>>>>>>> option 3 >>>>>>>> when we were doing Spark 3.0 to 3.1 upgrade. Recently we are seeing the >>>>>>>> same issue for Flink 1.11, 1.12 and 1.13 as well. I feel we need a >>>>>>>> consistent strategy around this, let's take this chance to make a good >>>>>>>> community guideline for all future engine versions, especially for >>>>>>>> Spark, >>>>>>>> Flink and Hive that are in the same repository. >>>>>>>> >>>>>>>> I can totally understand your point of view Wing, in fact, speaking >>>>>>>> from the perspective of AWS EMR, we have to support over 40 versions >>>>>>>> of the >>>>>>>> software because there are people who are still using Spark 1.4, >>>>>>>> believe it >>>>>>>> or not. After all, keep backporting changes will become a liability not >>>>>>>> only on the user side, but also on the service provider side, so I >>>>>>>> believe >>>>>>>> it's not a bad practice to push for user upgrade, as it will make the >>>>>>>> life >>>>>>>> of both parties easier in the end. New feature is definitely one of the >>>>>>>> best incentives to promote an upgrade on user side. >>>>>>>> >>>>>>>> I think the biggest issue of option 3 is about its scalability, >>>>>>>> because we will have an unbounded list of packages to add and compile >>>>>>>> in >>>>>>>> the future, and we probably cannot drop support of that package once >>>>>>>> created. If we go with option 1, I think we can still publish a few >>>>>>>> patch >>>>>>>> versions for old Iceberg releases, and committers can control the >>>>>>>> amount of >>>>>>>> patch versions to guard people from abusing the power of patching. I >>>>>>>> see >>>>>>>> this as a consistent strategy also for Flink and Hive. With this >>>>>>>> strategy, >>>>>>>> we can truly have a compatibility matrix for engine versions against >>>>>>>> Iceberg versions. >>>>>>>> >>>>>>>> -Jack >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> On Tue, Sep 14, 2021 at 10:00 PM Wing Yew Poon < >>>>>>>> wyp...@cloudera.com.invalid> wrote: >>>>>>>> >>>>>>>>> I understand and sympathize with the desire to use new DSv2 >>>>>>>>> features in Spark 3.2. I agree that Option 1 is the easiest for >>>>>>>>> developers, >>>>>>>>> but I don't think it considers the interests of users. I do not think >>>>>>>>> that >>>>>>>>> most users will upgrade to Spark 3.2 as soon as it is released. It is >>>>>>>>> a >>>>>>>>> "minor version" upgrade in name from 3.1 (or from 3.0), but I think >>>>>>>>> we all >>>>>>>>> know that it is not a minor upgrade. There are a lot of changes from >>>>>>>>> 3.0 to >>>>>>>>> 3.1 and from 3.1 to 3.2. I think there are even a lot of users running >>>>>>>>> Spark 2.4 and not even on Spark 3 yet. Do we also plan to stop >>>>>>>>> supporting >>>>>>>>> Spark 2.4? >>>>>>>>> >>>>>>>>> Please correct me if I'm mistaken, but the folks who have spoken >>>>>>>>> out in favor of Option 1 all work for the same organization, don't >>>>>>>>> they? >>>>>>>>> And they don't have a problem with making their users, all internal, >>>>>>>>> simply >>>>>>>>> upgrade to Spark 3.2, do they? (Or they are already running an >>>>>>>>> internal >>>>>>>>> fork that is close to 3.2.) >>>>>>>>> >>>>>>>>> I work for an organization with customers running different >>>>>>>>> versions of Spark. It is true that we can backport new features to >>>>>>>>> older >>>>>>>>> versions if we wanted to. I suppose the people contributing to >>>>>>>>> Iceberg work >>>>>>>>> for some organization or other that either use Iceberg in-house, or >>>>>>>>> provide >>>>>>>>> software (possibly in the form of a service) to customers, and either >>>>>>>>> way, >>>>>>>>> the organizations have the ability to backport features and fixes to >>>>>>>>> internal versions. Are there any users out there who simply use Apache >>>>>>>>> Iceberg and depend on the community version? >>>>>>>>> >>>>>>>>> There may be features that are broadly useful that do not depend >>>>>>>>> on Spark 3.2. Is it worth supporting them on Spark 3.0/3.1 (and even >>>>>>>>> 2.4)? >>>>>>>>> >>>>>>>>> I am not in favor of Option 2. I do not oppose Option 1, but I >>>>>>>>> would consider Option 3 too. Anton, you said 5 modules are required; >>>>>>>>> what >>>>>>>>> are the modules you're thinking of? >>>>>>>>> >>>>>>>>> - Wing Yew >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> On Tue, Sep 14, 2021 at 5:38 PM Yufei Gu <flyrain...@gmail.com> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> Option 1 sounds good to me. Here are my reasons: >>>>>>>>>> >>>>>>>>>> 1. Both 2 and 3 will slow down the development. Considering the >>>>>>>>>> limited resources in the open source community, the upsides of >>>>>>>>>> option 2 and >>>>>>>>>> 3 are probably not worthy. >>>>>>>>>> 2. Both 2 and 3 assume the use cases may not exist. It's hard to >>>>>>>>>> predict anything, but even if these use cases are legit, users can >>>>>>>>>> still >>>>>>>>>> get the new feature by backporting it to an older version in case of >>>>>>>>>> upgrading to a newer version isn't an option. >>>>>>>>>> >>>>>>>>>> Best, >>>>>>>>>> >>>>>>>>>> Yufei >>>>>>>>>> >>>>>>>>>> `This is not a contribution` >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Tue, Sep 14, 2021 at 4:54 PM Anton Okolnychyi < >>>>>>>>>> aokolnyc...@apple.com.invalid> wrote: >>>>>>>>>> >>>>>>>>>>> To sum up what we have so far: >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> *Option 1 (support just the most recent minor Spark 3 version)* >>>>>>>>>>> >>>>>>>>>>> The easiest option for us devs, forces the user to upgrade to >>>>>>>>>>> the most recent minor Spark version to consume any new Iceberg >>>>>>>>>>> features. >>>>>>>>>>> >>>>>>>>>>> *Option 2 (a separate project under Iceberg)* >>>>>>>>>>> >>>>>>>>>>> Can support as many Spark versions as needed and the codebase is >>>>>>>>>>> still separate as we can use separate branches. >>>>>>>>>>> Impossible to consume any unreleased changes in core, may slow >>>>>>>>>>> down the development. >>>>>>>>>>> >>>>>>>>>>> *Option 3 (separate modules for Spark 3.1/3.2)* >>>>>>>>>>> >>>>>>>>>>> Introduce more modules in the same project. >>>>>>>>>>> Can consume unreleased changes but it will required at least 5 >>>>>>>>>>> modules to support 2.4, 3.1 and 3.2, making the build and testing >>>>>>>>>>> complicated. >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> Are there any users for whom upgrading the minor Spark version >>>>>>>>>>> (e3.1 to 3.2) to consume new features is a blocker? >>>>>>>>>>> We follow Option 1 internally at the moment but I would like to >>>>>>>>>>> hear what other people think/need. >>>>>>>>>>> >>>>>>>>>>> - Anton >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On 14 Sep 2021, at 09:44, Russell Spitzer < >>>>>>>>>>> russell.spit...@gmail.com> wrote: >>>>>>>>>>> >>>>>>>>>>> I think we should go for option 1. I already am not a big fan of >>>>>>>>>>> having runtime errors for unsupported things based on versions and >>>>>>>>>>> I don't >>>>>>>>>>> think minor version upgrades are a large issue for users. I'm >>>>>>>>>>> especially >>>>>>>>>>> not looking forward to supporting interfaces that only exist in >>>>>>>>>>> Spark 3.2 >>>>>>>>>>> in a multiple Spark version support future. >>>>>>>>>>> >>>>>>>>>>> On Sep 14, 2021, at 11:32 AM, Anton Okolnychyi < >>>>>>>>>>> aokolnyc...@apple.com.INVALID> wrote: >>>>>>>>>>> >>>>>>>>>>> First of all, is option 2 a viable option? We discussed >>>>>>>>>>> separating the python module outside of the project a few weeks >>>>>>>>>>> ago, and >>>>>>>>>>> decided to not do that because it's beneficial for code cross >>>>>>>>>>> reference and >>>>>>>>>>> more intuitive for new developers to see everything in the same >>>>>>>>>>> repository. >>>>>>>>>>> I would expect the same argument to also hold here. >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> That’s exactly the concern I have about Option 2 at this moment. >>>>>>>>>>> >>>>>>>>>>> Overall I would personally prefer us to not support all the >>>>>>>>>>> minor versions, but instead support maybe just 2-3 latest versions >>>>>>>>>>> in a >>>>>>>>>>> major version. >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> This is when it gets a bit complicated. If we want to support >>>>>>>>>>> both Spark 3.1 and Spark 3.2 with a single module, it means we have >>>>>>>>>>> to >>>>>>>>>>> compile against 3.1. The problem is that we rely on DSv2 that is >>>>>>>>>>> being >>>>>>>>>>> actively developed. 3.2 and 3.1 have substantial differences. On >>>>>>>>>>> top of >>>>>>>>>>> that, we have our extensions that are extremely low-level and may >>>>>>>>>>> break not >>>>>>>>>>> only between minor versions but also between patch releases. >>>>>>>>>>> >>>>>>>>>>> f there are some features requiring a newer version, it makes >>>>>>>>>>> sense to move that newer version in master. >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> Internally, we don’t deliver new features to older Spark >>>>>>>>>>> versions as it requires a lot of effort to port things. Personally, >>>>>>>>>>> I don’t >>>>>>>>>>> think it is too bad to require users to upgrade if they want new >>>>>>>>>>> features. >>>>>>>>>>> At the same time, there are valid concerns with this approach too >>>>>>>>>>> that we >>>>>>>>>>> mentioned during the sync. For example, certain new features would >>>>>>>>>>> also >>>>>>>>>>> work fine with older Spark versions. I generally agree with that >>>>>>>>>>> and that >>>>>>>>>>> not supporting recent versions is not ideal. However, I want to >>>>>>>>>>> find a >>>>>>>>>>> balance between the complexity on our side and ease of use for the >>>>>>>>>>> users. >>>>>>>>>>> Ideally, supporting a few recent versions would be sufficient but >>>>>>>>>>> our Spark >>>>>>>>>>> integration is too low-level to do that with a single module. >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On 13 Sep 2021, at 20:53, Jack Ye <yezhao...@gmail.com> wrote: >>>>>>>>>>> >>>>>>>>>>> First of all, is option 2 a viable option? We discussed >>>>>>>>>>> separating the python module outside of the project a few weeks >>>>>>>>>>> ago, and >>>>>>>>>>> decided to not do that because it's beneficial for code cross >>>>>>>>>>> reference and >>>>>>>>>>> more intuitive for new developers to see everything in the same >>>>>>>>>>> repository. >>>>>>>>>>> I would expect the same argument to also hold here. >>>>>>>>>>> >>>>>>>>>>> Overall I would personally prefer us to not support all the >>>>>>>>>>> minor versions, but instead support maybe just 2-3 latest versions >>>>>>>>>>> in a >>>>>>>>>>> major version. This avoids the problem that some users are >>>>>>>>>>> unwilling to >>>>>>>>>>> move to a newer version and keep patching old Spark version >>>>>>>>>>> branches. If >>>>>>>>>>> there are some features requiring a newer version, it makes sense >>>>>>>>>>> to move >>>>>>>>>>> that newer version in master. >>>>>>>>>>> >>>>>>>>>>> In addition, because currently Spark is considered the most >>>>>>>>>>> feature-complete reference implementation compared to all other >>>>>>>>>>> engines, I >>>>>>>>>>> think we should not add artificial barriers that would slow down its >>>>>>>>>>> development speed. >>>>>>>>>>> >>>>>>>>>>> So my thinking is closer to option 1. >>>>>>>>>>> >>>>>>>>>>> Best, >>>>>>>>>>> Jack Ye >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On Mon, Sep 13, 2021 at 7:39 PM Anton Okolnychyi < >>>>>>>>>>> aokolnyc...@apple.com.invalid> wrote: >>>>>>>>>>> >>>>>>>>>>>> Hey folks, >>>>>>>>>>>> >>>>>>>>>>>> I want to discuss our Spark version support strategy. >>>>>>>>>>>> >>>>>>>>>>>> So far, we have tried to support both 3.0 and 3.1. It is great >>>>>>>>>>>> to support older versions but because we compile against 3.0, we >>>>>>>>>>>> cannot use >>>>>>>>>>>> any Spark features that are offered in newer versions. >>>>>>>>>>>> Spark 3.2 is just around the corner and it brings a lot of >>>>>>>>>>>> important features such dynamic filtering for v2 tables, required >>>>>>>>>>>> distribution and ordering for writes, etc. These features are too >>>>>>>>>>>> important >>>>>>>>>>>> to ignore them. >>>>>>>>>>>> >>>>>>>>>>>> Apart from that, I have an end-to-end prototype for >>>>>>>>>>>> merge-on-read with Spark that actually leverages some of the 3.2 >>>>>>>>>>>> features. >>>>>>>>>>>> I’ll be implementing all new Spark DSv2 APIs for us internally and >>>>>>>>>>>> would >>>>>>>>>>>> love to share that with the rest of the community. >>>>>>>>>>>> >>>>>>>>>>>> I see two options to move forward: >>>>>>>>>>>> >>>>>>>>>>>> Option 1 >>>>>>>>>>>> >>>>>>>>>>>> Migrate to Spark 3.2 in master, maintain 0.12 for a while by >>>>>>>>>>>> releasing minor versions with bug fixes. >>>>>>>>>>>> >>>>>>>>>>>> Pros: almost no changes to the build configuration, no extra >>>>>>>>>>>> work on our side as just a single Spark version is actively >>>>>>>>>>>> maintained. >>>>>>>>>>>> Cons: some new features that we will be adding to master could >>>>>>>>>>>> also work with older Spark versions but all 0.12 releases will >>>>>>>>>>>> only contain >>>>>>>>>>>> bug fixes. Therefore, users will be forced to migrate to Spark 3.2 >>>>>>>>>>>> to >>>>>>>>>>>> consume any new Spark or format features. >>>>>>>>>>>> >>>>>>>>>>>> Option 2 >>>>>>>>>>>> >>>>>>>>>>>> Move our Spark integration into a separate project and >>>>>>>>>>>> introduce branches for 3.0, 3.1 and 3.2. >>>>>>>>>>>> >>>>>>>>>>>> Pros: decouples the format version from Spark, we can support >>>>>>>>>>>> as many Spark versions as needed. >>>>>>>>>>>> Cons: more work initially to set everything up, more work to >>>>>>>>>>>> release, will need a new release of the core format to consume any >>>>>>>>>>>> changes >>>>>>>>>>>> in the Spark integration. >>>>>>>>>>>> >>>>>>>>>>>> Overall, I think option 2 seems better for the user but my main >>>>>>>>>>>> worry is that we will have to release the format more frequently >>>>>>>>>>>> (which is >>>>>>>>>>>> a good thing but requires more work and time) and the overall Spark >>>>>>>>>>>> development may be slower. >>>>>>>>>>>> >>>>>>>>>>>> I’d love to hear what everybody thinks about this matter. >>>>>>>>>>>> >>>>>>>>>>>> Thanks, >>>>>>>>>>>> Anton >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>> >>>>> -- >>>>> Ryan Blue >>>>> Tabular >>>>> >>>> >>>> >>> >>> -- >>> Ryan Blue >>> Tabular >>> >> > > -- > Ryan Blue > Tabular > > >