Hi OpenInx, I'm sorry I misunderstood the thinking of the Flink community. Thanks for the clarification. - Wing Yew
On Tue, Sep 28, 2021 at 7:15 PM OpenInx <open...@gmail.com> wrote: > Hi Wing > > As we discussed above, we community prefer to choose option.2 or > option.3. So in fact, when we planned to upgrade the flink version from > 1.12 to 1.13, we are doing our best to guarantee the master iceberg repo > could work fine for both flink1.12 & flink1.13. More context please see > [1], [2], [3] > > [1] https://github.com/apache/iceberg/pull/3116 > [2] https://github.com/apache/iceberg/issues/3183 > [3] > https://lists.apache.org/x/thread.html/ra438e89eeec2d4623a32822e21739c8f2229505522d73d1034e34198@%3Cdev.flink.apache.org%3E > > > On Wed, Sep 29, 2021 at 5:27 AM Wing Yew Poon <wyp...@cloudera.com.invalid> > wrote: > >> In the last community sync, we spent a little time on this topic. For >> Spark support, there are currently two options under consideration: >> >> Option 2: Separate repo for the Spark support. Use branches for >> supporting different Spark versions. Main branch for the latest Spark >> version (3.2 to begin with). >> Tooling needs to be built for producing regular snapshots of core Iceberg >> in a consumable way for this repo. Unclear if commits to core Iceberg will >> be tested pre-commit against Spark support; my impression is that they will >> not be, and the Spark support build can be broken by changes to core. >> >> A variant of option 3 (which we will simply call Option 3 going forward): >> Single repo, separate module (subdirectory) for each Spark version to be >> supported. Code duplication in each Spark module (no attempt to refactor >> out common code). Each module built against the specific version of Spark >> to be supported, producing a runtime jar built against that version. CI >> will test all modules. Support can be provided for only building the >> modules a developer cares about. >> >> More input was sought and people are encouraged to voice their preference. >> I lean towards Option 3. >> >> - Wing Yew >> >> ps. In the sync, as Steven Wu wrote, the question was raised if the same >> multi-version support strategy can be adopted across engines. Based on what >> Steven wrote, currently the Flink developer community's bandwidth makes >> supporting only a single Flink version (and focusing resources on >> developing new features on that version) the preferred choice. If so, then >> no multi-version support strategy for Flink is needed at this time. >> >> >> On Thu, Sep 23, 2021 at 5:26 PM Steven Wu <stevenz...@gmail.com> wrote: >> >>> During the sync meeting, people talked about if and how we can have the >>> same version support model across engines like Flink and Spark. I can >>> provide some input from the Flink side. >>> >>> Flink only supports two minor versions. E.g., right now Flink 1.13 is >>> the latest released version. That means only Flink 1.12 and 1.13 are >>> supported. Feature changes or bug fixes will only be backported to 1.12 and >>> 1.13, unless it is a serious bug (like security). With that context, >>> personally I like option 1 (with one actively supported Flink version in >>> master branch) for the iceberg-flink module. >>> >>> We discussed the idea of supporting multiple Flink versions via shm >>> layer and multiple modules. While it may be a little better to support >>> multiple Flink versions, I don't know if there is enough support and >>> resources from the community to pull it off. Also the ongoing maintenance >>> burden for each minor version release from Flink, which happens roughly >>> every 4 months. >>> >>> >>> On Thu, Sep 16, 2021 at 10:25 PM Peter Vary <pv...@cloudera.com.invalid> >>> wrote: >>> >>>> Since you mentioned Hive, I chime in with what we do there. You might >>>> find it useful: >>>> - metastore module - only small differences - DynConstructor solves for >>>> us >>>> - mr module - some bigger differences, but still manageable for Hive >>>> 2-3. Need some new classes, but most of the code is reused - extra module >>>> for Hive 3. For Hive 4 we use a different repo as we moved to the Hive >>>> codebase. >>>> >>>> My thoughts based on the above experience: >>>> - Keeping Hive 4 and Hive 2-3 code in sync is a pain. We constantly >>>> have problems with backporting changes between repos and we are slacking >>>> behind which hurts both projects >>>> - Hive 2-3 model is working better by forcing us to keep the things in >>>> sync, but with serious differences in the Hive project it still doesn't >>>> seem like a viable option. >>>> >>>> So I think the question is: How stable is the Spark code we are >>>> integrating to. If I is fairly stable then we are better off with a "one >>>> repo multiple modules" approach and we should consider the multirepo only >>>> if the differences become prohibitive. >>>> >>>> Thanks, Peter >>>> >>>> On Fri, 17 Sep 2021, 02:21 Anton Okolnychyi, >>>> <aokolnyc...@apple.com.invalid> wrote: >>>> >>>>> Okay, looks like there is consensus around supporting multiple Spark >>>>> versions at the same time. There are folks who mentioned this on this >>>>> thread and there were folks who brought this up during the sync. >>>>> >>>>> Let’s think through Option 2 and 3 in more detail then. >>>>> >>>>> Option 2 >>>>> >>>>> In Option 2, there will be a separate repo. I believe the master >>>>> branch will soon point to Spark 3.2 (the most recent supported version). >>>>> The main development will happen there and the artifact version will be >>>>> 0.1.0. I also suppose there will be 0.1.x-spark-2 and 0.1.x-spark-3.1 >>>>> branches where we will cherry-pick applicable changes. Once we are ready >>>>> to >>>>> release 0.1.0 Spark integration, we will create 0.1.x-spark-3.2 and cut 3 >>>>> releases: Spark 2.4, Spark 3.1, Spark 3.2. After that, we will bump the >>>>> version in master to 0.2.0 and create new 0.2.x-spark-2 and >>>>> 0.2.x-spark-3.1 >>>>> branches for cherry-picks. >>>>> >>>>> I guess we will continue to shade everything in the new repo and will >>>>> have to release every time the core is released. We will do a maintenance >>>>> release for each supported Spark version whenever we cut a new >>>>> maintenance Iceberg >>>>> release or need to fix any bugs in the Spark integration. >>>>> Under this model, we will probably need nightly snapshots (or on each >>>>> commit) for the core format and the Spark integration will depend on >>>>> snapshots until we are ready to release. >>>>> >>>>> Overall, I think this option gives us very simple builds and provides >>>>> best separation. It will keep the main repo clean. The main downside is >>>>> that we will have to split a Spark feature into two PRs: one against the >>>>> core and one against the Spark integration. Certain changes in core can >>>>> also break the Spark integration too and will require adaptations. >>>>> >>>>> Ryan, I am not sure I fully understood the testing part. How will we >>>>> be able to test the Spark integration in the main repo if certain changes >>>>> in core may break the Spark integration and require changes there? Will we >>>>> try to prohibit such changes? >>>>> >>>>> Option 3 (modified) >>>>> >>>>> If I get correctly, the modified Option 3 sounds very close to >>>>> the initially suggested approach by Imran but with code duplication >>>>> instead >>>>> of extra refactoring and introducing new common modules. >>>>> >>>>> Jack, are you suggesting we test only a single Spark version at a >>>>> time? Or do we expect to test all versions? Will there be any difference >>>>> compared to just having a module per version? I did not fully >>>>> understand. >>>>> >>>>> My worry with this approach is that our build will be very complicated >>>>> and we will still have a lot of Spark-related modules in the main repo. >>>>> Once people start using Flink and Hive more, will we have to do the same? >>>>> >>>>> - Anton >>>>> >>>>> >>>>> >>>>> On 16 Sep 2021, at 08:11, Ryan Blue <b...@tabular.io> wrote: >>>>> >>>>> I'd support the option that Jack suggests if we can set a few >>>>> expectations for keeping it clean. >>>>> >>>>> First, I'd like to avoid refactoring code to share it across Spark >>>>> versions -- that introduces risk because we're relying on compiling >>>>> against >>>>> one version and running in another and both Spark and Scala change >>>>> rapidly. >>>>> A big benefit of options 1 and 2 is that we mostly focus on only one Spark >>>>> version. I think we should duplicate code rather than spend time >>>>> refactoring to rely on binary compatibility. I propose we start each new >>>>> Spark version by copying the last one and updating it. And we should build >>>>> just the latest supported version by default. >>>>> >>>>> The drawback to having everything in a single repo is that we wouldn't >>>>> be able to cherry-pick changes across Spark versions/branches, but I think >>>>> Jack is right that having a single build is better. >>>>> >>>>> Second, we should make CI faster by running the Spark builds in >>>>> parallel. It sounds like this is what would happen anyway, with a property >>>>> that selects the Spark version that you want to build against. >>>>> >>>>> Overall, this new suggestion sounds like a promising way forward. >>>>> >>>>> Ryan >>>>> >>>>> On Wed, Sep 15, 2021 at 11:46 PM Jack Ye <yezhao...@gmail.com> wrote: >>>>> >>>>>> I think in Ryan's proposal we will create a ton of modules anyway, as >>>>>> Wing listed we are just using git branch as an additional dimension, but >>>>>> my >>>>>> understanding is that you will still have 1 core, 1 extension, 1 runtime >>>>>> artifact published for each Spark version in either approach. >>>>>> >>>>>> In that case, this is just brainstorming, I wonder if we can explore >>>>>> a modified option 3 that flattens all the versions in each Spark branch >>>>>> in >>>>>> option 2 into master. The repository structure would look something like: >>>>>> >>>>>> iceberg/api/... >>>>>> /bundled-guava/... >>>>>> /core/... >>>>>> ... >>>>>> /spark/2.4/core/... >>>>>> /extension/... >>>>>> /runtime/... >>>>>> /3.1/core/... >>>>>> /extension/... >>>>>> /runtime/... >>>>>> >>>>>> The gradle build script in the root is configured to build against >>>>>> the latest version of Spark by default, unless otherwise specified by the >>>>>> user. >>>>>> >>>>>> Intellij can also be configured to only index files of specific >>>>>> versions based on the same config used in build. >>>>>> >>>>>> In this way, I imagine the CI setup to be much easier to do things >>>>>> like testing version compatibility for a feature or running only a >>>>>> specific subset of Spark version builds based on the Spark version >>>>>> directories touched. >>>>>> >>>>>> And the biggest benefit is that we don't have the same difficulty as >>>>>> option 2 of developing a feature when it's both in core and Spark. >>>>>> >>>>>> We can then develop a mechanism to vote to stop support of certain >>>>>> versions, and archive the corresponding directory to avoid accumulating >>>>>> too >>>>>> many versions in the long term. >>>>>> >>>>>> -Jack Ye >>>>>> >>>>>> >>>>>> On Wed, Sep 15, 2021 at 4:17 PM Ryan Blue <b...@tabular.io> wrote: >>>>>> >>>>>>> Sorry, I was thinking about CI integration between Iceberg Java and >>>>>>> Iceberg Spark, I just didn't mention it and I see how that's a big >>>>>>> thing to >>>>>>> leave out! >>>>>>> >>>>>>> I would definitely want to test the projects together. One thing we >>>>>>> could do is have a nightly build like Russell suggests. I'm also >>>>>>> wondering >>>>>>> if we could have some tighter integration where the Iceberg Spark build >>>>>>> can >>>>>>> be included in the Iceberg Java build using properties. Maybe the github >>>>>>> action could checkout Iceberg, then checkout the Spark integration's >>>>>>> latest >>>>>>> branch, and then run the gradle build with a property that makes Spark a >>>>>>> subproject in the build. That way we can continue to have Spark CI run >>>>>>> regularly. >>>>>>> >>>>>>> On Wed, Sep 15, 2021 at 3:08 PM Russell Spitzer < >>>>>>> russell.spit...@gmail.com> wrote: >>>>>>> >>>>>>>> I agree that Option 2 is considerably more difficult for >>>>>>>> development when core API changes need to be picked up by the external >>>>>>>> Spark module. I also think a monthly release would probably still be >>>>>>>> prohibitive to actually implementing new features that appear in the >>>>>>>> API, I >>>>>>>> would hope we have a much faster process or maybe just have snapshot >>>>>>>> artifacts published nightly? >>>>>>>> >>>>>>>> On Sep 15, 2021, at 4:46 PM, Wing Yew Poon < >>>>>>>> wyp...@cloudera.com.INVALID> wrote: >>>>>>>> >>>>>>>> IIUC, Option 2 is to move the Spark support for Iceberg into a >>>>>>>> separate repo (subproject of Iceberg). Would we have branches such as >>>>>>>> 0.13-2.4, 0.13-3.0, 0.13-3.1, and 0.13-3.2? For features that can be >>>>>>>> supported in all versions or all Spark 3 versions, then we would need >>>>>>>> to >>>>>>>> commit the changes to all applicable branches. Basically we are trading >>>>>>>> more work to commit to multiple branches for simplified build and CI >>>>>>>> time per branch, which might be an acceptable trade-off. However, the >>>>>>>> biggest downside is that changes may need to be made in core Iceberg as >>>>>>>> well as in the engine (in this case Spark) support, and we need to >>>>>>>> wait for >>>>>>>> a release of core Iceberg to consume the changes in the subproject. In >>>>>>>> this >>>>>>>> case, maybe we should have a monthly release of core Iceberg (no >>>>>>>> matter how >>>>>>>> many changes go in, as long as it is non-zero) so that the subproject >>>>>>>> can >>>>>>>> consume changes fairly quickly? >>>>>>>> >>>>>>>> >>>>>>>> On Wed, Sep 15, 2021 at 2:09 PM Ryan Blue <b...@tabular.io> wrote: >>>>>>>> >>>>>>>>> Thanks for bringing this up, Anton. I’m glad that we have the set >>>>>>>>> of potential solutions well defined. >>>>>>>>> >>>>>>>>> Looks like the next step is to decide whether we want to require >>>>>>>>> people to update Spark versions to pick up newer versions of Iceberg. >>>>>>>>> If we >>>>>>>>> choose to make people upgrade, then option 1 is clearly the best >>>>>>>>> choice. >>>>>>>>> >>>>>>>>> I don’t think that we should make updating Spark a requirement. >>>>>>>>> Many of the things that we’re working on are orthogonal to Spark >>>>>>>>> versions, >>>>>>>>> like table maintenance actions, secondary indexes, the 1.0 API, >>>>>>>>> views, ORC >>>>>>>>> delete files, new storage implementations, etc. Upgrading Spark is >>>>>>>>> time >>>>>>>>> consuming and untrusted in my experience, so I think we would be >>>>>>>>> setting up >>>>>>>>> an unnecessary trade-off between spending lots of time to upgrade >>>>>>>>> Spark and >>>>>>>>> picking up new Iceberg features. >>>>>>>>> >>>>>>>>> Another way of thinking about this is that if we went with option >>>>>>>>> 1, then we could port bug fixes into 0.12.x. But there are many >>>>>>>>> things that >>>>>>>>> wouldn’t fit this model, like adding a FileIO implementation for >>>>>>>>> ADLS. So >>>>>>>>> some people in the community would have to maintain branches of newer >>>>>>>>> Iceberg versions with older versions of Spark outside of the main >>>>>>>>> Iceberg >>>>>>>>> project — that defeats the purpose of simplifying things with option 1 >>>>>>>>> because we would then have more people maintaining the same 0.13.x >>>>>>>>> with >>>>>>>>> Spark 3.1 branch. (This reminds me of the Spark community, where we >>>>>>>>> wanted >>>>>>>>> to release a 2.5 line with DSv2 backported, but the community decided >>>>>>>>> not >>>>>>>>> to so we built similar 2.4+DSv2 branches at Netflix, Tencent, Apple, >>>>>>>>> etc.) >>>>>>>>> >>>>>>>>> If the community is going to do the work anyway — and I think some >>>>>>>>> of us would — we should make it possible to share that work. That’s >>>>>>>>> why I >>>>>>>>> don’t think that we should go with option 1. >>>>>>>>> >>>>>>>>> If we don’t go with option 1, then the choice is how to maintain >>>>>>>>> multiple Spark versions. I think that the way we’re doing it right >>>>>>>>> now is >>>>>>>>> not something we want to continue. >>>>>>>>> >>>>>>>>> Using multiple modules (option 3) is concerning to me because of >>>>>>>>> the changes in Spark. We currently structure the library to share as >>>>>>>>> much >>>>>>>>> code as possible. But that means compiling against different Spark >>>>>>>>> versions >>>>>>>>> and relying on binary compatibility and reflection in some cases. To >>>>>>>>> me, >>>>>>>>> this seems unmaintainable in the long run because it requires >>>>>>>>> refactoring >>>>>>>>> common classes and spending a lot of time deduplicating code. It also >>>>>>>>> creates a ton of modules, at least one common module, then a module >>>>>>>>> per >>>>>>>>> version, then an extensions module per version, and finally a runtime >>>>>>>>> module per version. That’s 3 modules per Spark version, plus any new >>>>>>>>> common >>>>>>>>> modules. And each module needs to be tested, which is making our CI >>>>>>>>> take a >>>>>>>>> really long time. We also don’t support multiple Scala versions, >>>>>>>>> which is >>>>>>>>> another gap that will require even more modules and tests. >>>>>>>>> >>>>>>>>> I like option 2 because it would allow us to compile against a >>>>>>>>> single version of Spark (which will be much more reliable). It would >>>>>>>>> give >>>>>>>>> us an opportunity to support different Scala versions. It avoids the >>>>>>>>> need >>>>>>>>> to refactor to share code and allows people to focus on a single >>>>>>>>> version of >>>>>>>>> Spark, while also creating a way for people to maintain and update the >>>>>>>>> older versions with newer Iceberg releases. I don’t think that this >>>>>>>>> would >>>>>>>>> slow down development. I think it would actually speed it up because >>>>>>>>> we’d >>>>>>>>> be spending less time trying to make multiple versions work in the >>>>>>>>> same >>>>>>>>> build. And anyone in favor of option 1 would basically get option 1: >>>>>>>>> you >>>>>>>>> don’t have to care about branches for older Spark versions. >>>>>>>>> >>>>>>>>> Jack makes a good point about wanting to keep code in a single >>>>>>>>> repository, but I think that the need to manage more version >>>>>>>>> combinations >>>>>>>>> overrides this concern. It’s easier to make this decision in python >>>>>>>>> because >>>>>>>>> we’re not trying to depend on two projects that change relatively >>>>>>>>> quickly. >>>>>>>>> We’re just trying to build a library. >>>>>>>>> >>>>>>>>> Ryan >>>>>>>>> >>>>>>>>> On Wed, Sep 15, 2021 at 2:58 AM OpenInx <open...@gmail.com> wrote: >>>>>>>>> >>>>>>>>>> Thanks for bringing this up, Anton. >>>>>>>>>> >>>>>>>>>> Everyone has great pros/cons to support their preferences. >>>>>>>>>> Before giving my preference, let me raise one question: what's >>>>>>>>>> the top >>>>>>>>>> priority thing for apache iceberg project at this point in time ? >>>>>>>>>> This >>>>>>>>>> question will help us to answer the following question: Should we >>>>>>>>>> support >>>>>>>>>> more engine versions more robustly or be a bit more aggressive and >>>>>>>>>> concentrate on getting the new features that users need most in >>>>>>>>>> order to >>>>>>>>>> keep the project more competitive ? >>>>>>>>>> >>>>>>>>>> If people watch the apache iceberg project and check the issues & >>>>>>>>>> PR frequently, I guess more than 90% people will answer the priority >>>>>>>>>> question: There is no doubt for making the whole v2 story to be >>>>>>>>>> production-ready. The current roadmap discussion also proofs the >>>>>>>>>> thing : >>>>>>>>>> https://lists.apache.org/x/thread.html/r84e80216c259c81f824c6971504c321cd8c785774c489d52d4fc123f@%3Cdev.iceberg.apache.org%3E >>>>>>>>>> . >>>>>>>>>> >>>>>>>>>> In order to ensure the highest priority at this point in time, I >>>>>>>>>> will prefer option-1 to reduce the cost of engine maintenance, so as >>>>>>>>>> to >>>>>>>>>> free up resources to make v2 production-ready. >>>>>>>>>> >>>>>>>>>> On Wed, Sep 15, 2021 at 3:00 PM Saisai Shao < >>>>>>>>>> sai.sai.s...@gmail.com> wrote: >>>>>>>>>> >>>>>>>>>>> From Dev's point, it has less burden to always support the >>>>>>>>>>> latest version of Spark (for example). But from user's point, >>>>>>>>>>> especially for us who maintain Spark internally, it is not easy to >>>>>>>>>>> upgrade >>>>>>>>>>> the Spark version for the first time (since we have many >>>>>>>>>>> customizations >>>>>>>>>>> internally), and we're still promoting to upgrade to 3.1.2. If the >>>>>>>>>>> community ditches the support of old version of Spark3, users have >>>>>>>>>>> to >>>>>>>>>>> maintain it themselves unavoidably. >>>>>>>>>>> >>>>>>>>>>> So I'm inclined to make this support in community, not by users >>>>>>>>>>> themselves, as for Option 2 or 3, I'm fine with either. And to >>>>>>>>>>> relieve the >>>>>>>>>>> burden, we could support limited versions of Spark (for example 2 >>>>>>>>>>> versions). >>>>>>>>>>> >>>>>>>>>>> Just my two cents. >>>>>>>>>>> >>>>>>>>>>> -Saisai >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> Jack Ye <yezhao...@gmail.com> 于2021年9月15日周三 下午1:35写道: >>>>>>>>>>> >>>>>>>>>>>> Hi Wing Yew, >>>>>>>>>>>> >>>>>>>>>>>> I think 2.4 is a different story, we will continue to support >>>>>>>>>>>> Spark 2.4, but as you can see it will continue to have very limited >>>>>>>>>>>> functionalities comparing to Spark 3. I believe we discussed about >>>>>>>>>>>> option 3 >>>>>>>>>>>> when we were doing Spark 3.0 to 3.1 upgrade. Recently we are >>>>>>>>>>>> seeing the >>>>>>>>>>>> same issue for Flink 1.11, 1.12 and 1.13 as well. I feel we need a >>>>>>>>>>>> consistent strategy around this, let's take this chance to make a >>>>>>>>>>>> good >>>>>>>>>>>> community guideline for all future engine versions, especially for >>>>>>>>>>>> Spark, >>>>>>>>>>>> Flink and Hive that are in the same repository. >>>>>>>>>>>> >>>>>>>>>>>> I can totally understand your point of view Wing, in fact, >>>>>>>>>>>> speaking from the perspective of AWS EMR, we have to support over >>>>>>>>>>>> 40 >>>>>>>>>>>> versions of the software because there are people who are still >>>>>>>>>>>> using Spark >>>>>>>>>>>> 1.4, believe it or not. After all, keep backporting changes will >>>>>>>>>>>> become a >>>>>>>>>>>> liability not only on the user side, but also on the service >>>>>>>>>>>> provider side, >>>>>>>>>>>> so I believe it's not a bad practice to push for user upgrade, as >>>>>>>>>>>> it will >>>>>>>>>>>> make the life of both parties easier in the end. New feature is >>>>>>>>>>>> definitely >>>>>>>>>>>> one of the best incentives to promote an upgrade on user side. >>>>>>>>>>>> >>>>>>>>>>>> I think the biggest issue of option 3 is about its scalability, >>>>>>>>>>>> because we will have an unbounded list of packages to add and >>>>>>>>>>>> compile in >>>>>>>>>>>> the future, and we probably cannot drop support of that package >>>>>>>>>>>> once >>>>>>>>>>>> created. If we go with option 1, I think we can still publish a >>>>>>>>>>>> few patch >>>>>>>>>>>> versions for old Iceberg releases, and committers can control the >>>>>>>>>>>> amount of >>>>>>>>>>>> patch versions to guard people from abusing the power of patching. >>>>>>>>>>>> I see >>>>>>>>>>>> this as a consistent strategy also for Flink and Hive. With this >>>>>>>>>>>> strategy, >>>>>>>>>>>> we can truly have a compatibility matrix for engine versions >>>>>>>>>>>> against >>>>>>>>>>>> Iceberg versions. >>>>>>>>>>>> >>>>>>>>>>>> -Jack >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> On Tue, Sep 14, 2021 at 10:00 PM Wing Yew Poon < >>>>>>>>>>>> wyp...@cloudera.com.invalid> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> I understand and sympathize with the desire to use new DSv2 >>>>>>>>>>>>> features in Spark 3.2. I agree that Option 1 is the easiest for >>>>>>>>>>>>> developers, >>>>>>>>>>>>> but I don't think it considers the interests of users. I do not >>>>>>>>>>>>> think that >>>>>>>>>>>>> most users will upgrade to Spark 3.2 as soon as it is released. >>>>>>>>>>>>> It is a >>>>>>>>>>>>> "minor version" upgrade in name from 3.1 (or from 3.0), but I >>>>>>>>>>>>> think we all >>>>>>>>>>>>> know that it is not a minor upgrade. There are a lot of changes >>>>>>>>>>>>> from 3.0 to >>>>>>>>>>>>> 3.1 and from 3.1 to 3.2. I think there are even a lot of users >>>>>>>>>>>>> running >>>>>>>>>>>>> Spark 2.4 and not even on Spark 3 yet. Do we also plan to stop >>>>>>>>>>>>> supporting >>>>>>>>>>>>> Spark 2.4? >>>>>>>>>>>>> >>>>>>>>>>>>> Please correct me if I'm mistaken, but the folks who have >>>>>>>>>>>>> spoken out in favor of Option 1 all work for the same >>>>>>>>>>>>> organization, don't >>>>>>>>>>>>> they? And they don't have a problem with making their users, all >>>>>>>>>>>>> internal, >>>>>>>>>>>>> simply upgrade to Spark 3.2, do they? (Or they are already >>>>>>>>>>>>> running an >>>>>>>>>>>>> internal fork that is close to 3.2.) >>>>>>>>>>>>> >>>>>>>>>>>>> I work for an organization with customers running different >>>>>>>>>>>>> versions of Spark. It is true that we can backport new features >>>>>>>>>>>>> to older >>>>>>>>>>>>> versions if we wanted to. I suppose the people contributing to >>>>>>>>>>>>> Iceberg work >>>>>>>>>>>>> for some organization or other that either use Iceberg in-house, >>>>>>>>>>>>> or provide >>>>>>>>>>>>> software (possibly in the form of a service) to customers, and >>>>>>>>>>>>> either way, >>>>>>>>>>>>> the organizations have the ability to backport features and fixes >>>>>>>>>>>>> to >>>>>>>>>>>>> internal versions. Are there any users out there who simply use >>>>>>>>>>>>> Apache >>>>>>>>>>>>> Iceberg and depend on the community version? >>>>>>>>>>>>> >>>>>>>>>>>>> There may be features that are broadly useful that do not >>>>>>>>>>>>> depend on Spark 3.2. Is it worth supporting them on Spark 3.0/3.1 >>>>>>>>>>>>> (and even >>>>>>>>>>>>> 2.4)? >>>>>>>>>>>>> >>>>>>>>>>>>> I am not in favor of Option 2. I do not oppose Option 1, but I >>>>>>>>>>>>> would consider Option 3 too. Anton, you said 5 modules are >>>>>>>>>>>>> required; what >>>>>>>>>>>>> are the modules you're thinking of? >>>>>>>>>>>>> >>>>>>>>>>>>> - Wing Yew >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> On Tue, Sep 14, 2021 at 5:38 PM Yufei Gu <flyrain...@gmail.com> >>>>>>>>>>>>> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> Option 1 sounds good to me. Here are my reasons: >>>>>>>>>>>>>> >>>>>>>>>>>>>> 1. Both 2 and 3 will slow down the development. Considering >>>>>>>>>>>>>> the limited resources in the open source community, the upsides >>>>>>>>>>>>>> of option 2 >>>>>>>>>>>>>> and 3 are probably not worthy. >>>>>>>>>>>>>> 2. Both 2 and 3 assume the use cases may not exist. It's hard >>>>>>>>>>>>>> to predict anything, but even if these use cases are legit, >>>>>>>>>>>>>> users can still >>>>>>>>>>>>>> get the new feature by backporting it to an older version in >>>>>>>>>>>>>> case of >>>>>>>>>>>>>> upgrading to a newer version isn't an option. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Best, >>>>>>>>>>>>>> >>>>>>>>>>>>>> Yufei >>>>>>>>>>>>>> >>>>>>>>>>>>>> `This is not a contribution` >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> On Tue, Sep 14, 2021 at 4:54 PM Anton Okolnychyi < >>>>>>>>>>>>>> aokolnyc...@apple.com.invalid> wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>>> To sum up what we have so far: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> *Option 1 (support just the most recent minor Spark 3 >>>>>>>>>>>>>>> version)* >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> The easiest option for us devs, forces the user to upgrade >>>>>>>>>>>>>>> to the most recent minor Spark version to consume any new >>>>>>>>>>>>>>> Iceberg features. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> *Option 2 (a separate project under Iceberg)* >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Can support as many Spark versions as needed and the >>>>>>>>>>>>>>> codebase is still separate as we can use separate branches. >>>>>>>>>>>>>>> Impossible to consume any unreleased changes in core, may >>>>>>>>>>>>>>> slow down the development. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> *Option 3 (separate modules for Spark 3.1/3.2)* >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Introduce more modules in the same project. >>>>>>>>>>>>>>> Can consume unreleased changes but it will required at least >>>>>>>>>>>>>>> 5 modules to support 2.4, 3.1 and 3.2, making the build and >>>>>>>>>>>>>>> testing >>>>>>>>>>>>>>> complicated. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Are there any users for whom upgrading the minor Spark >>>>>>>>>>>>>>> version (e3.1 to 3.2) to consume new features is a blocker? >>>>>>>>>>>>>>> We follow Option 1 internally at the moment but I would like >>>>>>>>>>>>>>> to hear what other people think/need. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> - Anton >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> On 14 Sep 2021, at 09:44, Russell Spitzer < >>>>>>>>>>>>>>> russell.spit...@gmail.com> wrote: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> I think we should go for option 1. I already am not a big >>>>>>>>>>>>>>> fan of having runtime errors for unsupported things based on >>>>>>>>>>>>>>> versions and I >>>>>>>>>>>>>>> don't think minor version upgrades are a large issue for users. >>>>>>>>>>>>>>> I'm >>>>>>>>>>>>>>> especially not looking forward to supporting interfaces that >>>>>>>>>>>>>>> only exist in >>>>>>>>>>>>>>> Spark 3.2 in a multiple Spark version support future. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> On Sep 14, 2021, at 11:32 AM, Anton Okolnychyi < >>>>>>>>>>>>>>> aokolnyc...@apple.com.INVALID> wrote: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> First of all, is option 2 a viable option? We discussed >>>>>>>>>>>>>>> separating the python module outside of the project a few weeks >>>>>>>>>>>>>>> ago, and >>>>>>>>>>>>>>> decided to not do that because it's beneficial for code cross >>>>>>>>>>>>>>> reference and >>>>>>>>>>>>>>> more intuitive for new developers to see everything in the same >>>>>>>>>>>>>>> repository. >>>>>>>>>>>>>>> I would expect the same argument to also hold here. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> That’s exactly the concern I have about Option 2 at this >>>>>>>>>>>>>>> moment. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Overall I would personally prefer us to not support all the >>>>>>>>>>>>>>> minor versions, but instead support maybe just 2-3 latest >>>>>>>>>>>>>>> versions in a >>>>>>>>>>>>>>> major version. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> This is when it gets a bit complicated. If we want to >>>>>>>>>>>>>>> support both Spark 3.1 and Spark 3.2 with a single module, it >>>>>>>>>>>>>>> means we have >>>>>>>>>>>>>>> to compile against 3.1. The problem is that we rely on DSv2 >>>>>>>>>>>>>>> that is being >>>>>>>>>>>>>>> actively developed. 3.2 and 3.1 have substantial differences. >>>>>>>>>>>>>>> On top of >>>>>>>>>>>>>>> that, we have our extensions that are extremely low-level and >>>>>>>>>>>>>>> may break not >>>>>>>>>>>>>>> only between minor versions but also between patch releases. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> f there are some features requiring a newer version, it >>>>>>>>>>>>>>> makes sense to move that newer version in master. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Internally, we don’t deliver new features to older Spark >>>>>>>>>>>>>>> versions as it requires a lot of effort to port things. >>>>>>>>>>>>>>> Personally, I don’t >>>>>>>>>>>>>>> think it is too bad to require users to upgrade if they want >>>>>>>>>>>>>>> new features. >>>>>>>>>>>>>>> At the same time, there are valid concerns with this approach >>>>>>>>>>>>>>> too that we >>>>>>>>>>>>>>> mentioned during the sync. For example, certain new features >>>>>>>>>>>>>>> would also >>>>>>>>>>>>>>> work fine with older Spark versions. I generally agree with >>>>>>>>>>>>>>> that and that >>>>>>>>>>>>>>> not supporting recent versions is not ideal. However, I want to >>>>>>>>>>>>>>> find a >>>>>>>>>>>>>>> balance between the complexity on our side and ease of use for >>>>>>>>>>>>>>> the users. >>>>>>>>>>>>>>> Ideally, supporting a few recent versions would be sufficient >>>>>>>>>>>>>>> but our Spark >>>>>>>>>>>>>>> integration is too low-level to do that with a single module. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> On 13 Sep 2021, at 20:53, Jack Ye <yezhao...@gmail.com> >>>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> First of all, is option 2 a viable option? We discussed >>>>>>>>>>>>>>> separating the python module outside of the project a few weeks >>>>>>>>>>>>>>> ago, and >>>>>>>>>>>>>>> decided to not do that because it's beneficial for code cross >>>>>>>>>>>>>>> reference and >>>>>>>>>>>>>>> more intuitive for new developers to see everything in the same >>>>>>>>>>>>>>> repository. >>>>>>>>>>>>>>> I would expect the same argument to also hold here. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Overall I would personally prefer us to not support all the >>>>>>>>>>>>>>> minor versions, but instead support maybe just 2-3 latest >>>>>>>>>>>>>>> versions in a >>>>>>>>>>>>>>> major version. This avoids the problem that some users are >>>>>>>>>>>>>>> unwilling to >>>>>>>>>>>>>>> move to a newer version and keep patching old Spark version >>>>>>>>>>>>>>> branches. If >>>>>>>>>>>>>>> there are some features requiring a newer version, it makes >>>>>>>>>>>>>>> sense to move >>>>>>>>>>>>>>> that newer version in master. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> In addition, because currently Spark is considered the most >>>>>>>>>>>>>>> feature-complete reference implementation compared to all other >>>>>>>>>>>>>>> engines, I >>>>>>>>>>>>>>> think we should not add artificial barriers that would slow >>>>>>>>>>>>>>> down its >>>>>>>>>>>>>>> development speed. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> So my thinking is closer to option 1. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Best, >>>>>>>>>>>>>>> Jack Ye >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> On Mon, Sep 13, 2021 at 7:39 PM Anton Okolnychyi < >>>>>>>>>>>>>>> aokolnyc...@apple.com.invalid> wrote: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Hey folks, >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> I want to discuss our Spark version support strategy. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> So far, we have tried to support both 3.0 and 3.1. It is >>>>>>>>>>>>>>>> great to support older versions but because we compile against >>>>>>>>>>>>>>>> 3.0, we >>>>>>>>>>>>>>>> cannot use any Spark features that are offered in newer >>>>>>>>>>>>>>>> versions. >>>>>>>>>>>>>>>> Spark 3.2 is just around the corner and it brings a lot of >>>>>>>>>>>>>>>> important features such dynamic filtering for v2 tables, >>>>>>>>>>>>>>>> required >>>>>>>>>>>>>>>> distribution and ordering for writes, etc. These features are >>>>>>>>>>>>>>>> too important >>>>>>>>>>>>>>>> to ignore them. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Apart from that, I have an end-to-end prototype for >>>>>>>>>>>>>>>> merge-on-read with Spark that actually leverages some of the >>>>>>>>>>>>>>>> 3.2 features. >>>>>>>>>>>>>>>> I’ll be implementing all new Spark DSv2 APIs for us internally >>>>>>>>>>>>>>>> and would >>>>>>>>>>>>>>>> love to share that with the rest of the community. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> I see two options to move forward: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Option 1 >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Migrate to Spark 3.2 in master, maintain 0.12 for a while >>>>>>>>>>>>>>>> by releasing minor versions with bug fixes. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Pros: almost no changes to the build configuration, no >>>>>>>>>>>>>>>> extra work on our side as just a single Spark version is >>>>>>>>>>>>>>>> actively >>>>>>>>>>>>>>>> maintained. >>>>>>>>>>>>>>>> Cons: some new features that we will be adding to master >>>>>>>>>>>>>>>> could also work with older Spark versions but all 0.12 >>>>>>>>>>>>>>>> releases will only >>>>>>>>>>>>>>>> contain bug fixes. Therefore, users will be forced to migrate >>>>>>>>>>>>>>>> to Spark 3.2 >>>>>>>>>>>>>>>> to consume any new Spark or format features. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Option 2 >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Move our Spark integration into a separate project and >>>>>>>>>>>>>>>> introduce branches for 3.0, 3.1 and 3.2. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Pros: decouples the format version from Spark, we can >>>>>>>>>>>>>>>> support as many Spark versions as needed. >>>>>>>>>>>>>>>> Cons: more work initially to set everything up, more work >>>>>>>>>>>>>>>> to release, will need a new release of the core format to >>>>>>>>>>>>>>>> consume any >>>>>>>>>>>>>>>> changes in the Spark integration. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Overall, I think option 2 seems better for the user but my >>>>>>>>>>>>>>>> main worry is that we will have to release the format more >>>>>>>>>>>>>>>> frequently >>>>>>>>>>>>>>>> (which is a good thing but requires more work and time) and >>>>>>>>>>>>>>>> the overall >>>>>>>>>>>>>>>> Spark development may be slower. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> I’d love to hear what everybody thinks about this matter. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Thanks, >>>>>>>>>>>>>>>> Anton >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>> >>>>>>>>> -- >>>>>>>>> Ryan Blue >>>>>>>>> Tabular >>>>>>>>> >>>>>>>> >>>>>>>> >>>>>>> >>>>>>> -- >>>>>>> Ryan Blue >>>>>>> Tabular >>>>>>> >>>>>> >>>>> >>>>> -- >>>>> Ryan Blue >>>>> Tabular >>>>> >>>>> >>>>>