Wing, sorry, my earlier message probably misled you. I was speaking my personal opinion on Flink version support.
On Tue, Sep 28, 2021 at 8:03 PM Wing Yew Poon <wyp...@cloudera.com.invalid> wrote: > Hi OpenInx, > I'm sorry I misunderstood the thinking of the Flink community. Thanks for > the clarification. > - Wing Yew > > > On Tue, Sep 28, 2021 at 7:15 PM OpenInx <open...@gmail.com> wrote: > >> Hi Wing >> >> As we discussed above, we community prefer to choose option.2 or >> option.3. So in fact, when we planned to upgrade the flink version from >> 1.12 to 1.13, we are doing our best to guarantee the master iceberg repo >> could work fine for both flink1.12 & flink1.13. More context please see >> [1], [2], [3] >> >> [1] https://github.com/apache/iceberg/pull/3116 >> [2] https://github.com/apache/iceberg/issues/3183 >> [3] >> https://lists.apache.org/x/thread.html/ra438e89eeec2d4623a32822e21739c8f2229505522d73d1034e34198@%3Cdev.flink.apache.org%3E >> >> >> On Wed, Sep 29, 2021 at 5:27 AM Wing Yew Poon <wyp...@cloudera.com.invalid> >> wrote: >> >>> In the last community sync, we spent a little time on this topic. For >>> Spark support, there are currently two options under consideration: >>> >>> Option 2: Separate repo for the Spark support. Use branches for >>> supporting different Spark versions. Main branch for the latest Spark >>> version (3.2 to begin with). >>> Tooling needs to be built for producing regular snapshots of core >>> Iceberg in a consumable way for this repo. Unclear if commits to core >>> Iceberg will be tested pre-commit against Spark support; my impression is >>> that they will not be, and the Spark support build can be broken by changes >>> to core. >>> >>> A variant of option 3 (which we will simply call Option 3 going >>> forward): Single repo, separate module (subdirectory) for each Spark >>> version to be supported. Code duplication in each Spark module (no attempt >>> to refactor out common code). Each module built against the specific >>> version of Spark to be supported, producing a runtime jar built against >>> that version. CI will test all modules. Support can be provided for only >>> building the modules a developer cares about. >>> >>> More input was sought and people are encouraged to voice their >>> preference. >>> I lean towards Option 3. >>> >>> - Wing Yew >>> >>> ps. In the sync, as Steven Wu wrote, the question was raised if the same >>> multi-version support strategy can be adopted across engines. Based on what >>> Steven wrote, currently the Flink developer community's bandwidth makes >>> supporting only a single Flink version (and focusing resources on >>> developing new features on that version) the preferred choice. If so, then >>> no multi-version support strategy for Flink is needed at this time. >>> >>> >>> On Thu, Sep 23, 2021 at 5:26 PM Steven Wu <stevenz...@gmail.com> wrote: >>> >>>> During the sync meeting, people talked about if and how we can have the >>>> same version support model across engines like Flink and Spark. I can >>>> provide some input from the Flink side. >>>> >>>> Flink only supports two minor versions. E.g., right now Flink 1.13 is >>>> the latest released version. That means only Flink 1.12 and 1.13 are >>>> supported. Feature changes or bug fixes will only be backported to 1.12 and >>>> 1.13, unless it is a serious bug (like security). With that context, >>>> personally I like option 1 (with one actively supported Flink version in >>>> master branch) for the iceberg-flink module. >>>> >>>> We discussed the idea of supporting multiple Flink versions via shm >>>> layer and multiple modules. While it may be a little better to support >>>> multiple Flink versions, I don't know if there is enough support and >>>> resources from the community to pull it off. Also the ongoing maintenance >>>> burden for each minor version release from Flink, which happens roughly >>>> every 4 months. >>>> >>>> >>>> On Thu, Sep 16, 2021 at 10:25 PM Peter Vary <pv...@cloudera.com.invalid> >>>> wrote: >>>> >>>>> Since you mentioned Hive, I chime in with what we do there. You might >>>>> find it useful: >>>>> - metastore module - only small differences - DynConstructor solves >>>>> for us >>>>> - mr module - some bigger differences, but still manageable for Hive >>>>> 2-3. Need some new classes, but most of the code is reused - extra module >>>>> for Hive 3. For Hive 4 we use a different repo as we moved to the Hive >>>>> codebase. >>>>> >>>>> My thoughts based on the above experience: >>>>> - Keeping Hive 4 and Hive 2-3 code in sync is a pain. We constantly >>>>> have problems with backporting changes between repos and we are slacking >>>>> behind which hurts both projects >>>>> - Hive 2-3 model is working better by forcing us to keep the things in >>>>> sync, but with serious differences in the Hive project it still doesn't >>>>> seem like a viable option. >>>>> >>>>> So I think the question is: How stable is the Spark code we are >>>>> integrating to. If I is fairly stable then we are better off with a "one >>>>> repo multiple modules" approach and we should consider the multirepo only >>>>> if the differences become prohibitive. >>>>> >>>>> Thanks, Peter >>>>> >>>>> On Fri, 17 Sep 2021, 02:21 Anton Okolnychyi, >>>>> <aokolnyc...@apple.com.invalid> wrote: >>>>> >>>>>> Okay, looks like there is consensus around supporting multiple Spark >>>>>> versions at the same time. There are folks who mentioned this on this >>>>>> thread and there were folks who brought this up during the sync. >>>>>> >>>>>> Let’s think through Option 2 and 3 in more detail then. >>>>>> >>>>>> Option 2 >>>>>> >>>>>> In Option 2, there will be a separate repo. I believe the master >>>>>> branch will soon point to Spark 3.2 (the most recent supported version). >>>>>> The main development will happen there and the artifact version will be >>>>>> 0.1.0. I also suppose there will be 0.1.x-spark-2 and 0.1.x-spark-3.1 >>>>>> branches where we will cherry-pick applicable changes. Once we are ready >>>>>> to >>>>>> release 0.1.0 Spark integration, we will create 0.1.x-spark-3.2 and cut 3 >>>>>> releases: Spark 2.4, Spark 3.1, Spark 3.2. After that, we will bump the >>>>>> version in master to 0.2.0 and create new 0.2.x-spark-2 and >>>>>> 0.2.x-spark-3.1 >>>>>> branches for cherry-picks. >>>>>> >>>>>> I guess we will continue to shade everything in the new repo and will >>>>>> have to release every time the core is released. We will do a maintenance >>>>>> release for each supported Spark version whenever we cut a new >>>>>> maintenance Iceberg >>>>>> release or need to fix any bugs in the Spark integration. >>>>>> Under this model, we will probably need nightly snapshots (or on each >>>>>> commit) for the core format and the Spark integration will depend on >>>>>> snapshots until we are ready to release. >>>>>> >>>>>> Overall, I think this option gives us very simple builds and provides >>>>>> best separation. It will keep the main repo clean. The main downside is >>>>>> that we will have to split a Spark feature into two PRs: one against the >>>>>> core and one against the Spark integration. Certain changes in core can >>>>>> also break the Spark integration too and will require adaptations. >>>>>> >>>>>> Ryan, I am not sure I fully understood the testing part. How will we >>>>>> be able to test the Spark integration in the main repo if certain changes >>>>>> in core may break the Spark integration and require changes there? Will >>>>>> we >>>>>> try to prohibit such changes? >>>>>> >>>>>> Option 3 (modified) >>>>>> >>>>>> If I get correctly, the modified Option 3 sounds very close to >>>>>> the initially suggested approach by Imran but with code duplication >>>>>> instead >>>>>> of extra refactoring and introducing new common modules. >>>>>> >>>>>> Jack, are you suggesting we test only a single Spark version at a >>>>>> time? Or do we expect to test all versions? Will there be any difference >>>>>> compared to just having a module per version? I did not fully >>>>>> understand. >>>>>> >>>>>> My worry with this approach is that our build will be very >>>>>> complicated and we will still have a lot of Spark-related modules in the >>>>>> main repo. Once people start using Flink and Hive more, will we have to >>>>>> do >>>>>> the same? >>>>>> >>>>>> - Anton >>>>>> >>>>>> >>>>>> >>>>>> On 16 Sep 2021, at 08:11, Ryan Blue <b...@tabular.io> wrote: >>>>>> >>>>>> I'd support the option that Jack suggests if we can set a few >>>>>> expectations for keeping it clean. >>>>>> >>>>>> First, I'd like to avoid refactoring code to share it across Spark >>>>>> versions -- that introduces risk because we're relying on compiling >>>>>> against >>>>>> one version and running in another and both Spark and Scala change >>>>>> rapidly. >>>>>> A big benefit of options 1 and 2 is that we mostly focus on only one >>>>>> Spark >>>>>> version. I think we should duplicate code rather than spend time >>>>>> refactoring to rely on binary compatibility. I propose we start each new >>>>>> Spark version by copying the last one and updating it. And we should >>>>>> build >>>>>> just the latest supported version by default. >>>>>> >>>>>> The drawback to having everything in a single repo is that we >>>>>> wouldn't be able to cherry-pick changes across Spark versions/branches, >>>>>> but >>>>>> I think Jack is right that having a single build is better. >>>>>> >>>>>> Second, we should make CI faster by running the Spark builds in >>>>>> parallel. It sounds like this is what would happen anyway, with a >>>>>> property >>>>>> that selects the Spark version that you want to build against. >>>>>> >>>>>> Overall, this new suggestion sounds like a promising way forward. >>>>>> >>>>>> Ryan >>>>>> >>>>>> On Wed, Sep 15, 2021 at 11:46 PM Jack Ye <yezhao...@gmail.com> wrote: >>>>>> >>>>>>> I think in Ryan's proposal we will create a ton of modules anyway, >>>>>>> as Wing listed we are just using git branch as an additional dimension, >>>>>>> but >>>>>>> my understanding is that you will still have 1 core, 1 extension, 1 >>>>>>> runtime >>>>>>> artifact published for each Spark version in either approach. >>>>>>> >>>>>>> In that case, this is just brainstorming, I wonder if we can explore >>>>>>> a modified option 3 that flattens all the versions in each Spark branch >>>>>>> in >>>>>>> option 2 into master. The repository structure would look something >>>>>>> like: >>>>>>> >>>>>>> iceberg/api/... >>>>>>> /bundled-guava/... >>>>>>> /core/... >>>>>>> ... >>>>>>> /spark/2.4/core/... >>>>>>> /extension/... >>>>>>> /runtime/... >>>>>>> /3.1/core/... >>>>>>> /extension/... >>>>>>> /runtime/... >>>>>>> >>>>>>> The gradle build script in the root is configured to build against >>>>>>> the latest version of Spark by default, unless otherwise specified by >>>>>>> the >>>>>>> user. >>>>>>> >>>>>>> Intellij can also be configured to only index files of specific >>>>>>> versions based on the same config used in build. >>>>>>> >>>>>>> In this way, I imagine the CI setup to be much easier to do things >>>>>>> like testing version compatibility for a feature or running only a >>>>>>> specific subset of Spark version builds based on the Spark version >>>>>>> directories touched. >>>>>>> >>>>>>> And the biggest benefit is that we don't have the same difficulty as >>>>>>> option 2 of developing a feature when it's both in core and Spark. >>>>>>> >>>>>>> We can then develop a mechanism to vote to stop support of certain >>>>>>> versions, and archive the corresponding directory to avoid accumulating >>>>>>> too >>>>>>> many versions in the long term. >>>>>>> >>>>>>> -Jack Ye >>>>>>> >>>>>>> >>>>>>> On Wed, Sep 15, 2021 at 4:17 PM Ryan Blue <b...@tabular.io> wrote: >>>>>>> >>>>>>>> Sorry, I was thinking about CI integration between Iceberg Java and >>>>>>>> Iceberg Spark, I just didn't mention it and I see how that's a big >>>>>>>> thing to >>>>>>>> leave out! >>>>>>>> >>>>>>>> I would definitely want to test the projects together. One thing we >>>>>>>> could do is have a nightly build like Russell suggests. I'm also >>>>>>>> wondering >>>>>>>> if we could have some tighter integration where the Iceberg Spark >>>>>>>> build can >>>>>>>> be included in the Iceberg Java build using properties. Maybe the >>>>>>>> github >>>>>>>> action could checkout Iceberg, then checkout the Spark integration's >>>>>>>> latest >>>>>>>> branch, and then run the gradle build with a property that makes Spark >>>>>>>> a >>>>>>>> subproject in the build. That way we can continue to have Spark CI run >>>>>>>> regularly. >>>>>>>> >>>>>>>> On Wed, Sep 15, 2021 at 3:08 PM Russell Spitzer < >>>>>>>> russell.spit...@gmail.com> wrote: >>>>>>>> >>>>>>>>> I agree that Option 2 is considerably more difficult for >>>>>>>>> development when core API changes need to be picked up by the external >>>>>>>>> Spark module. I also think a monthly release would probably still be >>>>>>>>> prohibitive to actually implementing new features that appear in the >>>>>>>>> API, I >>>>>>>>> would hope we have a much faster process or maybe just have snapshot >>>>>>>>> artifacts published nightly? >>>>>>>>> >>>>>>>>> On Sep 15, 2021, at 4:46 PM, Wing Yew Poon < >>>>>>>>> wyp...@cloudera.com.INVALID> wrote: >>>>>>>>> >>>>>>>>> IIUC, Option 2 is to move the Spark support for Iceberg into a >>>>>>>>> separate repo (subproject of Iceberg). Would we have branches such as >>>>>>>>> 0.13-2.4, 0.13-3.0, 0.13-3.1, and 0.13-3.2? For features that can be >>>>>>>>> supported in all versions or all Spark 3 versions, then we would need >>>>>>>>> to >>>>>>>>> commit the changes to all applicable branches. Basically we are >>>>>>>>> trading >>>>>>>>> more work to commit to multiple branches for simplified build and CI >>>>>>>>> time per branch, which might be an acceptable trade-off. However, the >>>>>>>>> biggest downside is that changes may need to be made in core Iceberg >>>>>>>>> as >>>>>>>>> well as in the engine (in this case Spark) support, and we need to >>>>>>>>> wait for >>>>>>>>> a release of core Iceberg to consume the changes in the subproject. >>>>>>>>> In this >>>>>>>>> case, maybe we should have a monthly release of core Iceberg (no >>>>>>>>> matter how >>>>>>>>> many changes go in, as long as it is non-zero) so that the subproject >>>>>>>>> can >>>>>>>>> consume changes fairly quickly? >>>>>>>>> >>>>>>>>> >>>>>>>>> On Wed, Sep 15, 2021 at 2:09 PM Ryan Blue <b...@tabular.io> wrote: >>>>>>>>> >>>>>>>>>> Thanks for bringing this up, Anton. I’m glad that we have the set >>>>>>>>>> of potential solutions well defined. >>>>>>>>>> >>>>>>>>>> Looks like the next step is to decide whether we want to require >>>>>>>>>> people to update Spark versions to pick up newer versions of >>>>>>>>>> Iceberg. If we >>>>>>>>>> choose to make people upgrade, then option 1 is clearly the best >>>>>>>>>> choice. >>>>>>>>>> >>>>>>>>>> I don’t think that we should make updating Spark a requirement. >>>>>>>>>> Many of the things that we’re working on are orthogonal to Spark >>>>>>>>>> versions, >>>>>>>>>> like table maintenance actions, secondary indexes, the 1.0 API, >>>>>>>>>> views, ORC >>>>>>>>>> delete files, new storage implementations, etc. Upgrading Spark is >>>>>>>>>> time >>>>>>>>>> consuming and untrusted in my experience, so I think we would be >>>>>>>>>> setting up >>>>>>>>>> an unnecessary trade-off between spending lots of time to upgrade >>>>>>>>>> Spark and >>>>>>>>>> picking up new Iceberg features. >>>>>>>>>> >>>>>>>>>> Another way of thinking about this is that if we went with option >>>>>>>>>> 1, then we could port bug fixes into 0.12.x. But there are many >>>>>>>>>> things that >>>>>>>>>> wouldn’t fit this model, like adding a FileIO implementation for >>>>>>>>>> ADLS. So >>>>>>>>>> some people in the community would have to maintain branches of newer >>>>>>>>>> Iceberg versions with older versions of Spark outside of the main >>>>>>>>>> Iceberg >>>>>>>>>> project — that defeats the purpose of simplifying things with option >>>>>>>>>> 1 >>>>>>>>>> because we would then have more people maintaining the same 0.13.x >>>>>>>>>> with >>>>>>>>>> Spark 3.1 branch. (This reminds me of the Spark community, where we >>>>>>>>>> wanted >>>>>>>>>> to release a 2.5 line with DSv2 backported, but the community >>>>>>>>>> decided not >>>>>>>>>> to so we built similar 2.4+DSv2 branches at Netflix, Tencent, Apple, >>>>>>>>>> etc.) >>>>>>>>>> >>>>>>>>>> If the community is going to do the work anyway — and I think >>>>>>>>>> some of us would — we should make it possible to share that work. >>>>>>>>>> That’s >>>>>>>>>> why I don’t think that we should go with option 1. >>>>>>>>>> >>>>>>>>>> If we don’t go with option 1, then the choice is how to maintain >>>>>>>>>> multiple Spark versions. I think that the way we’re doing it right >>>>>>>>>> now is >>>>>>>>>> not something we want to continue. >>>>>>>>>> >>>>>>>>>> Using multiple modules (option 3) is concerning to me because of >>>>>>>>>> the changes in Spark. We currently structure the library to share as >>>>>>>>>> much >>>>>>>>>> code as possible. But that means compiling against different Spark >>>>>>>>>> versions >>>>>>>>>> and relying on binary compatibility and reflection in some cases. To >>>>>>>>>> me, >>>>>>>>>> this seems unmaintainable in the long run because it requires >>>>>>>>>> refactoring >>>>>>>>>> common classes and spending a lot of time deduplicating code. It also >>>>>>>>>> creates a ton of modules, at least one common module, then a module >>>>>>>>>> per >>>>>>>>>> version, then an extensions module per version, and finally a runtime >>>>>>>>>> module per version. That’s 3 modules per Spark version, plus any new >>>>>>>>>> common >>>>>>>>>> modules. And each module needs to be tested, which is making our CI >>>>>>>>>> take a >>>>>>>>>> really long time. We also don’t support multiple Scala versions, >>>>>>>>>> which is >>>>>>>>>> another gap that will require even more modules and tests. >>>>>>>>>> >>>>>>>>>> I like option 2 because it would allow us to compile against a >>>>>>>>>> single version of Spark (which will be much more reliable). It would >>>>>>>>>> give >>>>>>>>>> us an opportunity to support different Scala versions. It avoids the >>>>>>>>>> need >>>>>>>>>> to refactor to share code and allows people to focus on a single >>>>>>>>>> version of >>>>>>>>>> Spark, while also creating a way for people to maintain and update >>>>>>>>>> the >>>>>>>>>> older versions with newer Iceberg releases. I don’t think that this >>>>>>>>>> would >>>>>>>>>> slow down development. I think it would actually speed it up because >>>>>>>>>> we’d >>>>>>>>>> be spending less time trying to make multiple versions work in the >>>>>>>>>> same >>>>>>>>>> build. And anyone in favor of option 1 would basically get option 1: >>>>>>>>>> you >>>>>>>>>> don’t have to care about branches for older Spark versions. >>>>>>>>>> >>>>>>>>>> Jack makes a good point about wanting to keep code in a single >>>>>>>>>> repository, but I think that the need to manage more version >>>>>>>>>> combinations >>>>>>>>>> overrides this concern. It’s easier to make this decision in python >>>>>>>>>> because >>>>>>>>>> we’re not trying to depend on two projects that change relatively >>>>>>>>>> quickly. >>>>>>>>>> We’re just trying to build a library. >>>>>>>>>> >>>>>>>>>> Ryan >>>>>>>>>> >>>>>>>>>> On Wed, Sep 15, 2021 at 2:58 AM OpenInx <open...@gmail.com> >>>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>>> Thanks for bringing this up, Anton. >>>>>>>>>>> >>>>>>>>>>> Everyone has great pros/cons to support their preferences. >>>>>>>>>>> Before giving my preference, let me raise one question: what's >>>>>>>>>>> the top >>>>>>>>>>> priority thing for apache iceberg project at this point in time ? >>>>>>>>>>> This >>>>>>>>>>> question will help us to answer the following question: Should we >>>>>>>>>>> support >>>>>>>>>>> more engine versions more robustly or be a bit more aggressive and >>>>>>>>>>> concentrate on getting the new features that users need most in >>>>>>>>>>> order to >>>>>>>>>>> keep the project more competitive ? >>>>>>>>>>> >>>>>>>>>>> If people watch the apache iceberg project and check the issues >>>>>>>>>>> & PR frequently, I guess more than 90% people will answer the >>>>>>>>>>> priority >>>>>>>>>>> question: There is no doubt for making the whole v2 story to be >>>>>>>>>>> production-ready. The current roadmap discussion also proofs the >>>>>>>>>>> thing : >>>>>>>>>>> https://lists.apache.org/x/thread.html/r84e80216c259c81f824c6971504c321cd8c785774c489d52d4fc123f@%3Cdev.iceberg.apache.org%3E >>>>>>>>>>> . >>>>>>>>>>> >>>>>>>>>>> In order to ensure the highest priority at this point in time, I >>>>>>>>>>> will prefer option-1 to reduce the cost of engine maintenance, so >>>>>>>>>>> as to >>>>>>>>>>> free up resources to make v2 production-ready. >>>>>>>>>>> >>>>>>>>>>> On Wed, Sep 15, 2021 at 3:00 PM Saisai Shao < >>>>>>>>>>> sai.sai.s...@gmail.com> wrote: >>>>>>>>>>> >>>>>>>>>>>> From Dev's point, it has less burden to always support the >>>>>>>>>>>> latest version of Spark (for example). But from user's point, >>>>>>>>>>>> especially for us who maintain Spark internally, it is not easy to >>>>>>>>>>>> upgrade >>>>>>>>>>>> the Spark version for the first time (since we have many >>>>>>>>>>>> customizations >>>>>>>>>>>> internally), and we're still promoting to upgrade to 3.1.2. If the >>>>>>>>>>>> community ditches the support of old version of Spark3, users have >>>>>>>>>>>> to >>>>>>>>>>>> maintain it themselves unavoidably. >>>>>>>>>>>> >>>>>>>>>>>> So I'm inclined to make this support in community, not by users >>>>>>>>>>>> themselves, as for Option 2 or 3, I'm fine with either. And to >>>>>>>>>>>> relieve the >>>>>>>>>>>> burden, we could support limited versions of Spark (for example 2 >>>>>>>>>>>> versions). >>>>>>>>>>>> >>>>>>>>>>>> Just my two cents. >>>>>>>>>>>> >>>>>>>>>>>> -Saisai >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> Jack Ye <yezhao...@gmail.com> 于2021年9月15日周三 下午1:35写道: >>>>>>>>>>>> >>>>>>>>>>>>> Hi Wing Yew, >>>>>>>>>>>>> >>>>>>>>>>>>> I think 2.4 is a different story, we will continue to support >>>>>>>>>>>>> Spark 2.4, but as you can see it will continue to have very >>>>>>>>>>>>> limited >>>>>>>>>>>>> functionalities comparing to Spark 3. I believe we discussed >>>>>>>>>>>>> about option 3 >>>>>>>>>>>>> when we were doing Spark 3.0 to 3.1 upgrade. Recently we are >>>>>>>>>>>>> seeing the >>>>>>>>>>>>> same issue for Flink 1.11, 1.12 and 1.13 as well. I feel we need a >>>>>>>>>>>>> consistent strategy around this, let's take this chance to make a >>>>>>>>>>>>> good >>>>>>>>>>>>> community guideline for all future engine versions, especially >>>>>>>>>>>>> for Spark, >>>>>>>>>>>>> Flink and Hive that are in the same repository. >>>>>>>>>>>>> >>>>>>>>>>>>> I can totally understand your point of view Wing, in fact, >>>>>>>>>>>>> speaking from the perspective of AWS EMR, we have to support over >>>>>>>>>>>>> 40 >>>>>>>>>>>>> versions of the software because there are people who are still >>>>>>>>>>>>> using Spark >>>>>>>>>>>>> 1.4, believe it or not. After all, keep backporting changes will >>>>>>>>>>>>> become a >>>>>>>>>>>>> liability not only on the user side, but also on the service >>>>>>>>>>>>> provider side, >>>>>>>>>>>>> so I believe it's not a bad practice to push for user upgrade, as >>>>>>>>>>>>> it will >>>>>>>>>>>>> make the life of both parties easier in the end. New feature is >>>>>>>>>>>>> definitely >>>>>>>>>>>>> one of the best incentives to promote an upgrade on user side. >>>>>>>>>>>>> >>>>>>>>>>>>> I think the biggest issue of option 3 is about its >>>>>>>>>>>>> scalability, because we will have an unbounded list of packages >>>>>>>>>>>>> to add and >>>>>>>>>>>>> compile in the future, and we probably cannot drop support of >>>>>>>>>>>>> that package >>>>>>>>>>>>> once created. If we go with option 1, I think we can still >>>>>>>>>>>>> publish a few >>>>>>>>>>>>> patch versions for old Iceberg releases, and committers can >>>>>>>>>>>>> control the >>>>>>>>>>>>> amount of patch versions to guard people from abusing the power of >>>>>>>>>>>>> patching. I see this as a consistent strategy also for Flink and >>>>>>>>>>>>> Hive. With >>>>>>>>>>>>> this strategy, we can truly have a compatibility matrix for >>>>>>>>>>>>> engine versions >>>>>>>>>>>>> against Iceberg versions. >>>>>>>>>>>>> >>>>>>>>>>>>> -Jack >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> On Tue, Sep 14, 2021 at 10:00 PM Wing Yew Poon < >>>>>>>>>>>>> wyp...@cloudera.com.invalid> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> I understand and sympathize with the desire to use new DSv2 >>>>>>>>>>>>>> features in Spark 3.2. I agree that Option 1 is the easiest for >>>>>>>>>>>>>> developers, >>>>>>>>>>>>>> but I don't think it considers the interests of users. I do not >>>>>>>>>>>>>> think that >>>>>>>>>>>>>> most users will upgrade to Spark 3.2 as soon as it is released. >>>>>>>>>>>>>> It is a >>>>>>>>>>>>>> "minor version" upgrade in name from 3.1 (or from 3.0), but I >>>>>>>>>>>>>> think we all >>>>>>>>>>>>>> know that it is not a minor upgrade. There are a lot of changes >>>>>>>>>>>>>> from 3.0 to >>>>>>>>>>>>>> 3.1 and from 3.1 to 3.2. I think there are even a lot of users >>>>>>>>>>>>>> running >>>>>>>>>>>>>> Spark 2.4 and not even on Spark 3 yet. Do we also plan to stop >>>>>>>>>>>>>> supporting >>>>>>>>>>>>>> Spark 2.4? >>>>>>>>>>>>>> >>>>>>>>>>>>>> Please correct me if I'm mistaken, but the folks who have >>>>>>>>>>>>>> spoken out in favor of Option 1 all work for the same >>>>>>>>>>>>>> organization, don't >>>>>>>>>>>>>> they? And they don't have a problem with making their users, all >>>>>>>>>>>>>> internal, >>>>>>>>>>>>>> simply upgrade to Spark 3.2, do they? (Or they are already >>>>>>>>>>>>>> running an >>>>>>>>>>>>>> internal fork that is close to 3.2.) >>>>>>>>>>>>>> >>>>>>>>>>>>>> I work for an organization with customers running different >>>>>>>>>>>>>> versions of Spark. It is true that we can backport new features >>>>>>>>>>>>>> to older >>>>>>>>>>>>>> versions if we wanted to. I suppose the people contributing to >>>>>>>>>>>>>> Iceberg work >>>>>>>>>>>>>> for some organization or other that either use Iceberg in-house, >>>>>>>>>>>>>> or provide >>>>>>>>>>>>>> software (possibly in the form of a service) to customers, and >>>>>>>>>>>>>> either way, >>>>>>>>>>>>>> the organizations have the ability to backport features and >>>>>>>>>>>>>> fixes to >>>>>>>>>>>>>> internal versions. Are there any users out there who simply use >>>>>>>>>>>>>> Apache >>>>>>>>>>>>>> Iceberg and depend on the community version? >>>>>>>>>>>>>> >>>>>>>>>>>>>> There may be features that are broadly useful that do not >>>>>>>>>>>>>> depend on Spark 3.2. Is it worth supporting them on Spark >>>>>>>>>>>>>> 3.0/3.1 (and even >>>>>>>>>>>>>> 2.4)? >>>>>>>>>>>>>> >>>>>>>>>>>>>> I am not in favor of Option 2. I do not oppose Option 1, but >>>>>>>>>>>>>> I would consider Option 3 too. Anton, you said 5 modules are >>>>>>>>>>>>>> required; what >>>>>>>>>>>>>> are the modules you're thinking of? >>>>>>>>>>>>>> >>>>>>>>>>>>>> - Wing Yew >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> On Tue, Sep 14, 2021 at 5:38 PM Yufei Gu < >>>>>>>>>>>>>> flyrain...@gmail.com> wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>>> Option 1 sounds good to me. Here are my reasons: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> 1. Both 2 and 3 will slow down the development. Considering >>>>>>>>>>>>>>> the limited resources in the open source community, the upsides >>>>>>>>>>>>>>> of option 2 >>>>>>>>>>>>>>> and 3 are probably not worthy. >>>>>>>>>>>>>>> 2. Both 2 and 3 assume the use cases may not exist. It's >>>>>>>>>>>>>>> hard to predict anything, but even if these use cases are >>>>>>>>>>>>>>> legit, users can >>>>>>>>>>>>>>> still get the new feature by backporting it to an older version >>>>>>>>>>>>>>> in case of >>>>>>>>>>>>>>> upgrading to a newer version isn't an option. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Best, >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Yufei >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> `This is not a contribution` >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> On Tue, Sep 14, 2021 at 4:54 PM Anton Okolnychyi < >>>>>>>>>>>>>>> aokolnyc...@apple.com.invalid> wrote: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> To sum up what we have so far: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> *Option 1 (support just the most recent minor Spark 3 >>>>>>>>>>>>>>>> version)* >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> The easiest option for us devs, forces the user to upgrade >>>>>>>>>>>>>>>> to the most recent minor Spark version to consume any new >>>>>>>>>>>>>>>> Iceberg features. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> *Option 2 (a separate project under Iceberg)* >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Can support as many Spark versions as needed and the >>>>>>>>>>>>>>>> codebase is still separate as we can use separate branches. >>>>>>>>>>>>>>>> Impossible to consume any unreleased changes in core, may >>>>>>>>>>>>>>>> slow down the development. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> *Option 3 (separate modules for Spark 3.1/3.2)* >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Introduce more modules in the same project. >>>>>>>>>>>>>>>> Can consume unreleased changes but it will required at >>>>>>>>>>>>>>>> least 5 modules to support 2.4, 3.1 and 3.2, making the build >>>>>>>>>>>>>>>> and testing >>>>>>>>>>>>>>>> complicated. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Are there any users for whom upgrading the minor Spark >>>>>>>>>>>>>>>> version (e3.1 to 3.2) to consume new features is a blocker? >>>>>>>>>>>>>>>> We follow Option 1 internally at the moment but I would >>>>>>>>>>>>>>>> like to hear what other people think/need. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> - Anton >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> On 14 Sep 2021, at 09:44, Russell Spitzer < >>>>>>>>>>>>>>>> russell.spit...@gmail.com> wrote: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> I think we should go for option 1. I already am not a big >>>>>>>>>>>>>>>> fan of having runtime errors for unsupported things based on >>>>>>>>>>>>>>>> versions and I >>>>>>>>>>>>>>>> don't think minor version upgrades are a large issue for >>>>>>>>>>>>>>>> users. I'm >>>>>>>>>>>>>>>> especially not looking forward to supporting interfaces that >>>>>>>>>>>>>>>> only exist in >>>>>>>>>>>>>>>> Spark 3.2 in a multiple Spark version support future. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> On Sep 14, 2021, at 11:32 AM, Anton Okolnychyi < >>>>>>>>>>>>>>>> aokolnyc...@apple.com.INVALID> wrote: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> First of all, is option 2 a viable option? We discussed >>>>>>>>>>>>>>>> separating the python module outside of the project a few >>>>>>>>>>>>>>>> weeks ago, and >>>>>>>>>>>>>>>> decided to not do that because it's beneficial for code cross >>>>>>>>>>>>>>>> reference and >>>>>>>>>>>>>>>> more intuitive for new developers to see everything in the >>>>>>>>>>>>>>>> same repository. >>>>>>>>>>>>>>>> I would expect the same argument to also hold here. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> That’s exactly the concern I have about Option 2 at this >>>>>>>>>>>>>>>> moment. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Overall I would personally prefer us to not support all the >>>>>>>>>>>>>>>> minor versions, but instead support maybe just 2-3 latest >>>>>>>>>>>>>>>> versions in a >>>>>>>>>>>>>>>> major version. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> This is when it gets a bit complicated. If we want to >>>>>>>>>>>>>>>> support both Spark 3.1 and Spark 3.2 with a single module, it >>>>>>>>>>>>>>>> means we have >>>>>>>>>>>>>>>> to compile against 3.1. The problem is that we rely on DSv2 >>>>>>>>>>>>>>>> that is being >>>>>>>>>>>>>>>> actively developed. 3.2 and 3.1 have substantial differences. >>>>>>>>>>>>>>>> On top of >>>>>>>>>>>>>>>> that, we have our extensions that are extremely low-level and >>>>>>>>>>>>>>>> may break not >>>>>>>>>>>>>>>> only between minor versions but also between patch releases. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> f there are some features requiring a newer version, it >>>>>>>>>>>>>>>> makes sense to move that newer version in master. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Internally, we don’t deliver new features to older Spark >>>>>>>>>>>>>>>> versions as it requires a lot of effort to port things. >>>>>>>>>>>>>>>> Personally, I don’t >>>>>>>>>>>>>>>> think it is too bad to require users to upgrade if they want >>>>>>>>>>>>>>>> new features. >>>>>>>>>>>>>>>> At the same time, there are valid concerns with this approach >>>>>>>>>>>>>>>> too that we >>>>>>>>>>>>>>>> mentioned during the sync. For example, certain new features >>>>>>>>>>>>>>>> would also >>>>>>>>>>>>>>>> work fine with older Spark versions. I generally agree with >>>>>>>>>>>>>>>> that and that >>>>>>>>>>>>>>>> not supporting recent versions is not ideal. However, I want >>>>>>>>>>>>>>>> to find a >>>>>>>>>>>>>>>> balance between the complexity on our side and ease of use for >>>>>>>>>>>>>>>> the users. >>>>>>>>>>>>>>>> Ideally, supporting a few recent versions would be sufficient >>>>>>>>>>>>>>>> but our Spark >>>>>>>>>>>>>>>> integration is too low-level to do that with a single module. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> On 13 Sep 2021, at 20:53, Jack Ye <yezhao...@gmail.com> >>>>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> First of all, is option 2 a viable option? We discussed >>>>>>>>>>>>>>>> separating the python module outside of the project a few >>>>>>>>>>>>>>>> weeks ago, and >>>>>>>>>>>>>>>> decided to not do that because it's beneficial for code cross >>>>>>>>>>>>>>>> reference and >>>>>>>>>>>>>>>> more intuitive for new developers to see everything in the >>>>>>>>>>>>>>>> same repository. >>>>>>>>>>>>>>>> I would expect the same argument to also hold here. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Overall I would personally prefer us to not support all the >>>>>>>>>>>>>>>> minor versions, but instead support maybe just 2-3 latest >>>>>>>>>>>>>>>> versions in a >>>>>>>>>>>>>>>> major version. This avoids the problem that some users are >>>>>>>>>>>>>>>> unwilling to >>>>>>>>>>>>>>>> move to a newer version and keep patching old Spark version >>>>>>>>>>>>>>>> branches. If >>>>>>>>>>>>>>>> there are some features requiring a newer version, it makes >>>>>>>>>>>>>>>> sense to move >>>>>>>>>>>>>>>> that newer version in master. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> In addition, because currently Spark is considered the most >>>>>>>>>>>>>>>> feature-complete reference implementation compared to all >>>>>>>>>>>>>>>> other engines, I >>>>>>>>>>>>>>>> think we should not add artificial barriers that would slow >>>>>>>>>>>>>>>> down its >>>>>>>>>>>>>>>> development speed. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> So my thinking is closer to option 1. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Best, >>>>>>>>>>>>>>>> Jack Ye >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> On Mon, Sep 13, 2021 at 7:39 PM Anton Okolnychyi < >>>>>>>>>>>>>>>> aokolnyc...@apple.com.invalid> wrote: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Hey folks, >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> I want to discuss our Spark version support strategy. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> So far, we have tried to support both 3.0 and 3.1. It is >>>>>>>>>>>>>>>>> great to support older versions but because we compile >>>>>>>>>>>>>>>>> against 3.0, we >>>>>>>>>>>>>>>>> cannot use any Spark features that are offered in newer >>>>>>>>>>>>>>>>> versions. >>>>>>>>>>>>>>>>> Spark 3.2 is just around the corner and it brings a lot of >>>>>>>>>>>>>>>>> important features such dynamic filtering for v2 tables, >>>>>>>>>>>>>>>>> required >>>>>>>>>>>>>>>>> distribution and ordering for writes, etc. These features are >>>>>>>>>>>>>>>>> too important >>>>>>>>>>>>>>>>> to ignore them. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Apart from that, I have an end-to-end prototype for >>>>>>>>>>>>>>>>> merge-on-read with Spark that actually leverages some of the >>>>>>>>>>>>>>>>> 3.2 features. >>>>>>>>>>>>>>>>> I’ll be implementing all new Spark DSv2 APIs for us >>>>>>>>>>>>>>>>> internally and would >>>>>>>>>>>>>>>>> love to share that with the rest of the community. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> I see two options to move forward: >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Option 1 >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Migrate to Spark 3.2 in master, maintain 0.12 for a while >>>>>>>>>>>>>>>>> by releasing minor versions with bug fixes. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Pros: almost no changes to the build configuration, no >>>>>>>>>>>>>>>>> extra work on our side as just a single Spark version is >>>>>>>>>>>>>>>>> actively >>>>>>>>>>>>>>>>> maintained. >>>>>>>>>>>>>>>>> Cons: some new features that we will be adding to master >>>>>>>>>>>>>>>>> could also work with older Spark versions but all 0.12 >>>>>>>>>>>>>>>>> releases will only >>>>>>>>>>>>>>>>> contain bug fixes. Therefore, users will be forced to migrate >>>>>>>>>>>>>>>>> to Spark 3.2 >>>>>>>>>>>>>>>>> to consume any new Spark or format features. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Option 2 >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Move our Spark integration into a separate project and >>>>>>>>>>>>>>>>> introduce branches for 3.0, 3.1 and 3.2. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Pros: decouples the format version from Spark, we can >>>>>>>>>>>>>>>>> support as many Spark versions as needed. >>>>>>>>>>>>>>>>> Cons: more work initially to set everything up, more work >>>>>>>>>>>>>>>>> to release, will need a new release of the core format to >>>>>>>>>>>>>>>>> consume any >>>>>>>>>>>>>>>>> changes in the Spark integration. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Overall, I think option 2 seems better for the user but my >>>>>>>>>>>>>>>>> main worry is that we will have to release the format more >>>>>>>>>>>>>>>>> frequently >>>>>>>>>>>>>>>>> (which is a good thing but requires more work and time) and >>>>>>>>>>>>>>>>> the overall >>>>>>>>>>>>>>>>> Spark development may be slower. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> I’d love to hear what everybody thinks about this matter. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Thanks, >>>>>>>>>>>>>>>>> Anton >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> -- >>>>>>>>>> Ryan Blue >>>>>>>>>> Tabular >>>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>>> -- >>>>>>>> Ryan Blue >>>>>>>> Tabular >>>>>>>> >>>>>>> >>>>>> >>>>>> -- >>>>>> Ryan Blue >>>>>> Tabular >>>>>> >>>>>> >>>>>>