Re: [DISCUSS] Spark version support strategy

Wing Yew Poon Tue, 28 Sep 2021 20:03:11 -0700

Hi OpenInx,
I'm sorry I misunderstood the thinking of the Flink community. Thanks for
the clarification.
- Wing Yew



On Tue, Sep 28, 2021 at 7:15 PM OpenInx <open...@gmail.com> wrote:

> Hi Wing
>
> As we discussed above, we community prefer to choose option.2 or
> option.3.  So in fact, when we planned to upgrade the flink version from
> 1.12 to 1.13,  we are doing our best to guarantee the master iceberg repo
> could work fine for both flink1.12 & flink1.13. More context please see
> [1], [2], [3]
>
> [1] https://github.com/apache/iceberg/pull/3116
> [2] https://github.com/apache/iceberg/issues/3183
> [3]
> https://lists.apache.org/x/thread.html/ra438e89eeec2d4623a32822e21739c8f2229505522d73d1034e34198@%3Cdev.flink.apache.org%3E
>
>
> On Wed, Sep 29, 2021 at 5:27 AM Wing Yew Poon <wyp...@cloudera.com.invalid>
> wrote:
>
>> In the last community sync, we spent a little time on this topic. For
>> Spark support, there are currently two options under consideration:
>>
>> Option 2: Separate repo for the Spark support. Use branches for
>> supporting different Spark versions. Main branch for the latest Spark
>> version (3.2 to begin with).
>> Tooling needs to be built for producing regular snapshots of core Iceberg
>> in a consumable way for this repo. Unclear if commits to core Iceberg will
>> be tested pre-commit against Spark support; my impression is that they will
>> not be, and the Spark support build can be broken by changes to core.
>>
>> A variant of option 3 (which we will simply call Option 3 going forward):
>> Single repo, separate module (subdirectory) for each Spark version to be
>> supported. Code duplication in each Spark module (no attempt to refactor
>> out common code). Each module built against the specific version of Spark
>> to be supported, producing a runtime jar built against that version. CI
>> will test all modules. Support can be provided for only building the
>> modules a developer cares about.
>>
>> More input was sought and people are encouraged to voice their preference.
>> I lean towards Option 3.
>>
>> - Wing Yew
>>
>> ps. In the sync, as Steven Wu wrote, the question was raised if the same
>> multi-version support strategy can be adopted across engines. Based on what
>> Steven wrote, currently the Flink developer community's bandwidth makes
>> supporting only a single Flink version (and focusing resources on
>> developing new features on that version) the preferred choice. If so, then
>> no multi-version support strategy for Flink is needed at this time.
>>
>>
>> On Thu, Sep 23, 2021 at 5:26 PM Steven Wu <stevenz...@gmail.com> wrote:
>>
>>> During the sync meeting, people talked about if and how we can have the
>>> same version support model across engines like Flink and Spark. I can
>>> provide some input from the Flink side.
>>>
>>> Flink only supports two minor versions. E.g., right now Flink 1.13 is
>>> the latest released version. That means only Flink 1.12 and 1.13 are
>>> supported. Feature changes or bug fixes will only be backported to 1.12 and
>>> 1.13, unless it is a serious bug (like security). With that context,
>>> personally I like option 1 (with one actively supported Flink version in
>>> master branch) for the iceberg-flink module.
>>>
>>> We discussed the idea of supporting multiple Flink versions via shm
>>> layer and multiple modules. While it may be a little better to support
>>> multiple Flink versions, I don't know if there is enough support and
>>> resources from the community to pull it off. Also the ongoing maintenance
>>> burden for each minor version release from Flink, which happens roughly
>>> every 4 months.
>>>
>>>
>>> On Thu, Sep 16, 2021 at 10:25 PM Peter Vary <pv...@cloudera.com.invalid>
>>> wrote:
>>>
>>>> Since you mentioned Hive, I chime in with what we do there. You might
>>>> find it useful:
>>>> - metastore module - only small differences - DynConstructor solves for
>>>> us
>>>> - mr module - some bigger differences, but still manageable for Hive
>>>> 2-3. Need some new classes, but most of the code is reused - extra module
>>>> for Hive 3. For Hive 4 we use a different repo as we moved to the Hive
>>>> codebase.
>>>>
>>>> My thoughts based on the above experience:
>>>> - Keeping Hive 4 and Hive 2-3 code in sync is a pain. We constantly
>>>> have problems with backporting changes between repos and we are slacking
>>>> behind which hurts both projects
>>>> - Hive 2-3 model is working better by forcing us to keep the things in
>>>> sync, but with serious differences in the Hive project it still doesn't
>>>> seem like a viable option.
>>>>
>>>> So I think the question is: How stable is the Spark code we are
>>>> integrating to. If I is fairly stable then we are better off with a "one
>>>> repo multiple modules" approach and we should consider the multirepo only
>>>> if the differences become prohibitive.
>>>>
>>>> Thanks, Peter
>>>>
>>>> On Fri, 17 Sep 2021, 02:21 Anton Okolnychyi,
>>>> <aokolnyc...@apple.com.invalid> wrote:
>>>>
>>>>> Okay, looks like there is consensus around supporting multiple Spark
>>>>> versions at the same time. There are folks who mentioned this on this
>>>>> thread and there were folks who brought this up during the sync.
>>>>>
>>>>> Let’s think through Option 2 and 3 in more detail then.
>>>>>
>>>>> Option 2
>>>>>
>>>>> In Option 2, there will be a separate repo. I believe the master
>>>>> branch will soon point to Spark 3.2 (the most recent supported version).
>>>>> The main development will happen there and the artifact version will be
>>>>> 0.1.0. I also suppose there will be 0.1.x-spark-2 and 0.1.x-spark-3.1
>>>>> branches where we will cherry-pick applicable changes. Once we are ready 
>>>>> to
>>>>> release 0.1.0 Spark integration, we will create 0.1.x-spark-3.2 and cut 3
>>>>> releases: Spark 2.4, Spark 3.1, Spark 3.2. After that, we will bump the
>>>>> version in master to 0.2.0 and create new 0.2.x-spark-2 and 
>>>>> 0.2.x-spark-3.1
>>>>> branches for cherry-picks.
>>>>>
>>>>> I guess we will continue to shade everything in the new repo and will
>>>>> have to release every time the core is released. We will do a maintenance
>>>>> release for each supported Spark version whenever we cut a new 
>>>>> maintenance Iceberg
>>>>> release or need to fix any bugs in the Spark integration.
>>>>> Under this model, we will probably need nightly snapshots (or on each
>>>>> commit) for the core format and the Spark integration will depend on
>>>>> snapshots until we are ready to release.
>>>>>
>>>>> Overall, I think this option gives us very simple builds and provides
>>>>> best separation. It will keep the main repo clean. The main downside is
>>>>> that we will have to split a Spark feature into two PRs: one against the
>>>>> core and one against the Spark integration. Certain changes in core can
>>>>> also break the Spark integration too and will require adaptations.
>>>>>
>>>>> Ryan, I am not sure I fully understood the testing part. How will we
>>>>> be able to test the Spark integration in the main repo if certain changes
>>>>> in core may break the Spark integration and require changes there? Will we
>>>>> try to prohibit such changes?
>>>>>
>>>>> Option 3 (modified)
>>>>>
>>>>> If I get correctly, the modified Option 3 sounds very close to
>>>>> the initially suggested approach by Imran but with code duplication 
>>>>> instead
>>>>> of extra refactoring and introducing new common modules.
>>>>>
>>>>> Jack, are you suggesting we test only a single Spark version at a
>>>>> time? Or do we expect to test all versions? Will there be any difference
>>>>> compared to just having a module per version? I did not fully
>>>>> understand.
>>>>>
>>>>> My worry with this approach is that our build will be very complicated
>>>>> and we will still have a lot of Spark-related modules in the main repo.
>>>>> Once people start using Flink and Hive more, will we have to do the same?
>>>>>
>>>>> - Anton
>>>>>
>>>>>
>>>>>
>>>>> On 16 Sep 2021, at 08:11, Ryan Blue <b...@tabular.io> wrote:
>>>>>
>>>>> I'd support the option that Jack suggests if we can set a few
>>>>> expectations for keeping it clean.
>>>>>
>>>>> First, I'd like to avoid refactoring code to share it across Spark
>>>>> versions -- that introduces risk because we're relying on compiling 
>>>>> against
>>>>> one version and running in another and both Spark and Scala change 
>>>>> rapidly.
>>>>> A big benefit of options 1 and 2 is that we mostly focus on only one Spark
>>>>> version. I think we should duplicate code rather than spend time
>>>>> refactoring to rely on binary compatibility. I propose we start each new
>>>>> Spark version by copying the last one and updating it. And we should build
>>>>> just the latest supported version by default.
>>>>>
>>>>> The drawback to having everything in a single repo is that we wouldn't
>>>>> be able to cherry-pick changes across Spark versions/branches, but I think
>>>>> Jack is right that having a single build is better.
>>>>>
>>>>> Second, we should make CI faster by running the Spark builds in
>>>>> parallel. It sounds like this is what would happen anyway, with a property
>>>>> that selects the Spark version that you want to build against.
>>>>>
>>>>> Overall, this new suggestion sounds like a promising way forward.
>>>>>
>>>>> Ryan
>>>>>
>>>>> On Wed, Sep 15, 2021 at 11:46 PM Jack Ye <yezhao...@gmail.com> wrote:
>>>>>
>>>>>> I think in Ryan's proposal we will create a ton of modules anyway, as
>>>>>> Wing listed we are just using git branch as an additional dimension, but 
>>>>>> my
>>>>>> understanding is that you will still have 1 core, 1 extension, 1 runtime
>>>>>> artifact published for each Spark version in either approach.
>>>>>>
>>>>>> In that case, this is just brainstorming, I wonder if we can explore
>>>>>> a modified option 3 that flattens all the versions in each Spark branch 
>>>>>> in
>>>>>> option 2 into master. The repository structure would look something like:
>>>>>>
>>>>>> iceberg/api/...
>>>>>>             /bundled-guava/...
>>>>>>             /core/...
>>>>>>             ...
>>>>>>             /spark/2.4/core/...
>>>>>>                             /extension/...
>>>>>>                             /runtime/...
>>>>>>                       /3.1/core/...
>>>>>>                             /extension/...
>>>>>>                             /runtime/...
>>>>>>
>>>>>> The gradle build script in the root is configured to build against
>>>>>> the latest version of Spark by default, unless otherwise specified by the
>>>>>> user.
>>>>>>
>>>>>> Intellij can also be configured to only index files of specific
>>>>>> versions based on the same config used in build.
>>>>>>
>>>>>> In this way, I imagine the CI setup to be much easier to do things
>>>>>> like testing version compatibility for a feature or running only a
>>>>>> specific subset of Spark version builds based on the Spark version
>>>>>> directories touched.
>>>>>>
>>>>>> And the biggest benefit is that we don't have the same difficulty as
>>>>>> option 2 of developing a feature when it's both in core and Spark.
>>>>>>
>>>>>> We can then develop a mechanism to vote to stop support of certain
>>>>>> versions, and archive the corresponding directory to avoid accumulating 
>>>>>> too
>>>>>> many versions in the long term.
>>>>>>
>>>>>> -Jack Ye
>>>>>>
>>>>>>
>>>>>> On Wed, Sep 15, 2021 at 4:17 PM Ryan Blue <b...@tabular.io> wrote:
>>>>>>
>>>>>>> Sorry, I was thinking about CI integration between Iceberg Java and
>>>>>>> Iceberg Spark, I just didn't mention it and I see how that's a big 
>>>>>>> thing to
>>>>>>> leave out!
>>>>>>>
>>>>>>> I would definitely want to test the projects together. One thing we
>>>>>>> could do is have a nightly build like Russell suggests. I'm also 
>>>>>>> wondering
>>>>>>> if we could have some tighter integration where the Iceberg Spark build 
>>>>>>> can
>>>>>>> be included in the Iceberg Java build using properties. Maybe the github
>>>>>>> action could checkout Iceberg, then checkout the Spark integration's 
>>>>>>> latest
>>>>>>> branch, and then run the gradle build with a property that makes Spark a
>>>>>>> subproject in the build. That way we can continue to have Spark CI run
>>>>>>> regularly.
>>>>>>>
>>>>>>> On Wed, Sep 15, 2021 at 3:08 PM Russell Spitzer <
>>>>>>> russell.spit...@gmail.com> wrote:
>>>>>>>
>>>>>>>> I agree that Option 2 is considerably more difficult for
>>>>>>>> development when core API changes need to be picked up by the external
>>>>>>>> Spark module. I also think a monthly release would probably still be
>>>>>>>> prohibitive to actually implementing new features that appear in the 
>>>>>>>> API, I
>>>>>>>> would hope we have a much faster process or maybe just have snapshot
>>>>>>>> artifacts published nightly?
>>>>>>>>
>>>>>>>> On Sep 15, 2021, at 4:46 PM, Wing Yew Poon <
>>>>>>>> wyp...@cloudera.com.INVALID> wrote:
>>>>>>>>
>>>>>>>> IIUC, Option 2 is to move the Spark support for Iceberg into a
>>>>>>>> separate repo (subproject of Iceberg). Would we have branches such as
>>>>>>>> 0.13-2.4, 0.13-3.0, 0.13-3.1, and 0.13-3.2? For features that can be
>>>>>>>> supported in all versions or all Spark 3 versions, then we would need 
>>>>>>>> to
>>>>>>>> commit the changes to all applicable branches. Basically we are trading
>>>>>>>> more work to commit to multiple branches for simplified build and CI
>>>>>>>> time per branch, which might be an acceptable trade-off. However, the
>>>>>>>> biggest downside is that changes may need to be made in core Iceberg as
>>>>>>>> well as in the engine (in this case Spark) support, and we need to 
>>>>>>>> wait for
>>>>>>>> a release of core Iceberg to consume the changes in the subproject. In 
>>>>>>>> this
>>>>>>>> case, maybe we should have a monthly release of core Iceberg (no 
>>>>>>>> matter how
>>>>>>>> many changes go in, as long as it is non-zero) so that the subproject 
>>>>>>>> can
>>>>>>>> consume changes fairly quickly?
>>>>>>>>
>>>>>>>>
>>>>>>>> On Wed, Sep 15, 2021 at 2:09 PM Ryan Blue <b...@tabular.io> wrote:
>>>>>>>>
>>>>>>>>> Thanks for bringing this up, Anton. I’m glad that we have the set
>>>>>>>>> of potential solutions well defined.
>>>>>>>>>
>>>>>>>>> Looks like the next step is to decide whether we want to require
>>>>>>>>> people to update Spark versions to pick up newer versions of Iceberg. 
>>>>>>>>> If we
>>>>>>>>> choose to make people upgrade, then option 1 is clearly the best 
>>>>>>>>> choice.
>>>>>>>>>
>>>>>>>>> I don’t think that we should make updating Spark a requirement.
>>>>>>>>> Many of the things that we’re working on are orthogonal to Spark 
>>>>>>>>> versions,
>>>>>>>>> like table maintenance actions, secondary indexes, the 1.0 API, 
>>>>>>>>> views, ORC
>>>>>>>>> delete files, new storage implementations, etc. Upgrading Spark is 
>>>>>>>>> time
>>>>>>>>> consuming and untrusted in my experience, so I think we would be 
>>>>>>>>> setting up
>>>>>>>>> an unnecessary trade-off between spending lots of time to upgrade 
>>>>>>>>> Spark and
>>>>>>>>> picking up new Iceberg features.
>>>>>>>>>
>>>>>>>>> Another way of thinking about this is that if we went with option
>>>>>>>>> 1, then we could port bug fixes into 0.12.x. But there are many 
>>>>>>>>> things that
>>>>>>>>> wouldn’t fit this model, like adding a FileIO implementation for 
>>>>>>>>> ADLS. So
>>>>>>>>> some people in the community would have to maintain branches of newer
>>>>>>>>> Iceberg versions with older versions of Spark outside of the main 
>>>>>>>>> Iceberg
>>>>>>>>> project — that defeats the purpose of simplifying things with option 1
>>>>>>>>> because we would then have more people maintaining the same 0.13.x 
>>>>>>>>> with
>>>>>>>>> Spark 3.1 branch. (This reminds me of the Spark community, where we 
>>>>>>>>> wanted
>>>>>>>>> to release a 2.5 line with DSv2 backported, but the community decided 
>>>>>>>>> not
>>>>>>>>> to so we built similar 2.4+DSv2 branches at Netflix, Tencent, Apple, 
>>>>>>>>> etc.)
>>>>>>>>>
>>>>>>>>> If the community is going to do the work anyway — and I think some
>>>>>>>>> of us would — we should make it possible to share that work. That’s 
>>>>>>>>> why I
>>>>>>>>> don’t think that we should go with option 1.
>>>>>>>>>
>>>>>>>>> If we don’t go with option 1, then the choice is how to maintain
>>>>>>>>> multiple Spark versions. I think that the way we’re doing it right 
>>>>>>>>> now is
>>>>>>>>> not something we want to continue.
>>>>>>>>>
>>>>>>>>> Using multiple modules (option 3) is concerning to me because of
>>>>>>>>> the changes in Spark. We currently structure the library to share as 
>>>>>>>>> much
>>>>>>>>> code as possible. But that means compiling against different Spark 
>>>>>>>>> versions
>>>>>>>>> and relying on binary compatibility and reflection in some cases. To 
>>>>>>>>> me,
>>>>>>>>> this seems unmaintainable in the long run because it requires 
>>>>>>>>> refactoring
>>>>>>>>> common classes and spending a lot of time deduplicating code. It also
>>>>>>>>> creates a ton of modules, at least one common module, then a module 
>>>>>>>>> per
>>>>>>>>> version, then an extensions module per version, and finally a runtime
>>>>>>>>> module per version. That’s 3 modules per Spark version, plus any new 
>>>>>>>>> common
>>>>>>>>> modules. And each module needs to be tested, which is making our CI 
>>>>>>>>> take a
>>>>>>>>> really long time. We also don’t support multiple Scala versions, 
>>>>>>>>> which is
>>>>>>>>> another gap that will require even more modules and tests.
>>>>>>>>>
>>>>>>>>> I like option 2 because it would allow us to compile against a
>>>>>>>>> single version of Spark (which will be much more reliable). It would 
>>>>>>>>> give
>>>>>>>>> us an opportunity to support different Scala versions. It avoids the 
>>>>>>>>> need
>>>>>>>>> to refactor to share code and allows people to focus on a single 
>>>>>>>>> version of
>>>>>>>>> Spark, while also creating a way for people to maintain and update the
>>>>>>>>> older versions with newer Iceberg releases. I don’t think that this 
>>>>>>>>> would
>>>>>>>>> slow down development. I think it would actually speed it up because 
>>>>>>>>> we’d
>>>>>>>>> be spending less time trying to make multiple versions work in the 
>>>>>>>>> same
>>>>>>>>> build. And anyone in favor of option 1 would basically get option 1: 
>>>>>>>>> you
>>>>>>>>> don’t have to care about branches for older Spark versions.
>>>>>>>>>
>>>>>>>>> Jack makes a good point about wanting to keep code in a single
>>>>>>>>> repository, but I think that the need to manage more version 
>>>>>>>>> combinations
>>>>>>>>> overrides this concern. It’s easier to make this decision in python 
>>>>>>>>> because
>>>>>>>>> we’re not trying to depend on two projects that change relatively 
>>>>>>>>> quickly.
>>>>>>>>> We’re just trying to build a library.
>>>>>>>>>
>>>>>>>>> Ryan
>>>>>>>>>
>>>>>>>>> On Wed, Sep 15, 2021 at 2:58 AM OpenInx <open...@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> Thanks for bringing this up,  Anton.
>>>>>>>>>>
>>>>>>>>>> Everyone has great pros/cons to support their preferences.
>>>>>>>>>> Before giving my preference, let me raise one question:    what's 
>>>>>>>>>> the top
>>>>>>>>>> priority thing for apache iceberg project at this point in time ?  
>>>>>>>>>> This
>>>>>>>>>> question will help us to answer the following question: Should we 
>>>>>>>>>> support
>>>>>>>>>> more engine versions more robustly or be a bit more aggressive and
>>>>>>>>>> concentrate on getting the new features that users need most in 
>>>>>>>>>> order to
>>>>>>>>>> keep the project more competitive ?
>>>>>>>>>>
>>>>>>>>>> If people watch the apache iceberg project and check the issues &
>>>>>>>>>> PR frequently,  I guess more than 90% people will answer the priority
>>>>>>>>>> question:   There is no doubt for making the whole v2 story to be
>>>>>>>>>> production-ready.   The current roadmap discussion also proofs the 
>>>>>>>>>> thing :
>>>>>>>>>> https://lists.apache.org/x/thread.html/r84e80216c259c81f824c6971504c321cd8c785774c489d52d4fc123f@%3Cdev.iceberg.apache.org%3E
>>>>>>>>>> .
>>>>>>>>>>
>>>>>>>>>> In order to ensure the highest priority at this point in time, I
>>>>>>>>>> will prefer option-1 to reduce the cost of engine maintenance, so as 
>>>>>>>>>> to
>>>>>>>>>> free up resources to make v2 production-ready.
>>>>>>>>>>
>>>>>>>>>> On Wed, Sep 15, 2021 at 3:00 PM Saisai Shao <
>>>>>>>>>> sai.sai.s...@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> From Dev's point, it has less burden to always support the
>>>>>>>>>>> latest version of Spark (for example). But from user's point,
>>>>>>>>>>> especially for us who maintain Spark internally, it is not easy to 
>>>>>>>>>>> upgrade
>>>>>>>>>>> the Spark version for the first time (since we have many 
>>>>>>>>>>> customizations
>>>>>>>>>>> internally), and we're still promoting to upgrade to 3.1.2. If the
>>>>>>>>>>> community ditches the support of old version of Spark3, users have 
>>>>>>>>>>> to
>>>>>>>>>>> maintain it themselves unavoidably.
>>>>>>>>>>>
>>>>>>>>>>> So I'm inclined to make this support in community, not by users
>>>>>>>>>>> themselves, as for Option 2 or 3, I'm fine with either. And to 
>>>>>>>>>>> relieve the
>>>>>>>>>>> burden, we could support limited versions of Spark (for example 2 
>>>>>>>>>>> versions).
>>>>>>>>>>>
>>>>>>>>>>> Just my two cents.
>>>>>>>>>>>
>>>>>>>>>>> -Saisai
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Jack Ye <yezhao...@gmail.com> 于2021年9月15日周三 下午1:35写道：
>>>>>>>>>>>
>>>>>>>>>>>> Hi Wing Yew,
>>>>>>>>>>>>
>>>>>>>>>>>> I think 2.4 is a different story, we will continue to support
>>>>>>>>>>>> Spark 2.4, but as you can see it will continue to have very limited
>>>>>>>>>>>> functionalities comparing to Spark 3. I believe we discussed about 
>>>>>>>>>>>> option 3
>>>>>>>>>>>> when we were doing Spark 3.0 to 3.1 upgrade. Recently we are 
>>>>>>>>>>>> seeing the
>>>>>>>>>>>> same issue for Flink 1.11, 1.12 and 1.13 as well. I feel we need a
>>>>>>>>>>>> consistent strategy around this, let's take this chance to make a 
>>>>>>>>>>>> good
>>>>>>>>>>>> community guideline for all future engine versions, especially for 
>>>>>>>>>>>> Spark,
>>>>>>>>>>>> Flink and Hive that are in the same repository.
>>>>>>>>>>>>
>>>>>>>>>>>> I can totally understand your point of view Wing, in fact,
>>>>>>>>>>>> speaking from the perspective of AWS EMR, we have to support over 
>>>>>>>>>>>> 40
>>>>>>>>>>>> versions of the software because there are people who are still 
>>>>>>>>>>>> using Spark
>>>>>>>>>>>> 1.4, believe it or not. After all, keep backporting changes will 
>>>>>>>>>>>> become a
>>>>>>>>>>>> liability not only on the user side, but also on the service 
>>>>>>>>>>>> provider side,
>>>>>>>>>>>> so I believe it's not a bad practice to push for user upgrade, as 
>>>>>>>>>>>> it will
>>>>>>>>>>>> make the life of both parties easier in the end. New feature is 
>>>>>>>>>>>> definitely
>>>>>>>>>>>> one of the best incentives to promote an upgrade on user side.
>>>>>>>>>>>>
>>>>>>>>>>>> I think the biggest issue of option 3 is about its scalability,
>>>>>>>>>>>> because we will have an unbounded list of packages to add and 
>>>>>>>>>>>> compile in
>>>>>>>>>>>> the future, and we probably cannot drop support of that package 
>>>>>>>>>>>> once
>>>>>>>>>>>> created. If we go with option 1, I think we can still publish a 
>>>>>>>>>>>> few patch
>>>>>>>>>>>> versions for old Iceberg releases, and committers can control the 
>>>>>>>>>>>> amount of
>>>>>>>>>>>> patch versions to guard people from abusing the power of patching. 
>>>>>>>>>>>> I see
>>>>>>>>>>>> this as a consistent strategy also for Flink and Hive. With this 
>>>>>>>>>>>> strategy,
>>>>>>>>>>>> we can truly have a compatibility matrix for engine versions 
>>>>>>>>>>>> against
>>>>>>>>>>>> Iceberg versions.
>>>>>>>>>>>>
>>>>>>>>>>>> -Jack
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Tue, Sep 14, 2021 at 10:00 PM Wing Yew Poon <
>>>>>>>>>>>> wyp...@cloudera.com.invalid> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> I understand and sympathize with the desire to use new DSv2
>>>>>>>>>>>>> features in Spark 3.2. I agree that Option 1 is the easiest for 
>>>>>>>>>>>>> developers,
>>>>>>>>>>>>> but I don't think it considers the interests of users. I do not 
>>>>>>>>>>>>> think that
>>>>>>>>>>>>> most users will upgrade to Spark 3.2 as soon as it is released. 
>>>>>>>>>>>>> It is a
>>>>>>>>>>>>> "minor version" upgrade in name from 3.1 (or from 3.0), but I 
>>>>>>>>>>>>> think we all
>>>>>>>>>>>>> know that it is not a minor upgrade. There are a lot of changes 
>>>>>>>>>>>>> from 3.0 to
>>>>>>>>>>>>> 3.1 and from 3.1 to 3.2. I think there are even a lot of users 
>>>>>>>>>>>>> running
>>>>>>>>>>>>> Spark 2.4 and not even on Spark 3 yet. Do we also plan to stop 
>>>>>>>>>>>>> supporting
>>>>>>>>>>>>> Spark 2.4?
>>>>>>>>>>>>>
>>>>>>>>>>>>> Please correct me if I'm mistaken, but the folks who have
>>>>>>>>>>>>> spoken out in favor of Option 1 all work for the same 
>>>>>>>>>>>>> organization, don't
>>>>>>>>>>>>> they? And they don't have a problem with making their users, all 
>>>>>>>>>>>>> internal,
>>>>>>>>>>>>> simply upgrade to Spark 3.2, do they? (Or they are already 
>>>>>>>>>>>>> running an
>>>>>>>>>>>>> internal fork that is close to 3.2.)
>>>>>>>>>>>>>
>>>>>>>>>>>>> I work for an organization with customers running different
>>>>>>>>>>>>> versions of Spark. It is true that we can backport new features 
>>>>>>>>>>>>> to older
>>>>>>>>>>>>> versions if we wanted to. I suppose the people contributing to 
>>>>>>>>>>>>> Iceberg work
>>>>>>>>>>>>> for some organization or other that either use Iceberg in-house, 
>>>>>>>>>>>>> or provide
>>>>>>>>>>>>> software (possibly in the form of a service) to customers, and 
>>>>>>>>>>>>> either way,
>>>>>>>>>>>>> the organizations have the ability to backport features and fixes 
>>>>>>>>>>>>> to
>>>>>>>>>>>>> internal versions. Are there any users out there who simply use 
>>>>>>>>>>>>> Apache
>>>>>>>>>>>>> Iceberg and depend on the community version?
>>>>>>>>>>>>>
>>>>>>>>>>>>> There may be features that are broadly useful that do not
>>>>>>>>>>>>> depend on Spark 3.2. Is it worth supporting them on Spark 3.0/3.1 
>>>>>>>>>>>>> (and even
>>>>>>>>>>>>> 2.4)?
>>>>>>>>>>>>>
>>>>>>>>>>>>> I am not in favor of Option 2. I do not oppose Option 1, but I
>>>>>>>>>>>>> would consider Option 3 too. Anton, you said 5 modules are 
>>>>>>>>>>>>> required; what
>>>>>>>>>>>>> are the modules you're thinking of?
>>>>>>>>>>>>>
>>>>>>>>>>>>> - Wing Yew
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Tue, Sep 14, 2021 at 5:38 PM Yufei Gu <flyrain...@gmail.com>
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Option 1 sounds good to me. Here are my reasons:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> 1. Both 2 and 3 will slow down the development. Considering
>>>>>>>>>>>>>> the limited resources in the open source community, the upsides 
>>>>>>>>>>>>>> of option 2
>>>>>>>>>>>>>> and 3 are probably not worthy.
>>>>>>>>>>>>>> 2. Both 2 and 3 assume the use cases may not exist. It's hard
>>>>>>>>>>>>>> to predict anything, but even if these use cases are legit, 
>>>>>>>>>>>>>> users can still
>>>>>>>>>>>>>> get the new feature by backporting it to an older version in 
>>>>>>>>>>>>>> case of
>>>>>>>>>>>>>> upgrading to a newer version isn't an option.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Yufei
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> `This is not a contribution`
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Tue, Sep 14, 2021 at 4:54 PM Anton Okolnychyi <
>>>>>>>>>>>>>> aokolnyc...@apple.com.invalid> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> To sum up what we have so far:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> *Option 1 (support just the most recent minor Spark 3
>>>>>>>>>>>>>>> version)*
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> The easiest option for us devs, forces the user to upgrade
>>>>>>>>>>>>>>> to the most recent minor Spark version to consume any new
>>>>>>>>>>>>>>> Iceberg features.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> *Option 2 (a separate project under Iceberg)*
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Can support as many Spark versions as needed and the
>>>>>>>>>>>>>>> codebase is still separate as we can use separate branches.
>>>>>>>>>>>>>>> Impossible to consume any unreleased changes in core, may
>>>>>>>>>>>>>>> slow down the development.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> *Option 3 (separate modules for Spark 3.1/3.2)*
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Introduce more modules in the same project.
>>>>>>>>>>>>>>> Can consume unreleased changes but it will required at least
>>>>>>>>>>>>>>> 5 modules to support 2.4, 3.1 and 3.2, making the build and 
>>>>>>>>>>>>>>> testing
>>>>>>>>>>>>>>> complicated.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Are there any users for whom upgrading the minor Spark
>>>>>>>>>>>>>>> version (e3.1 to 3.2) to consume new features is a blocker?
>>>>>>>>>>>>>>> We follow Option 1 internally at the moment but I would like
>>>>>>>>>>>>>>> to hear what other people think/need.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> - Anton
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On 14 Sep 2021, at 09:44, Russell Spitzer <
>>>>>>>>>>>>>>> russell.spit...@gmail.com> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I think we should go for option 1. I already am not a big
>>>>>>>>>>>>>>> fan of having runtime errors for unsupported things based on 
>>>>>>>>>>>>>>> versions and I
>>>>>>>>>>>>>>> don't think minor version upgrades are a large issue for users. 
>>>>>>>>>>>>>>>  I'm
>>>>>>>>>>>>>>> especially not looking forward to supporting interfaces that 
>>>>>>>>>>>>>>> only exist in
>>>>>>>>>>>>>>> Spark 3.2 in a multiple Spark version support future.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Sep 14, 2021, at 11:32 AM, Anton Okolnychyi <
>>>>>>>>>>>>>>> aokolnyc...@apple.com.INVALID> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> First of all, is option 2 a viable option? We discussed
>>>>>>>>>>>>>>> separating the python module outside of the project a few weeks 
>>>>>>>>>>>>>>> ago, and
>>>>>>>>>>>>>>> decided to not do that because it's beneficial for code cross 
>>>>>>>>>>>>>>> reference and
>>>>>>>>>>>>>>> more intuitive for new developers to see everything in the same 
>>>>>>>>>>>>>>> repository.
>>>>>>>>>>>>>>> I would expect the same argument to also hold here.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> That’s exactly the concern I have about Option 2 at this
>>>>>>>>>>>>>>> moment.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Overall I would personally prefer us to not support all the
>>>>>>>>>>>>>>> minor versions, but instead support maybe just 2-3 latest 
>>>>>>>>>>>>>>> versions in a
>>>>>>>>>>>>>>> major version.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> This is when it gets a bit complicated. If we want to
>>>>>>>>>>>>>>> support both Spark 3.1 and Spark 3.2 with a single module, it 
>>>>>>>>>>>>>>> means we have
>>>>>>>>>>>>>>> to compile against 3.1. The problem is that we rely on DSv2 
>>>>>>>>>>>>>>> that is being
>>>>>>>>>>>>>>> actively developed. 3.2 and 3.1 have substantial differences. 
>>>>>>>>>>>>>>> On top of
>>>>>>>>>>>>>>> that, we have our extensions that are extremely low-level and 
>>>>>>>>>>>>>>> may break not
>>>>>>>>>>>>>>> only between minor versions but also between patch releases.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> f there are some features requiring a newer version, it
>>>>>>>>>>>>>>> makes sense to move that newer version in master.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Internally, we don’t deliver new features to older Spark
>>>>>>>>>>>>>>> versions as it requires a lot of effort to port things. 
>>>>>>>>>>>>>>> Personally, I don’t
>>>>>>>>>>>>>>> think it is too bad to require users to upgrade if they want 
>>>>>>>>>>>>>>> new features.
>>>>>>>>>>>>>>> At the same time, there are valid concerns with this approach 
>>>>>>>>>>>>>>> too that we
>>>>>>>>>>>>>>> mentioned during the sync. For example, certain new features 
>>>>>>>>>>>>>>> would also
>>>>>>>>>>>>>>> work fine with older Spark versions. I generally agree with 
>>>>>>>>>>>>>>> that and that
>>>>>>>>>>>>>>> not supporting recent versions is not ideal. However, I want to 
>>>>>>>>>>>>>>> find a
>>>>>>>>>>>>>>> balance between the complexity on our side and ease of use for 
>>>>>>>>>>>>>>> the users.
>>>>>>>>>>>>>>> Ideally, supporting a few recent versions would be sufficient 
>>>>>>>>>>>>>>> but our Spark
>>>>>>>>>>>>>>> integration is too low-level to do that with a single module.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On 13 Sep 2021, at 20:53, Jack Ye <yezhao...@gmail.com>
>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> First of all, is option 2 a viable option? We discussed
>>>>>>>>>>>>>>> separating the python module outside of the project a few weeks 
>>>>>>>>>>>>>>> ago, and
>>>>>>>>>>>>>>> decided to not do that because it's beneficial for code cross 
>>>>>>>>>>>>>>> reference and
>>>>>>>>>>>>>>> more intuitive for new developers to see everything in the same 
>>>>>>>>>>>>>>> repository.
>>>>>>>>>>>>>>> I would expect the same argument to also hold here.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Overall I would personally prefer us to not support all the
>>>>>>>>>>>>>>> minor versions, but instead support maybe just 2-3 latest 
>>>>>>>>>>>>>>> versions in a
>>>>>>>>>>>>>>> major version. This avoids the problem that some users are 
>>>>>>>>>>>>>>> unwilling to
>>>>>>>>>>>>>>> move to a newer version and keep patching old Spark version 
>>>>>>>>>>>>>>> branches. If
>>>>>>>>>>>>>>> there are some features requiring a newer version, it makes 
>>>>>>>>>>>>>>> sense to move
>>>>>>>>>>>>>>> that newer version in master.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> In addition, because currently Spark is considered the most
>>>>>>>>>>>>>>> feature-complete reference implementation compared to all other 
>>>>>>>>>>>>>>> engines, I
>>>>>>>>>>>>>>> think we should not add artificial barriers that would slow 
>>>>>>>>>>>>>>> down its
>>>>>>>>>>>>>>> development speed.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> So my thinking is closer to option 1.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>>> Jack Ye
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Mon, Sep 13, 2021 at 7:39 PM Anton Okolnychyi <
>>>>>>>>>>>>>>> aokolnyc...@apple.com.invalid> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Hey folks,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I want to discuss our Spark version support strategy.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> So far, we have tried to support both 3.0 and 3.1. It is
>>>>>>>>>>>>>>>> great to support older versions but because we compile against 
>>>>>>>>>>>>>>>> 3.0, we
>>>>>>>>>>>>>>>> cannot use any Spark features that are offered in newer 
>>>>>>>>>>>>>>>> versions.
>>>>>>>>>>>>>>>> Spark 3.2 is just around the corner and it brings a lot of
>>>>>>>>>>>>>>>> important features such dynamic filtering for v2 tables, 
>>>>>>>>>>>>>>>> required
>>>>>>>>>>>>>>>> distribution and ordering for writes, etc. These features are 
>>>>>>>>>>>>>>>> too important
>>>>>>>>>>>>>>>> to ignore them.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Apart from that, I have an end-to-end prototype for
>>>>>>>>>>>>>>>> merge-on-read with Spark that actually leverages some of the 
>>>>>>>>>>>>>>>> 3.2 features.
>>>>>>>>>>>>>>>> I’ll be implementing all new Spark DSv2 APIs for us internally 
>>>>>>>>>>>>>>>> and would
>>>>>>>>>>>>>>>> love to share that with the rest of the community.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I see two options to move forward:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Option 1
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Migrate to Spark 3.2 in master, maintain 0.12 for a while
>>>>>>>>>>>>>>>> by releasing minor versions with bug fixes.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Pros: almost no changes to the build configuration, no
>>>>>>>>>>>>>>>> extra work on our side as just a single Spark version is 
>>>>>>>>>>>>>>>> actively
>>>>>>>>>>>>>>>> maintained.
>>>>>>>>>>>>>>>> Cons: some new features that we will be adding to master
>>>>>>>>>>>>>>>> could also work with older Spark versions but all 0.12 
>>>>>>>>>>>>>>>> releases will only
>>>>>>>>>>>>>>>> contain bug fixes. Therefore, users will be forced to migrate 
>>>>>>>>>>>>>>>> to Spark 3.2
>>>>>>>>>>>>>>>> to consume any new Spark or format features.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Option 2
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Move our Spark integration into a separate project and
>>>>>>>>>>>>>>>> introduce branches for 3.0, 3.1 and 3.2.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Pros: decouples the format version from Spark, we can
>>>>>>>>>>>>>>>> support as many Spark versions as needed.
>>>>>>>>>>>>>>>> Cons: more work initially to set everything up, more work
>>>>>>>>>>>>>>>> to release, will need a new release of the core format to 
>>>>>>>>>>>>>>>> consume any
>>>>>>>>>>>>>>>> changes in the Spark integration.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Overall, I think option 2 seems better for the user but my
>>>>>>>>>>>>>>>> main worry is that we will have to release the format more 
>>>>>>>>>>>>>>>> frequently
>>>>>>>>>>>>>>>> (which is a good thing but requires more work and time) and 
>>>>>>>>>>>>>>>> the overall
>>>>>>>>>>>>>>>> Spark development may be slower.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I’d love to hear what everybody thinks about this matter.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>> Anton
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Ryan Blue
>>>>>>>>> Tabular
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Ryan Blue
>>>>>>> Tabular
>>>>>>>
>>>>>>
>>>>>
>>>>> --
>>>>> Ryan Blue
>>>>> Tabular
>>>>>
>>>>>
>>>>>

Re: [DISCUSS] Spark version support strategy

Reply via email to