Re: [DISCUSS] Spark version support strategy

Steven Wu Wed, 29 Sep 2021 21:12:46 -0700

Wing, sorry, my earlier message probably misled you. I was speaking my
personal opinion on Flink version support.


On Tue, Sep 28, 2021 at 8:03 PM Wing Yew Poon <[email protected]>
wrote:

> Hi OpenInx,
> I'm sorry I misunderstood the thinking of the Flink community. Thanks for
> the clarification.
> - Wing Yew
>
>
> On Tue, Sep 28, 2021 at 7:15 PM OpenInx <[email protected]> wrote:
>
>> Hi Wing
>>
>> As we discussed above, we community prefer to choose option.2 or
>> option.3.  So in fact, when we planned to upgrade the flink version from
>> 1.12 to 1.13,  we are doing our best to guarantee the master iceberg repo
>> could work fine for both flink1.12 & flink1.13. More context please see
>> [1], [2], [3]
>>
>> [1] https://github.com/apache/iceberg/pull/3116
>> [2] https://github.com/apache/iceberg/issues/3183
>> [3]
>> https://lists.apache.org/x/thread.html/ra438e89eeec2d4623a32822e21739c8f2229505522d73d1034e34198@%3Cdev.flink.apache.org%3E
>>
>>
>> On Wed, Sep 29, 2021 at 5:27 AM Wing Yew Poon <[email protected]>
>> wrote:
>>
>>> In the last community sync, we spent a little time on this topic. For
>>> Spark support, there are currently two options under consideration:
>>>
>>> Option 2: Separate repo for the Spark support. Use branches for
>>> supporting different Spark versions. Main branch for the latest Spark
>>> version (3.2 to begin with).
>>> Tooling needs to be built for producing regular snapshots of core
>>> Iceberg in a consumable way for this repo. Unclear if commits to core
>>> Iceberg will be tested pre-commit against Spark support; my impression is
>>> that they will not be, and the Spark support build can be broken by changes
>>> to core.
>>>
>>> A variant of option 3 (which we will simply call Option 3 going
>>> forward): Single repo, separate module (subdirectory) for each Spark
>>> version to be supported. Code duplication in each Spark module (no attempt
>>> to refactor out common code). Each module built against the specific
>>> version of Spark to be supported, producing a runtime jar built against
>>> that version. CI will test all modules. Support can be provided for only
>>> building the modules a developer cares about.
>>>
>>> More input was sought and people are encouraged to voice their
>>> preference.
>>> I lean towards Option 3.
>>>
>>> - Wing Yew
>>>
>>> ps. In the sync, as Steven Wu wrote, the question was raised if the same
>>> multi-version support strategy can be adopted across engines. Based on what
>>> Steven wrote, currently the Flink developer community's bandwidth makes
>>> supporting only a single Flink version (and focusing resources on
>>> developing new features on that version) the preferred choice. If so, then
>>> no multi-version support strategy for Flink is needed at this time.
>>>
>>>
>>> On Thu, Sep 23, 2021 at 5:26 PM Steven Wu <[email protected]> wrote:
>>>
>>>> During the sync meeting, people talked about if and how we can have the
>>>> same version support model across engines like Flink and Spark. I can
>>>> provide some input from the Flink side.
>>>>
>>>> Flink only supports two minor versions. E.g., right now Flink 1.13 is
>>>> the latest released version. That means only Flink 1.12 and 1.13 are
>>>> supported. Feature changes or bug fixes will only be backported to 1.12 and
>>>> 1.13, unless it is a serious bug (like security). With that context,
>>>> personally I like option 1 (with one actively supported Flink version in
>>>> master branch) for the iceberg-flink module.
>>>>
>>>> We discussed the idea of supporting multiple Flink versions via shm
>>>> layer and multiple modules. While it may be a little better to support
>>>> multiple Flink versions, I don't know if there is enough support and
>>>> resources from the community to pull it off. Also the ongoing maintenance
>>>> burden for each minor version release from Flink, which happens roughly
>>>> every 4 months.
>>>>
>>>>
>>>> On Thu, Sep 16, 2021 at 10:25 PM Peter Vary <[email protected]>
>>>> wrote:
>>>>
>>>>> Since you mentioned Hive, I chime in with what we do there. You might
>>>>> find it useful:
>>>>> - metastore module - only small differences - DynConstructor solves
>>>>> for us
>>>>> - mr module - some bigger differences, but still manageable for Hive
>>>>> 2-3. Need some new classes, but most of the code is reused - extra module
>>>>> for Hive 3. For Hive 4 we use a different repo as we moved to the Hive
>>>>> codebase.
>>>>>
>>>>> My thoughts based on the above experience:
>>>>> - Keeping Hive 4 and Hive 2-3 code in sync is a pain. We constantly
>>>>> have problems with backporting changes between repos and we are slacking
>>>>> behind which hurts both projects
>>>>> - Hive 2-3 model is working better by forcing us to keep the things in
>>>>> sync, but with serious differences in the Hive project it still doesn't
>>>>> seem like a viable option.
>>>>>
>>>>> So I think the question is: How stable is the Spark code we are
>>>>> integrating to. If I is fairly stable then we are better off with a "one
>>>>> repo multiple modules" approach and we should consider the multirepo only
>>>>> if the differences become prohibitive.
>>>>>
>>>>> Thanks, Peter
>>>>>
>>>>> On Fri, 17 Sep 2021, 02:21 Anton Okolnychyi,
>>>>> <[email protected]> wrote:
>>>>>
>>>>>> Okay, looks like there is consensus around supporting multiple Spark
>>>>>> versions at the same time. There are folks who mentioned this on this
>>>>>> thread and there were folks who brought this up during the sync.
>>>>>>
>>>>>> Let’s think through Option 2 and 3 in more detail then.
>>>>>>
>>>>>> Option 2
>>>>>>
>>>>>> In Option 2, there will be a separate repo. I believe the master
>>>>>> branch will soon point to Spark 3.2 (the most recent supported version).
>>>>>> The main development will happen there and the artifact version will be
>>>>>> 0.1.0. I also suppose there will be 0.1.x-spark-2 and 0.1.x-spark-3.1
>>>>>> branches where we will cherry-pick applicable changes. Once we are ready 
>>>>>> to
>>>>>> release 0.1.0 Spark integration, we will create 0.1.x-spark-3.2 and cut 3
>>>>>> releases: Spark 2.4, Spark 3.1, Spark 3.2. After that, we will bump the
>>>>>> version in master to 0.2.0 and create new 0.2.x-spark-2 and 
>>>>>> 0.2.x-spark-3.1
>>>>>> branches for cherry-picks.
>>>>>>
>>>>>> I guess we will continue to shade everything in the new repo and will
>>>>>> have to release every time the core is released. We will do a maintenance
>>>>>> release for each supported Spark version whenever we cut a new 
>>>>>> maintenance Iceberg
>>>>>> release or need to fix any bugs in the Spark integration.
>>>>>> Under this model, we will probably need nightly snapshots (or on each
>>>>>> commit) for the core format and the Spark integration will depend on
>>>>>> snapshots until we are ready to release.
>>>>>>
>>>>>> Overall, I think this option gives us very simple builds and provides
>>>>>> best separation. It will keep the main repo clean. The main downside is
>>>>>> that we will have to split a Spark feature into two PRs: one against the
>>>>>> core and one against the Spark integration. Certain changes in core can
>>>>>> also break the Spark integration too and will require adaptations.
>>>>>>
>>>>>> Ryan, I am not sure I fully understood the testing part. How will we
>>>>>> be able to test the Spark integration in the main repo if certain changes
>>>>>> in core may break the Spark integration and require changes there? Will 
>>>>>> we
>>>>>> try to prohibit such changes?
>>>>>>
>>>>>> Option 3 (modified)
>>>>>>
>>>>>> If I get correctly, the modified Option 3 sounds very close to
>>>>>> the initially suggested approach by Imran but with code duplication 
>>>>>> instead
>>>>>> of extra refactoring and introducing new common modules.
>>>>>>
>>>>>> Jack, are you suggesting we test only a single Spark version at a
>>>>>> time? Or do we expect to test all versions? Will there be any difference
>>>>>> compared to just having a module per version? I did not fully
>>>>>> understand.
>>>>>>
>>>>>> My worry with this approach is that our build will be very
>>>>>> complicated and we will still have a lot of Spark-related modules in the
>>>>>> main repo. Once people start using Flink and Hive more, will we have to 
>>>>>> do
>>>>>> the same?
>>>>>>
>>>>>> - Anton
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 16 Sep 2021, at 08:11, Ryan Blue <[email protected]> wrote:
>>>>>>
>>>>>> I'd support the option that Jack suggests if we can set a few
>>>>>> expectations for keeping it clean.
>>>>>>
>>>>>> First, I'd like to avoid refactoring code to share it across Spark
>>>>>> versions -- that introduces risk because we're relying on compiling 
>>>>>> against
>>>>>> one version and running in another and both Spark and Scala change 
>>>>>> rapidly.
>>>>>> A big benefit of options 1 and 2 is that we mostly focus on only one 
>>>>>> Spark
>>>>>> version. I think we should duplicate code rather than spend time
>>>>>> refactoring to rely on binary compatibility. I propose we start each new
>>>>>> Spark version by copying the last one and updating it. And we should 
>>>>>> build
>>>>>> just the latest supported version by default.
>>>>>>
>>>>>> The drawback to having everything in a single repo is that we
>>>>>> wouldn't be able to cherry-pick changes across Spark versions/branches, 
>>>>>> but
>>>>>> I think Jack is right that having a single build is better.
>>>>>>
>>>>>> Second, we should make CI faster by running the Spark builds in
>>>>>> parallel. It sounds like this is what would happen anyway, with a 
>>>>>> property
>>>>>> that selects the Spark version that you want to build against.
>>>>>>
>>>>>> Overall, this new suggestion sounds like a promising way forward.
>>>>>>
>>>>>> Ryan
>>>>>>
>>>>>> On Wed, Sep 15, 2021 at 11:46 PM Jack Ye <[email protected]> wrote:
>>>>>>
>>>>>>> I think in Ryan's proposal we will create a ton of modules anyway,
>>>>>>> as Wing listed we are just using git branch as an additional dimension, 
>>>>>>> but
>>>>>>> my understanding is that you will still have 1 core, 1 extension, 1 
>>>>>>> runtime
>>>>>>> artifact published for each Spark version in either approach.
>>>>>>>
>>>>>>> In that case, this is just brainstorming, I wonder if we can explore
>>>>>>> a modified option 3 that flattens all the versions in each Spark branch 
>>>>>>> in
>>>>>>> option 2 into master. The repository structure would look something 
>>>>>>> like:
>>>>>>>
>>>>>>> iceberg/api/...
>>>>>>>             /bundled-guava/...
>>>>>>>             /core/...
>>>>>>>             ...
>>>>>>>             /spark/2.4/core/...
>>>>>>>                             /extension/...
>>>>>>>                             /runtime/...
>>>>>>>                       /3.1/core/...
>>>>>>>                             /extension/...
>>>>>>>                             /runtime/...
>>>>>>>
>>>>>>> The gradle build script in the root is configured to build against
>>>>>>> the latest version of Spark by default, unless otherwise specified by 
>>>>>>> the
>>>>>>> user.
>>>>>>>
>>>>>>> Intellij can also be configured to only index files of specific
>>>>>>> versions based on the same config used in build.
>>>>>>>
>>>>>>> In this way, I imagine the CI setup to be much easier to do things
>>>>>>> like testing version compatibility for a feature or running only a
>>>>>>> specific subset of Spark version builds based on the Spark version
>>>>>>> directories touched.
>>>>>>>
>>>>>>> And the biggest benefit is that we don't have the same difficulty as
>>>>>>> option 2 of developing a feature when it's both in core and Spark.
>>>>>>>
>>>>>>> We can then develop a mechanism to vote to stop support of certain
>>>>>>> versions, and archive the corresponding directory to avoid accumulating 
>>>>>>> too
>>>>>>> many versions in the long term.
>>>>>>>
>>>>>>> -Jack Ye
>>>>>>>
>>>>>>>
>>>>>>> On Wed, Sep 15, 2021 at 4:17 PM Ryan Blue <[email protected]> wrote:
>>>>>>>
>>>>>>>> Sorry, I was thinking about CI integration between Iceberg Java and
>>>>>>>> Iceberg Spark, I just didn't mention it and I see how that's a big 
>>>>>>>> thing to
>>>>>>>> leave out!
>>>>>>>>
>>>>>>>> I would definitely want to test the projects together. One thing we
>>>>>>>> could do is have a nightly build like Russell suggests. I'm also 
>>>>>>>> wondering
>>>>>>>> if we could have some tighter integration where the Iceberg Spark 
>>>>>>>> build can
>>>>>>>> be included in the Iceberg Java build using properties. Maybe the 
>>>>>>>> github
>>>>>>>> action could checkout Iceberg, then checkout the Spark integration's 
>>>>>>>> latest
>>>>>>>> branch, and then run the gradle build with a property that makes Spark 
>>>>>>>> a
>>>>>>>> subproject in the build. That way we can continue to have Spark CI run
>>>>>>>> regularly.
>>>>>>>>
>>>>>>>> On Wed, Sep 15, 2021 at 3:08 PM Russell Spitzer <
>>>>>>>> [email protected]> wrote:
>>>>>>>>
>>>>>>>>> I agree that Option 2 is considerably more difficult for
>>>>>>>>> development when core API changes need to be picked up by the external
>>>>>>>>> Spark module. I also think a monthly release would probably still be
>>>>>>>>> prohibitive to actually implementing new features that appear in the 
>>>>>>>>> API, I
>>>>>>>>> would hope we have a much faster process or maybe just have snapshot
>>>>>>>>> artifacts published nightly?
>>>>>>>>>
>>>>>>>>> On Sep 15, 2021, at 4:46 PM, Wing Yew Poon <
>>>>>>>>> [email protected]> wrote:
>>>>>>>>>
>>>>>>>>> IIUC, Option 2 is to move the Spark support for Iceberg into a
>>>>>>>>> separate repo (subproject of Iceberg). Would we have branches such as
>>>>>>>>> 0.13-2.4, 0.13-3.0, 0.13-3.1, and 0.13-3.2? For features that can be
>>>>>>>>> supported in all versions or all Spark 3 versions, then we would need 
>>>>>>>>> to
>>>>>>>>> commit the changes to all applicable branches. Basically we are 
>>>>>>>>> trading
>>>>>>>>> more work to commit to multiple branches for simplified build and CI
>>>>>>>>> time per branch, which might be an acceptable trade-off. However, the
>>>>>>>>> biggest downside is that changes may need to be made in core Iceberg 
>>>>>>>>> as
>>>>>>>>> well as in the engine (in this case Spark) support, and we need to 
>>>>>>>>> wait for
>>>>>>>>> a release of core Iceberg to consume the changes in the subproject. 
>>>>>>>>> In this
>>>>>>>>> case, maybe we should have a monthly release of core Iceberg (no 
>>>>>>>>> matter how
>>>>>>>>> many changes go in, as long as it is non-zero) so that the subproject 
>>>>>>>>> can
>>>>>>>>> consume changes fairly quickly?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Wed, Sep 15, 2021 at 2:09 PM Ryan Blue <[email protected]> wrote:
>>>>>>>>>
>>>>>>>>>> Thanks for bringing this up, Anton. I’m glad that we have the set
>>>>>>>>>> of potential solutions well defined.
>>>>>>>>>>
>>>>>>>>>> Looks like the next step is to decide whether we want to require
>>>>>>>>>> people to update Spark versions to pick up newer versions of 
>>>>>>>>>> Iceberg. If we
>>>>>>>>>> choose to make people upgrade, then option 1 is clearly the best 
>>>>>>>>>> choice.
>>>>>>>>>>
>>>>>>>>>> I don’t think that we should make updating Spark a requirement.
>>>>>>>>>> Many of the things that we’re working on are orthogonal to Spark 
>>>>>>>>>> versions,
>>>>>>>>>> like table maintenance actions, secondary indexes, the 1.0 API, 
>>>>>>>>>> views, ORC
>>>>>>>>>> delete files, new storage implementations, etc. Upgrading Spark is 
>>>>>>>>>> time
>>>>>>>>>> consuming and untrusted in my experience, so I think we would be 
>>>>>>>>>> setting up
>>>>>>>>>> an unnecessary trade-off between spending lots of time to upgrade 
>>>>>>>>>> Spark and
>>>>>>>>>> picking up new Iceberg features.
>>>>>>>>>>
>>>>>>>>>> Another way of thinking about this is that if we went with option
>>>>>>>>>> 1, then we could port bug fixes into 0.12.x. But there are many 
>>>>>>>>>> things that
>>>>>>>>>> wouldn’t fit this model, like adding a FileIO implementation for 
>>>>>>>>>> ADLS. So
>>>>>>>>>> some people in the community would have to maintain branches of newer
>>>>>>>>>> Iceberg versions with older versions of Spark outside of the main 
>>>>>>>>>> Iceberg
>>>>>>>>>> project — that defeats the purpose of simplifying things with option 
>>>>>>>>>> 1
>>>>>>>>>> because we would then have more people maintaining the same 0.13.x 
>>>>>>>>>> with
>>>>>>>>>> Spark 3.1 branch. (This reminds me of the Spark community, where we 
>>>>>>>>>> wanted
>>>>>>>>>> to release a 2.5 line with DSv2 backported, but the community 
>>>>>>>>>> decided not
>>>>>>>>>> to so we built similar 2.4+DSv2 branches at Netflix, Tencent, Apple, 
>>>>>>>>>> etc.)
>>>>>>>>>>
>>>>>>>>>> If the community is going to do the work anyway — and I think
>>>>>>>>>> some of us would — we should make it possible to share that work. 
>>>>>>>>>> That’s
>>>>>>>>>> why I don’t think that we should go with option 1.
>>>>>>>>>>
>>>>>>>>>> If we don’t go with option 1, then the choice is how to maintain
>>>>>>>>>> multiple Spark versions. I think that the way we’re doing it right 
>>>>>>>>>> now is
>>>>>>>>>> not something we want to continue.
>>>>>>>>>>
>>>>>>>>>> Using multiple modules (option 3) is concerning to me because of
>>>>>>>>>> the changes in Spark. We currently structure the library to share as 
>>>>>>>>>> much
>>>>>>>>>> code as possible. But that means compiling against different Spark 
>>>>>>>>>> versions
>>>>>>>>>> and relying on binary compatibility and reflection in some cases. To 
>>>>>>>>>> me,
>>>>>>>>>> this seems unmaintainable in the long run because it requires 
>>>>>>>>>> refactoring
>>>>>>>>>> common classes and spending a lot of time deduplicating code. It also
>>>>>>>>>> creates a ton of modules, at least one common module, then a module 
>>>>>>>>>> per
>>>>>>>>>> version, then an extensions module per version, and finally a runtime
>>>>>>>>>> module per version. That’s 3 modules per Spark version, plus any new 
>>>>>>>>>> common
>>>>>>>>>> modules. And each module needs to be tested, which is making our CI 
>>>>>>>>>> take a
>>>>>>>>>> really long time. We also don’t support multiple Scala versions, 
>>>>>>>>>> which is
>>>>>>>>>> another gap that will require even more modules and tests.
>>>>>>>>>>
>>>>>>>>>> I like option 2 because it would allow us to compile against a
>>>>>>>>>> single version of Spark (which will be much more reliable). It would 
>>>>>>>>>> give
>>>>>>>>>> us an opportunity to support different Scala versions. It avoids the 
>>>>>>>>>> need
>>>>>>>>>> to refactor to share code and allows people to focus on a single 
>>>>>>>>>> version of
>>>>>>>>>> Spark, while also creating a way for people to maintain and update 
>>>>>>>>>> the
>>>>>>>>>> older versions with newer Iceberg releases. I don’t think that this 
>>>>>>>>>> would
>>>>>>>>>> slow down development. I think it would actually speed it up because 
>>>>>>>>>> we’d
>>>>>>>>>> be spending less time trying to make multiple versions work in the 
>>>>>>>>>> same
>>>>>>>>>> build. And anyone in favor of option 1 would basically get option 1: 
>>>>>>>>>> you
>>>>>>>>>> don’t have to care about branches for older Spark versions.
>>>>>>>>>>
>>>>>>>>>> Jack makes a good point about wanting to keep code in a single
>>>>>>>>>> repository, but I think that the need to manage more version 
>>>>>>>>>> combinations
>>>>>>>>>> overrides this concern. It’s easier to make this decision in python 
>>>>>>>>>> because
>>>>>>>>>> we’re not trying to depend on two projects that change relatively 
>>>>>>>>>> quickly.
>>>>>>>>>> We’re just trying to build a library.
>>>>>>>>>>
>>>>>>>>>> Ryan
>>>>>>>>>>
>>>>>>>>>> On Wed, Sep 15, 2021 at 2:58 AM OpenInx <[email protected]>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> Thanks for bringing this up,  Anton.
>>>>>>>>>>>
>>>>>>>>>>> Everyone has great pros/cons to support their preferences.
>>>>>>>>>>> Before giving my preference, let me raise one question:    what's 
>>>>>>>>>>> the top
>>>>>>>>>>> priority thing for apache iceberg project at this point in time ?  
>>>>>>>>>>> This
>>>>>>>>>>> question will help us to answer the following question: Should we 
>>>>>>>>>>> support
>>>>>>>>>>> more engine versions more robustly or be a bit more aggressive and
>>>>>>>>>>> concentrate on getting the new features that users need most in 
>>>>>>>>>>> order to
>>>>>>>>>>> keep the project more competitive ?
>>>>>>>>>>>
>>>>>>>>>>> If people watch the apache iceberg project and check the issues
>>>>>>>>>>> & PR frequently,  I guess more than 90% people will answer the 
>>>>>>>>>>> priority
>>>>>>>>>>> question:   There is no doubt for making the whole v2 story to be
>>>>>>>>>>> production-ready.   The current roadmap discussion also proofs the 
>>>>>>>>>>> thing :
>>>>>>>>>>> https://lists.apache.org/x/thread.html/r84e80216c259c81f824c6971504c321cd8c785774c489d52d4fc123f@%3Cdev.iceberg.apache.org%3E
>>>>>>>>>>> .
>>>>>>>>>>>
>>>>>>>>>>> In order to ensure the highest priority at this point in time, I
>>>>>>>>>>> will prefer option-1 to reduce the cost of engine maintenance, so 
>>>>>>>>>>> as to
>>>>>>>>>>> free up resources to make v2 production-ready.
>>>>>>>>>>>
>>>>>>>>>>> On Wed, Sep 15, 2021 at 3:00 PM Saisai Shao <
>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> From Dev's point, it has less burden to always support the
>>>>>>>>>>>> latest version of Spark (for example). But from user's point,
>>>>>>>>>>>> especially for us who maintain Spark internally, it is not easy to 
>>>>>>>>>>>> upgrade
>>>>>>>>>>>> the Spark version for the first time (since we have many 
>>>>>>>>>>>> customizations
>>>>>>>>>>>> internally), and we're still promoting to upgrade to 3.1.2. If the
>>>>>>>>>>>> community ditches the support of old version of Spark3, users have 
>>>>>>>>>>>> to
>>>>>>>>>>>> maintain it themselves unavoidably.
>>>>>>>>>>>>
>>>>>>>>>>>> So I'm inclined to make this support in community, not by users
>>>>>>>>>>>> themselves, as for Option 2 or 3, I'm fine with either. And to 
>>>>>>>>>>>> relieve the
>>>>>>>>>>>> burden, we could support limited versions of Spark (for example 2 
>>>>>>>>>>>> versions).
>>>>>>>>>>>>
>>>>>>>>>>>> Just my two cents.
>>>>>>>>>>>>
>>>>>>>>>>>> -Saisai
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Jack Ye <[email protected]> 于2021年9月15日周三 下午1:35写道：
>>>>>>>>>>>>
>>>>>>>>>>>>> Hi Wing Yew,
>>>>>>>>>>>>>
>>>>>>>>>>>>> I think 2.4 is a different story, we will continue to support
>>>>>>>>>>>>> Spark 2.4, but as you can see it will continue to have very 
>>>>>>>>>>>>> limited
>>>>>>>>>>>>> functionalities comparing to Spark 3. I believe we discussed 
>>>>>>>>>>>>> about option 3
>>>>>>>>>>>>> when we were doing Spark 3.0 to 3.1 upgrade. Recently we are 
>>>>>>>>>>>>> seeing the
>>>>>>>>>>>>> same issue for Flink 1.11, 1.12 and 1.13 as well. I feel we need a
>>>>>>>>>>>>> consistent strategy around this, let's take this chance to make a 
>>>>>>>>>>>>> good
>>>>>>>>>>>>> community guideline for all future engine versions, especially 
>>>>>>>>>>>>> for Spark,
>>>>>>>>>>>>> Flink and Hive that are in the same repository.
>>>>>>>>>>>>>
>>>>>>>>>>>>> I can totally understand your point of view Wing, in fact,
>>>>>>>>>>>>> speaking from the perspective of AWS EMR, we have to support over 
>>>>>>>>>>>>> 40
>>>>>>>>>>>>> versions of the software because there are people who are still 
>>>>>>>>>>>>> using Spark
>>>>>>>>>>>>> 1.4, believe it or not. After all, keep backporting changes will 
>>>>>>>>>>>>> become a
>>>>>>>>>>>>> liability not only on the user side, but also on the service 
>>>>>>>>>>>>> provider side,
>>>>>>>>>>>>> so I believe it's not a bad practice to push for user upgrade, as 
>>>>>>>>>>>>> it will
>>>>>>>>>>>>> make the life of both parties easier in the end. New feature is 
>>>>>>>>>>>>> definitely
>>>>>>>>>>>>> one of the best incentives to promote an upgrade on user side.
>>>>>>>>>>>>>
>>>>>>>>>>>>> I think the biggest issue of option 3 is about its
>>>>>>>>>>>>> scalability, because we will have an unbounded list of packages 
>>>>>>>>>>>>> to add and
>>>>>>>>>>>>> compile in the future, and we probably cannot drop support of 
>>>>>>>>>>>>> that package
>>>>>>>>>>>>> once created. If we go with option 1, I think we can still 
>>>>>>>>>>>>> publish a few
>>>>>>>>>>>>> patch versions for old Iceberg releases, and committers can 
>>>>>>>>>>>>> control the
>>>>>>>>>>>>> amount of patch versions to guard people from abusing the power of
>>>>>>>>>>>>> patching. I see this as a consistent strategy also for Flink and 
>>>>>>>>>>>>> Hive. With
>>>>>>>>>>>>> this strategy, we can truly have a compatibility matrix for 
>>>>>>>>>>>>> engine versions
>>>>>>>>>>>>> against Iceberg versions.
>>>>>>>>>>>>>
>>>>>>>>>>>>> -Jack
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Tue, Sep 14, 2021 at 10:00 PM Wing Yew Poon <
>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> I understand and sympathize with the desire to use new DSv2
>>>>>>>>>>>>>> features in Spark 3.2. I agree that Option 1 is the easiest for 
>>>>>>>>>>>>>> developers,
>>>>>>>>>>>>>> but I don't think it considers the interests of users. I do not 
>>>>>>>>>>>>>> think that
>>>>>>>>>>>>>> most users will upgrade to Spark 3.2 as soon as it is released. 
>>>>>>>>>>>>>> It is a
>>>>>>>>>>>>>> "minor version" upgrade in name from 3.1 (or from 3.0), but I 
>>>>>>>>>>>>>> think we all
>>>>>>>>>>>>>> know that it is not a minor upgrade. There are a lot of changes 
>>>>>>>>>>>>>> from 3.0 to
>>>>>>>>>>>>>> 3.1 and from 3.1 to 3.2. I think there are even a lot of users 
>>>>>>>>>>>>>> running
>>>>>>>>>>>>>> Spark 2.4 and not even on Spark 3 yet. Do we also plan to stop 
>>>>>>>>>>>>>> supporting
>>>>>>>>>>>>>> Spark 2.4?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Please correct me if I'm mistaken, but the folks who have
>>>>>>>>>>>>>> spoken out in favor of Option 1 all work for the same 
>>>>>>>>>>>>>> organization, don't
>>>>>>>>>>>>>> they? And they don't have a problem with making their users, all 
>>>>>>>>>>>>>> internal,
>>>>>>>>>>>>>> simply upgrade to Spark 3.2, do they? (Or they are already 
>>>>>>>>>>>>>> running an
>>>>>>>>>>>>>> internal fork that is close to 3.2.)
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I work for an organization with customers running different
>>>>>>>>>>>>>> versions of Spark. It is true that we can backport new features 
>>>>>>>>>>>>>> to older
>>>>>>>>>>>>>> versions if we wanted to. I suppose the people contributing to 
>>>>>>>>>>>>>> Iceberg work
>>>>>>>>>>>>>> for some organization or other that either use Iceberg in-house, 
>>>>>>>>>>>>>> or provide
>>>>>>>>>>>>>> software (possibly in the form of a service) to customers, and 
>>>>>>>>>>>>>> either way,
>>>>>>>>>>>>>> the organizations have the ability to backport features and 
>>>>>>>>>>>>>> fixes to
>>>>>>>>>>>>>> internal versions. Are there any users out there who simply use 
>>>>>>>>>>>>>> Apache
>>>>>>>>>>>>>> Iceberg and depend on the community version?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> There may be features that are broadly useful that do not
>>>>>>>>>>>>>> depend on Spark 3.2. Is it worth supporting them on Spark 
>>>>>>>>>>>>>> 3.0/3.1 (and even
>>>>>>>>>>>>>> 2.4)?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I am not in favor of Option 2. I do not oppose Option 1, but
>>>>>>>>>>>>>> I would consider Option 3 too. Anton, you said 5 modules are 
>>>>>>>>>>>>>> required; what
>>>>>>>>>>>>>> are the modules you're thinking of?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> - Wing Yew
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Tue, Sep 14, 2021 at 5:38 PM Yufei Gu <
>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Option 1 sounds good to me. Here are my reasons:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> 1. Both 2 and 3 will slow down the development. Considering
>>>>>>>>>>>>>>> the limited resources in the open source community, the upsides 
>>>>>>>>>>>>>>> of option 2
>>>>>>>>>>>>>>> and 3 are probably not worthy.
>>>>>>>>>>>>>>> 2. Both 2 and 3 assume the use cases may not exist. It's
>>>>>>>>>>>>>>> hard to predict anything, but even if these use cases are 
>>>>>>>>>>>>>>> legit, users can
>>>>>>>>>>>>>>> still get the new feature by backporting it to an older version 
>>>>>>>>>>>>>>> in case of
>>>>>>>>>>>>>>> upgrading to a newer version isn't an option.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Yufei
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> `This is not a contribution`
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Tue, Sep 14, 2021 at 4:54 PM Anton Okolnychyi <
>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> To sum up what we have so far:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> *Option 1 (support just the most recent minor Spark 3
>>>>>>>>>>>>>>>> version)*
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> The easiest option for us devs, forces the user to upgrade
>>>>>>>>>>>>>>>> to the most recent minor Spark version to consume any new
>>>>>>>>>>>>>>>> Iceberg features.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> *Option 2 (a separate project under Iceberg)*
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Can support as many Spark versions as needed and the
>>>>>>>>>>>>>>>> codebase is still separate as we can use separate branches.
>>>>>>>>>>>>>>>> Impossible to consume any unreleased changes in core, may
>>>>>>>>>>>>>>>> slow down the development.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> *Option 3 (separate modules for Spark 3.1/3.2)*
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Introduce more modules in the same project.
>>>>>>>>>>>>>>>> Can consume unreleased changes but it will required at
>>>>>>>>>>>>>>>> least 5 modules to support 2.4, 3.1 and 3.2, making the build 
>>>>>>>>>>>>>>>> and testing
>>>>>>>>>>>>>>>> complicated.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Are there any users for whom upgrading the minor Spark
>>>>>>>>>>>>>>>> version (e3.1 to 3.2) to consume new features is a blocker?
>>>>>>>>>>>>>>>> We follow Option 1 internally at the moment but I would
>>>>>>>>>>>>>>>> like to hear what other people think/need.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> - Anton
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On 14 Sep 2021, at 09:44, Russell Spitzer <
>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I think we should go for option 1. I already am not a big
>>>>>>>>>>>>>>>> fan of having runtime errors for unsupported things based on 
>>>>>>>>>>>>>>>> versions and I
>>>>>>>>>>>>>>>> don't think minor version upgrades are a large issue for 
>>>>>>>>>>>>>>>> users.  I'm
>>>>>>>>>>>>>>>> especially not looking forward to supporting interfaces that 
>>>>>>>>>>>>>>>> only exist in
>>>>>>>>>>>>>>>> Spark 3.2 in a multiple Spark version support future.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Sep 14, 2021, at 11:32 AM, Anton Okolnychyi <
>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> First of all, is option 2 a viable option? We discussed
>>>>>>>>>>>>>>>> separating the python module outside of the project a few 
>>>>>>>>>>>>>>>> weeks ago, and
>>>>>>>>>>>>>>>> decided to not do that because it's beneficial for code cross 
>>>>>>>>>>>>>>>> reference and
>>>>>>>>>>>>>>>> more intuitive for new developers to see everything in the 
>>>>>>>>>>>>>>>> same repository.
>>>>>>>>>>>>>>>> I would expect the same argument to also hold here.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> That’s exactly the concern I have about Option 2 at this
>>>>>>>>>>>>>>>> moment.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Overall I would personally prefer us to not support all the
>>>>>>>>>>>>>>>> minor versions, but instead support maybe just 2-3 latest 
>>>>>>>>>>>>>>>> versions in a
>>>>>>>>>>>>>>>> major version.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> This is when it gets a bit complicated. If we want to
>>>>>>>>>>>>>>>> support both Spark 3.1 and Spark 3.2 with a single module, it 
>>>>>>>>>>>>>>>> means we have
>>>>>>>>>>>>>>>> to compile against 3.1. The problem is that we rely on DSv2 
>>>>>>>>>>>>>>>> that is being
>>>>>>>>>>>>>>>> actively developed. 3.2 and 3.1 have substantial differences. 
>>>>>>>>>>>>>>>> On top of
>>>>>>>>>>>>>>>> that, we have our extensions that are extremely low-level and 
>>>>>>>>>>>>>>>> may break not
>>>>>>>>>>>>>>>> only between minor versions but also between patch releases.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> f there are some features requiring a newer version, it
>>>>>>>>>>>>>>>> makes sense to move that newer version in master.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Internally, we don’t deliver new features to older Spark
>>>>>>>>>>>>>>>> versions as it requires a lot of effort to port things. 
>>>>>>>>>>>>>>>> Personally, I don’t
>>>>>>>>>>>>>>>> think it is too bad to require users to upgrade if they want 
>>>>>>>>>>>>>>>> new features.
>>>>>>>>>>>>>>>> At the same time, there are valid concerns with this approach 
>>>>>>>>>>>>>>>> too that we
>>>>>>>>>>>>>>>> mentioned during the sync. For example, certain new features 
>>>>>>>>>>>>>>>> would also
>>>>>>>>>>>>>>>> work fine with older Spark versions. I generally agree with 
>>>>>>>>>>>>>>>> that and that
>>>>>>>>>>>>>>>> not supporting recent versions is not ideal. However, I want 
>>>>>>>>>>>>>>>> to find a
>>>>>>>>>>>>>>>> balance between the complexity on our side and ease of use for 
>>>>>>>>>>>>>>>> the users.
>>>>>>>>>>>>>>>> Ideally, supporting a few recent versions would be sufficient 
>>>>>>>>>>>>>>>> but our Spark
>>>>>>>>>>>>>>>> integration is too low-level to do that with a single module.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On 13 Sep 2021, at 20:53, Jack Ye <[email protected]>
>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> First of all, is option 2 a viable option? We discussed
>>>>>>>>>>>>>>>> separating the python module outside of the project a few 
>>>>>>>>>>>>>>>> weeks ago, and
>>>>>>>>>>>>>>>> decided to not do that because it's beneficial for code cross 
>>>>>>>>>>>>>>>> reference and
>>>>>>>>>>>>>>>> more intuitive for new developers to see everything in the 
>>>>>>>>>>>>>>>> same repository.
>>>>>>>>>>>>>>>> I would expect the same argument to also hold here.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Overall I would personally prefer us to not support all the
>>>>>>>>>>>>>>>> minor versions, but instead support maybe just 2-3 latest 
>>>>>>>>>>>>>>>> versions in a
>>>>>>>>>>>>>>>> major version. This avoids the problem that some users are 
>>>>>>>>>>>>>>>> unwilling to
>>>>>>>>>>>>>>>> move to a newer version and keep patching old Spark version 
>>>>>>>>>>>>>>>> branches. If
>>>>>>>>>>>>>>>> there are some features requiring a newer version, it makes 
>>>>>>>>>>>>>>>> sense to move
>>>>>>>>>>>>>>>> that newer version in master.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> In addition, because currently Spark is considered the most
>>>>>>>>>>>>>>>> feature-complete reference implementation compared to all 
>>>>>>>>>>>>>>>> other engines, I
>>>>>>>>>>>>>>>> think we should not add artificial barriers that would slow 
>>>>>>>>>>>>>>>> down its
>>>>>>>>>>>>>>>> development speed.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> So my thinking is closer to option 1.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>>>> Jack Ye
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Mon, Sep 13, 2021 at 7:39 PM Anton Okolnychyi <
>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Hey folks,
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> I want to discuss our Spark version support strategy.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> So far, we have tried to support both 3.0 and 3.1. It is
>>>>>>>>>>>>>>>>> great to support older versions but because we compile 
>>>>>>>>>>>>>>>>> against 3.0, we
>>>>>>>>>>>>>>>>> cannot use any Spark features that are offered in newer 
>>>>>>>>>>>>>>>>> versions.
>>>>>>>>>>>>>>>>> Spark 3.2 is just around the corner and it brings a lot of
>>>>>>>>>>>>>>>>> important features such dynamic filtering for v2 tables, 
>>>>>>>>>>>>>>>>> required
>>>>>>>>>>>>>>>>> distribution and ordering for writes, etc. These features are 
>>>>>>>>>>>>>>>>> too important
>>>>>>>>>>>>>>>>> to ignore them.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Apart from that, I have an end-to-end prototype for
>>>>>>>>>>>>>>>>> merge-on-read with Spark that actually leverages some of the 
>>>>>>>>>>>>>>>>> 3.2 features.
>>>>>>>>>>>>>>>>> I’ll be implementing all new Spark DSv2 APIs for us 
>>>>>>>>>>>>>>>>> internally and would
>>>>>>>>>>>>>>>>> love to share that with the rest of the community.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> I see two options to move forward:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Option 1
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Migrate to Spark 3.2 in master, maintain 0.12 for a while
>>>>>>>>>>>>>>>>> by releasing minor versions with bug fixes.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Pros: almost no changes to the build configuration, no
>>>>>>>>>>>>>>>>> extra work on our side as just a single Spark version is 
>>>>>>>>>>>>>>>>> actively
>>>>>>>>>>>>>>>>> maintained.
>>>>>>>>>>>>>>>>> Cons: some new features that we will be adding to master
>>>>>>>>>>>>>>>>> could also work with older Spark versions but all 0.12 
>>>>>>>>>>>>>>>>> releases will only
>>>>>>>>>>>>>>>>> contain bug fixes. Therefore, users will be forced to migrate 
>>>>>>>>>>>>>>>>> to Spark 3.2
>>>>>>>>>>>>>>>>> to consume any new Spark or format features.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Option 2
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Move our Spark integration into a separate project and
>>>>>>>>>>>>>>>>> introduce branches for 3.0, 3.1 and 3.2.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Pros: decouples the format version from Spark, we can
>>>>>>>>>>>>>>>>> support as many Spark versions as needed.
>>>>>>>>>>>>>>>>> Cons: more work initially to set everything up, more work
>>>>>>>>>>>>>>>>> to release, will need a new release of the core format to 
>>>>>>>>>>>>>>>>> consume any
>>>>>>>>>>>>>>>>> changes in the Spark integration.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Overall, I think option 2 seems better for the user but my
>>>>>>>>>>>>>>>>> main worry is that we will have to release the format more 
>>>>>>>>>>>>>>>>> frequently
>>>>>>>>>>>>>>>>> (which is a good thing but requires more work and time) and 
>>>>>>>>>>>>>>>>> the overall
>>>>>>>>>>>>>>>>> Spark development may be slower.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> I’d love to hear what everybody thinks about this matter.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>> Anton
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> Ryan Blue
>>>>>>>>>> Tabular
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Ryan Blue
>>>>>>>> Tabular
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>> --
>>>>>> Ryan Blue
>>>>>> Tabular
>>>>>>
>>>>>>
>>>>>>

Re: [DISCUSS] Spark version support strategy

Reply via email to