Re: [DISCUSS] Spark version support strategy

OpenInx Thu, 07 Oct 2021 19:09:14 -0700

> We should probably add a section to our Flink docs that explains and
links to Flink’s support policy and has a table of Iceberg versions that
work with Flink versions. (We should probably have the same table for
Spark, too!)


Thanks Ryan for the suggestion, I created a separate issue to address this
thing before: https://github.com/apache/iceberg/issues/3115 .  I will make
this forward.

On Thu, Oct 7, 2021 at 1:55 PM Jack Ye <yezhao...@gmail.com> wrote:

> Hi everyone,
>
> I tried to prototype option 3, here is the PR:
> https://github.com/apache/iceberg/pull/3237
>
> Sorry I did not see that Anton is planning to do it, but anyway it's just
> a draft, so feel free to just use it as reference.
>
> Best,
> Jack Ye
>
> On Sun, Oct 3, 2021 at 2:19 PM Ryan Blue <b...@tabular.io> wrote:
>
>> Thanks for the context on the Flink side! I think it sounds reasonable to
>> keep up to date with the latest supported Flink version. If we want, we
>> could later go with something similar to what we do for Spark but we’ll see
>> how it goes and what the Flink community needs. We should probably add a
>> section to our Flink docs that explains and links to Flink’s support policy
>> and has a table of Iceberg versions that work with Flink versions. (We
>> should probably have the same table for Spark, too!)
>>
>> For Spark, I’m also leaning toward the modified option 3 where we keep
>> all of the code in the main repository but only build with one module at a
>> time by default. It makes sense to switch based on modules — rather than
>> selecting src paths within a module — so that it is easy to run a build
>> with all modules if you choose to — for example, when building release
>> binaries.
>>
>> The reason I think we should go with option 3 is for testing. If we have
>> a single repo with api, core, etc. and spark then changes to the common
>> modules can be tested by CI actions. Updates to individual Spark modules
>> would be completely independent. There is a slight inconvenience that when
>> an API used by Spark changes, the author would still need to fix multiple
>> Spark versions. But the trade-off is that with a separate repository like
>> option 2, changes that break Spark versions are not caught and then the
>> Spark repository’s CI ends up failing on completely unrelated changes. That
>> would be a major pain, felt by everyone contributing to the Spark
>> integration, so I think option 3 is the best path forward.
>>
>> It sounds like we probably have some agreement now, but please speak up
>> if you think another option would be better.
>>
>> The next step is to prototype the build changes to test out option 3. Or
>> if you prefer option 2, then prototype those changes as well. I think that
>> Anton is planning to do this, but if you have time and the desire to do it
>> please reach out and coordinate with us!
>>
>> Ryan
>>
>> On Wed, Sep 29, 2021 at 9:12 PM Steven Wu <stevenz...@gmail.com> wrote:
>>
>>> Wing, sorry, my earlier message probably misled you. I was speaking my
>>> personal opinion on Flink version support.
>>>
>>> On Tue, Sep 28, 2021 at 8:03 PM Wing Yew Poon
>>> <wyp...@cloudera.com.invalid> wrote:
>>>
>>>> Hi OpenInx,
>>>> I'm sorry I misunderstood the thinking of the Flink community. Thanks
>>>> for the clarification.
>>>> - Wing Yew
>>>>
>>>>
>>>> On Tue, Sep 28, 2021 at 7:15 PM OpenInx <open...@gmail.com> wrote:
>>>>
>>>>> Hi Wing
>>>>>
>>>>> As we discussed above, we community prefer to choose option.2 or
>>>>> option.3.  So in fact, when we planned to upgrade the flink version from
>>>>> 1.12 to 1.13,  we are doing our best to guarantee the master iceberg repo
>>>>> could work fine for both flink1.12 & flink1.13. More context please see
>>>>> [1], [2], [3]
>>>>>
>>>>> [1] https://github.com/apache/iceberg/pull/3116
>>>>> [2] https://github.com/apache/iceberg/issues/3183
>>>>> [3]
>>>>> https://lists.apache.org/x/thread.html/ra438e89eeec2d4623a32822e21739c8f2229505522d73d1034e34198@%3Cdev.flink.apache.org%3E
>>>>>
>>>>>
>>>>> On Wed, Sep 29, 2021 at 5:27 AM Wing Yew Poon
>>>>> <wyp...@cloudera.com.invalid> wrote:
>>>>>
>>>>>> In the last community sync, we spent a little time on this topic. For
>>>>>> Spark support, there are currently two options under consideration:
>>>>>>
>>>>>> Option 2: Separate repo for the Spark support. Use branches for
>>>>>> supporting different Spark versions. Main branch for the latest Spark
>>>>>> version (3.2 to begin with).
>>>>>> Tooling needs to be built for producing regular snapshots of core
>>>>>> Iceberg in a consumable way for this repo. Unclear if commits to core
>>>>>> Iceberg will be tested pre-commit against Spark support; my impression is
>>>>>> that they will not be, and the Spark support build can be broken by 
>>>>>> changes
>>>>>> to core.
>>>>>>
>>>>>> A variant of option 3 (which we will simply call Option 3 going
>>>>>> forward): Single repo, separate module (subdirectory) for each Spark
>>>>>> version to be supported. Code duplication in each Spark module (no 
>>>>>> attempt
>>>>>> to refactor out common code). Each module built against the specific
>>>>>> version of Spark to be supported, producing a runtime jar built against
>>>>>> that version. CI will test all modules. Support can be provided for only
>>>>>> building the modules a developer cares about.
>>>>>>
>>>>>> More input was sought and people are encouraged to voice their
>>>>>> preference.
>>>>>> I lean towards Option 3.
>>>>>>
>>>>>> - Wing Yew
>>>>>>
>>>>>> ps. In the sync, as Steven Wu wrote, the question was raised if the
>>>>>> same multi-version support strategy can be adopted across engines. Based 
>>>>>> on
>>>>>> what Steven wrote, currently the Flink developer community's bandwidth
>>>>>> makes supporting only a single Flink version (and focusing resources on
>>>>>> developing new features on that version) the preferred choice. If so, 
>>>>>> then
>>>>>> no multi-version support strategy for Flink is needed at this time.
>>>>>>
>>>>>>
>>>>>> On Thu, Sep 23, 2021 at 5:26 PM Steven Wu <stevenz...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> During the sync meeting, people talked about if and how we can have
>>>>>>> the same version support model across engines like Flink and Spark. I 
>>>>>>> can
>>>>>>> provide some input from the Flink side.
>>>>>>>
>>>>>>> Flink only supports two minor versions. E.g., right now Flink 1.13
>>>>>>> is the latest released version. That means only Flink 1.12 and 1.13 are
>>>>>>> supported. Feature changes or bug fixes will only be backported to 1.12 
>>>>>>> and
>>>>>>> 1.13, unless it is a serious bug (like security). With that context,
>>>>>>> personally I like option 1 (with one actively supported Flink version in
>>>>>>> master branch) for the iceberg-flink module.
>>>>>>>
>>>>>>> We discussed the idea of supporting multiple Flink versions via shm
>>>>>>> layer and multiple modules. While it may be a little better to support
>>>>>>> multiple Flink versions, I don't know if there is enough support and
>>>>>>> resources from the community to pull it off. Also the ongoing 
>>>>>>> maintenance
>>>>>>> burden for each minor version release from Flink, which happens roughly
>>>>>>> every 4 months.
>>>>>>>
>>>>>>>
>>>>>>> On Thu, Sep 16, 2021 at 10:25 PM Peter Vary
>>>>>>> <pv...@cloudera.com.invalid> wrote:
>>>>>>>
>>>>>>>> Since you mentioned Hive, I chime in with what we do there. You
>>>>>>>> might find it useful:
>>>>>>>> - metastore module - only small differences - DynConstructor solves
>>>>>>>> for us
>>>>>>>> - mr module - some bigger differences, but still manageable for
>>>>>>>> Hive 2-3. Need some new classes, but most of the code is reused - extra
>>>>>>>> module for Hive 3. For Hive 4 we use a different repo as we moved to 
>>>>>>>> the
>>>>>>>> Hive codebase.
>>>>>>>>
>>>>>>>> My thoughts based on the above experience:
>>>>>>>> - Keeping Hive 4 and Hive 2-3 code in sync is a pain. We constantly
>>>>>>>> have problems with backporting changes between repos and we are 
>>>>>>>> slacking
>>>>>>>> behind which hurts both projects
>>>>>>>> - Hive 2-3 model is working better by forcing us to keep the things
>>>>>>>> in sync, but with serious differences in the Hive project it still 
>>>>>>>> doesn't
>>>>>>>> seem like a viable option.
>>>>>>>>
>>>>>>>> So I think the question is: How stable is the Spark code we are
>>>>>>>> integrating to. If I is fairly stable then we are better off with a 
>>>>>>>> "one
>>>>>>>> repo multiple modules" approach and we should consider the multirepo 
>>>>>>>> only
>>>>>>>> if the differences become prohibitive.
>>>>>>>>
>>>>>>>> Thanks, Peter
>>>>>>>>
>>>>>>>> On Fri, 17 Sep 2021, 02:21 Anton Okolnychyi,
>>>>>>>> <aokolnyc...@apple.com.invalid> wrote:
>>>>>>>>
>>>>>>>>> Okay, looks like there is consensus around supporting multiple
>>>>>>>>> Spark versions at the same time. There are folks who mentioned this 
>>>>>>>>> on this
>>>>>>>>> thread and there were folks who brought this up during the sync.
>>>>>>>>>
>>>>>>>>> Let’s think through Option 2 and 3 in more detail then.
>>>>>>>>>
>>>>>>>>> Option 2
>>>>>>>>>
>>>>>>>>> In Option 2, there will be a separate repo. I believe the master
>>>>>>>>> branch will soon point to Spark 3.2 (the most recent supported 
>>>>>>>>> version).
>>>>>>>>> The main development will happen there and the artifact version will 
>>>>>>>>> be
>>>>>>>>> 0.1.0. I also suppose there will be 0.1.x-spark-2 and 0.1.x-spark-3.1
>>>>>>>>> branches where we will cherry-pick applicable changes. Once we are 
>>>>>>>>> ready to
>>>>>>>>> release 0.1.0 Spark integration, we will create 0.1.x-spark-3.2 and 
>>>>>>>>> cut 3
>>>>>>>>> releases: Spark 2.4, Spark 3.1, Spark 3.2. After that, we will bump 
>>>>>>>>> the
>>>>>>>>> version in master to 0.2.0 and create new 0.2.x-spark-2 and 
>>>>>>>>> 0.2.x-spark-3.1
>>>>>>>>> branches for cherry-picks.
>>>>>>>>>
>>>>>>>>> I guess we will continue to shade everything in the new repo and
>>>>>>>>> will have to release every time the core is released. We will do a
>>>>>>>>> maintenance release for each supported Spark version whenever we cut 
>>>>>>>>> a new
>>>>>>>>> maintenance Iceberg release or need to fix any bugs in the Spark
>>>>>>>>> integration.
>>>>>>>>> Under this model, we will probably need nightly snapshots (or on
>>>>>>>>> each commit) for the core format and the Spark integration will 
>>>>>>>>> depend on
>>>>>>>>> snapshots until we are ready to release.
>>>>>>>>>
>>>>>>>>> Overall, I think this option gives us very simple builds and
>>>>>>>>> provides best separation. It will keep the main repo clean. The main
>>>>>>>>> downside is that we will have to split a Spark feature into two PRs: 
>>>>>>>>> one
>>>>>>>>> against the core and one against the Spark integration. Certain 
>>>>>>>>> changes in
>>>>>>>>> core can also break the Spark integration too and will require 
>>>>>>>>> adaptations.
>>>>>>>>>
>>>>>>>>> Ryan, I am not sure I fully understood the testing part. How will
>>>>>>>>> we be able to test the Spark integration in the main repo if certain
>>>>>>>>> changes in core may break the Spark integration and require changes 
>>>>>>>>> there?
>>>>>>>>> Will we try to prohibit such changes?
>>>>>>>>>
>>>>>>>>> Option 3 (modified)
>>>>>>>>>
>>>>>>>>> If I get correctly, the modified Option 3 sounds very close to
>>>>>>>>> the initially suggested approach by Imran but with code duplication 
>>>>>>>>> instead
>>>>>>>>> of extra refactoring and introducing new common modules.
>>>>>>>>>
>>>>>>>>> Jack, are you suggesting we test only a single Spark version at a
>>>>>>>>> time? Or do we expect to test all versions? Will there be any 
>>>>>>>>> difference
>>>>>>>>> compared to just having a module per version? I did not fully
>>>>>>>>> understand.
>>>>>>>>>
>>>>>>>>> My worry with this approach is that our build will be very
>>>>>>>>> complicated and we will still have a lot of Spark-related modules in 
>>>>>>>>> the
>>>>>>>>> main repo. Once people start using Flink and Hive more, will we have 
>>>>>>>>> to do
>>>>>>>>> the same?
>>>>>>>>>
>>>>>>>>> - Anton
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On 16 Sep 2021, at 08:11, Ryan Blue <b...@tabular.io> wrote:
>>>>>>>>>
>>>>>>>>> I'd support the option that Jack suggests if we can set a few
>>>>>>>>> expectations for keeping it clean.
>>>>>>>>>
>>>>>>>>> First, I'd like to avoid refactoring code to share it across Spark
>>>>>>>>> versions -- that introduces risk because we're relying on compiling 
>>>>>>>>> against
>>>>>>>>> one version and running in another and both Spark and Scala change 
>>>>>>>>> rapidly.
>>>>>>>>> A big benefit of options 1 and 2 is that we mostly focus on only one 
>>>>>>>>> Spark
>>>>>>>>> version. I think we should duplicate code rather than spend time
>>>>>>>>> refactoring to rely on binary compatibility. I propose we start each 
>>>>>>>>> new
>>>>>>>>> Spark version by copying the last one and updating it. And we should 
>>>>>>>>> build
>>>>>>>>> just the latest supported version by default.
>>>>>>>>>
>>>>>>>>> The drawback to having everything in a single repo is that we
>>>>>>>>> wouldn't be able to cherry-pick changes across Spark 
>>>>>>>>> versions/branches, but
>>>>>>>>> I think Jack is right that having a single build is better.
>>>>>>>>>
>>>>>>>>> Second, we should make CI faster by running the Spark builds in
>>>>>>>>> parallel. It sounds like this is what would happen anyway, with a 
>>>>>>>>> property
>>>>>>>>> that selects the Spark version that you want to build against.
>>>>>>>>>
>>>>>>>>> Overall, this new suggestion sounds like a promising way forward.
>>>>>>>>>
>>>>>>>>> Ryan
>>>>>>>>>
>>>>>>>>> On Wed, Sep 15, 2021 at 11:46 PM Jack Ye <yezhao...@gmail.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> I think in Ryan's proposal we will create a ton of modules
>>>>>>>>>> anyway, as Wing listed we are just using git branch as an additional
>>>>>>>>>> dimension, but my understanding is that you will still have 1 core, 1
>>>>>>>>>> extension, 1 runtime artifact published for each Spark version in 
>>>>>>>>>> either
>>>>>>>>>> approach.
>>>>>>>>>>
>>>>>>>>>> In that case, this is just brainstorming, I wonder if we can
>>>>>>>>>> explore a modified option 3 that flattens all the versions in each 
>>>>>>>>>> Spark
>>>>>>>>>> branch in option 2 into master. The repository structure would look
>>>>>>>>>> something like:
>>>>>>>>>>
>>>>>>>>>> iceberg/api/...
>>>>>>>>>>             /bundled-guava/...
>>>>>>>>>>             /core/...
>>>>>>>>>>             ...
>>>>>>>>>>             /spark/2.4/core/...
>>>>>>>>>>                             /extension/...
>>>>>>>>>>                             /runtime/...
>>>>>>>>>>                       /3.1/core/...
>>>>>>>>>>                             /extension/...
>>>>>>>>>>                             /runtime/...
>>>>>>>>>>
>>>>>>>>>> The gradle build script in the root is configured to build
>>>>>>>>>> against the latest version of Spark by default, unless otherwise 
>>>>>>>>>> specified
>>>>>>>>>> by the user.
>>>>>>>>>>
>>>>>>>>>> Intellij can also be configured to only index files of specific
>>>>>>>>>> versions based on the same config used in build.
>>>>>>>>>>
>>>>>>>>>> In this way, I imagine the CI setup to be much easier to do
>>>>>>>>>> things like testing version compatibility for a feature or running 
>>>>>>>>>> only a
>>>>>>>>>> specific subset of Spark version builds based on the Spark version
>>>>>>>>>> directories touched.
>>>>>>>>>>
>>>>>>>>>> And the biggest benefit is that we don't have the same difficulty
>>>>>>>>>> as option 2 of developing a feature when it's both in core and Spark.
>>>>>>>>>>
>>>>>>>>>> We can then develop a mechanism to vote to stop support of
>>>>>>>>>> certain versions, and archive the corresponding directory to avoid
>>>>>>>>>> accumulating too many versions in the long term.
>>>>>>>>>>
>>>>>>>>>> -Jack Ye
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Wed, Sep 15, 2021 at 4:17 PM Ryan Blue <b...@tabular.io>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> Sorry, I was thinking about CI integration between Iceberg Java
>>>>>>>>>>> and Iceberg Spark, I just didn't mention it and I see how that's a 
>>>>>>>>>>> big
>>>>>>>>>>> thing to leave out!
>>>>>>>>>>>
>>>>>>>>>>> I would definitely want to test the projects together. One thing
>>>>>>>>>>> we could do is have a nightly build like Russell suggests. I'm also
>>>>>>>>>>> wondering if we could have some tighter integration where the 
>>>>>>>>>>> Iceberg Spark
>>>>>>>>>>> build can be included in the Iceberg Java build using properties. 
>>>>>>>>>>> Maybe the
>>>>>>>>>>> github action could checkout Iceberg, then checkout the Spark
>>>>>>>>>>> integration's latest branch, and then run the gradle build with a 
>>>>>>>>>>> property
>>>>>>>>>>> that makes Spark a subproject in the build. That way we can 
>>>>>>>>>>> continue to
>>>>>>>>>>> have Spark CI run regularly.
>>>>>>>>>>>
>>>>>>>>>>> On Wed, Sep 15, 2021 at 3:08 PM Russell Spitzer <
>>>>>>>>>>> russell.spit...@gmail.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> I agree that Option 2 is considerably more difficult for
>>>>>>>>>>>> development when core API changes need to be picked up by the 
>>>>>>>>>>>> external
>>>>>>>>>>>> Spark module. I also think a monthly release would probably still 
>>>>>>>>>>>> be
>>>>>>>>>>>> prohibitive to actually implementing new features that appear in 
>>>>>>>>>>>> the API, I
>>>>>>>>>>>> would hope we have a much faster process or maybe just have 
>>>>>>>>>>>> snapshot
>>>>>>>>>>>> artifacts published nightly?
>>>>>>>>>>>>
>>>>>>>>>>>> On Sep 15, 2021, at 4:46 PM, Wing Yew Poon <
>>>>>>>>>>>> wyp...@cloudera.com.INVALID> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>> IIUC, Option 2 is to move the Spark support for Iceberg into a
>>>>>>>>>>>> separate repo (subproject of Iceberg). Would we have branches such 
>>>>>>>>>>>> as
>>>>>>>>>>>> 0.13-2.4, 0.13-3.0, 0.13-3.1, and 0.13-3.2? For features that can 
>>>>>>>>>>>> be
>>>>>>>>>>>> supported in all versions or all Spark 3 versions, then we would 
>>>>>>>>>>>> need to
>>>>>>>>>>>> commit the changes to all applicable branches. Basically we are 
>>>>>>>>>>>> trading
>>>>>>>>>>>> more work to commit to multiple branches for simplified build and 
>>>>>>>>>>>> CI
>>>>>>>>>>>> time per branch, which might be an acceptable trade-off. However, 
>>>>>>>>>>>> the
>>>>>>>>>>>> biggest downside is that changes may need to be made in core 
>>>>>>>>>>>> Iceberg as
>>>>>>>>>>>> well as in the engine (in this case Spark) support, and we need to 
>>>>>>>>>>>> wait for
>>>>>>>>>>>> a release of core Iceberg to consume the changes in the 
>>>>>>>>>>>> subproject. In this
>>>>>>>>>>>> case, maybe we should have a monthly release of core Iceberg (no 
>>>>>>>>>>>> matter how
>>>>>>>>>>>> many changes go in, as long as it is non-zero) so that the 
>>>>>>>>>>>> subproject can
>>>>>>>>>>>> consume changes fairly quickly?
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Wed, Sep 15, 2021 at 2:09 PM Ryan Blue <b...@tabular.io>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks for bringing this up, Anton. I’m glad that we have the
>>>>>>>>>>>>> set of potential solutions well defined.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Looks like the next step is to decide whether we want to
>>>>>>>>>>>>> require people to update Spark versions to pick up newer versions 
>>>>>>>>>>>>> of
>>>>>>>>>>>>> Iceberg. If we choose to make people upgrade, then option 1 is 
>>>>>>>>>>>>> clearly the
>>>>>>>>>>>>> best choice.
>>>>>>>>>>>>>
>>>>>>>>>>>>> I don’t think that we should make updating Spark a
>>>>>>>>>>>>> requirement. Many of the things that we’re working on are 
>>>>>>>>>>>>> orthogonal to
>>>>>>>>>>>>> Spark versions, like table maintenance actions, secondary 
>>>>>>>>>>>>> indexes, the 1.0
>>>>>>>>>>>>> API, views, ORC delete files, new storage implementations, etc. 
>>>>>>>>>>>>> Upgrading
>>>>>>>>>>>>> Spark is time consuming and untrusted in my experience, so I 
>>>>>>>>>>>>> think we would
>>>>>>>>>>>>> be setting up an unnecessary trade-off between spending lots of 
>>>>>>>>>>>>> time to
>>>>>>>>>>>>> upgrade Spark and picking up new Iceberg features.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Another way of thinking about this is that if we went with
>>>>>>>>>>>>> option 1, then we could port bug fixes into 0.12.x. But there are 
>>>>>>>>>>>>> many
>>>>>>>>>>>>> things that wouldn’t fit this model, like adding a FileIO 
>>>>>>>>>>>>> implementation
>>>>>>>>>>>>> for ADLS. So some people in the community would have to maintain 
>>>>>>>>>>>>> branches
>>>>>>>>>>>>> of newer Iceberg versions with older versions of Spark outside of 
>>>>>>>>>>>>> the main
>>>>>>>>>>>>> Iceberg project — that defeats the purpose of simplifying things 
>>>>>>>>>>>>> with
>>>>>>>>>>>>> option 1 because we would then have more people maintaining the 
>>>>>>>>>>>>> same 0.13.x
>>>>>>>>>>>>> with Spark 3.1 branch. (This reminds me of the Spark community, 
>>>>>>>>>>>>> where we
>>>>>>>>>>>>> wanted to release a 2.5 line with DSv2 backported, but the 
>>>>>>>>>>>>> community
>>>>>>>>>>>>> decided not to so we built similar 2.4+DSv2 branches at Netflix, 
>>>>>>>>>>>>> Tencent,
>>>>>>>>>>>>> Apple, etc.)
>>>>>>>>>>>>>
>>>>>>>>>>>>> If the community is going to do the work anyway — and I think
>>>>>>>>>>>>> some of us would — we should make it possible to share that work. 
>>>>>>>>>>>>> That’s
>>>>>>>>>>>>> why I don’t think that we should go with option 1.
>>>>>>>>>>>>>
>>>>>>>>>>>>> If we don’t go with option 1, then the choice is how to
>>>>>>>>>>>>> maintain multiple Spark versions. I think that the way we’re 
>>>>>>>>>>>>> doing it right
>>>>>>>>>>>>> now is not something we want to continue.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Using multiple modules (option 3) is concerning to me because
>>>>>>>>>>>>> of the changes in Spark. We currently structure the library to 
>>>>>>>>>>>>> share as
>>>>>>>>>>>>> much code as possible. But that means compiling against different 
>>>>>>>>>>>>> Spark
>>>>>>>>>>>>> versions and relying on binary compatibility and reflection in 
>>>>>>>>>>>>> some cases.
>>>>>>>>>>>>> To me, this seems unmaintainable in the long run because it 
>>>>>>>>>>>>> requires
>>>>>>>>>>>>> refactoring common classes and spending a lot of time 
>>>>>>>>>>>>> deduplicating code.
>>>>>>>>>>>>> It also creates a ton of modules, at least one common module, 
>>>>>>>>>>>>> then a module
>>>>>>>>>>>>> per version, then an extensions module per version, and finally a 
>>>>>>>>>>>>> runtime
>>>>>>>>>>>>> module per version. That’s 3 modules per Spark version, plus any 
>>>>>>>>>>>>> new common
>>>>>>>>>>>>> modules. And each module needs to be tested, which is making our 
>>>>>>>>>>>>> CI take a
>>>>>>>>>>>>> really long time. We also don’t support multiple Scala versions, 
>>>>>>>>>>>>> which is
>>>>>>>>>>>>> another gap that will require even more modules and tests.
>>>>>>>>>>>>>
>>>>>>>>>>>>> I like option 2 because it would allow us to compile against a
>>>>>>>>>>>>> single version of Spark (which will be much more reliable). It 
>>>>>>>>>>>>> would give
>>>>>>>>>>>>> us an opportunity to support different Scala versions. It avoids 
>>>>>>>>>>>>> the need
>>>>>>>>>>>>> to refactor to share code and allows people to focus on a single 
>>>>>>>>>>>>> version of
>>>>>>>>>>>>> Spark, while also creating a way for people to maintain and 
>>>>>>>>>>>>> update the
>>>>>>>>>>>>> older versions with newer Iceberg releases. I don’t think that 
>>>>>>>>>>>>> this would
>>>>>>>>>>>>> slow down development. I think it would actually speed it up 
>>>>>>>>>>>>> because we’d
>>>>>>>>>>>>> be spending less time trying to make multiple versions work in 
>>>>>>>>>>>>> the same
>>>>>>>>>>>>> build. And anyone in favor of option 1 would basically get option 
>>>>>>>>>>>>> 1: you
>>>>>>>>>>>>> don’t have to care about branches for older Spark versions.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Jack makes a good point about wanting to keep code in a single
>>>>>>>>>>>>> repository, but I think that the need to manage more version 
>>>>>>>>>>>>> combinations
>>>>>>>>>>>>> overrides this concern. It’s easier to make this decision in 
>>>>>>>>>>>>> python because
>>>>>>>>>>>>> we’re not trying to depend on two projects that change relatively 
>>>>>>>>>>>>> quickly.
>>>>>>>>>>>>> We’re just trying to build a library.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Ryan
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Wed, Sep 15, 2021 at 2:58 AM OpenInx <open...@gmail.com>
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks for bringing this up,  Anton.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Everyone has great pros/cons to support their preferences.
>>>>>>>>>>>>>> Before giving my preference, let me raise one question:    
>>>>>>>>>>>>>> what's the top
>>>>>>>>>>>>>> priority thing for apache iceberg project at this point in time 
>>>>>>>>>>>>>> ?  This
>>>>>>>>>>>>>> question will help us to answer the following question: Should 
>>>>>>>>>>>>>> we support
>>>>>>>>>>>>>> more engine versions more robustly or be a bit more aggressive 
>>>>>>>>>>>>>> and
>>>>>>>>>>>>>> concentrate on getting the new features that users need most in 
>>>>>>>>>>>>>> order to
>>>>>>>>>>>>>> keep the project more competitive ?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> If people watch the apache iceberg project and check the
>>>>>>>>>>>>>> issues & PR frequently,  I guess more than 90% people will 
>>>>>>>>>>>>>> answer the
>>>>>>>>>>>>>> priority question:   There is no doubt for making the whole v2 
>>>>>>>>>>>>>> story to be
>>>>>>>>>>>>>> production-ready.   The current roadmap discussion also proofs 
>>>>>>>>>>>>>> the thing :
>>>>>>>>>>>>>> https://lists.apache.org/x/thread.html/r84e80216c259c81f824c6971504c321cd8c785774c489d52d4fc123f@%3Cdev.iceberg.apache.org%3E
>>>>>>>>>>>>>> .
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> In order to ensure the highest priority at this point in
>>>>>>>>>>>>>> time, I will prefer option-1 to reduce the cost of engine 
>>>>>>>>>>>>>> maintenance, so
>>>>>>>>>>>>>> as to free up resources to make v2 production-ready.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Wed, Sep 15, 2021 at 3:00 PM Saisai Shao <
>>>>>>>>>>>>>> sai.sai.s...@gmail.com> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> From Dev's point, it has less burden to always support the
>>>>>>>>>>>>>>> latest version of Spark (for example). But from user's point,
>>>>>>>>>>>>>>> especially for us who maintain Spark internally, it is not easy 
>>>>>>>>>>>>>>> to upgrade
>>>>>>>>>>>>>>> the Spark version for the first time (since we have many 
>>>>>>>>>>>>>>> customizations
>>>>>>>>>>>>>>> internally), and we're still promoting to upgrade to 3.1.2. If 
>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>> community ditches the support of old version of Spark3, users 
>>>>>>>>>>>>>>> have to
>>>>>>>>>>>>>>> maintain it themselves unavoidably.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> So I'm inclined to make this support in community, not by
>>>>>>>>>>>>>>> users themselves, as for Option 2 or 3, I'm fine with either. 
>>>>>>>>>>>>>>> And to
>>>>>>>>>>>>>>> relieve the burden, we could support limited versions of Spark 
>>>>>>>>>>>>>>> (for example
>>>>>>>>>>>>>>> 2 versions).
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Just my two cents.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> -Saisai
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Jack Ye <yezhao...@gmail.com> 于2021年9月15日周三 下午1:35写道：
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Hi Wing Yew,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I think 2.4 is a different story, we will continue to
>>>>>>>>>>>>>>>> support Spark 2.4, but as you can see it will continue to have 
>>>>>>>>>>>>>>>> very limited
>>>>>>>>>>>>>>>> functionalities comparing to Spark 3. I believe we discussed 
>>>>>>>>>>>>>>>> about option 3
>>>>>>>>>>>>>>>> when we were doing Spark 3.0 to 3.1 upgrade. Recently we are 
>>>>>>>>>>>>>>>> seeing the
>>>>>>>>>>>>>>>> same issue for Flink 1.11, 1.12 and 1.13 as well. I feel we 
>>>>>>>>>>>>>>>> need a
>>>>>>>>>>>>>>>> consistent strategy around this, let's take this chance to 
>>>>>>>>>>>>>>>> make a good
>>>>>>>>>>>>>>>> community guideline for all future engine versions, especially 
>>>>>>>>>>>>>>>> for Spark,
>>>>>>>>>>>>>>>> Flink and Hive that are in the same repository.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I can totally understand your point of view Wing, in fact,
>>>>>>>>>>>>>>>> speaking from the perspective of AWS EMR, we have to support 
>>>>>>>>>>>>>>>> over 40
>>>>>>>>>>>>>>>> versions of the software because there are people who are 
>>>>>>>>>>>>>>>> still using Spark
>>>>>>>>>>>>>>>> 1.4, believe it or not. After all, keep backporting changes 
>>>>>>>>>>>>>>>> will become a
>>>>>>>>>>>>>>>> liability not only on the user side, but also on the service 
>>>>>>>>>>>>>>>> provider side,
>>>>>>>>>>>>>>>> so I believe it's not a bad practice to push for user upgrade, 
>>>>>>>>>>>>>>>> as it will
>>>>>>>>>>>>>>>> make the life of both parties easier in the end. New feature 
>>>>>>>>>>>>>>>> is definitely
>>>>>>>>>>>>>>>> one of the best incentives to promote an upgrade on user side.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I think the biggest issue of option 3 is about its
>>>>>>>>>>>>>>>> scalability, because we will have an unbounded list of 
>>>>>>>>>>>>>>>> packages to add and
>>>>>>>>>>>>>>>> compile in the future, and we probably cannot drop support of 
>>>>>>>>>>>>>>>> that package
>>>>>>>>>>>>>>>> once created. If we go with option 1, I think we can still 
>>>>>>>>>>>>>>>> publish a few
>>>>>>>>>>>>>>>> patch versions for old Iceberg releases, and committers can 
>>>>>>>>>>>>>>>> control the
>>>>>>>>>>>>>>>> amount of patch versions to guard people from abusing the 
>>>>>>>>>>>>>>>> power of
>>>>>>>>>>>>>>>> patching. I see this as a consistent strategy also for Flink 
>>>>>>>>>>>>>>>> and Hive. With
>>>>>>>>>>>>>>>> this strategy, we can truly have a compatibility matrix for 
>>>>>>>>>>>>>>>> engine versions
>>>>>>>>>>>>>>>> against Iceberg versions.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> -Jack
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Tue, Sep 14, 2021 at 10:00 PM Wing Yew Poon <
>>>>>>>>>>>>>>>> wyp...@cloudera.com.invalid> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> I understand and sympathize with the desire to use new
>>>>>>>>>>>>>>>>> DSv2 features in Spark 3.2. I agree that Option 1 is the 
>>>>>>>>>>>>>>>>> easiest for
>>>>>>>>>>>>>>>>> developers, but I don't think it considers the interests of 
>>>>>>>>>>>>>>>>> users. I do not
>>>>>>>>>>>>>>>>> think that most users will upgrade to Spark 3.2 as soon as it 
>>>>>>>>>>>>>>>>> is released.
>>>>>>>>>>>>>>>>> It is a "minor version" upgrade in name from 3.1 (or from 
>>>>>>>>>>>>>>>>> 3.0), but I think
>>>>>>>>>>>>>>>>> we all know that it is not a minor upgrade. There are a lot 
>>>>>>>>>>>>>>>>> of changes from
>>>>>>>>>>>>>>>>> 3.0 to 3.1 and from 3.1 to 3.2. I think there are even a lot 
>>>>>>>>>>>>>>>>> of users
>>>>>>>>>>>>>>>>> running Spark 2.4 and not even on Spark 3 yet. Do we also 
>>>>>>>>>>>>>>>>> plan to stop
>>>>>>>>>>>>>>>>> supporting Spark 2.4?
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Please correct me if I'm mistaken, but the folks who have
>>>>>>>>>>>>>>>>> spoken out in favor of Option 1 all work for the same 
>>>>>>>>>>>>>>>>> organization, don't
>>>>>>>>>>>>>>>>> they? And they don't have a problem with making their users, 
>>>>>>>>>>>>>>>>> all internal,
>>>>>>>>>>>>>>>>> simply upgrade to Spark 3.2, do they? (Or they are already 
>>>>>>>>>>>>>>>>> running an
>>>>>>>>>>>>>>>>> internal fork that is close to 3.2.)
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> I work for an organization with customers running
>>>>>>>>>>>>>>>>> different versions of Spark. It is true that we can backport 
>>>>>>>>>>>>>>>>> new features
>>>>>>>>>>>>>>>>> to older versions if we wanted to. I suppose the people 
>>>>>>>>>>>>>>>>> contributing to
>>>>>>>>>>>>>>>>> Iceberg work for some organization or other that either use 
>>>>>>>>>>>>>>>>> Iceberg
>>>>>>>>>>>>>>>>> in-house, or provide software (possibly in the form of a 
>>>>>>>>>>>>>>>>> service) to
>>>>>>>>>>>>>>>>> customers, and either way, the organizations have the ability 
>>>>>>>>>>>>>>>>> to backport
>>>>>>>>>>>>>>>>> features and fixes to internal versions. Are there any users 
>>>>>>>>>>>>>>>>> out there who
>>>>>>>>>>>>>>>>> simply use Apache Iceberg and depend on the community version?
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> There may be features that are broadly useful that do not
>>>>>>>>>>>>>>>>> depend on Spark 3.2. Is it worth supporting them on Spark 
>>>>>>>>>>>>>>>>> 3.0/3.1 (and even
>>>>>>>>>>>>>>>>> 2.4)?
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> I am not in favor of Option 2. I do not oppose Option 1,
>>>>>>>>>>>>>>>>> but I would consider Option 3 too. Anton, you said 5 modules 
>>>>>>>>>>>>>>>>> are required;
>>>>>>>>>>>>>>>>> what are the modules you're thinking of?
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> - Wing Yew
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On Tue, Sep 14, 2021 at 5:38 PM Yufei Gu <
>>>>>>>>>>>>>>>>> flyrain...@gmail.com> wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Option 1 sounds good to me. Here are my reasons:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> 1. Both 2 and 3 will slow down the development.
>>>>>>>>>>>>>>>>>> Considering the limited resources in the open source 
>>>>>>>>>>>>>>>>>> community, the upsides
>>>>>>>>>>>>>>>>>> of option 2 and 3 are probably not worthy.
>>>>>>>>>>>>>>>>>> 2. Both 2 and 3 assume the use cases may not exist. It's
>>>>>>>>>>>>>>>>>> hard to predict anything, but even if these use cases are 
>>>>>>>>>>>>>>>>>> legit, users can
>>>>>>>>>>>>>>>>>> still get the new feature by backporting it to an older 
>>>>>>>>>>>>>>>>>> version in case of
>>>>>>>>>>>>>>>>>> upgrading to a newer version isn't an option.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Yufei
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> `This is not a contribution`
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> On Tue, Sep 14, 2021 at 4:54 PM Anton Okolnychyi <
>>>>>>>>>>>>>>>>>> aokolnyc...@apple.com.invalid> wrote:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> To sum up what we have so far:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> *Option 1 (support just the most recent minor Spark 3
>>>>>>>>>>>>>>>>>>> version)*
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> The easiest option for us devs, forces the user to
>>>>>>>>>>>>>>>>>>> upgrade to the most recent minor Spark version to consume 
>>>>>>>>>>>>>>>>>>> any new
>>>>>>>>>>>>>>>>>>> Iceberg features.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> *Option 2 (a separate project under Iceberg)*
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Can support as many Spark versions as needed and the
>>>>>>>>>>>>>>>>>>> codebase is still separate as we can use separate branches.
>>>>>>>>>>>>>>>>>>> Impossible to consume any unreleased changes in core,
>>>>>>>>>>>>>>>>>>> may slow down the development.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> *Option 3 (separate modules for Spark 3.1/3.2)*
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Introduce more modules in the same project.
>>>>>>>>>>>>>>>>>>> Can consume unreleased changes but it will required at
>>>>>>>>>>>>>>>>>>> least 5 modules to support 2.4, 3.1 and 3.2, making the 
>>>>>>>>>>>>>>>>>>> build and testing
>>>>>>>>>>>>>>>>>>> complicated.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Are there any users for whom upgrading the minor Spark
>>>>>>>>>>>>>>>>>>> version (e3.1 to 3.2) to consume new features is a blocker?
>>>>>>>>>>>>>>>>>>> We follow Option 1 internally at the moment but I would
>>>>>>>>>>>>>>>>>>> like to hear what other people think/need.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> - Anton
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> On 14 Sep 2021, at 09:44, Russell Spitzer <
>>>>>>>>>>>>>>>>>>> russell.spit...@gmail.com> wrote:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> I think we should go for option 1. I already am not a
>>>>>>>>>>>>>>>>>>> big fan of having runtime errors for unsupported things 
>>>>>>>>>>>>>>>>>>> based on versions
>>>>>>>>>>>>>>>>>>> and I don't think minor version upgrades are a large issue 
>>>>>>>>>>>>>>>>>>> for users.  I'm
>>>>>>>>>>>>>>>>>>> especially not looking forward to supporting interfaces 
>>>>>>>>>>>>>>>>>>> that only exist in
>>>>>>>>>>>>>>>>>>> Spark 3.2 in a multiple Spark version support future.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> On Sep 14, 2021, at 11:32 AM, Anton Okolnychyi <
>>>>>>>>>>>>>>>>>>> aokolnyc...@apple.com.INVALID> wrote:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> First of all, is option 2 a viable option? We discussed
>>>>>>>>>>>>>>>>>>> separating the python module outside of the project a few 
>>>>>>>>>>>>>>>>>>> weeks ago, and
>>>>>>>>>>>>>>>>>>> decided to not do that because it's beneficial for code 
>>>>>>>>>>>>>>>>>>> cross reference and
>>>>>>>>>>>>>>>>>>> more intuitive for new developers to see everything in the 
>>>>>>>>>>>>>>>>>>> same repository.
>>>>>>>>>>>>>>>>>>> I would expect the same argument to also hold here.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> That’s exactly the concern I have about Option 2 at this
>>>>>>>>>>>>>>>>>>> moment.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Overall I would personally prefer us to not support all
>>>>>>>>>>>>>>>>>>> the minor versions, but instead support maybe just 2-3 
>>>>>>>>>>>>>>>>>>> latest versions in a
>>>>>>>>>>>>>>>>>>> major version.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> This is when it gets a bit complicated. If we want to
>>>>>>>>>>>>>>>>>>> support both Spark 3.1 and Spark 3.2 with a single module, 
>>>>>>>>>>>>>>>>>>> it means we have
>>>>>>>>>>>>>>>>>>> to compile against 3.1. The problem is that we rely on DSv2 
>>>>>>>>>>>>>>>>>>> that is being
>>>>>>>>>>>>>>>>>>> actively developed. 3.2 and 3.1 have substantial 
>>>>>>>>>>>>>>>>>>> differences. On top of
>>>>>>>>>>>>>>>>>>> that, we have our extensions that are extremely low-level 
>>>>>>>>>>>>>>>>>>> and may break not
>>>>>>>>>>>>>>>>>>> only between minor versions but also between patch releases.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> f there are some features requiring a newer version, it
>>>>>>>>>>>>>>>>>>> makes sense to move that newer version in master.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Internally, we don’t deliver new features to older Spark
>>>>>>>>>>>>>>>>>>> versions as it requires a lot of effort to port things. 
>>>>>>>>>>>>>>>>>>> Personally, I don’t
>>>>>>>>>>>>>>>>>>> think it is too bad to require users to upgrade if they 
>>>>>>>>>>>>>>>>>>> want new features.
>>>>>>>>>>>>>>>>>>> At the same time, there are valid concerns with this 
>>>>>>>>>>>>>>>>>>> approach too that we
>>>>>>>>>>>>>>>>>>> mentioned during the sync. For example, certain new 
>>>>>>>>>>>>>>>>>>> features would also
>>>>>>>>>>>>>>>>>>> work fine with older Spark versions. I generally agree with 
>>>>>>>>>>>>>>>>>>> that and that
>>>>>>>>>>>>>>>>>>> not supporting recent versions is not ideal. However, I 
>>>>>>>>>>>>>>>>>>> want to find a
>>>>>>>>>>>>>>>>>>> balance between the complexity on our side and ease of use 
>>>>>>>>>>>>>>>>>>> for the users.
>>>>>>>>>>>>>>>>>>> Ideally, supporting a few recent versions would be 
>>>>>>>>>>>>>>>>>>> sufficient but our Spark
>>>>>>>>>>>>>>>>>>> integration is too low-level to do that with a single 
>>>>>>>>>>>>>>>>>>> module.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> On 13 Sep 2021, at 20:53, Jack Ye <yezhao...@gmail.com>
>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> First of all, is option 2 a viable option? We discussed
>>>>>>>>>>>>>>>>>>> separating the python module outside of the project a few 
>>>>>>>>>>>>>>>>>>> weeks ago, and
>>>>>>>>>>>>>>>>>>> decided to not do that because it's beneficial for code 
>>>>>>>>>>>>>>>>>>> cross reference and
>>>>>>>>>>>>>>>>>>> more intuitive for new developers to see everything in the 
>>>>>>>>>>>>>>>>>>> same repository.
>>>>>>>>>>>>>>>>>>> I would expect the same argument to also hold here.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Overall I would personally prefer us to not support all
>>>>>>>>>>>>>>>>>>> the minor versions, but instead support maybe just 2-3 
>>>>>>>>>>>>>>>>>>> latest versions in a
>>>>>>>>>>>>>>>>>>> major version. This avoids the problem that some users are 
>>>>>>>>>>>>>>>>>>> unwilling to
>>>>>>>>>>>>>>>>>>> move to a newer version and keep patching old Spark version 
>>>>>>>>>>>>>>>>>>> branches. If
>>>>>>>>>>>>>>>>>>> there are some features requiring a newer version, it makes 
>>>>>>>>>>>>>>>>>>> sense to move
>>>>>>>>>>>>>>>>>>> that newer version in master.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> In addition, because currently Spark is considered the
>>>>>>>>>>>>>>>>>>> most feature-complete reference implementation compared to 
>>>>>>>>>>>>>>>>>>> all other
>>>>>>>>>>>>>>>>>>> engines, I think we should not add artificial barriers that 
>>>>>>>>>>>>>>>>>>> would slow down
>>>>>>>>>>>>>>>>>>> its development speed.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> So my thinking is closer to option 1.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>>>>>>> Jack Ye
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> On Mon, Sep 13, 2021 at 7:39 PM Anton Okolnychyi <
>>>>>>>>>>>>>>>>>>> aokolnyc...@apple.com.invalid> wrote:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Hey folks,
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> I want to discuss our Spark version support strategy.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> So far, we have tried to support both 3.0 and 3.1. It
>>>>>>>>>>>>>>>>>>>> is great to support older versions but because we compile 
>>>>>>>>>>>>>>>>>>>> against 3.0, we
>>>>>>>>>>>>>>>>>>>> cannot use any Spark features that are offered in newer 
>>>>>>>>>>>>>>>>>>>> versions.
>>>>>>>>>>>>>>>>>>>> Spark 3.2 is just around the corner and it brings a lot
>>>>>>>>>>>>>>>>>>>> of important features such dynamic filtering for v2 
>>>>>>>>>>>>>>>>>>>> tables, required
>>>>>>>>>>>>>>>>>>>> distribution and ordering for writes, etc. These features 
>>>>>>>>>>>>>>>>>>>> are too important
>>>>>>>>>>>>>>>>>>>> to ignore them.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Apart from that, I have an end-to-end prototype for
>>>>>>>>>>>>>>>>>>>> merge-on-read with Spark that actually leverages some of 
>>>>>>>>>>>>>>>>>>>> the 3.2 features.
>>>>>>>>>>>>>>>>>>>> I’ll be implementing all new Spark DSv2 APIs for us 
>>>>>>>>>>>>>>>>>>>> internally and would
>>>>>>>>>>>>>>>>>>>> love to share that with the rest of the community.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> I see two options to move forward:
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Option 1
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Migrate to Spark 3.2 in master, maintain 0.12 for a
>>>>>>>>>>>>>>>>>>>> while by releasing minor versions with bug fixes.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Pros: almost no changes to the build configuration, no
>>>>>>>>>>>>>>>>>>>> extra work on our side as just a single Spark version is 
>>>>>>>>>>>>>>>>>>>> actively
>>>>>>>>>>>>>>>>>>>> maintained.
>>>>>>>>>>>>>>>>>>>> Cons: some new features that we will be adding to
>>>>>>>>>>>>>>>>>>>> master could also work with older Spark versions but all 
>>>>>>>>>>>>>>>>>>>> 0.12 releases will
>>>>>>>>>>>>>>>>>>>> only contain bug fixes. Therefore, users will be forced to 
>>>>>>>>>>>>>>>>>>>> migrate to Spark
>>>>>>>>>>>>>>>>>>>> 3.2 to consume any new Spark or format features.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Option 2
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Move our Spark integration into a separate project and
>>>>>>>>>>>>>>>>>>>> introduce branches for 3.0, 3.1 and 3.2.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Pros: decouples the format version from Spark, we can
>>>>>>>>>>>>>>>>>>>> support as many Spark versions as needed.
>>>>>>>>>>>>>>>>>>>> Cons: more work initially to set everything up, more
>>>>>>>>>>>>>>>>>>>> work to release, will need a new release of the core 
>>>>>>>>>>>>>>>>>>>> format to consume any
>>>>>>>>>>>>>>>>>>>> changes in the Spark integration.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Overall, I think option 2 seems better for the user but
>>>>>>>>>>>>>>>>>>>> my main worry is that we will have to release the format 
>>>>>>>>>>>>>>>>>>>> more frequently
>>>>>>>>>>>>>>>>>>>> (which is a good thing but requires more work and time) 
>>>>>>>>>>>>>>>>>>>> and the overall
>>>>>>>>>>>>>>>>>>>> Spark development may be slower.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> I’d love to hear what everybody thinks about this
>>>>>>>>>>>>>>>>>>>> matter.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>>>> Anton
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> --
>>>>>>>>>>>>> Ryan Blue
>>>>>>>>>>>>> Tabular
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>> Ryan Blue
>>>>>>>>>>> Tabular
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Ryan Blue
>>>>>>>>> Tabular
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>
>> --
>> Ryan Blue
>> Tabular
>>
>

Re: [DISCUSS] Spark version support strategy

Reply via email to