Re: [DISCUSS] Spark version support strategy

Peter Vary Thu, 16 Sep 2021 22:25:24 -0700

Since you mentioned Hive, I chime in with what we do there. You might find
it useful:
- metastore module - only small differences - DynConstructor solves for us
- mr module - some bigger differences, but still manageable for Hive 2-3.
Need some new classes, but most of the code is reused - extra module for
Hive 3. For Hive 4 we use a different repo as we moved to the Hive
codebase.


My thoughts based on the above experience:
- Keeping Hive 4 and Hive 2-3 code in sync is a pain. We constantly have
problems with backporting changes between repos and we are slacking behind
which hurts both projects
- Hive 2-3 model is working better by forcing us to keep the things in
sync, but with serious differences in the Hive project it still doesn't
seem like a viable option.

So I think the question is: How stable is the Spark code we are integrating
to. If I is fairly stable then we are better off with a "one repo multiple
modules" approach and we should consider the multirepo only if the
differences become prohibitive.

Thanks, Peter

On Fri, 17 Sep 2021, 02:21 Anton Okolnychyi, <aokolnyc...@apple.com.invalid>
wrote:

> Okay, looks like there is consensus around supporting multiple Spark
> versions at the same time. There are folks who mentioned this on this
> thread and there were folks who brought this up during the sync.
>
> Let’s think through Option 2 and 3 in more detail then.
>
> Option 2
>
> In Option 2, there will be a separate repo. I believe the master branch
> will soon point to Spark 3.2 (the most recent supported version). The main
> development will happen there and the artifact version will be 0.1.0. I
> also suppose there will be 0.1.x-spark-2 and 0.1.x-spark-3.1 branches where
> we will cherry-pick applicable changes. Once we are ready to release 0.1.0
> Spark integration, we will create 0.1.x-spark-3.2 and cut 3 releases: Spark
> 2.4, Spark 3.1, Spark 3.2. After that, we will bump the version in master
> to 0.2.0 and create new 0.2.x-spark-2 and 0.2.x-spark-3.1 branches for
> cherry-picks.
>
> I guess we will continue to shade everything in the new repo and will have
> to release every time the core is released. We will do a maintenance
> release for each supported Spark version whenever we cut a new maintenance 
> Iceberg
> release or need to fix any bugs in the Spark integration.
> Under this model, we will probably need nightly snapshots (or on each
> commit) for the core format and the Spark integration will depend on
> snapshots until we are ready to release.
>
> Overall, I think this option gives us very simple builds and provides best
> separation. It will keep the main repo clean. The main downside is that we
> will have to split a Spark feature into two PRs: one against the core and
> one against the Spark integration. Certain changes in core can also break
> the Spark integration too and will require adaptations.
>
> Ryan, I am not sure I fully understood the testing part. How will we be
> able to test the Spark integration in the main repo if certain changes in
> core may break the Spark integration and require changes there? Will we try
> to prohibit such changes?
>
> Option 3 (modified)
>
> If I get correctly, the modified Option 3 sounds very close to
> the initially suggested approach by Imran but with code duplication instead
> of extra refactoring and introducing new common modules.
>
> Jack, are you suggesting we test only a single Spark version at a time? Or
> do we expect to test all versions? Will there be any difference compared to
> just having a module per version? I did not fully understand.
>
> My worry with this approach is that our build will be very complicated and
> we will still have a lot of Spark-related modules in the main repo. Once
> people start using Flink and Hive more, will we have to do the same?
>
> - Anton
>
>
>
> On 16 Sep 2021, at 08:11, Ryan Blue <b...@tabular.io> wrote:
>
> I'd support the option that Jack suggests if we can set a few expectations
> for keeping it clean.
>
> First, I'd like to avoid refactoring code to share it across Spark
> versions -- that introduces risk because we're relying on compiling against
> one version and running in another and both Spark and Scala change rapidly.
> A big benefit of options 1 and 2 is that we mostly focus on only one Spark
> version. I think we should duplicate code rather than spend time
> refactoring to rely on binary compatibility. I propose we start each new
> Spark version by copying the last one and updating it. And we should build
> just the latest supported version by default.
>
> The drawback to having everything in a single repo is that we wouldn't be
> able to cherry-pick changes across Spark versions/branches, but I think
> Jack is right that having a single build is better.
>
> Second, we should make CI faster by running the Spark builds in parallel.
> It sounds like this is what would happen anyway, with a property that
> selects the Spark version that you want to build against.
>
> Overall, this new suggestion sounds like a promising way forward.
>
> Ryan
>
> On Wed, Sep 15, 2021 at 11:46 PM Jack Ye <yezhao...@gmail.com> wrote:
>
>> I think in Ryan's proposal we will create a ton of modules anyway, as
>> Wing listed we are just using git branch as an additional dimension, but my
>> understanding is that you will still have 1 core, 1 extension, 1 runtime
>> artifact published for each Spark version in either approach.
>>
>> In that case, this is just brainstorming, I wonder if we can explore a
>> modified option 3 that flattens all the versions in each Spark branch in
>> option 2 into master. The repository structure would look something like:
>>
>> iceberg/api/...
>>             /bundled-guava/...
>>             /core/...
>>             ...
>>             /spark/2.4/core/...
>>                             /extension/...
>>                             /runtime/...
>>                       /3.1/core/...
>>                             /extension/...
>>                             /runtime/...
>>
>> The gradle build script in the root is configured to build against the
>> latest version of Spark by default, unless otherwise specified by the user.
>>
>> Intellij can also be configured to only index files of specific versions
>> based on the same config used in build.
>>
>> In this way, I imagine the CI setup to be much easier to do things like
>> testing version compatibility for a feature or running only a
>> specific subset of Spark version builds based on the Spark version
>> directories touched.
>>
>> And the biggest benefit is that we don't have the same difficulty as
>> option 2 of developing a feature when it's both in core and Spark.
>>
>> We can then develop a mechanism to vote to stop support of certain
>> versions, and archive the corresponding directory to avoid accumulating too
>> many versions in the long term.
>>
>> -Jack Ye
>>
>>
>> On Wed, Sep 15, 2021 at 4:17 PM Ryan Blue <b...@tabular.io> wrote:
>>
>>> Sorry, I was thinking about CI integration between Iceberg Java and
>>> Iceberg Spark, I just didn't mention it and I see how that's a big thing to
>>> leave out!
>>>
>>> I would definitely want to test the projects together. One thing we
>>> could do is have a nightly build like Russell suggests. I'm also wondering
>>> if we could have some tighter integration where the Iceberg Spark build can
>>> be included in the Iceberg Java build using properties. Maybe the github
>>> action could checkout Iceberg, then checkout the Spark integration's latest
>>> branch, and then run the gradle build with a property that makes Spark a
>>> subproject in the build. That way we can continue to have Spark CI run
>>> regularly.
>>>
>>> On Wed, Sep 15, 2021 at 3:08 PM Russell Spitzer <
>>> russell.spit...@gmail.com> wrote:
>>>
>>>> I agree that Option 2 is considerably more difficult for development
>>>> when core API changes need to be picked up by the external Spark module. I
>>>> also think a monthly release would probably still be prohibitive to
>>>> actually implementing new features that appear in the API, I would hope we
>>>> have a much faster process or maybe just have snapshot artifacts published
>>>> nightly?
>>>>
>>>> On Sep 15, 2021, at 4:46 PM, Wing Yew Poon <wyp...@cloudera.com.INVALID>
>>>> wrote:
>>>>
>>>> IIUC, Option 2 is to move the Spark support for Iceberg into a separate
>>>> repo (subproject of Iceberg). Would we have branches such as 0.13-2.4,
>>>> 0.13-3.0, 0.13-3.1, and 0.13-3.2? For features that can be supported in all
>>>> versions or all Spark 3 versions, then we would need to commit the changes
>>>> to all applicable branches. Basically we are trading more work to commit to
>>>> multiple branches for simplified build and CI time per branch, which might
>>>> be an acceptable trade-off. However, the biggest downside is that changes
>>>> may need to be made in core Iceberg as well as in the engine (in this case
>>>> Spark) support, and we need to wait for a release of core Iceberg to
>>>> consume the changes in the subproject. In this case, maybe we should have a
>>>> monthly release of core Iceberg (no matter how many changes go in, as long
>>>> as it is non-zero) so that the subproject can consume changes fairly
>>>> quickly?
>>>>
>>>>
>>>> On Wed, Sep 15, 2021 at 2:09 PM Ryan Blue <b...@tabular.io> wrote:
>>>>
>>>>> Thanks for bringing this up, Anton. I’m glad that we have the set of
>>>>> potential solutions well defined.
>>>>>
>>>>> Looks like the next step is to decide whether we want to require
>>>>> people to update Spark versions to pick up newer versions of Iceberg. If 
>>>>> we
>>>>> choose to make people upgrade, then option 1 is clearly the best choice.
>>>>>
>>>>> I don’t think that we should make updating Spark a requirement. Many
>>>>> of the things that we’re working on are orthogonal to Spark versions, like
>>>>> table maintenance actions, secondary indexes, the 1.0 API, views, ORC
>>>>> delete files, new storage implementations, etc. Upgrading Spark is time
>>>>> consuming and untrusted in my experience, so I think we would be setting 
>>>>> up
>>>>> an unnecessary trade-off between spending lots of time to upgrade Spark 
>>>>> and
>>>>> picking up new Iceberg features.
>>>>>
>>>>> Another way of thinking about this is that if we went with option 1,
>>>>> then we could port bug fixes into 0.12.x. But there are many things that
>>>>> wouldn’t fit this model, like adding a FileIO implementation for ADLS. So
>>>>> some people in the community would have to maintain branches of newer
>>>>> Iceberg versions with older versions of Spark outside of the main Iceberg
>>>>> project — that defeats the purpose of simplifying things with option 1
>>>>> because we would then have more people maintaining the same 0.13.x with
>>>>> Spark 3.1 branch. (This reminds me of the Spark community, where we wanted
>>>>> to release a 2.5 line with DSv2 backported, but the community decided not
>>>>> to so we built similar 2.4+DSv2 branches at Netflix, Tencent, Apple, etc.)
>>>>>
>>>>> If the community is going to do the work anyway — and I think some of
>>>>> us would — we should make it possible to share that work. That’s why I
>>>>> don’t think that we should go with option 1.
>>>>>
>>>>> If we don’t go with option 1, then the choice is how to maintain
>>>>> multiple Spark versions. I think that the way we’re doing it right now is
>>>>> not something we want to continue.
>>>>>
>>>>> Using multiple modules (option 3) is concerning to me because of the
>>>>> changes in Spark. We currently structure the library to share as much code
>>>>> as possible. But that means compiling against different Spark versions and
>>>>> relying on binary compatibility and reflection in some cases. To me, this
>>>>> seems unmaintainable in the long run because it requires refactoring 
>>>>> common
>>>>> classes and spending a lot of time deduplicating code. It also creates a
>>>>> ton of modules, at least one common module, then a module per version, 
>>>>> then
>>>>> an extensions module per version, and finally a runtime module per 
>>>>> version.
>>>>> That’s 3 modules per Spark version, plus any new common modules. And each
>>>>> module needs to be tested, which is making our CI take a really long time.
>>>>> We also don’t support multiple Scala versions, which is another gap that
>>>>> will require even more modules and tests.
>>>>>
>>>>> I like option 2 because it would allow us to compile against a single
>>>>> version of Spark (which will be much more reliable). It would give us an
>>>>> opportunity to support different Scala versions. It avoids the need to
>>>>> refactor to share code and allows people to focus on a single version of
>>>>> Spark, while also creating a way for people to maintain and update the
>>>>> older versions with newer Iceberg releases. I don’t think that this would
>>>>> slow down development. I think it would actually speed it up because we’d
>>>>> be spending less time trying to make multiple versions work in the same
>>>>> build. And anyone in favor of option 1 would basically get option 1: you
>>>>> don’t have to care about branches for older Spark versions.
>>>>>
>>>>> Jack makes a good point about wanting to keep code in a single
>>>>> repository, but I think that the need to manage more version combinations
>>>>> overrides this concern. It’s easier to make this decision in python 
>>>>> because
>>>>> we’re not trying to depend on two projects that change relatively quickly.
>>>>> We’re just trying to build a library.
>>>>>
>>>>> Ryan
>>>>>
>>>>> On Wed, Sep 15, 2021 at 2:58 AM OpenInx <open...@gmail.com> wrote:
>>>>>
>>>>>> Thanks for bringing this up,  Anton.
>>>>>>
>>>>>> Everyone has great pros/cons to support their preferences.  Before
>>>>>> giving my preference, let me raise one question:    what's the top 
>>>>>> priority
>>>>>> thing for apache iceberg project at this point in time ?  This question
>>>>>> will help us to answer the following question: Should we support more
>>>>>> engine versions more robustly or be a bit more aggressive and concentrate
>>>>>> on getting the new features that users need most in order to keep the
>>>>>> project more competitive ?
>>>>>>
>>>>>> If people watch the apache iceberg project and check the issues &
>>>>>> PR frequently,  I guess more than 90% people will answer the priority
>>>>>> question:   There is no doubt for making the whole v2 story to be
>>>>>> production-ready.   The current roadmap discussion also proofs the thing 
>>>>>> :
>>>>>> https://lists.apache.org/x/thread.html/r84e80216c259c81f824c6971504c321cd8c785774c489d52d4fc123f@%3Cdev.iceberg.apache.org%3E
>>>>>> .
>>>>>>
>>>>>> In order to ensure the highest priority at this point in time, I will
>>>>>> prefer option-1 to reduce the cost of engine maintenance, so as to free 
>>>>>> up
>>>>>> resources to make v2 production-ready.
>>>>>>
>>>>>> On Wed, Sep 15, 2021 at 3:00 PM Saisai Shao <sai.sai.s...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> From Dev's point, it has less burden to always support the latest
>>>>>>> version of Spark (for example). But from user's point, especially for us
>>>>>>> who maintain Spark internally, it is not easy to upgrade the Spark 
>>>>>>> version
>>>>>>> for the first time (since we have many customizations internally), and
>>>>>>> we're still promoting to upgrade to 3.1.2. If the community ditches the
>>>>>>> support of old version of Spark3, users have to maintain it themselves
>>>>>>> unavoidably.
>>>>>>>
>>>>>>> So I'm inclined to make this support in community, not by users
>>>>>>> themselves, as for Option 2 or 3, I'm fine with either. And to relieve 
>>>>>>> the
>>>>>>> burden, we could support limited versions of Spark (for example 2 
>>>>>>> versions).
>>>>>>>
>>>>>>> Just my two cents.
>>>>>>>
>>>>>>> -Saisai
>>>>>>>
>>>>>>>
>>>>>>> Jack Ye <yezhao...@gmail.com> 于2021年9月15日周三 下午1:35写道：
>>>>>>>
>>>>>>>> Hi Wing Yew,
>>>>>>>>
>>>>>>>> I think 2.4 is a different story, we will continue to support Spark
>>>>>>>> 2.4, but as you can see it will continue to have very limited
>>>>>>>> functionalities comparing to Spark 3. I believe we discussed about 
>>>>>>>> option 3
>>>>>>>> when we were doing Spark 3.0 to 3.1 upgrade. Recently we are seeing the
>>>>>>>> same issue for Flink 1.11, 1.12 and 1.13 as well. I feel we need a
>>>>>>>> consistent strategy around this, let's take this chance to make a good
>>>>>>>> community guideline for all future engine versions, especially for 
>>>>>>>> Spark,
>>>>>>>> Flink and Hive that are in the same repository.
>>>>>>>>
>>>>>>>> I can totally understand your point of view Wing, in fact, speaking
>>>>>>>> from the perspective of AWS EMR, we have to support over 40 versions 
>>>>>>>> of the
>>>>>>>> software because there are people who are still using Spark 1.4, 
>>>>>>>> believe it
>>>>>>>> or not. After all, keep backporting changes will become a liability not
>>>>>>>> only on the user side, but also on the service provider side, so I 
>>>>>>>> believe
>>>>>>>> it's not a bad practice to push for user upgrade, as it will make the 
>>>>>>>> life
>>>>>>>> of both parties easier in the end. New feature is definitely one of the
>>>>>>>> best incentives to promote an upgrade on user side.
>>>>>>>>
>>>>>>>> I think the biggest issue of option 3 is about its scalability,
>>>>>>>> because we will have an unbounded list of packages to add and compile 
>>>>>>>> in
>>>>>>>> the future, and we probably cannot drop support of that package once
>>>>>>>> created. If we go with option 1, I think we can still publish a few 
>>>>>>>> patch
>>>>>>>> versions for old Iceberg releases, and committers can control the 
>>>>>>>> amount of
>>>>>>>> patch versions to guard people from abusing the power of patching. I 
>>>>>>>> see
>>>>>>>> this as a consistent strategy also for Flink and Hive. With this 
>>>>>>>> strategy,
>>>>>>>> we can truly have a compatibility matrix for engine versions against
>>>>>>>> Iceberg versions.
>>>>>>>>
>>>>>>>> -Jack
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Tue, Sep 14, 2021 at 10:00 PM Wing Yew Poon <
>>>>>>>> wyp...@cloudera.com.invalid> wrote:
>>>>>>>>
>>>>>>>>> I understand and sympathize with the desire to use new DSv2
>>>>>>>>> features in Spark 3.2. I agree that Option 1 is the easiest for 
>>>>>>>>> developers,
>>>>>>>>> but I don't think it considers the interests of users. I do not think 
>>>>>>>>> that
>>>>>>>>> most users will upgrade to Spark 3.2 as soon as it is released. It is 
>>>>>>>>> a
>>>>>>>>> "minor version" upgrade in name from 3.1 (or from 3.0), but I think 
>>>>>>>>> we all
>>>>>>>>> know that it is not a minor upgrade. There are a lot of changes from 
>>>>>>>>> 3.0 to
>>>>>>>>> 3.1 and from 3.1 to 3.2. I think there are even a lot of users running
>>>>>>>>> Spark 2.4 and not even on Spark 3 yet. Do we also plan to stop 
>>>>>>>>> supporting
>>>>>>>>> Spark 2.4?
>>>>>>>>>
>>>>>>>>> Please correct me if I'm mistaken, but the folks who have spoken
>>>>>>>>> out in favor of Option 1 all work for the same organization, don't 
>>>>>>>>> they?
>>>>>>>>> And they don't have a problem with making their users, all internal, 
>>>>>>>>> simply
>>>>>>>>> upgrade to Spark 3.2, do they? (Or they are already running an 
>>>>>>>>> internal
>>>>>>>>> fork that is close to 3.2.)
>>>>>>>>>
>>>>>>>>> I work for an organization with customers running different
>>>>>>>>> versions of Spark. It is true that we can backport new features to 
>>>>>>>>> older
>>>>>>>>> versions if we wanted to. I suppose the people contributing to 
>>>>>>>>> Iceberg work
>>>>>>>>> for some organization or other that either use Iceberg in-house, or 
>>>>>>>>> provide
>>>>>>>>> software (possibly in the form of a service) to customers, and either 
>>>>>>>>> way,
>>>>>>>>> the organizations have the ability to backport features and fixes to
>>>>>>>>> internal versions. Are there any users out there who simply use Apache
>>>>>>>>> Iceberg and depend on the community version?
>>>>>>>>>
>>>>>>>>> There may be features that are broadly useful that do not depend
>>>>>>>>> on Spark 3.2. Is it worth supporting them on Spark 3.0/3.1 (and even 
>>>>>>>>> 2.4)?
>>>>>>>>>
>>>>>>>>> I am not in favor of Option 2. I do not oppose Option 1, but I
>>>>>>>>> would consider Option 3 too. Anton, you said 5 modules are required; 
>>>>>>>>> what
>>>>>>>>> are the modules you're thinking of?
>>>>>>>>>
>>>>>>>>> - Wing Yew
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Tue, Sep 14, 2021 at 5:38 PM Yufei Gu <flyrain...@gmail.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Option 1 sounds good to me. Here are my reasons:
>>>>>>>>>>
>>>>>>>>>> 1. Both 2 and 3 will slow down the development. Considering the
>>>>>>>>>> limited resources in the open source community, the upsides of 
>>>>>>>>>> option 2 and
>>>>>>>>>> 3 are probably not worthy.
>>>>>>>>>> 2. Both 2 and 3 assume the use cases may not exist. It's hard to
>>>>>>>>>> predict anything, but even if these use cases are legit, users can 
>>>>>>>>>> still
>>>>>>>>>> get the new feature by backporting it to an older version in case of
>>>>>>>>>> upgrading to a newer version isn't an option.
>>>>>>>>>>
>>>>>>>>>> Best,
>>>>>>>>>>
>>>>>>>>>> Yufei
>>>>>>>>>>
>>>>>>>>>> `This is not a contribution`
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Tue, Sep 14, 2021 at 4:54 PM Anton Okolnychyi <
>>>>>>>>>> aokolnyc...@apple.com.invalid> wrote:
>>>>>>>>>>
>>>>>>>>>>> To sum up what we have so far:
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> *Option 1 (support just the most recent minor Spark 3 version)*
>>>>>>>>>>>
>>>>>>>>>>> The easiest option for us devs, forces the user to upgrade to
>>>>>>>>>>> the most recent minor Spark version to consume any new Iceberg
>>>>>>>>>>> features.
>>>>>>>>>>>
>>>>>>>>>>> *Option 2 (a separate project under Iceberg)*
>>>>>>>>>>>
>>>>>>>>>>> Can support as many Spark versions as needed and the codebase is
>>>>>>>>>>> still separate as we can use separate branches.
>>>>>>>>>>> Impossible to consume any unreleased changes in core, may slow
>>>>>>>>>>> down the development.
>>>>>>>>>>>
>>>>>>>>>>> *Option 3 (separate modules for Spark 3.1/3.2)*
>>>>>>>>>>>
>>>>>>>>>>> Introduce more modules in the same project.
>>>>>>>>>>> Can consume unreleased changes but it will required at least 5
>>>>>>>>>>> modules to support 2.4, 3.1 and 3.2, making the build and testing
>>>>>>>>>>> complicated.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Are there any users for whom upgrading the minor Spark version
>>>>>>>>>>> (e3.1 to 3.2) to consume new features is a blocker?
>>>>>>>>>>> We follow Option 1 internally at the moment but I would like to
>>>>>>>>>>> hear what other people think/need.
>>>>>>>>>>>
>>>>>>>>>>> - Anton
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On 14 Sep 2021, at 09:44, Russell Spitzer <
>>>>>>>>>>> russell.spit...@gmail.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>> I think we should go for option 1. I already am not a big fan of
>>>>>>>>>>> having runtime errors for unsupported things based on versions and 
>>>>>>>>>>> I don't
>>>>>>>>>>> think minor version upgrades are a large issue for users.  I'm 
>>>>>>>>>>> especially
>>>>>>>>>>> not looking forward to supporting interfaces that only exist in 
>>>>>>>>>>> Spark 3.2
>>>>>>>>>>> in a multiple Spark version support future.
>>>>>>>>>>>
>>>>>>>>>>> On Sep 14, 2021, at 11:32 AM, Anton Okolnychyi <
>>>>>>>>>>> aokolnyc...@apple.com.INVALID> wrote:
>>>>>>>>>>>
>>>>>>>>>>> First of all, is option 2 a viable option? We discussed
>>>>>>>>>>> separating the python module outside of the project a few weeks 
>>>>>>>>>>> ago, and
>>>>>>>>>>> decided to not do that because it's beneficial for code cross 
>>>>>>>>>>> reference and
>>>>>>>>>>> more intuitive for new developers to see everything in the same 
>>>>>>>>>>> repository.
>>>>>>>>>>> I would expect the same argument to also hold here.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> That’s exactly the concern I have about Option 2 at this moment.
>>>>>>>>>>>
>>>>>>>>>>> Overall I would personally prefer us to not support all the
>>>>>>>>>>> minor versions, but instead support maybe just 2-3 latest versions 
>>>>>>>>>>> in a
>>>>>>>>>>> major version.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> This is when it gets a bit complicated. If we want to support
>>>>>>>>>>> both Spark 3.1 and Spark 3.2 with a single module, it means we have 
>>>>>>>>>>> to
>>>>>>>>>>> compile against 3.1. The problem is that we rely on DSv2 that is 
>>>>>>>>>>> being
>>>>>>>>>>> actively developed. 3.2 and 3.1 have substantial differences. On 
>>>>>>>>>>> top of
>>>>>>>>>>> that, we have our extensions that are extremely low-level and may 
>>>>>>>>>>> break not
>>>>>>>>>>> only between minor versions but also between patch releases.
>>>>>>>>>>>
>>>>>>>>>>> f there are some features requiring a newer version, it makes
>>>>>>>>>>> sense to move that newer version in master.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Internally, we don’t deliver new features to older Spark
>>>>>>>>>>> versions as it requires a lot of effort to port things. Personally, 
>>>>>>>>>>> I don’t
>>>>>>>>>>> think it is too bad to require users to upgrade if they want new 
>>>>>>>>>>> features.
>>>>>>>>>>> At the same time, there are valid concerns with this approach too 
>>>>>>>>>>> that we
>>>>>>>>>>> mentioned during the sync. For example, certain new features would 
>>>>>>>>>>> also
>>>>>>>>>>> work fine with older Spark versions. I generally agree with that 
>>>>>>>>>>> and that
>>>>>>>>>>> not supporting recent versions is not ideal. However, I want to 
>>>>>>>>>>> find a
>>>>>>>>>>> balance between the complexity on our side and ease of use for the 
>>>>>>>>>>> users.
>>>>>>>>>>> Ideally, supporting a few recent versions would be sufficient but 
>>>>>>>>>>> our Spark
>>>>>>>>>>> integration is too low-level to do that with a single module.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On 13 Sep 2021, at 20:53, Jack Ye <yezhao...@gmail.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>> First of all, is option 2 a viable option? We discussed
>>>>>>>>>>> separating the python module outside of the project a few weeks 
>>>>>>>>>>> ago, and
>>>>>>>>>>> decided to not do that because it's beneficial for code cross 
>>>>>>>>>>> reference and
>>>>>>>>>>> more intuitive for new developers to see everything in the same 
>>>>>>>>>>> repository.
>>>>>>>>>>> I would expect the same argument to also hold here.
>>>>>>>>>>>
>>>>>>>>>>> Overall I would personally prefer us to not support all the
>>>>>>>>>>> minor versions, but instead support maybe just 2-3 latest versions 
>>>>>>>>>>> in a
>>>>>>>>>>> major version. This avoids the problem that some users are 
>>>>>>>>>>> unwilling to
>>>>>>>>>>> move to a newer version and keep patching old Spark version 
>>>>>>>>>>> branches. If
>>>>>>>>>>> there are some features requiring a newer version, it makes sense 
>>>>>>>>>>> to move
>>>>>>>>>>> that newer version in master.
>>>>>>>>>>>
>>>>>>>>>>> In addition, because currently Spark is considered the most
>>>>>>>>>>> feature-complete reference implementation compared to all other 
>>>>>>>>>>> engines, I
>>>>>>>>>>> think we should not add artificial barriers that would slow down its
>>>>>>>>>>> development speed.
>>>>>>>>>>>
>>>>>>>>>>> So my thinking is closer to option 1.
>>>>>>>>>>>
>>>>>>>>>>> Best,
>>>>>>>>>>> Jack Ye
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Mon, Sep 13, 2021 at 7:39 PM Anton Okolnychyi <
>>>>>>>>>>> aokolnyc...@apple.com.invalid> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hey folks,
>>>>>>>>>>>>
>>>>>>>>>>>> I want to discuss our Spark version support strategy.
>>>>>>>>>>>>
>>>>>>>>>>>> So far, we have tried to support both 3.0 and 3.1. It is great
>>>>>>>>>>>> to support older versions but because we compile against 3.0, we 
>>>>>>>>>>>> cannot use
>>>>>>>>>>>> any Spark features that are offered in newer versions.
>>>>>>>>>>>> Spark 3.2 is just around the corner and it brings a lot of
>>>>>>>>>>>> important features such dynamic filtering for v2 tables, required
>>>>>>>>>>>> distribution and ordering for writes, etc. These features are too 
>>>>>>>>>>>> important
>>>>>>>>>>>> to ignore them.
>>>>>>>>>>>>
>>>>>>>>>>>> Apart from that, I have an end-to-end prototype for
>>>>>>>>>>>> merge-on-read with Spark that actually leverages some of the 3.2 
>>>>>>>>>>>> features.
>>>>>>>>>>>> I’ll be implementing all new Spark DSv2 APIs for us internally and 
>>>>>>>>>>>> would
>>>>>>>>>>>> love to share that with the rest of the community.
>>>>>>>>>>>>
>>>>>>>>>>>> I see two options to move forward:
>>>>>>>>>>>>
>>>>>>>>>>>> Option 1
>>>>>>>>>>>>
>>>>>>>>>>>> Migrate to Spark 3.2 in master, maintain 0.12 for a while by
>>>>>>>>>>>> releasing minor versions with bug fixes.
>>>>>>>>>>>>
>>>>>>>>>>>> Pros: almost no changes to the build configuration, no extra
>>>>>>>>>>>> work on our side as just a single Spark version is actively 
>>>>>>>>>>>> maintained.
>>>>>>>>>>>> Cons: some new features that we will be adding to master could
>>>>>>>>>>>> also work with older Spark versions but all 0.12 releases will 
>>>>>>>>>>>> only contain
>>>>>>>>>>>> bug fixes. Therefore, users will be forced to migrate to Spark 3.2 
>>>>>>>>>>>> to
>>>>>>>>>>>> consume any new Spark or format features.
>>>>>>>>>>>>
>>>>>>>>>>>> Option 2
>>>>>>>>>>>>
>>>>>>>>>>>> Move our Spark integration into a separate project and
>>>>>>>>>>>> introduce branches for 3.0, 3.1 and 3.2.
>>>>>>>>>>>>
>>>>>>>>>>>> Pros: decouples the format version from Spark, we can support
>>>>>>>>>>>> as many Spark versions as needed.
>>>>>>>>>>>> Cons: more work initially to set everything up, more work to
>>>>>>>>>>>> release, will need a new release of the core format to consume any 
>>>>>>>>>>>> changes
>>>>>>>>>>>> in the Spark integration.
>>>>>>>>>>>>
>>>>>>>>>>>> Overall, I think option 2 seems better for the user but my main
>>>>>>>>>>>> worry is that we will have to release the format more frequently 
>>>>>>>>>>>> (which is
>>>>>>>>>>>> a good thing but requires more work and time) and the overall Spark
>>>>>>>>>>>> development may be slower.
>>>>>>>>>>>>
>>>>>>>>>>>> I’d love to hear what everybody thinks about this matter.
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks,
>>>>>>>>>>>> Anton
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>
>>>>> --
>>>>> Ryan Blue
>>>>> Tabular
>>>>>
>>>>
>>>>
>>>
>>> --
>>> Ryan Blue
>>> Tabular
>>>
>>
>
> --
> Ryan Blue
> Tabular
>
>
>

Re: [DISCUSS] Spark version support strategy

Reply via email to