Re: [DISCUSS] Spark version support strategy

Jack Ye Wed, 06 Oct 2021 22:55:50 -0700

Hi everyone,

I tried to prototype option 3, here is the PR:
https://github.com/apache/iceberg/pull/3237


Sorry I did not see that Anton is planning to do it, but anyway it's just a
draft, so feel free to just use it as reference.

Best,
Jack Ye

On Sun, Oct 3, 2021 at 2:19 PM Ryan Blue <b...@tabular.io> wrote:

> Thanks for the context on the Flink side! I think it sounds reasonable to
> keep up to date with the latest supported Flink version. If we want, we
> could later go with something similar to what we do for Spark but we’ll see
> how it goes and what the Flink community needs. We should probably add a
> section to our Flink docs that explains and links to Flink’s support policy
> and has a table of Iceberg versions that work with Flink versions. (We
> should probably have the same table for Spark, too!)
>
> For Spark, I’m also leaning toward the modified option 3 where we keep all
> of the code in the main repository but only build with one module at a time
> by default. It makes sense to switch based on modules — rather than
> selecting src paths within a module — so that it is easy to run a build
> with all modules if you choose to — for example, when building release
> binaries.
>
> The reason I think we should go with option 3 is for testing. If we have a
> single repo with api, core, etc. and spark then changes to the common
> modules can be tested by CI actions. Updates to individual Spark modules
> would be completely independent. There is a slight inconvenience that when
> an API used by Spark changes, the author would still need to fix multiple
> Spark versions. But the trade-off is that with a separate repository like
> option 2, changes that break Spark versions are not caught and then the
> Spark repository’s CI ends up failing on completely unrelated changes. That
> would be a major pain, felt by everyone contributing to the Spark
> integration, so I think option 3 is the best path forward.
>
> It sounds like we probably have some agreement now, but please speak up if
> you think another option would be better.
>
> The next step is to prototype the build changes to test out option 3. Or
> if you prefer option 2, then prototype those changes as well. I think that
> Anton is planning to do this, but if you have time and the desire to do it
> please reach out and coordinate with us!
>
> Ryan
>
> On Wed, Sep 29, 2021 at 9:12 PM Steven Wu <stevenz...@gmail.com> wrote:
>
>> Wing, sorry, my earlier message probably misled you. I was speaking my
>> personal opinion on Flink version support.
>>
>> On Tue, Sep 28, 2021 at 8:03 PM Wing Yew Poon <wyp...@cloudera.com.invalid>
>> wrote:
>>
>>> Hi OpenInx,
>>> I'm sorry I misunderstood the thinking of the Flink community. Thanks
>>> for the clarification.
>>> - Wing Yew
>>>
>>>
>>> On Tue, Sep 28, 2021 at 7:15 PM OpenInx <open...@gmail.com> wrote:
>>>
>>>> Hi Wing
>>>>
>>>> As we discussed above, we community prefer to choose option.2 or
>>>> option.3.  So in fact, when we planned to upgrade the flink version from
>>>> 1.12 to 1.13,  we are doing our best to guarantee the master iceberg repo
>>>> could work fine for both flink1.12 & flink1.13. More context please see
>>>> [1], [2], [3]
>>>>
>>>> [1] https://github.com/apache/iceberg/pull/3116
>>>> [2] https://github.com/apache/iceberg/issues/3183
>>>> [3]
>>>> https://lists.apache.org/x/thread.html/ra438e89eeec2d4623a32822e21739c8f2229505522d73d1034e34198@%3Cdev.flink.apache.org%3E
>>>>
>>>>
>>>> On Wed, Sep 29, 2021 at 5:27 AM Wing Yew Poon
>>>> <wyp...@cloudera.com.invalid> wrote:
>>>>
>>>>> In the last community sync, we spent a little time on this topic. For
>>>>> Spark support, there are currently two options under consideration:
>>>>>
>>>>> Option 2: Separate repo for the Spark support. Use branches for
>>>>> supporting different Spark versions. Main branch for the latest Spark
>>>>> version (3.2 to begin with).
>>>>> Tooling needs to be built for producing regular snapshots of core
>>>>> Iceberg in a consumable way for this repo. Unclear if commits to core
>>>>> Iceberg will be tested pre-commit against Spark support; my impression is
>>>>> that they will not be, and the Spark support build can be broken by 
>>>>> changes
>>>>> to core.
>>>>>
>>>>> A variant of option 3 (which we will simply call Option 3 going
>>>>> forward): Single repo, separate module (subdirectory) for each Spark
>>>>> version to be supported. Code duplication in each Spark module (no attempt
>>>>> to refactor out common code). Each module built against the specific
>>>>> version of Spark to be supported, producing a runtime jar built against
>>>>> that version. CI will test all modules. Support can be provided for only
>>>>> building the modules a developer cares about.
>>>>>
>>>>> More input was sought and people are encouraged to voice their
>>>>> preference.
>>>>> I lean towards Option 3.
>>>>>
>>>>> - Wing Yew
>>>>>
>>>>> ps. In the sync, as Steven Wu wrote, the question was raised if the
>>>>> same multi-version support strategy can be adopted across engines. Based 
>>>>> on
>>>>> what Steven wrote, currently the Flink developer community's bandwidth
>>>>> makes supporting only a single Flink version (and focusing resources on
>>>>> developing new features on that version) the preferred choice. If so, then
>>>>> no multi-version support strategy for Flink is needed at this time.
>>>>>
>>>>>
>>>>> On Thu, Sep 23, 2021 at 5:26 PM Steven Wu <stevenz...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> During the sync meeting, people talked about if and how we can have
>>>>>> the same version support model across engines like Flink and Spark. I can
>>>>>> provide some input from the Flink side.
>>>>>>
>>>>>> Flink only supports two minor versions. E.g., right now Flink 1.13 is
>>>>>> the latest released version. That means only Flink 1.12 and 1.13 are
>>>>>> supported. Feature changes or bug fixes will only be backported to 1.12 
>>>>>> and
>>>>>> 1.13, unless it is a serious bug (like security). With that context,
>>>>>> personally I like option 1 (with one actively supported Flink version in
>>>>>> master branch) for the iceberg-flink module.
>>>>>>
>>>>>> We discussed the idea of supporting multiple Flink versions via shm
>>>>>> layer and multiple modules. While it may be a little better to support
>>>>>> multiple Flink versions, I don't know if there is enough support and
>>>>>> resources from the community to pull it off. Also the ongoing maintenance
>>>>>> burden for each minor version release from Flink, which happens roughly
>>>>>> every 4 months.
>>>>>>
>>>>>>
>>>>>> On Thu, Sep 16, 2021 at 10:25 PM Peter Vary
>>>>>> <pv...@cloudera.com.invalid> wrote:
>>>>>>
>>>>>>> Since you mentioned Hive, I chime in with what we do there. You
>>>>>>> might find it useful:
>>>>>>> - metastore module - only small differences - DynConstructor solves
>>>>>>> for us
>>>>>>> - mr module - some bigger differences, but still manageable for Hive
>>>>>>> 2-3. Need some new classes, but most of the code is reused - extra 
>>>>>>> module
>>>>>>> for Hive 3. For Hive 4 we use a different repo as we moved to the Hive
>>>>>>> codebase.
>>>>>>>
>>>>>>> My thoughts based on the above experience:
>>>>>>> - Keeping Hive 4 and Hive 2-3 code in sync is a pain. We constantly
>>>>>>> have problems with backporting changes between repos and we are slacking
>>>>>>> behind which hurts both projects
>>>>>>> - Hive 2-3 model is working better by forcing us to keep the things
>>>>>>> in sync, but with serious differences in the Hive project it still 
>>>>>>> doesn't
>>>>>>> seem like a viable option.
>>>>>>>
>>>>>>> So I think the question is: How stable is the Spark code we are
>>>>>>> integrating to. If I is fairly stable then we are better off with a "one
>>>>>>> repo multiple modules" approach and we should consider the multirepo 
>>>>>>> only
>>>>>>> if the differences become prohibitive.
>>>>>>>
>>>>>>> Thanks, Peter
>>>>>>>
>>>>>>> On Fri, 17 Sep 2021, 02:21 Anton Okolnychyi,
>>>>>>> <aokolnyc...@apple.com.invalid> wrote:
>>>>>>>
>>>>>>>> Okay, looks like there is consensus around supporting multiple
>>>>>>>> Spark versions at the same time. There are folks who mentioned this on 
>>>>>>>> this
>>>>>>>> thread and there were folks who brought this up during the sync.
>>>>>>>>
>>>>>>>> Let’s think through Option 2 and 3 in more detail then.
>>>>>>>>
>>>>>>>> Option 2
>>>>>>>>
>>>>>>>> In Option 2, there will be a separate repo. I believe the master
>>>>>>>> branch will soon point to Spark 3.2 (the most recent supported 
>>>>>>>> version).
>>>>>>>> The main development will happen there and the artifact version will be
>>>>>>>> 0.1.0. I also suppose there will be 0.1.x-spark-2 and 0.1.x-spark-3.1
>>>>>>>> branches where we will cherry-pick applicable changes. Once we are 
>>>>>>>> ready to
>>>>>>>> release 0.1.0 Spark integration, we will create 0.1.x-spark-3.2 and 
>>>>>>>> cut 3
>>>>>>>> releases: Spark 2.4, Spark 3.1, Spark 3.2. After that, we will bump the
>>>>>>>> version in master to 0.2.0 and create new 0.2.x-spark-2 and 
>>>>>>>> 0.2.x-spark-3.1
>>>>>>>> branches for cherry-picks.
>>>>>>>>
>>>>>>>> I guess we will continue to shade everything in the new repo and
>>>>>>>> will have to release every time the core is released. We will do a
>>>>>>>> maintenance release for each supported Spark version whenever we cut a 
>>>>>>>> new
>>>>>>>> maintenance Iceberg release or need to fix any bugs in the Spark
>>>>>>>> integration.
>>>>>>>> Under this model, we will probably need nightly snapshots (or on
>>>>>>>> each commit) for the core format and the Spark integration will depend 
>>>>>>>> on
>>>>>>>> snapshots until we are ready to release.
>>>>>>>>
>>>>>>>> Overall, I think this option gives us very simple builds and
>>>>>>>> provides best separation. It will keep the main repo clean. The main
>>>>>>>> downside is that we will have to split a Spark feature into two PRs: 
>>>>>>>> one
>>>>>>>> against the core and one against the Spark integration. Certain 
>>>>>>>> changes in
>>>>>>>> core can also break the Spark integration too and will require 
>>>>>>>> adaptations.
>>>>>>>>
>>>>>>>> Ryan, I am not sure I fully understood the testing part. How will
>>>>>>>> we be able to test the Spark integration in the main repo if certain
>>>>>>>> changes in core may break the Spark integration and require changes 
>>>>>>>> there?
>>>>>>>> Will we try to prohibit such changes?
>>>>>>>>
>>>>>>>> Option 3 (modified)
>>>>>>>>
>>>>>>>> If I get correctly, the modified Option 3 sounds very close to
>>>>>>>> the initially suggested approach by Imran but with code duplication 
>>>>>>>> instead
>>>>>>>> of extra refactoring and introducing new common modules.
>>>>>>>>
>>>>>>>> Jack, are you suggesting we test only a single Spark version at a
>>>>>>>> time? Or do we expect to test all versions? Will there be any 
>>>>>>>> difference
>>>>>>>> compared to just having a module per version? I did not fully
>>>>>>>> understand.
>>>>>>>>
>>>>>>>> My worry with this approach is that our build will be very
>>>>>>>> complicated and we will still have a lot of Spark-related modules in 
>>>>>>>> the
>>>>>>>> main repo. Once people start using Flink and Hive more, will we have 
>>>>>>>> to do
>>>>>>>> the same?
>>>>>>>>
>>>>>>>> - Anton
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On 16 Sep 2021, at 08:11, Ryan Blue <b...@tabular.io> wrote:
>>>>>>>>
>>>>>>>> I'd support the option that Jack suggests if we can set a few
>>>>>>>> expectations for keeping it clean.
>>>>>>>>
>>>>>>>> First, I'd like to avoid refactoring code to share it across Spark
>>>>>>>> versions -- that introduces risk because we're relying on compiling 
>>>>>>>> against
>>>>>>>> one version and running in another and both Spark and Scala change 
>>>>>>>> rapidly.
>>>>>>>> A big benefit of options 1 and 2 is that we mostly focus on only one 
>>>>>>>> Spark
>>>>>>>> version. I think we should duplicate code rather than spend time
>>>>>>>> refactoring to rely on binary compatibility. I propose we start each 
>>>>>>>> new
>>>>>>>> Spark version by copying the last one and updating it. And we should 
>>>>>>>> build
>>>>>>>> just the latest supported version by default.
>>>>>>>>
>>>>>>>> The drawback to having everything in a single repo is that we
>>>>>>>> wouldn't be able to cherry-pick changes across Spark 
>>>>>>>> versions/branches, but
>>>>>>>> I think Jack is right that having a single build is better.
>>>>>>>>
>>>>>>>> Second, we should make CI faster by running the Spark builds in
>>>>>>>> parallel. It sounds like this is what would happen anyway, with a 
>>>>>>>> property
>>>>>>>> that selects the Spark version that you want to build against.
>>>>>>>>
>>>>>>>> Overall, this new suggestion sounds like a promising way forward.
>>>>>>>>
>>>>>>>> Ryan
>>>>>>>>
>>>>>>>> On Wed, Sep 15, 2021 at 11:46 PM Jack Ye <yezhao...@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> I think in Ryan's proposal we will create a ton of modules anyway,
>>>>>>>>> as Wing listed we are just using git branch as an additional 
>>>>>>>>> dimension, but
>>>>>>>>> my understanding is that you will still have 1 core, 1 extension, 1 
>>>>>>>>> runtime
>>>>>>>>> artifact published for each Spark version in either approach.
>>>>>>>>>
>>>>>>>>> In that case, this is just brainstorming, I wonder if we can
>>>>>>>>> explore a modified option 3 that flattens all the versions in each 
>>>>>>>>> Spark
>>>>>>>>> branch in option 2 into master. The repository structure would look
>>>>>>>>> something like:
>>>>>>>>>
>>>>>>>>> iceberg/api/...
>>>>>>>>>             /bundled-guava/...
>>>>>>>>>             /core/...
>>>>>>>>>             ...
>>>>>>>>>             /spark/2.4/core/...
>>>>>>>>>                             /extension/...
>>>>>>>>>                             /runtime/...
>>>>>>>>>                       /3.1/core/...
>>>>>>>>>                             /extension/...
>>>>>>>>>                             /runtime/...
>>>>>>>>>
>>>>>>>>> The gradle build script in the root is configured to build against
>>>>>>>>> the latest version of Spark by default, unless otherwise specified by 
>>>>>>>>> the
>>>>>>>>> user.
>>>>>>>>>
>>>>>>>>> Intellij can also be configured to only index files of specific
>>>>>>>>> versions based on the same config used in build.
>>>>>>>>>
>>>>>>>>> In this way, I imagine the CI setup to be much easier to do things
>>>>>>>>> like testing version compatibility for a feature or running only a
>>>>>>>>> specific subset of Spark version builds based on the Spark version
>>>>>>>>> directories touched.
>>>>>>>>>
>>>>>>>>> And the biggest benefit is that we don't have the same difficulty
>>>>>>>>> as option 2 of developing a feature when it's both in core and Spark.
>>>>>>>>>
>>>>>>>>> We can then develop a mechanism to vote to stop support of certain
>>>>>>>>> versions, and archive the corresponding directory to avoid 
>>>>>>>>> accumulating too
>>>>>>>>> many versions in the long term.
>>>>>>>>>
>>>>>>>>> -Jack Ye
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Wed, Sep 15, 2021 at 4:17 PM Ryan Blue <b...@tabular.io> wrote:
>>>>>>>>>
>>>>>>>>>> Sorry, I was thinking about CI integration between Iceberg Java
>>>>>>>>>> and Iceberg Spark, I just didn't mention it and I see how that's a 
>>>>>>>>>> big
>>>>>>>>>> thing to leave out!
>>>>>>>>>>
>>>>>>>>>> I would definitely want to test the projects together. One thing
>>>>>>>>>> we could do is have a nightly build like Russell suggests. I'm also
>>>>>>>>>> wondering if we could have some tighter integration where the 
>>>>>>>>>> Iceberg Spark
>>>>>>>>>> build can be included in the Iceberg Java build using properties. 
>>>>>>>>>> Maybe the
>>>>>>>>>> github action could checkout Iceberg, then checkout the Spark
>>>>>>>>>> integration's latest branch, and then run the gradle build with a 
>>>>>>>>>> property
>>>>>>>>>> that makes Spark a subproject in the build. That way we can continue 
>>>>>>>>>> to
>>>>>>>>>> have Spark CI run regularly.
>>>>>>>>>>
>>>>>>>>>> On Wed, Sep 15, 2021 at 3:08 PM Russell Spitzer <
>>>>>>>>>> russell.spit...@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> I agree that Option 2 is considerably more difficult for
>>>>>>>>>>> development when core API changes need to be picked up by the 
>>>>>>>>>>> external
>>>>>>>>>>> Spark module. I also think a monthly release would probably still be
>>>>>>>>>>> prohibitive to actually implementing new features that appear in 
>>>>>>>>>>> the API, I
>>>>>>>>>>> would hope we have a much faster process or maybe just have snapshot
>>>>>>>>>>> artifacts published nightly?
>>>>>>>>>>>
>>>>>>>>>>> On Sep 15, 2021, at 4:46 PM, Wing Yew Poon <
>>>>>>>>>>> wyp...@cloudera.com.INVALID> wrote:
>>>>>>>>>>>
>>>>>>>>>>> IIUC, Option 2 is to move the Spark support for Iceberg into a
>>>>>>>>>>> separate repo (subproject of Iceberg). Would we have branches such 
>>>>>>>>>>> as
>>>>>>>>>>> 0.13-2.4, 0.13-3.0, 0.13-3.1, and 0.13-3.2? For features that can be
>>>>>>>>>>> supported in all versions or all Spark 3 versions, then we would 
>>>>>>>>>>> need to
>>>>>>>>>>> commit the changes to all applicable branches. Basically we are 
>>>>>>>>>>> trading
>>>>>>>>>>> more work to commit to multiple branches for simplified build and CI
>>>>>>>>>>> time per branch, which might be an acceptable trade-off. However, 
>>>>>>>>>>> the
>>>>>>>>>>> biggest downside is that changes may need to be made in core 
>>>>>>>>>>> Iceberg as
>>>>>>>>>>> well as in the engine (in this case Spark) support, and we need to 
>>>>>>>>>>> wait for
>>>>>>>>>>> a release of core Iceberg to consume the changes in the subproject. 
>>>>>>>>>>> In this
>>>>>>>>>>> case, maybe we should have a monthly release of core Iceberg (no 
>>>>>>>>>>> matter how
>>>>>>>>>>> many changes go in, as long as it is non-zero) so that the 
>>>>>>>>>>> subproject can
>>>>>>>>>>> consume changes fairly quickly?
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Wed, Sep 15, 2021 at 2:09 PM Ryan Blue <b...@tabular.io>
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Thanks for bringing this up, Anton. I’m glad that we have the
>>>>>>>>>>>> set of potential solutions well defined.
>>>>>>>>>>>>
>>>>>>>>>>>> Looks like the next step is to decide whether we want to
>>>>>>>>>>>> require people to update Spark versions to pick up newer versions 
>>>>>>>>>>>> of
>>>>>>>>>>>> Iceberg. If we choose to make people upgrade, then option 1 is 
>>>>>>>>>>>> clearly the
>>>>>>>>>>>> best choice.
>>>>>>>>>>>>
>>>>>>>>>>>> I don’t think that we should make updating Spark a requirement.
>>>>>>>>>>>> Many of the things that we’re working on are orthogonal to Spark 
>>>>>>>>>>>> versions,
>>>>>>>>>>>> like table maintenance actions, secondary indexes, the 1.0 API, 
>>>>>>>>>>>> views, ORC
>>>>>>>>>>>> delete files, new storage implementations, etc. Upgrading Spark is 
>>>>>>>>>>>> time
>>>>>>>>>>>> consuming and untrusted in my experience, so I think we would be 
>>>>>>>>>>>> setting up
>>>>>>>>>>>> an unnecessary trade-off between spending lots of time to upgrade 
>>>>>>>>>>>> Spark and
>>>>>>>>>>>> picking up new Iceberg features.
>>>>>>>>>>>>
>>>>>>>>>>>> Another way of thinking about this is that if we went with
>>>>>>>>>>>> option 1, then we could port bug fixes into 0.12.x. But there are 
>>>>>>>>>>>> many
>>>>>>>>>>>> things that wouldn’t fit this model, like adding a FileIO 
>>>>>>>>>>>> implementation
>>>>>>>>>>>> for ADLS. So some people in the community would have to maintain 
>>>>>>>>>>>> branches
>>>>>>>>>>>> of newer Iceberg versions with older versions of Spark outside of 
>>>>>>>>>>>> the main
>>>>>>>>>>>> Iceberg project — that defeats the purpose of simplifying things 
>>>>>>>>>>>> with
>>>>>>>>>>>> option 1 because we would then have more people maintaining the 
>>>>>>>>>>>> same 0.13.x
>>>>>>>>>>>> with Spark 3.1 branch. (This reminds me of the Spark community, 
>>>>>>>>>>>> where we
>>>>>>>>>>>> wanted to release a 2.5 line with DSv2 backported, but the 
>>>>>>>>>>>> community
>>>>>>>>>>>> decided not to so we built similar 2.4+DSv2 branches at Netflix, 
>>>>>>>>>>>> Tencent,
>>>>>>>>>>>> Apple, etc.)
>>>>>>>>>>>>
>>>>>>>>>>>> If the community is going to do the work anyway — and I think
>>>>>>>>>>>> some of us would — we should make it possible to share that work. 
>>>>>>>>>>>> That’s
>>>>>>>>>>>> why I don’t think that we should go with option 1.
>>>>>>>>>>>>
>>>>>>>>>>>> If we don’t go with option 1, then the choice is how to
>>>>>>>>>>>> maintain multiple Spark versions. I think that the way we’re doing 
>>>>>>>>>>>> it right
>>>>>>>>>>>> now is not something we want to continue.
>>>>>>>>>>>>
>>>>>>>>>>>> Using multiple modules (option 3) is concerning to me because
>>>>>>>>>>>> of the changes in Spark. We currently structure the library to 
>>>>>>>>>>>> share as
>>>>>>>>>>>> much code as possible. But that means compiling against different 
>>>>>>>>>>>> Spark
>>>>>>>>>>>> versions and relying on binary compatibility and reflection in 
>>>>>>>>>>>> some cases.
>>>>>>>>>>>> To me, this seems unmaintainable in the long run because it 
>>>>>>>>>>>> requires
>>>>>>>>>>>> refactoring common classes and spending a lot of time 
>>>>>>>>>>>> deduplicating code.
>>>>>>>>>>>> It also creates a ton of modules, at least one common module, then 
>>>>>>>>>>>> a module
>>>>>>>>>>>> per version, then an extensions module per version, and finally a 
>>>>>>>>>>>> runtime
>>>>>>>>>>>> module per version. That’s 3 modules per Spark version, plus any 
>>>>>>>>>>>> new common
>>>>>>>>>>>> modules. And each module needs to be tested, which is making our 
>>>>>>>>>>>> CI take a
>>>>>>>>>>>> really long time. We also don’t support multiple Scala versions, 
>>>>>>>>>>>> which is
>>>>>>>>>>>> another gap that will require even more modules and tests.
>>>>>>>>>>>>
>>>>>>>>>>>> I like option 2 because it would allow us to compile against a
>>>>>>>>>>>> single version of Spark (which will be much more reliable). It 
>>>>>>>>>>>> would give
>>>>>>>>>>>> us an opportunity to support different Scala versions. It avoids 
>>>>>>>>>>>> the need
>>>>>>>>>>>> to refactor to share code and allows people to focus on a single 
>>>>>>>>>>>> version of
>>>>>>>>>>>> Spark, while also creating a way for people to maintain and update 
>>>>>>>>>>>> the
>>>>>>>>>>>> older versions with newer Iceberg releases. I don’t think that 
>>>>>>>>>>>> this would
>>>>>>>>>>>> slow down development. I think it would actually speed it up 
>>>>>>>>>>>> because we’d
>>>>>>>>>>>> be spending less time trying to make multiple versions work in the 
>>>>>>>>>>>> same
>>>>>>>>>>>> build. And anyone in favor of option 1 would basically get option 
>>>>>>>>>>>> 1: you
>>>>>>>>>>>> don’t have to care about branches for older Spark versions.
>>>>>>>>>>>>
>>>>>>>>>>>> Jack makes a good point about wanting to keep code in a single
>>>>>>>>>>>> repository, but I think that the need to manage more version 
>>>>>>>>>>>> combinations
>>>>>>>>>>>> overrides this concern. It’s easier to make this decision in 
>>>>>>>>>>>> python because
>>>>>>>>>>>> we’re not trying to depend on two projects that change relatively 
>>>>>>>>>>>> quickly.
>>>>>>>>>>>> We’re just trying to build a library.
>>>>>>>>>>>>
>>>>>>>>>>>> Ryan
>>>>>>>>>>>>
>>>>>>>>>>>> On Wed, Sep 15, 2021 at 2:58 AM OpenInx <open...@gmail.com>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks for bringing this up,  Anton.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Everyone has great pros/cons to support their preferences.
>>>>>>>>>>>>> Before giving my preference, let me raise one question:    what's 
>>>>>>>>>>>>> the top
>>>>>>>>>>>>> priority thing for apache iceberg project at this point in time ? 
>>>>>>>>>>>>>  This
>>>>>>>>>>>>> question will help us to answer the following question: Should we 
>>>>>>>>>>>>> support
>>>>>>>>>>>>> more engine versions more robustly or be a bit more aggressive and
>>>>>>>>>>>>> concentrate on getting the new features that users need most in 
>>>>>>>>>>>>> order to
>>>>>>>>>>>>> keep the project more competitive ?
>>>>>>>>>>>>>
>>>>>>>>>>>>> If people watch the apache iceberg project and check the
>>>>>>>>>>>>> issues & PR frequently,  I guess more than 90% people will answer 
>>>>>>>>>>>>> the
>>>>>>>>>>>>> priority question:   There is no doubt for making the whole v2 
>>>>>>>>>>>>> story to be
>>>>>>>>>>>>> production-ready.   The current roadmap discussion also proofs 
>>>>>>>>>>>>> the thing :
>>>>>>>>>>>>> https://lists.apache.org/x/thread.html/r84e80216c259c81f824c6971504c321cd8c785774c489d52d4fc123f@%3Cdev.iceberg.apache.org%3E
>>>>>>>>>>>>> .
>>>>>>>>>>>>>
>>>>>>>>>>>>> In order to ensure the highest priority at this point in time,
>>>>>>>>>>>>> I will prefer option-1 to reduce the cost of engine maintenance, 
>>>>>>>>>>>>> so as to
>>>>>>>>>>>>> free up resources to make v2 production-ready.
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Wed, Sep 15, 2021 at 3:00 PM Saisai Shao <
>>>>>>>>>>>>> sai.sai.s...@gmail.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> From Dev's point, it has less burden to always support the
>>>>>>>>>>>>>> latest version of Spark (for example). But from user's point,
>>>>>>>>>>>>>> especially for us who maintain Spark internally, it is not easy 
>>>>>>>>>>>>>> to upgrade
>>>>>>>>>>>>>> the Spark version for the first time (since we have many 
>>>>>>>>>>>>>> customizations
>>>>>>>>>>>>>> internally), and we're still promoting to upgrade to 3.1.2. If 
>>>>>>>>>>>>>> the
>>>>>>>>>>>>>> community ditches the support of old version of Spark3, users 
>>>>>>>>>>>>>> have to
>>>>>>>>>>>>>> maintain it themselves unavoidably.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> So I'm inclined to make this support in community, not by
>>>>>>>>>>>>>> users themselves, as for Option 2 or 3, I'm fine with either. 
>>>>>>>>>>>>>> And to
>>>>>>>>>>>>>> relieve the burden, we could support limited versions of Spark 
>>>>>>>>>>>>>> (for example
>>>>>>>>>>>>>> 2 versions).
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Just my two cents.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> -Saisai
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Jack Ye <yezhao...@gmail.com> 于2021年9月15日周三 下午1:35写道：
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Hi Wing Yew,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I think 2.4 is a different story, we will continue to
>>>>>>>>>>>>>>> support Spark 2.4, but as you can see it will continue to have 
>>>>>>>>>>>>>>> very limited
>>>>>>>>>>>>>>> functionalities comparing to Spark 3. I believe we discussed 
>>>>>>>>>>>>>>> about option 3
>>>>>>>>>>>>>>> when we were doing Spark 3.0 to 3.1 upgrade. Recently we are 
>>>>>>>>>>>>>>> seeing the
>>>>>>>>>>>>>>> same issue for Flink 1.11, 1.12 and 1.13 as well. I feel we 
>>>>>>>>>>>>>>> need a
>>>>>>>>>>>>>>> consistent strategy around this, let's take this chance to make 
>>>>>>>>>>>>>>> a good
>>>>>>>>>>>>>>> community guideline for all future engine versions, especially 
>>>>>>>>>>>>>>> for Spark,
>>>>>>>>>>>>>>> Flink and Hive that are in the same repository.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I can totally understand your point of view Wing, in fact,
>>>>>>>>>>>>>>> speaking from the perspective of AWS EMR, we have to support 
>>>>>>>>>>>>>>> over 40
>>>>>>>>>>>>>>> versions of the software because there are people who are still 
>>>>>>>>>>>>>>> using Spark
>>>>>>>>>>>>>>> 1.4, believe it or not. After all, keep backporting changes 
>>>>>>>>>>>>>>> will become a
>>>>>>>>>>>>>>> liability not only on the user side, but also on the service 
>>>>>>>>>>>>>>> provider side,
>>>>>>>>>>>>>>> so I believe it's not a bad practice to push for user upgrade, 
>>>>>>>>>>>>>>> as it will
>>>>>>>>>>>>>>> make the life of both parties easier in the end. New feature is 
>>>>>>>>>>>>>>> definitely
>>>>>>>>>>>>>>> one of the best incentives to promote an upgrade on user side.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I think the biggest issue of option 3 is about its
>>>>>>>>>>>>>>> scalability, because we will have an unbounded list of packages 
>>>>>>>>>>>>>>> to add and
>>>>>>>>>>>>>>> compile in the future, and we probably cannot drop support of 
>>>>>>>>>>>>>>> that package
>>>>>>>>>>>>>>> once created. If we go with option 1, I think we can still 
>>>>>>>>>>>>>>> publish a few
>>>>>>>>>>>>>>> patch versions for old Iceberg releases, and committers can 
>>>>>>>>>>>>>>> control the
>>>>>>>>>>>>>>> amount of patch versions to guard people from abusing the power 
>>>>>>>>>>>>>>> of
>>>>>>>>>>>>>>> patching. I see this as a consistent strategy also for Flink 
>>>>>>>>>>>>>>> and Hive. With
>>>>>>>>>>>>>>> this strategy, we can truly have a compatibility matrix for 
>>>>>>>>>>>>>>> engine versions
>>>>>>>>>>>>>>> against Iceberg versions.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> -Jack
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Tue, Sep 14, 2021 at 10:00 PM Wing Yew Poon <
>>>>>>>>>>>>>>> wyp...@cloudera.com.invalid> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I understand and sympathize with the desire to use new DSv2
>>>>>>>>>>>>>>>> features in Spark 3.2. I agree that Option 1 is the easiest 
>>>>>>>>>>>>>>>> for developers,
>>>>>>>>>>>>>>>> but I don't think it considers the interests of users. I do 
>>>>>>>>>>>>>>>> not think that
>>>>>>>>>>>>>>>> most users will upgrade to Spark 3.2 as soon as it is 
>>>>>>>>>>>>>>>> released. It is a
>>>>>>>>>>>>>>>> "minor version" upgrade in name from 3.1 (or from 3.0), but I 
>>>>>>>>>>>>>>>> think we all
>>>>>>>>>>>>>>>> know that it is not a minor upgrade. There are a lot of 
>>>>>>>>>>>>>>>> changes from 3.0 to
>>>>>>>>>>>>>>>> 3.1 and from 3.1 to 3.2. I think there are even a lot of users 
>>>>>>>>>>>>>>>> running
>>>>>>>>>>>>>>>> Spark 2.4 and not even on Spark 3 yet. Do we also plan to stop 
>>>>>>>>>>>>>>>> supporting
>>>>>>>>>>>>>>>> Spark 2.4?
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Please correct me if I'm mistaken, but the folks who have
>>>>>>>>>>>>>>>> spoken out in favor of Option 1 all work for the same 
>>>>>>>>>>>>>>>> organization, don't
>>>>>>>>>>>>>>>> they? And they don't have a problem with making their users, 
>>>>>>>>>>>>>>>> all internal,
>>>>>>>>>>>>>>>> simply upgrade to Spark 3.2, do they? (Or they are already 
>>>>>>>>>>>>>>>> running an
>>>>>>>>>>>>>>>> internal fork that is close to 3.2.)
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I work for an organization with customers running different
>>>>>>>>>>>>>>>> versions of Spark. It is true that we can backport new 
>>>>>>>>>>>>>>>> features to older
>>>>>>>>>>>>>>>> versions if we wanted to. I suppose the people contributing to 
>>>>>>>>>>>>>>>> Iceberg work
>>>>>>>>>>>>>>>> for some organization or other that either use Iceberg 
>>>>>>>>>>>>>>>> in-house, or provide
>>>>>>>>>>>>>>>> software (possibly in the form of a service) to customers, and 
>>>>>>>>>>>>>>>> either way,
>>>>>>>>>>>>>>>> the organizations have the ability to backport features and 
>>>>>>>>>>>>>>>> fixes to
>>>>>>>>>>>>>>>> internal versions. Are there any users out there who simply 
>>>>>>>>>>>>>>>> use Apache
>>>>>>>>>>>>>>>> Iceberg and depend on the community version?
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> There may be features that are broadly useful that do not
>>>>>>>>>>>>>>>> depend on Spark 3.2. Is it worth supporting them on Spark 
>>>>>>>>>>>>>>>> 3.0/3.1 (and even
>>>>>>>>>>>>>>>> 2.4)?
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I am not in favor of Option 2. I do not oppose Option 1,
>>>>>>>>>>>>>>>> but I would consider Option 3 too. Anton, you said 5 modules 
>>>>>>>>>>>>>>>> are required;
>>>>>>>>>>>>>>>> what are the modules you're thinking of?
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> - Wing Yew
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Tue, Sep 14, 2021 at 5:38 PM Yufei Gu <
>>>>>>>>>>>>>>>> flyrain...@gmail.com> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Option 1 sounds good to me. Here are my reasons:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> 1. Both 2 and 3 will slow down the development.
>>>>>>>>>>>>>>>>> Considering the limited resources in the open source 
>>>>>>>>>>>>>>>>> community, the upsides
>>>>>>>>>>>>>>>>> of option 2 and 3 are probably not worthy.
>>>>>>>>>>>>>>>>> 2. Both 2 and 3 assume the use cases may not exist. It's
>>>>>>>>>>>>>>>>> hard to predict anything, but even if these use cases are 
>>>>>>>>>>>>>>>>> legit, users can
>>>>>>>>>>>>>>>>> still get the new feature by backporting it to an older 
>>>>>>>>>>>>>>>>> version in case of
>>>>>>>>>>>>>>>>> upgrading to a newer version isn't an option.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Yufei
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> `This is not a contribution`
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On Tue, Sep 14, 2021 at 4:54 PM Anton Okolnychyi <
>>>>>>>>>>>>>>>>> aokolnyc...@apple.com.invalid> wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> To sum up what we have so far:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> *Option 1 (support just the most recent minor Spark 3
>>>>>>>>>>>>>>>>>> version)*
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> The easiest option for us devs, forces the user to
>>>>>>>>>>>>>>>>>> upgrade to the most recent minor Spark version to consume 
>>>>>>>>>>>>>>>>>> any new
>>>>>>>>>>>>>>>>>> Iceberg features.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> *Option 2 (a separate project under Iceberg)*
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Can support as many Spark versions as needed and the
>>>>>>>>>>>>>>>>>> codebase is still separate as we can use separate branches.
>>>>>>>>>>>>>>>>>> Impossible to consume any unreleased changes in core, may
>>>>>>>>>>>>>>>>>> slow down the development.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> *Option 3 (separate modules for Spark 3.1/3.2)*
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Introduce more modules in the same project.
>>>>>>>>>>>>>>>>>> Can consume unreleased changes but it will required at
>>>>>>>>>>>>>>>>>> least 5 modules to support 2.4, 3.1 and 3.2, making the 
>>>>>>>>>>>>>>>>>> build and testing
>>>>>>>>>>>>>>>>>> complicated.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Are there any users for whom upgrading the minor Spark
>>>>>>>>>>>>>>>>>> version (e3.1 to 3.2) to consume new features is a blocker?
>>>>>>>>>>>>>>>>>> We follow Option 1 internally at the moment but I would
>>>>>>>>>>>>>>>>>> like to hear what other people think/need.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> - Anton
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> On 14 Sep 2021, at 09:44, Russell Spitzer <
>>>>>>>>>>>>>>>>>> russell.spit...@gmail.com> wrote:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> I think we should go for option 1. I already am not a big
>>>>>>>>>>>>>>>>>> fan of having runtime errors for unsupported things based on 
>>>>>>>>>>>>>>>>>> versions and I
>>>>>>>>>>>>>>>>>> don't think minor version upgrades are a large issue for 
>>>>>>>>>>>>>>>>>> users.  I'm
>>>>>>>>>>>>>>>>>> especially not looking forward to supporting interfaces that 
>>>>>>>>>>>>>>>>>> only exist in
>>>>>>>>>>>>>>>>>> Spark 3.2 in a multiple Spark version support future.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> On Sep 14, 2021, at 11:32 AM, Anton Okolnychyi <
>>>>>>>>>>>>>>>>>> aokolnyc...@apple.com.INVALID> wrote:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> First of all, is option 2 a viable option? We discussed
>>>>>>>>>>>>>>>>>> separating the python module outside of the project a few 
>>>>>>>>>>>>>>>>>> weeks ago, and
>>>>>>>>>>>>>>>>>> decided to not do that because it's beneficial for code 
>>>>>>>>>>>>>>>>>> cross reference and
>>>>>>>>>>>>>>>>>> more intuitive for new developers to see everything in the 
>>>>>>>>>>>>>>>>>> same repository.
>>>>>>>>>>>>>>>>>> I would expect the same argument to also hold here.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> That’s exactly the concern I have about Option 2 at this
>>>>>>>>>>>>>>>>>> moment.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Overall I would personally prefer us to not support all
>>>>>>>>>>>>>>>>>> the minor versions, but instead support maybe just 2-3 
>>>>>>>>>>>>>>>>>> latest versions in a
>>>>>>>>>>>>>>>>>> major version.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> This is when it gets a bit complicated. If we want to
>>>>>>>>>>>>>>>>>> support both Spark 3.1 and Spark 3.2 with a single module, 
>>>>>>>>>>>>>>>>>> it means we have
>>>>>>>>>>>>>>>>>> to compile against 3.1. The problem is that we rely on DSv2 
>>>>>>>>>>>>>>>>>> that is being
>>>>>>>>>>>>>>>>>> actively developed. 3.2 and 3.1 have substantial 
>>>>>>>>>>>>>>>>>> differences. On top of
>>>>>>>>>>>>>>>>>> that, we have our extensions that are extremely low-level 
>>>>>>>>>>>>>>>>>> and may break not
>>>>>>>>>>>>>>>>>> only between minor versions but also between patch releases.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> f there are some features requiring a newer version, it
>>>>>>>>>>>>>>>>>> makes sense to move that newer version in master.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Internally, we don’t deliver new features to older Spark
>>>>>>>>>>>>>>>>>> versions as it requires a lot of effort to port things. 
>>>>>>>>>>>>>>>>>> Personally, I don’t
>>>>>>>>>>>>>>>>>> think it is too bad to require users to upgrade if they want 
>>>>>>>>>>>>>>>>>> new features.
>>>>>>>>>>>>>>>>>> At the same time, there are valid concerns with this 
>>>>>>>>>>>>>>>>>> approach too that we
>>>>>>>>>>>>>>>>>> mentioned during the sync. For example, certain new features 
>>>>>>>>>>>>>>>>>> would also
>>>>>>>>>>>>>>>>>> work fine with older Spark versions. I generally agree with 
>>>>>>>>>>>>>>>>>> that and that
>>>>>>>>>>>>>>>>>> not supporting recent versions is not ideal. However, I want 
>>>>>>>>>>>>>>>>>> to find a
>>>>>>>>>>>>>>>>>> balance between the complexity on our side and ease of use 
>>>>>>>>>>>>>>>>>> for the users.
>>>>>>>>>>>>>>>>>> Ideally, supporting a few recent versions would be 
>>>>>>>>>>>>>>>>>> sufficient but our Spark
>>>>>>>>>>>>>>>>>> integration is too low-level to do that with a single module.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> On 13 Sep 2021, at 20:53, Jack Ye <yezhao...@gmail.com>
>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> First of all, is option 2 a viable option? We discussed
>>>>>>>>>>>>>>>>>> separating the python module outside of the project a few 
>>>>>>>>>>>>>>>>>> weeks ago, and
>>>>>>>>>>>>>>>>>> decided to not do that because it's beneficial for code 
>>>>>>>>>>>>>>>>>> cross reference and
>>>>>>>>>>>>>>>>>> more intuitive for new developers to see everything in the 
>>>>>>>>>>>>>>>>>> same repository.
>>>>>>>>>>>>>>>>>> I would expect the same argument to also hold here.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Overall I would personally prefer us to not support all
>>>>>>>>>>>>>>>>>> the minor versions, but instead support maybe just 2-3 
>>>>>>>>>>>>>>>>>> latest versions in a
>>>>>>>>>>>>>>>>>> major version. This avoids the problem that some users are 
>>>>>>>>>>>>>>>>>> unwilling to
>>>>>>>>>>>>>>>>>> move to a newer version and keep patching old Spark version 
>>>>>>>>>>>>>>>>>> branches. If
>>>>>>>>>>>>>>>>>> there are some features requiring a newer version, it makes 
>>>>>>>>>>>>>>>>>> sense to move
>>>>>>>>>>>>>>>>>> that newer version in master.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> In addition, because currently Spark is considered the
>>>>>>>>>>>>>>>>>> most feature-complete reference implementation compared to 
>>>>>>>>>>>>>>>>>> all other
>>>>>>>>>>>>>>>>>> engines, I think we should not add artificial barriers that 
>>>>>>>>>>>>>>>>>> would slow down
>>>>>>>>>>>>>>>>>> its development speed.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> So my thinking is closer to option 1.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>>>>>> Jack Ye
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> On Mon, Sep 13, 2021 at 7:39 PM Anton Okolnychyi <
>>>>>>>>>>>>>>>>>> aokolnyc...@apple.com.invalid> wrote:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Hey folks,
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> I want to discuss our Spark version support strategy.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> So far, we have tried to support both 3.0 and 3.1. It is
>>>>>>>>>>>>>>>>>>> great to support older versions but because we compile 
>>>>>>>>>>>>>>>>>>> against 3.0, we
>>>>>>>>>>>>>>>>>>> cannot use any Spark features that are offered in newer 
>>>>>>>>>>>>>>>>>>> versions.
>>>>>>>>>>>>>>>>>>> Spark 3.2 is just around the corner and it brings a lot
>>>>>>>>>>>>>>>>>>> of important features such dynamic filtering for v2 tables, 
>>>>>>>>>>>>>>>>>>> required
>>>>>>>>>>>>>>>>>>> distribution and ordering for writes, etc. These features 
>>>>>>>>>>>>>>>>>>> are too important
>>>>>>>>>>>>>>>>>>> to ignore them.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Apart from that, I have an end-to-end prototype for
>>>>>>>>>>>>>>>>>>> merge-on-read with Spark that actually leverages some of 
>>>>>>>>>>>>>>>>>>> the 3.2 features.
>>>>>>>>>>>>>>>>>>> I’ll be implementing all new Spark DSv2 APIs for us 
>>>>>>>>>>>>>>>>>>> internally and would
>>>>>>>>>>>>>>>>>>> love to share that with the rest of the community.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> I see two options to move forward:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Option 1
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Migrate to Spark 3.2 in master, maintain 0.12 for a
>>>>>>>>>>>>>>>>>>> while by releasing minor versions with bug fixes.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Pros: almost no changes to the build configuration, no
>>>>>>>>>>>>>>>>>>> extra work on our side as just a single Spark version is 
>>>>>>>>>>>>>>>>>>> actively
>>>>>>>>>>>>>>>>>>> maintained.
>>>>>>>>>>>>>>>>>>> Cons: some new features that we will be adding to master
>>>>>>>>>>>>>>>>>>> could also work with older Spark versions but all 0.12 
>>>>>>>>>>>>>>>>>>> releases will only
>>>>>>>>>>>>>>>>>>> contain bug fixes. Therefore, users will be forced to 
>>>>>>>>>>>>>>>>>>> migrate to Spark 3.2
>>>>>>>>>>>>>>>>>>> to consume any new Spark or format features.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Option 2
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Move our Spark integration into a separate project and
>>>>>>>>>>>>>>>>>>> introduce branches for 3.0, 3.1 and 3.2.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Pros: decouples the format version from Spark, we can
>>>>>>>>>>>>>>>>>>> support as many Spark versions as needed.
>>>>>>>>>>>>>>>>>>> Cons: more work initially to set everything up, more
>>>>>>>>>>>>>>>>>>> work to release, will need a new release of the core format 
>>>>>>>>>>>>>>>>>>> to consume any
>>>>>>>>>>>>>>>>>>> changes in the Spark integration.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Overall, I think option 2 seems better for the user but
>>>>>>>>>>>>>>>>>>> my main worry is that we will have to release the format 
>>>>>>>>>>>>>>>>>>> more frequently
>>>>>>>>>>>>>>>>>>> (which is a good thing but requires more work and time) and 
>>>>>>>>>>>>>>>>>>> the overall
>>>>>>>>>>>>>>>>>>> Spark development may be slower.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> I’d love to hear what everybody thinks about this matter.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>>> Anton
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> --
>>>>>>>>>>>> Ryan Blue
>>>>>>>>>>>> Tabular
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> Ryan Blue
>>>>>>>>>> Tabular
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Ryan Blue
>>>>>>>> Tabular
>>>>>>>>
>>>>>>>>
>>>>>>>>
>
> --
> Ryan Blue
> Tabular
>

Re: [DISCUSS] Spark version support strategy

Reply via email to