Re: Revisiting the idea of a Spark 2.5 transitional release

DB Tsai Sat, 13 Jun 2020 00:19:07 -0700

For example, JDK11 requires dependency changes which can not go into 2.4.7. 
Recent development on Kube such as supporting dynamical allocation in Spark 3.0 
in Kube (without shuffle service) will be hard to go in 2.4.7.


Sent from my iPhone

> On Jun 12, 2020, at 11:50 PM, Reynold Xin <r...@databricks.com> wrote:
> 
> 
> Echoing Sean's earlier comment … What is the functionality that would go into 
> a 2.5.0 release, that can't be in a 2.4.7 release? 
> 
> 
>> On Fri, Jun 12, 2020 at 11:14 PM, Holden Karau <hol...@pigscanfly.ca> wrote:
>> Can I suggest we maybe decouple this conversation a bit? First, if there is 
>> an agreement in making a transitional release in principle and then folks 
>> who feel strongly about specific backports can have their respective 
>> discussions.It's not like we normally know or have agreement on everything 
>> going into a release at the time we cut the branch.
>> 
>> On Fri, Jun 12, 2020 at 10:28 PM Reynold Xin <r...@databricks.com> wrote:
>> I understand the argument to add JDK 11 support just to extend the EOL, but 
>> the other things seem kind of arbitrary and are not supported by your 
>> arguments, especially DSv2 which is a massive change. DSv2 IIUC is not api 
>> stable yet and will continue to evolve in the 3.x line. 
>> 
>> Spark is designed in a way that’s decoupled from storage, and as a result 
>> one can run multiple versions of Spark in parallel during migration. 
>> At the job level sure, but upgrading large jobs, possibly written in Scala 
>> 2.11, whole-hog as it currently stands is not a small matter. 
>> 
>> On Fri, Jun 12, 2020 at 9:40 PM DB Tsai <dbt...@dbtsai.com> wrote:
>> +1 for a 2.x release with DSv2, JDK11, and Scala 2.11 support
>> 
>> We had an internal preview version of Spark 3.0 for our customers to try out 
>> for a while, and then we realized that it's very challenging for enterprise 
>> applications in production to move to Spark 3.0. For example, many of our 
>> customers' Spark applications depend on some internal projects that may not 
>> be owned by ETL teams; it requires much coordination with other teams to 
>> cross-build the dependencies that Spark applications depend on with Scala 
>> 2.12 in order to use Spark 3.0. Now, we removed the support of Scala 2.11 in 
>> Spark 3.0, this results in a really big gap to migrate from 2.x version to 
>> 3.0 based on my observation working with our customers.
>> 
>> Also, JDK8 is already EOL, in some companies, using JDK8 is not supported by 
>> the infra team, and requires an exception to use unsupported JDK. Of course, 
>> for those companies, they can use vendor's Spark distribution such as CDH 
>> Spark 2.4 which supports JDK11 or they can maintain their own Spark release 
>> which is possible but not very trivial.
>> 
>> As a result, having a 2.5 release with DSv2, JDK11, and Scala 2.11 support 
>> can definitely lower the gap, and users can still move forward using new 
>> features. Afterall, the reason why we are working on OSS is we like people 
>> to use our code, isn't it?
>> 
>> Sincerely,
>> 
>> DB Tsai
>> ----------------------------------------------------------
>> Web: https://www.dbtsai.com
>> PGP Key ID: 42E5B25A8F7A82C1
>> 
>> 
>> On Fri, Jun 12, 2020 at 8:51 PM Jungtaek Lim <kabhwan.opensou...@gmail.com> 
>> wrote:
>> I guess we already went through the same discussion, right? If anyone is 
>> missed, please go through the discussion thread. [1] The consensus looks to 
>> be not positive to migrate the new DSv2 into Spark 2.x version line, because 
>> the change is pretty much huge, and also backward incompatible.
>> 
>> What I can think of benefits of having Spark 2.5 is to avoid force upgrade 
>> to the major release to have fixes for critical bugs. Not all critical fixes 
>> were landed to 2.x as well, because some fixes bring backward 
>> incompatibility. We don't land these fixes to the 2.x version line because 
>> we didn't consider having Spark 2.5 before - we don't want to let end users 
>> tolerate the inconvenience during upgrading bugfix version. End users may be 
>> OK to tolerate during upgrading minor version, since they can still live 
>> with 2.4.x to deny these fixes.
>> 
>> In addition, given there's a huge time gap between Spark 2.4 and 3.0, we 
>> might want to consider porting some of features which don't bring backward 
>> incompatibility. Well, new major features of Spark 3.0 would be probably 
>> better to be introduced in Spark 3.0, but some features could be, especially 
>> if the feature resolves the long-standing issue or the feature has been 
>> provided for a long time in competitive products.
>> 
>> Thanks,
>> Jungtaek Lim (HeartSaVioR)
>> 
>> 1. 
>> http://apache-spark-developers-list.1001551.n3.nabble.com/DISCUSS-Spark-2-5-release-td27963.html#a27979
>> 
>> On Sat, Jun 13, 2020 at 10:13 AM Ryan Blue <rb...@netflix.com.invalid> wrote:
>> +1 for a 2.x release with a DSv2 API that matches 3.0.
>> 
>> There are a lot of big differences between the API in 2.4 and 3.0, and I 
>> think a release to help migrate would be beneficial to organizations like 
>> ours that will be supporting 2.x and 3.0 in parallel for quite a while. 
>> Migration to Spark 3 is going to take time as people build confidence in it. 
>> I don't think that can be avoided by leaving a larger feature gap between 
>> 2.x and 3.0.
>> 
>> On Fri, Jun 12, 2020 at 5:53 PM Xiao Li <lix...@databricks.com> wrote:
>> Based on my understanding, DSV2 is not stable yet. It still misses various 
>> features. Even our built-in file sources are still unable to fully migrate 
>> to DSV2. We plan to enhance it in the next few releases to close the gap. 
>> 
>> Also, the changes on DSV2 in Spark 3.0 did not break any existing 
>> application. We should encourage more users to try Spark 3 and increase the 
>> adoption of Spark 3.x. 
>> 
>> Xiao 
>> 
>> On Fri, Jun 12, 2020 at 5:36 PM Holden Karau <hol...@pigscanfly.ca> wrote:
>> So I one of the things which we’re planning on backporting internally is 
>> DSv2, which I think being available in a community release in a 2 branch 
>> would be more broadly useful. Anything else on top of that would be on a 
>> case by case basis for if they make an easier upgrade path to 3.
>> 
>> If we’re worried about people using 2.5 as a long term home we could always 
>> mark it with “-transitional” or something similar?
>> 
>> On Fri, Jun 12, 2020 at 4:33 PM Sean Owen <sro...@gmail.com> wrote:
>> What is the functionality that would go into a 2.5.0 release, that can't be 
>> in a 2.4.7 release? I think that's the key question. 2.4.x is the 2.x 
>> maintenance branch, and I personally could imagine being open to more freely 
>> backporting a few new features for 2.x users, whereas usually it's only bug 
>> fixes. Making 2.5.0 implies that 2.5.x is the 2.x maintenance branch but 
>> there's something too big for a 'normal' maintenance release, and I think 
>> the whole question turns on what that is.
>> 
>> If it's things like JDK 11 support, I think that is unfortunately fairly 
>> 'breaking' because of dependency updates. But maybe that's not it.
>> 
>> 
>> On Fri, Jun 12, 2020 at 4:38 PM Holden Karau <hol...@pigscanfly.ca> wrote:
>> Hi Folks,
>> 
>> As we're getting closer to Spark 3 I'd like to revisit a Spark 2.5 release. 
>> Spark 3 brings a number of important changes, and by its nature is not 
>> backward compatible. I think we'd all like to have as smooth an upgrade 
>> experience to Spark 3 as possible, and I believe that having a Spark 2 
>> release some of the new functionality while continuing to support the older 
>> APIs and current Scala version would make the upgrade path smoother.
>> 
>> This pattern is not uncommon in other Hadoop ecosystem projects, like Hadoop 
>> itself and HBase.
>> 
>> I know that Ryan Blue has indicated he is already going to be maintaining 
>> something like that internally at Netflix, and we'll be doing the same thing 
>> at Apple. It seems like having a transitional release could benefit the 
>> community with easy migrations and help avoid duplicated work.
>> 
>> I want to be clear I'm volunteering to do the work of managing a 2.5 
>> release, so hopefully, this wouldn't create any substantial burdens on the 
>> community.
>> 
>> Cheers,
>> 
>> Holden
>> -- 
>> Twitter: https://twitter.com/holdenkarau
>> Books (Learning Spark, High Performance Spark, etc.): 
>> https://amzn.to/2MaRAG9 
>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>> -- 
>> Twitter: https://twitter.com/holdenkarau
>> Books (Learning Spark, High Performance Spark, etc.): 
>> https://amzn.to/2MaRAG9 
>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>> 
>> 
>> -- 
>> 
>> 
>> 
>> -- 
>> Ryan Blue
>> Software Engineer
>> Netflix
>> 
>> 
>> -- 
>> Twitter: https://twitter.com/holdenkarau
>> Books (Learning Spark, High Performance Spark, etc.): 
>> https://amzn.to/2MaRAG9 
>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>

Re: Revisiting the idea of a Spark 2.5 transitional release

Reply via email to