Re: Revisiting the idea of a Spark 2.5 transitional release

Holden Karau Fri, 12 Jun 2020 23:15:34 -0700

Can I suggest we maybe decouple this conversation a bit? First, if there is
an agreement in making a transitional release in principle and then folks
who feel strongly about specific backports can have their respective
discussions.It's not like we normally know or have agreement on everything
going into a release at the time we cut the branch.


On Fri, Jun 12, 2020 at 10:28 PM Reynold Xin <r...@databricks.com> wrote:

> I understand the argument to add JDK 11 support just to extend the EOL,
> but the other things seem kind of arbitrary and are not supported by your
> arguments, especially DSv2 which is a massive change. DSv2 IIUC is not api
> stable yet and will continue to evolve in the 3.x line.
>
> Spark is designed in a way that’s decoupled from storage, and as a result
> one can run multiple versions of Spark in parallel during migration.
>
At the job level sure, but upgrading large jobs, possibly written in Scala
2.11, whole-hog as it currently stands is not a small matter.

>
> On Fri, Jun 12, 2020 at 9:40 PM DB Tsai <dbt...@dbtsai.com> wrote:
>
>> +1 for a 2.x release with DSv2, JDK11, and Scala 2.11 support
>>
>> We had an internal preview version of Spark 3.0 for our customers to try
>> out for a while, and then we realized that it's very challenging for
>> enterprise applications in production to move to Spark 3.0. For example,
>> many of our customers' Spark applications depend on some internal projects
>> that may not be owned by ETL teams; it requires much coordination with
>> other teams to cross-build the dependencies that Spark applications depend
>> on with Scala 2.12 in order to use Spark 3.0. Now, we removed the support
>> of Scala 2.11 in Spark 3.0, this results in a really big gap to migrate
>> from 2.x version to 3.0 based on my observation working with our customers.
>>
>> Also, JDK8 is already EOL, in some companies, using JDK8 is not supported
>> by the infra team, and requires an exception to use unsupported JDK. Of
>> course, for those companies, they can use vendor's Spark distribution such
>> as CDH Spark 2.4 which supports JDK11 or they can maintain their own Spark
>> release which is possible but not very trivial.
>>
>> As a result, having a 2.5 release with DSv2, JDK11, and Scala 2.11
>> support can definitely lower the gap, and users can still move forward
>> using new features. Afterall, the reason why we are working on OSS is we
>> like people to use our code, isn't it?
>>
>> Sincerely,
>>
>> DB Tsai
>> ----------------------------------------------------------
>> Web: https://www.dbtsai.com
>> PGP Key ID: 42E5B25A8F7A82C1
>>
>>
>> On Fri, Jun 12, 2020 at 8:51 PM Jungtaek Lim <
>> kabhwan.opensou...@gmail.com> wrote:
>>
>>> I guess we already went through the same discussion, right? If anyone is
>>> missed, please go through the discussion thread. [1] The consensus looks to
>>> be not positive to migrate the new DSv2 into Spark 2.x version line,
>>> because the change is pretty much huge, and also backward incompatible.
>>>
>>> What I can think of benefits of having Spark 2.5 is to avoid force
>>> upgrade to the major release to have fixes for critical bugs. Not all
>>> critical fixes were landed to 2.x as well, because some fixes bring
>>> backward incompatibility. We don't land these fixes to the 2.x version line
>>> because we didn't consider having Spark 2.5 before - we don't want to let
>>> end users tolerate the inconvenience during upgrading bugfix version. End
>>> users may be OK to tolerate during upgrading minor version, since they can
>>> still live with 2.4.x to deny these fixes.
>>>
>>> In addition, given there's a huge time gap between Spark 2.4 and 3.0, we
>>> might want to consider porting some of features which don't bring backward
>>> incompatibility. Well, new major features of Spark 3.0 would be probably
>>> better to be introduced in Spark 3.0, but some features could be,
>>> especially if the feature resolves the long-standing issue or the feature
>>> has been provided for a long time in competitive products.
>>>
>>> Thanks,
>>> Jungtaek Lim (HeartSaVioR)
>>>
>>> 1.
>>> http://apache-spark-developers-list.1001551.n3.nabble.com/DISCUSS-Spark-2-5-release-td27963.html#a27979
>>>
>>> On Sat, Jun 13, 2020 at 10:13 AM Ryan Blue <rb...@netflix.com.invalid>
>>> wrote:
>>>
>>>> +1 for a 2.x release with a DSv2 API that matches 3.0.
>>>>
>>>> There are a lot of big differences between the API in 2.4 and 3.0, and
>>>> I think a release to help migrate would be beneficial to organizations like
>>>> ours that will be supporting 2.x and 3.0 in parallel for quite a while.
>>>> Migration to Spark 3 is going to take time as people build confidence in
>>>> it. I don't think that can be avoided by leaving a larger feature gap
>>>> between 2.x and 3.0.
>>>>
>>>> On Fri, Jun 12, 2020 at 5:53 PM Xiao Li <lix...@databricks.com> wrote:
>>>>
>>>>> Based on my understanding, DSV2 is not stable yet. It still
>>>>> misses various features. Even our built-in file sources are still unable 
>>>>> to
>>>>> fully migrate to DSV2. We plan to enhance it in the next few releases to
>>>>> close the gap.
>>>>>
>>>>> Also, the changes on DSV2 in Spark 3.0 did not break any existing
>>>>> application. We should encourage more users to try Spark 3 and increase 
>>>>> the
>>>>> adoption of Spark 3.x.
>>>>>
>>>>> Xiao
>>>>>
>>>>> On Fri, Jun 12, 2020 at 5:36 PM Holden Karau <hol...@pigscanfly.ca>
>>>>> wrote:
>>>>>
>>>>>> So I one of the things which we’re planning on backporting internally
>>>>>> is DSv2, which I think being available in a community release in a 2 
>>>>>> branch
>>>>>> would be more broadly useful. Anything else on top of that would be on a
>>>>>> case by case basis for if they make an easier upgrade path to 3.
>>>>>>
>>>>>> If we’re worried about people using 2.5 as a long term home we could
>>>>>> always mark it with “-transitional” or something similar?
>>>>>>
>>>>>> On Fri, Jun 12, 2020 at 4:33 PM Sean Owen <sro...@gmail.com> wrote:
>>>>>>
>>>>>>> What is the functionality that would go into a 2.5.0 release, that
>>>>>>> can't be in a 2.4.7 release? I think that's the key question. 2.4.x is 
>>>>>>> the
>>>>>>> 2.x maintenance branch, and I personally could imagine being open to 
>>>>>>> more
>>>>>>> freely backporting a few new features for 2.x users, whereas usually 
>>>>>>> it's
>>>>>>> only bug fixes. Making 2.5.0 implies that 2.5.x is the 2.x maintenance
>>>>>>> branch but there's something too big for a 'normal' maintenance release,
>>>>>>> and I think the whole question turns on what that is.
>>>>>>>
>>>>>>> If it's things like JDK 11 support, I think that is unfortunately
>>>>>>> fairly 'breaking' because of dependency updates. But maybe that's not 
>>>>>>> it.
>>>>>>>
>>>>>>>
>>>>>>> On Fri, Jun 12, 2020 at 4:38 PM Holden Karau <hol...@pigscanfly.ca>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi Folks,
>>>>>>>>
>>>>>>>> As we're getting closer to Spark 3 I'd like to revisit a Spark 2.5
>>>>>>>> release. Spark 3 brings a number of important changes, and by its 
>>>>>>>> nature is
>>>>>>>> not backward compatible. I think we'd all like to have as smooth an 
>>>>>>>> upgrade
>>>>>>>> experience to Spark 3 as possible, and I believe that having a Spark 2
>>>>>>>> release some of the new functionality while continuing to support the 
>>>>>>>> older
>>>>>>>> APIs and current Scala version would make the upgrade path smoother.
>>>>>>>>
>>>>>>>> This pattern is not uncommon in other Hadoop ecosystem projects,
>>>>>>>> like Hadoop itself and HBase.
>>>>>>>>
>>>>>>>> I know that Ryan Blue has indicated he is already going to be
>>>>>>>> maintaining something like that internally at Netflix, and we'll be 
>>>>>>>> doing
>>>>>>>> the same thing at Apple. It seems like having a transitional release 
>>>>>>>> could
>>>>>>>> benefit the community with easy migrations and help avoid duplicated 
>>>>>>>> work.
>>>>>>>>
>>>>>>>> I want to be clear I'm volunteering to do the work of managing a
>>>>>>>> 2.5 release, so hopefully, this wouldn't create any substantial 
>>>>>>>> burdens on
>>>>>>>> the community.
>>>>>>>>
>>>>>>>> Cheers,
>>>>>>>>
>>>>>>>> Holden
>>>>>>>> --
>>>>>>>> Twitter: https://twitter.com/holdenkarau
>>>>>>>> Books (Learning Spark, High Performance Spark, etc.):
>>>>>>>> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
>>>>>>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>>>>>>>
>>>>>>> --
>>>>>> Twitter: https://twitter.com/holdenkarau
>>>>>> Books (Learning Spark, High Performance Spark, etc.):
>>>>>> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
>>>>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> <https://databricks.com/sparkaisummit/north-america>
>>>>>
>>>>
>>>>
>>>> --
>>>> Ryan Blue
>>>> Software Engineer
>>>> Netflix
>>>>
>>>

-- 
Twitter: https://twitter.com/holdenkarau
Books (Learning Spark, High Performance Spark, etc.):
https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
YouTube Live Streams: https://www.youtube.com/user/holdenkarau

Re: Revisiting the idea of a Spark 2.5 transitional release

Reply via email to