Re: time for Apache Spark 3.0?

Mridul Muralidharan Fri, 15 Jun 2018 10:52:45 -0700

I agree, I dont see pressing need for major version bump as well.


Regards,
Mridul
On Fri, Jun 15, 2018 at 10:25 AM Mark Hamstra <m...@clearstorydata.com> wrote:
>
> Changing major version numbers is not about new features or a vague notion 
> that it is time to do something that will be seen to be a significant 
> release. It is about breaking stable public APIs.
>
> I still remain unconvinced that the next version can't be 2.4.0.
>
> On Fri, Jun 15, 2018 at 1:34 AM Andy <andyye...@gmail.com> wrote:
>>
>> Dear all:
>>
>> It have been 2 months since this topic being proposed. Any progress now? 
>> 2018 has been passed about 1/2.
>>
>> I agree with that the new version should be some exciting new feature. How 
>> about this one:
>>
>> 6. ML/DL framework to be integrated as core component and feature. (Such as 
>> Angel / BigDL / ……)
>>
>> 3.0 is a very important version for an good open source project. It should 
>> be better to drift away the historical burden and focus in new area. Spark 
>> has been widely used all over the world as a successful big data framework. 
>> And it can be better than that.
>>
>> Andy
>>
>>
>> On Thu, Apr 5, 2018 at 7:20 AM Reynold Xin <r...@databricks.com> wrote:
>>>
>>> There was a discussion thread on scala-contributors about Apache Spark not 
>>> yet supporting Scala 2.12, and that got me to think perhaps it is about 
>>> time for Spark to work towards the 3.0 release. By the time it comes out, 
>>> it will be more than 2 years since Spark 2.0.
>>>
>>> For contributors less familiar with Spark’s history, I want to give more 
>>> context on Spark releases:
>>>
>>> 1. Timeline: Spark 1.0 was released May 2014. Spark 2.0 was July 2016. If 
>>> we were to maintain the ~ 2 year cadence, it is time to work on Spark 3.0 
>>> in 2018.
>>>
>>> 2. Spark’s versioning policy promises that Spark does not break stable APIs 
>>> in feature releases (e.g. 2.1, 2.2). API breaking changes are sometimes a 
>>> necessary evil, and can be done in major releases (e.g. 1.6 to 2.0, 2.x to 
>>> 3.0).
>>>
>>> 3. That said, a major version isn’t necessarily the playground for 
>>> disruptive API changes to make it painful for users to update. The main 
>>> purpose of a major release is an opportunity to fix things that are broken 
>>> in the current API and remove certain deprecated APIs.
>>>
>>> 4. Spark as a project has a culture of evolving architecture and developing 
>>> major new features incrementally, so major releases are not the only time 
>>> for exciting new features. For example, the bulk of the work in the move 
>>> towards the DataFrame API was done in Spark 1.3, and Continuous Processing 
>>> was introduced in Spark 2.3. Both were feature releases rather than major 
>>> releases.
>>>
>>>
>>> You can find more background in the thread discussing Spark 2.0: 
>>> http://apache-spark-developers-list.1001551.n3.nabble.com/A-proposal-for-Spark-2-0-td15122.html
>>>
>>>
>>> The primary motivating factor IMO for a major version bump is to support 
>>> Scala 2.12, which requires minor API breaking changes to Spark’s APIs. 
>>> Similar to Spark 2.0, I think there are also opportunities for other 
>>> changes that we know have been biting us for a long time but can’t be 
>>> changed in feature releases (to be clear, I’m actually not sure they are 
>>> all good ideas, but I’m writing them down as candidates for consideration):
>>>
>>> 1. Support Scala 2.12.
>>>
>>> 2. Remove interfaces, configs, and modules (e.g. Bagel) deprecated in Spark 
>>> 2.x.
>>>
>>> 3. Shade all dependencies.
>>>
>>> 4. Change the reserved keywords in Spark SQL to be more ANSI-SQL compliant, 
>>> to prevent users from shooting themselves in the foot, e.g. “SELECT 2 
>>> SECOND” -- is “SECOND” an interval unit or an alias? To make it less 
>>> painful for users to upgrade here, I’d suggest creating a flag for backward 
>>> compatibility mode.
>>>
>>> 5. Similar to 4, make our type coercion rule in DataFrame/SQL more standard 
>>> compliant, and have a flag for backward compatibility.
>>>
>>> 6. Miscellaneous other small changes documented in JIRA already (e.g. 
>>> “JavaPairRDD flatMapValues requires function returning Iterable, not 
>>> Iterator”, “Prevent column name duplication in temporary view”).
>>>
>>>
>>> Now the reality of a major version bump is that the world often thinks in 
>>> terms of what exciting features are coming. I do think there are a number 
>>> of major changes happening already that can be part of the 3.0 release, if 
>>> they make it in:
>>>
>>> 1. Scala 2.12 support (listing it twice)
>>> 2. Continuous Processing non-experimental
>>> 3. Kubernetes support non-experimental
>>> 4. A more flushed out version of data source API v2 (I don’t think it is 
>>> realistic to stabilize that in one release)
>>> 5. Hadoop 3.0 support
>>> 6. ...
>>>
>>>
>>>
>>> Similar to the 2.0 discussion, this thread should focus on the framework 
>>> and whether it’d make sense to create Spark 3.0 as the next release, rather 
>>> than the individual feature requests. Those are important but are best done 
>>> in their own separate threads.
>>>
>>>
>>>
>>>

---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Re: time for Apache Spark 3.0?

Reply via email to