Re: time for Apache Spark 3.0?

Andy Fri, 15 Jun 2018 01:34:52 -0700

*Dear all:*

It have been 2 months since this topic being proposed. Any progress now?
2018 has been passed about 1/2.


I agree with that the new version should be some exciting new feature. How
about this one:

*6. ML/DL framework to be integrated as core component and feature. (Such
as Angel / BigDL / ……)*

3.0 is a very important version for an good open source project. It should
be better to drift away the historical burden and *focus in new area*.
Spark has been widely used all over the world as a successful big data
framework. And it can be better than that.


*Andy*


On Thu, Apr 5, 2018 at 7:20 AM Reynold Xin <r...@databricks.com> wrote:

> There was a discussion thread on scala-contributors
> <https://contributors.scala-lang.org/t/spark-as-a-scala-gateway-drug-and-the-2-12-failure/1747>
> about Apache Spark not yet supporting Scala 2.12, and that got me to think
> perhaps it is about time for Spark to work towards the 3.0 release. By the
> time it comes out, it will be more than 2 years since Spark 2.0.
>
> For contributors less familiar with Spark’s history, I want to give more
> context on Spark releases:
>
> 1. Timeline: Spark 1.0 was released May 2014. Spark 2.0 was July 2016. If
> we were to maintain the ~ 2 year cadence, it is time to work on Spark 3.0
> in 2018.
>
> 2. Spark’s versioning policy promises that Spark does not break stable
> APIs in feature releases (e.g. 2.1, 2.2). API breaking changes are
> sometimes a necessary evil, and can be done in major releases (e.g. 1.6 to
> 2.0, 2.x to 3.0).
>
> 3. That said, a major version isn’t necessarily the playground for
> disruptive API changes to make it painful for users to update. The main
> purpose of a major release is an opportunity to fix things that are broken
> in the current API and remove certain deprecated APIs.
>
> 4. Spark as a project has a culture of evolving architecture and
> developing major new features incrementally, so major releases are not the
> only time for exciting new features. For example, the bulk of the work in
> the move towards the DataFrame API was done in Spark 1.3, and Continuous
> Processing was introduced in Spark 2.3. Both were feature releases rather
> than major releases.
>
>
> You can find more background in the thread discussing Spark 2.0:
> http://apache-spark-developers-list.1001551.n3.nabble.com/A-proposal-for-Spark-2-0-td15122.html
>
>
> The primary motivating factor IMO for a major version bump is to support
> Scala 2.12, which requires minor API breaking changes to Spark’s APIs.
> Similar to Spark 2.0, I think there are also opportunities for other
> changes that we know have been biting us for a long time but can’t be
> changed in feature releases (to be clear, I’m actually not sure they are
> all good ideas, but I’m writing them down as candidates for consideration):
>
> 1. Support Scala 2.12.
>
> 2. Remove interfaces, configs, and modules (e.g. Bagel) deprecated in
> Spark 2.x.
>
> 3. Shade all dependencies.
>
> 4. Change the reserved keywords in Spark SQL to be more ANSI-SQL
> compliant, to prevent users from shooting themselves in the foot, e.g.
> “SELECT 2 SECOND” -- is “SECOND” an interval unit or an alias? To make it
> less painful for users to upgrade here, I’d suggest creating a flag for
> backward compatibility mode.
>
> 5. Similar to 4, make our type coercion rule in DataFrame/SQL more
> standard compliant, and have a flag for backward compatibility.
>
> 6. Miscellaneous other small changes documented in JIRA already (e.g.
> “JavaPairRDD flatMapValues requires function returning Iterable, not
> Iterator”, “Prevent column name duplication in temporary view”).
>
>
> Now the reality of a major version bump is that the world often thinks in
> terms of what exciting features are coming. I do think there are a number
> of major changes happening already that can be part of the 3.0 release, if
> they make it in:
>
> 1. Scala 2.12 support (listing it twice)
> 2. Continuous Processing non-experimental
> 3. Kubernetes support non-experimental
> 4. A more flushed out version of data source API v2 (I don’t think it is
> realistic to stabilize that in one release)
> 5. Hadoop 3.0 support
> 6. ...
>
>
>
> Similar to the 2.0 discussion, this thread should focus on the framework
> and whether it’d make sense to create Spark 3.0 as the next release, rather
> than the individual feature requests. Those are important but are best done
> in their own separate threads.
>
>
>
>
>

Re: time for Apache Spark 3.0?

Reply via email to