There was a discussion thread on scala-contributors <https://contributors.scala-lang.org/t/spark-as-a-scala-gateway-drug-and-the-2-12-failure/1747> about Apache Spark not yet supporting Scala 2.12, and that got me to think perhaps it is about time for Spark to work towards the 3.0 release. By the time it comes out, it will be more than 2 years since Spark 2.0.
For contributors less familiar with Spark’s history, I want to give more context on Spark releases: 1. Timeline: Spark 1.0 was released May 2014. Spark 2.0 was July 2016. If we were to maintain the ~ 2 year cadence, it is time to work on Spark 3.0 in 2018. 2. Spark’s versioning policy promises that Spark does not break stable APIs in feature releases (e.g. 2.1, 2.2). API breaking changes are sometimes a necessary evil, and can be done in major releases (e.g. 1.6 to 2.0, 2.x to 3.0). 3. That said, a major version isn’t necessarily the playground for disruptive API changes to make it painful for users to update. The main purpose of a major release is an opportunity to fix things that are broken in the current API and remove certain deprecated APIs. 4. Spark as a project has a culture of evolving architecture and developing major new features incrementally, so major releases are not the only time for exciting new features. For example, the bulk of the work in the move towards the DataFrame API was done in Spark 1.3, and Continuous Processing was introduced in Spark 2.3. Both were feature releases rather than major releases. You can find more background in the thread discussing Spark 2.0: http://apache-spark-developers-list.1001551.n3.nabble.com/A-proposal-for-Spark-2-0-td15122.html The primary motivating factor IMO for a major version bump is to support Scala 2.12, which requires minor API breaking changes to Spark’s APIs. Similar to Spark 2.0, I think there are also opportunities for other changes that we know have been biting us for a long time but can’t be changed in feature releases (to be clear, I’m actually not sure they are all good ideas, but I’m writing them down as candidates for consideration): 1. Support Scala 2.12. 2. Remove interfaces, configs, and modules (e.g. Bagel) deprecated in Spark 2.x. 3. Shade all dependencies. 4. Change the reserved keywords in Spark SQL to be more ANSI-SQL compliant, to prevent users from shooting themselves in the foot, e.g. “SELECT 2 SECOND” -- is “SECOND” an interval unit or an alias? To make it less painful for users to upgrade here, I’d suggest creating a flag for backward compatibility mode. 5. Similar to 4, make our type coercion rule in DataFrame/SQL more standard compliant, and have a flag for backward compatibility. 6. Miscellaneous other small changes documented in JIRA already (e.g. “JavaPairRDD flatMapValues requires function returning Iterable, not Iterator”, “Prevent column name duplication in temporary view”). Now the reality of a major version bump is that the world often thinks in terms of what exciting features are coming. I do think there are a number of major changes happening already that can be part of the 3.0 release, if they make it in: 1. Scala 2.12 support (listing it twice) 2. Continuous Processing non-experimental 3. Kubernetes support non-experimental 4. A more flushed out version of data source API v2 (I don’t think it is realistic to stabilize that in one release) 5. Hadoop 3.0 support 6. ... Similar to the 2.0 discussion, this thread should focus on the framework and whether it’d make sense to create Spark 3.0 as the next release, rather than the individual feature requests. Those are important but are best done in their own separate threads.