Re: A proposal for Spark 2.0

Tom Graves Tue, 22 Dec 2015 06:32:10 -0800

Do we have a summary of all the discussions and what is planned for 2.0 then?  
Perhaps we should put on the wiki for reference.
Tom


    On Tuesday, December 22, 2015 12:12 AM, Reynold Xin <r...@databricks.com> 
wrote:
 

 FYI I updated the master branch's Spark version to 2.0.0-SNAPSHOT. 
On Tue, Nov 10, 2015 at 3:10 PM, Reynold Xin <r...@databricks.com> wrote:

I’m starting a new thread since the other one got intermixed with feature 
requests. Please refrain from making feature request in this thread. Not that 
we shouldn’t be adding features, but we can always add features in 1.7, 2.1, 
2.2, ...
First - I want to propose a premise for how to think about Spark 2.0 and major 
releases in Spark, based on discussion with several members of the community: a 
major release should be low overhead and minimally disruptive to the Spark 
community. A major release should not be very different from a minor release 
and should not be gated based on new features. The main purpose of a major 
release is an opportunity to fix things that are broken in the current API and 
remove certain deprecated APIs (examples follow).
For this reason, I would *not* propose doing major releases to break 
substantial API's or perform large re-architecting that prevent users from 
upgrading. Spark has always had a culture of evolving architecture 
incrementally and making changes - and I don't think we want to change this 
model. In fact, we’ve released many architectural changes on the 1.X line.
If the community likes the above model, then to me it seems reasonable to do 
Spark 2.0 either after Spark 1.6 (in lieu of Spark 1.7) or immediately after 
Spark 1.7. It will be 18 or 21 months since Spark 1.0. A cadence of major 
releases every 2 years seems doable within the above model.
Under this model, here is a list of example things I would propose doing in 
Spark 2.0, separated into APIs and Operation/Deployment:

APIs
1. Remove interfaces, configs, and modules (e.g. Bagel) deprecated in Spark 1.x.
2. Remove Akka from Spark’s API dependency (in streaming), so user applications 
can use Akka (SPARK-5293). We have gotten a lot of complaints about user 
applications being unable to use Akka due to Spark’s dependency on Akka.
3. Remove Guava from Spark’s public API (JavaRDD Optional).
4. Better class package structure for low level developer API’s. In particular, 
we have some DeveloperApi (mostly various listener-related classes) added over 
the years. Some packages include only one or two public classes but a lot of 
private classes. A better structure is to have public classes isolated to a few 
public packages, and these public packages should have minimal private classes 
for low level developer APIs.
5. Consolidate task metric and accumulator API. Although having some subtle 
differences, these two are very similar but have completely different code path.
6. Possibly making Catalyst, Dataset, and DataFrame more general by moving them 
to other package(s). They are already used beyond SQL, e.g. in ML pipelines, 
and will be used by streaming also.

Operation/Deployment
1. Scala 2.11 as the default build. We should still support Scala 2.10, but it 
has been end-of-life.
2. Remove Hadoop 1 support. 
3. Assembly-free distribution of Spark: don’t require building an enormous 
assembly jar in order to run Spark.

Re: A proposal for Spark 2.0

Reply via email to