FWIW, all of that sounds like a good plan to me. Developing one API is certainly better than two.
On Tue, Apr 5, 2016 at 7:01 PM, Xiangrui Meng <men...@gmail.com> wrote: > Hi all, > > More than a year ago, in Spark 1.2 we introduced the ML pipeline API built > on top of Spark SQL’s DataFrames. Since then the new DataFrame-based API has > been developed under the spark.ml package, while the old RDD-based API has > been developed in parallel under the spark.mllib package. While it was > easier to implement and experiment with new APIs under a new package, it > became harder and harder to maintain as both packages grew bigger and > bigger. And new users are often confused by having two sets of APIs with > overlapped functions. > > We started to recommend the DataFrame-based API over the RDD-based API in > Spark 1.5 for its versatility and flexibility, and we saw the development > and the usage gradually shifting to the DataFrame-based API. Just counting > the lines of Scala code, from 1.5 to the current master we added ~10000 > lines to the DataFrame-based API while ~700 to the RDD-based API. So, to > gather more resources on the development of the DataFrame-based API and to > help users migrate over sooner, I want to propose switching RDD-based MLlib > APIs to maintenance mode in Spark 2.0. What does it mean exactly? > > * We do not accept new features in the RDD-based spark.mllib package, unless > they block implementing new features in the DataFrame-based spark.ml > package. > * We still accept bug fixes in the RDD-based API. > * We will add more features to the DataFrame-based API in the 2.x series to > reach feature parity with the RDD-based API. > * Once we reach feature parity (possibly in Spark 2.2), we will deprecate > the RDD-based API. > * We will remove the RDD-based API from the main Spark repo in Spark 3.0. > > Though the RDD-based API is already in de facto maintenance mode, this > announcement will make it clear and hence important to both MLlib developers > and users. So we’d greatly appreciate your feedback! > > (As a side note, people sometimes use “Spark ML” to refer to the > DataFrame-based API or even the entire MLlib component. This also causes > confusion. To be clear, “Spark ML” is not an official name and there are no > plans to rename MLlib to “Spark ML” at this time.) > > Best, > Xiangrui --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org