Yes, DB (cc'ed) is working on porting the local linear algebra library over (SPARK-13944). There are also frequent pattern mining algorithms we need to port over in order to reach feature parity. -Xiangrui
On Tue, Apr 5, 2016 at 12:08 PM Shivaram Venkataraman < shiva...@eecs.berkeley.edu> wrote: > Overall this sounds good to me. One question I have is that in > addition to the ML algorithms we have a number of linear algebra > (various distributed matrices) and statistical methods in the > spark.mllib package. Is the plan to port or move these to the spark.ml > namespace in the 2.x series ? > > Thanks > Shivaram > > On Tue, Apr 5, 2016 at 11:48 AM, Sean Owen <so...@cloudera.com> wrote: > > FWIW, all of that sounds like a good plan to me. Developing one API is > > certainly better than two. > > > > On Tue, Apr 5, 2016 at 7:01 PM, Xiangrui Meng <men...@gmail.com> wrote: > >> Hi all, > >> > >> More than a year ago, in Spark 1.2 we introduced the ML pipeline API > built > >> on top of Spark SQL’s DataFrames. Since then the new DataFrame-based > API has > >> been developed under the spark.ml package, while the old RDD-based API > has > >> been developed in parallel under the spark.mllib package. While it was > >> easier to implement and experiment with new APIs under a new package, it > >> became harder and harder to maintain as both packages grew bigger and > >> bigger. And new users are often confused by having two sets of APIs with > >> overlapped functions. > >> > >> We started to recommend the DataFrame-based API over the RDD-based API > in > >> Spark 1.5 for its versatility and flexibility, and we saw the > development > >> and the usage gradually shifting to the DataFrame-based API. Just > counting > >> the lines of Scala code, from 1.5 to the current master we added ~10000 > >> lines to the DataFrame-based API while ~700 to the RDD-based API. So, to > >> gather more resources on the development of the DataFrame-based API and > to > >> help users migrate over sooner, I want to propose switching RDD-based > MLlib > >> APIs to maintenance mode in Spark 2.0. What does it mean exactly? > >> > >> * We do not accept new features in the RDD-based spark.mllib package, > unless > >> they block implementing new features in the DataFrame-based spark.ml > >> package. > >> * We still accept bug fixes in the RDD-based API. > >> * We will add more features to the DataFrame-based API in the 2.x > series to > >> reach feature parity with the RDD-based API. > >> * Once we reach feature parity (possibly in Spark 2.2), we will > deprecate > >> the RDD-based API. > >> * We will remove the RDD-based API from the main Spark repo in Spark > 3.0. > >> > >> Though the RDD-based API is already in de facto maintenance mode, this > >> announcement will make it clear and hence important to both MLlib > developers > >> and users. So we’d greatly appreciate your feedback! > >> > >> (As a side note, people sometimes use “Spark ML” to refer to the > >> DataFrame-based API or even the entire MLlib component. This also causes > >> confusion. To be clear, “Spark ML” is not an official name and there > are no > >> plans to rename MLlib to “Spark ML” at this time.) > >> > >> Best, > >> Xiangrui > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > > For additional commands, e-mail: user-h...@spark.apache.org > > >