+1 From: Matei Zaharia <matei.zaha...@gmail.com> Date: Tuesday, April 5, 2016 at 4:58 PM To: Xiangrui Meng <m...@databricks.com> Cc: Shivaram Venkataraman <shiva...@eecs.berkeley.edu>, Sean Owen <so...@cloudera.com>, Xiangrui Meng <men...@gmail.com>, dev <d...@spark.apache.org>, "user @spark" <user@spark.apache.org>, DB Tsai <dbt...@dbtsai.com> Subject: Re: Switch RDD-based MLlib APIs to maintenance mode in Spark 2.0
> This sounds good to me as well. The one thing we should pay attention to is > how we update the docs so that people know to start with the spark.ml classes. > Right now the docs list spark.mllib first and also seem more comprehensive in > that area than in spark.ml, so maybe people naturally move towards that. > > Matei > >> On Apr 5, 2016, at 4:44 PM, Xiangrui Meng <m...@databricks.com> wrote: >> >> Yes, DB (cc'ed) is working on porting the local linear algebra library over >> (SPARK-13944). There are also frequent pattern mining algorithms we need to >> port over in order to reach feature parity. -Xiangrui >> >> On Tue, Apr 5, 2016 at 12:08 PM Shivaram Venkataraman >> <shiva...@eecs.berkeley.edu> wrote: >>> Overall this sounds good to me. One question I have is that in >>> addition to the ML algorithms we have a number of linear algebra >>> (various distributed matrices) and statistical methods in the >>> spark.mllib package. Is the plan to port or move these to the spark.ml >>> <http://spark.ml/> >>> namespace in the 2.x series ? >>> >>> Thanks >>> Shivaram >>> >>> On Tue, Apr 5, 2016 at 11:48 AM, Sean Owen <so...@cloudera.com> wrote: >>>> > FWIW, all of that sounds like a good plan to me. Developing one API is >>>> > certainly better than two. >>>> > >>>> > On Tue, Apr 5, 2016 at 7:01 PM, Xiangrui Meng <men...@gmail.com> wrote: >>>>> >> Hi all, >>>>> >> >>>>> >> More than a year ago, in Spark 1.2 we introduced the ML pipeline API >>>>> built >>>>> >> on top of Spark SQL¹s DataFrames. Since then the new DataFrame-based >>>>> API has >>>>> >> been developed under the spark.ml <http://spark.ml/> package, while >>>>> the old RDD-based API has >>>>> >> been developed in parallel under the spark.mllib package. While it was >>>>> >> easier to implement and experiment with new APIs under a new package, it >>>>> >> became harder and harder to maintain as both packages grew bigger and >>>>> >> bigger. And new users are often confused by having two sets of APIs >>>>> with >>>>> >> overlapped functions. >>>>> >> >>>>> >> We started to recommend the DataFrame-based API over the RDD-based API in >>>>> >> Spark 1.5 for its versatility and flexibility, and we saw the >>>>> development >>>>> >> and the usage gradually shifting to the DataFrame-based API. Just >>>>> counting >>>>> >> the lines of Scala code, from 1.5 to the current master we added ~10000 >>>>> >> lines to the DataFrame-based API while ~700 to the RDD-based API. So, to >>>>> >> gather more resources on the development of the DataFrame-based API and to >>>>> >> help users migrate over sooner, I want to propose switching RDD-based >>>>> MLlib >>>>> >> APIs to maintenance mode in Spark 2.0. What does it mean exactly? >>>>> >> >>>>> >> * We do not accept new features in the RDD-based spark.mllib package, >>>>> unless >>>>> >> they block implementing new features in the DataFrame-based spark.ml >>>>> <http://spark.ml/> >>>>> >> package. >>>>> >> * We still accept bug fixes in the RDD-based API. >>>>> >> * We will add more features to the DataFrame-based API in the 2.x >>>>> series to >>>>> >> reach feature parity with the RDD-based API. >>>>> >> * Once we reach feature parity (possibly in Spark 2.2), we will >>>>> deprecate >>>>> >> the RDD-based API. >>>>> >> * We will remove the RDD-based API from the main Spark repo in Spark >>>>> 3.0. >>>>> >> >>>>> >> Though the RDD-based API is already in de facto maintenance mode, this >>>>> >> announcement will make it clear and hence important to both MLlib >>>>> developers >>>>> >> and users. So we¹d greatly appreciate your feedback! >>>>> >> >>>>> >> (As a side note, people sometimes use ³Spark ML² to refer to the >>>>> >> DataFrame-based API or even the entire MLlib component. This also >>>>> causes >>>>> >> confusion. To be clear, ³Spark ML² is not an official name and there >>>>> are no >>>>> >> plans to rename MLlib to ³Spark ML² at this time.) >>>>> >> >>>>> >> Best, >>>>> >> Xiangrui >>>> > >>>> > --------------------------------------------------------------------- >>>> > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >>>> > For additional commands, e-mail: user-h...@spark.apache.org >>>> > >