Re: Switch RDD-based MLlib APIs to maintenance mode in Spark 2.0

Chris Fregly Tue, 05 Apr 2016 20:03:07 -0700

perhaps renaming to Spark ML would actually clear up code and documentation 
confusion?


+1 for rename 

> On Apr 5, 2016, at 7:00 PM, Reynold Xin <r...@databricks.com> wrote:
> 
> +1
> 
> This is a no brainer IMO.
> 
> 
>> On Tue, Apr 5, 2016 at 7:32 PM, Joseph Bradley <jos...@databricks.com> wrote:
>> +1  By the way, the JIRA for tracking (Scala) API parity is: 
>> https://issues.apache.org/jira/browse/SPARK-4591
>> 
>>> On Tue, Apr 5, 2016 at 4:58 PM, Matei Zaharia <matei.zaha...@gmail.com> 
>>> wrote:
>>> This sounds good to me as well. The one thing we should pay attention to is 
>>> how we update the docs so that people know to start with the spark.ml 
>>> classes. Right now the docs list spark.mllib first and also seem more 
>>> comprehensive in that area than in spark.ml, so maybe people naturally move 
>>> towards that.
>>> 
>>> Matei
>>> 
>>>> On Apr 5, 2016, at 4:44 PM, Xiangrui Meng <m...@databricks.com> wrote:
>>>> 
>>>> Yes, DB (cc'ed) is working on porting the local linear algebra library 
>>>> over (SPARK-13944). There are also frequent pattern mining algorithms we 
>>>> need to port over in order to reach feature parity. -Xiangrui
>>>> 
>>>>> On Tue, Apr 5, 2016 at 12:08 PM Shivaram Venkataraman 
>>>>> <shiva...@eecs.berkeley.edu> wrote:
>>>>> Overall this sounds good to me. One question I have is that in
>>>>> addition to the ML algorithms we have a number of linear algebra
>>>>> (various distributed matrices) and statistical methods in the
>>>>> spark.mllib package. Is the plan to port or move these to the spark.ml
>>>>> namespace in the 2.x series ?
>>>>> 
>>>>> Thanks
>>>>> Shivaram
>>>>> 
>>>>> On Tue, Apr 5, 2016 at 11:48 AM, Sean Owen <so...@cloudera.com> wrote:
>>>>> > FWIW, all of that sounds like a good plan to me. Developing one API is
>>>>> > certainly better than two.
>>>>> >
>>>>> > On Tue, Apr 5, 2016 at 7:01 PM, Xiangrui Meng <men...@gmail.com> wrote:
>>>>> >> Hi all,
>>>>> >>
>>>>> >> More than a year ago, in Spark 1.2 we introduced the ML pipeline API 
>>>>> >> built
>>>>> >> on top of Spark SQL’s DataFrames. Since then the new DataFrame-based 
>>>>> >> API has
>>>>> >> been developed under the spark.ml package, while the old RDD-based API 
>>>>> >> has
>>>>> >> been developed in parallel under the spark.mllib package. While it was
>>>>> >> easier to implement and experiment with new APIs under a new package, 
>>>>> >> it
>>>>> >> became harder and harder to maintain as both packages grew bigger and
>>>>> >> bigger. And new users are often confused by having two sets of APIs 
>>>>> >> with
>>>>> >> overlapped functions.
>>>>> >>
>>>>> >> We started to recommend the DataFrame-based API over the RDD-based API 
>>>>> >> in
>>>>> >> Spark 1.5 for its versatility and flexibility, and we saw the 
>>>>> >> development
>>>>> >> and the usage gradually shifting to the DataFrame-based API. Just 
>>>>> >> counting
>>>>> >> the lines of Scala code, from 1.5 to the current master we added ~10000
>>>>> >> lines to the DataFrame-based API while ~700 to the RDD-based API. So, 
>>>>> >> to
>>>>> >> gather more resources on the development of the DataFrame-based API 
>>>>> >> and to
>>>>> >> help users migrate over sooner, I want to propose switching RDD-based 
>>>>> >> MLlib
>>>>> >> APIs to maintenance mode in Spark 2.0. What does it mean exactly?
>>>>> >>
>>>>> >> * We do not accept new features in the RDD-based spark.mllib package, 
>>>>> >> unless
>>>>> >> they block implementing new features in the DataFrame-based spark.ml
>>>>> >> package.
>>>>> >> * We still accept bug fixes in the RDD-based API.
>>>>> >> * We will add more features to the DataFrame-based API in the 2.x 
>>>>> >> series to
>>>>> >> reach feature parity with the RDD-based API.
>>>>> >> * Once we reach feature parity (possibly in Spark 2.2), we will 
>>>>> >> deprecate
>>>>> >> the RDD-based API.
>>>>> >> * We will remove the RDD-based API from the main Spark repo in Spark 
>>>>> >> 3.0.
>>>>> >>
>>>>> >> Though the RDD-based API is already in de facto maintenance mode, this
>>>>> >> announcement will make it clear and hence important to both MLlib 
>>>>> >> developers
>>>>> >> and users. So we’d greatly appreciate your feedback!
>>>>> >>
>>>>> >> (As a side note, people sometimes use “Spark ML” to refer to the
>>>>> >> DataFrame-based API or even the entire MLlib component. This also 
>>>>> >> causes
>>>>> >> confusion. To be clear, “Spark ML” is not an official name and there 
>>>>> >> are no
>>>>> >> plans to rename MLlib to “Spark ML” at this time.)
>>>>> >>
>>>>> >> Best,
>>>>> >> Xiangrui
>>>>> >
>>>>> > ---------------------------------------------------------------------
>>>>> > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>>>> > For additional commands, e-mail: user-h...@spark.apache.org
>>>>> >
>

Re: Switch RDD-based MLlib APIs to maintenance mode in Spark 2.0

Reply via email to