Re: Switch RDD-based MLlib APIs to maintenance mode in Spark 2.0

Shivaram Venkataraman Tue, 05 Apr 2016 12:08:32 -0700

Overall this sounds good to me. One question I have is that in
addition to the ML algorithms we have a number of linear algebra
(various distributed matrices) and statistical methods in the
spark.mllib package. Is the plan to port or move these to the spark.ml
namespace in the 2.x series ?


Thanks
Shivaram

On Tue, Apr 5, 2016 at 11:48 AM, Sean Owen <so...@cloudera.com> wrote:
> FWIW, all of that sounds like a good plan to me. Developing one API is
> certainly better than two.
>
> On Tue, Apr 5, 2016 at 7:01 PM, Xiangrui Meng <men...@gmail.com> wrote:
>> Hi all,
>>
>> More than a year ago, in Spark 1.2 we introduced the ML pipeline API built
>> on top of Spark SQL’s DataFrames. Since then the new DataFrame-based API has
>> been developed under the spark.ml package, while the old RDD-based API has
>> been developed in parallel under the spark.mllib package. While it was
>> easier to implement and experiment with new APIs under a new package, it
>> became harder and harder to maintain as both packages grew bigger and
>> bigger. And new users are often confused by having two sets of APIs with
>> overlapped functions.
>>
>> We started to recommend the DataFrame-based API over the RDD-based API in
>> Spark 1.5 for its versatility and flexibility, and we saw the development
>> and the usage gradually shifting to the DataFrame-based API. Just counting
>> the lines of Scala code, from 1.5 to the current master we added ~10000
>> lines to the DataFrame-based API while ~700 to the RDD-based API. So, to
>> gather more resources on the development of the DataFrame-based API and to
>> help users migrate over sooner, I want to propose switching RDD-based MLlib
>> APIs to maintenance mode in Spark 2.0. What does it mean exactly?
>>
>> * We do not accept new features in the RDD-based spark.mllib package, unless
>> they block implementing new features in the DataFrame-based spark.ml
>> package.
>> * We still accept bug fixes in the RDD-based API.
>> * We will add more features to the DataFrame-based API in the 2.x series to
>> reach feature parity with the RDD-based API.
>> * Once we reach feature parity (possibly in Spark 2.2), we will deprecate
>> the RDD-based API.
>> * We will remove the RDD-based API from the main Spark repo in Spark 3.0.
>>
>> Though the RDD-based API is already in de facto maintenance mode, this
>> announcement will make it clear and hence important to both MLlib developers
>> and users. So we’d greatly appreciate your feedback!
>>
>> (As a side note, people sometimes use “Spark ML” to refer to the
>> DataFrame-based API or even the entire MLlib component. This also causes
>> confusion. To be clear, “Spark ML” is not an official name and there are no
>> plans to rename MLlib to “Spark ML” at this time.)
>>
>> Best,
>> Xiangrui
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Switch RDD-based MLlib APIs to maintenance mode in Spark 2.0

Reply via email to