Re: Switch RDD-based MLlib APIs to maintenance mode in Spark 2.0

DB Tsai Wed, 06 Apr 2016 15:59:21 -0700

+1 for renaming the jar file.

Sincerely,


DB Tsai
----------------------------------------------------------
Web: https://www.dbtsai.com
PGP Key ID: 0xAF08DF8D


On Tue, Apr 5, 2016 at 8:02 PM, Chris Fregly <ch...@fregly.com> wrote:
> perhaps renaming to Spark ML would actually clear up code and documentation
> confusion?
>
> +1 for rename
>
> On Apr 5, 2016, at 7:00 PM, Reynold Xin <r...@databricks.com> wrote:
>
> +1
>
> This is a no brainer IMO.
>
>
> On Tue, Apr 5, 2016 at 7:32 PM, Joseph Bradley <jos...@databricks.com>
> wrote:
>>
>> +1  By the way, the JIRA for tracking (Scala) API parity is:
>> https://issues.apache.org/jira/browse/SPARK-4591
>>
>> On Tue, Apr 5, 2016 at 4:58 PM, Matei Zaharia <matei.zaha...@gmail.com>
>> wrote:
>>>
>>> This sounds good to me as well. The one thing we should pay attention to
>>> is how we update the docs so that people know to start with the spark.ml
>>> classes. Right now the docs list spark.mllib first and also seem more
>>> comprehensive in that area than in spark.ml, so maybe people naturally move
>>> towards that.
>>>
>>> Matei
>>>
>>> On Apr 5, 2016, at 4:44 PM, Xiangrui Meng <m...@databricks.com> wrote:
>>>
>>> Yes, DB (cc'ed) is working on porting the local linear algebra library
>>> over (SPARK-13944). There are also frequent pattern mining algorithms we
>>> need to port over in order to reach feature parity. -Xiangrui
>>>
>>> On Tue, Apr 5, 2016 at 12:08 PM Shivaram Venkataraman
>>> <shiva...@eecs.berkeley.edu> wrote:
>>>>
>>>> Overall this sounds good to me. One question I have is that in
>>>> addition to the ML algorithms we have a number of linear algebra
>>>> (various distributed matrices) and statistical methods in the
>>>> spark.mllib package. Is the plan to port or move these to the spark.ml
>>>> namespace in the 2.x series ?
>>>>
>>>> Thanks
>>>> Shivaram
>>>>
>>>> On Tue, Apr 5, 2016 at 11:48 AM, Sean Owen <so...@cloudera.com> wrote:
>>>> > FWIW, all of that sounds like a good plan to me. Developing one API is
>>>> > certainly better than two.
>>>> >
>>>> > On Tue, Apr 5, 2016 at 7:01 PM, Xiangrui Meng <men...@gmail.com>
>>>> > wrote:
>>>> >> Hi all,
>>>> >>
>>>> >> More than a year ago, in Spark 1.2 we introduced the ML pipeline API
>>>> >> built
>>>> >> on top of Spark SQL’s DataFrames. Since then the new DataFrame-based
>>>> >> API has
>>>> >> been developed under the spark.ml package, while the old RDD-based
>>>> >> API has
>>>> >> been developed in parallel under the spark.mllib package. While it
>>>> >> was
>>>> >> easier to implement and experiment with new APIs under a new package,
>>>> >> it
>>>> >> became harder and harder to maintain as both packages grew bigger and
>>>> >> bigger. And new users are often confused by having two sets of APIs
>>>> >> with
>>>> >> overlapped functions.
>>>> >>
>>>> >> We started to recommend the DataFrame-based API over the RDD-based
>>>> >> API in
>>>> >> Spark 1.5 for its versatility and flexibility, and we saw the
>>>> >> development
>>>> >> and the usage gradually shifting to the DataFrame-based API. Just
>>>> >> counting
>>>> >> the lines of Scala code, from 1.5 to the current master we added
>>>> >> ~10000
>>>> >> lines to the DataFrame-based API while ~700 to the RDD-based API. So,
>>>> >> to
>>>> >> gather more resources on the development of the DataFrame-based API
>>>> >> and to
>>>> >> help users migrate over sooner, I want to propose switching RDD-based
>>>> >> MLlib
>>>> >> APIs to maintenance mode in Spark 2.0. What does it mean exactly?
>>>> >>
>>>> >> * We do not accept new features in the RDD-based spark.mllib package,
>>>> >> unless
>>>> >> they block implementing new features in the DataFrame-based spark.ml
>>>> >> package.
>>>> >> * We still accept bug fixes in the RDD-based API.
>>>> >> * We will add more features to the DataFrame-based API in the 2.x
>>>> >> series to
>>>> >> reach feature parity with the RDD-based API.
>>>> >> * Once we reach feature parity (possibly in Spark 2.2), we will
>>>> >> deprecate
>>>> >> the RDD-based API.
>>>> >> * We will remove the RDD-based API from the main Spark repo in Spark
>>>> >> 3.0.
>>>> >>
>>>> >> Though the RDD-based API is already in de facto maintenance mode,
>>>> >> this
>>>> >> announcement will make it clear and hence important to both MLlib
>>>> >> developers
>>>> >> and users. So we’d greatly appreciate your feedback!
>>>> >>
>>>> >> (As a side note, people sometimes use “Spark ML” to refer to the
>>>> >> DataFrame-based API or even the entire MLlib component. This also
>>>> >> causes
>>>> >> confusion. To be clear, “Spark ML” is not an official name and there
>>>> >> are no
>>>> >> plans to rename MLlib to “Spark ML” at this time.)
>>>> >>
>>>> >> Best,
>>>> >> Xiangrui
>>>> >
>>>> > ---------------------------------------------------------------------
>>>> > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>>> > For additional commands, e-mail: user-h...@spark.apache.org
>>>> >
>>>
>>>
>>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Re: Switch RDD-based MLlib APIs to maintenance mode in Spark 2.0

Reply via email to