[ 
https://issues.apache.org/jira/browse/SPARK-14033?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15213274#comment-15213274
 ] 

Joseph K. Bradley commented on SPARK-14033:
-------------------------------------------

We've been using "MLlib" to refer to both spark.mllib and spark.ml.  I think 
it's clearer to users to refer to spark.ml as the "DataFrame-based API" and 
spark.mllib as the "RDD-based API."

{quote}
1) What is lacking about the current spark documentation that makes this 
transition/onboarding difficult for users coming from scikit? 
{quote}
--> The main complaints I've heard have been about either (a) finding the 
estimator in the doc, such as LogisticRegression, but not realizing immediately 
that they also need to check out the docs for LogisticRegressionModel or (b) 
doing the same with imports.  I'm not quite sure how to make this clearer in 
the docs.

{quote}
2) Understanding the distinction between MLLib vs spark.ml is confusing at 
first, do you think this is perhaps part of the problem?
{quote}
--> It is a problem, but I think it's orthogonal.

{quote}
3) Can you give examples about what is unclear about the current semantics?
{quote}
I think the examples in the design doc about mutability & shared references 
within Pipelines are the best I've got.  [~mengxr] might have more.  As far as 
distinguishing between Estimators and Models, it really depends on the user's 
background.  I've given a lot of talks about Pipelines, and it's a bit tricky 
to explain how a Pipeline contains Estimators and Transformers, a Pipeline 
produces a PipelineModel, a PipelineModel contains only Models and 
Transformers, Models are a special type of Transformer, etc.  This proposal 
does reduce the number of concepts.

{quote}
4) Wouldn't this proposal make it more complex to maintain code going forward?
{quote}
It actually makes it significantly easier to maintain because it eliminates a 
lot of duplicated code.  This duplicated functionality between Estimator & 
Model (in setters, schema validation, etc.) leads to inconsistencies and bugs; 
I actually found bugs in StringIndexer and RFormula while prototyping the merge 
of StringIndexer & Model.


> Merging Estimator & Model
> -------------------------
>
>                 Key: SPARK-14033
>                 URL: https://issues.apache.org/jira/browse/SPARK-14033
>             Project: Spark
>          Issue Type: Improvement
>          Components: ML
>            Reporter: Joseph K. Bradley
>            Assignee: Joseph K. Bradley
>         Attachments: StyleMutabilityMergingEstimatorandModel.pdf
>
>
> This JIRA is for merging the spark.ml concepts of Estimator and Model.
> Goal: Have clearer semantics which match existing libraries (such as 
> scikit-learn).
> For details, please see the linked design doc.  Comment on this JIRA to give 
> feedback on the proposed design.  Once the proposal is discussed and this 
> work is confirmed as ready to proceed, this JIRA will serve as an umbrella 
> for the merge tasks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to