[ https://issues.apache.org/jira/browse/SPARK-14033?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15213274#comment-15213274 ]
Joseph K. Bradley commented on SPARK-14033: ------------------------------------------- We've been using "MLlib" to refer to both spark.mllib and spark.ml. I think it's clearer to users to refer to spark.ml as the "DataFrame-based API" and spark.mllib as the "RDD-based API." {quote} 1) What is lacking about the current spark documentation that makes this transition/onboarding difficult for users coming from scikit? {quote} --> The main complaints I've heard have been about either (a) finding the estimator in the doc, such as LogisticRegression, but not realizing immediately that they also need to check out the docs for LogisticRegressionModel or (b) doing the same with imports. I'm not quite sure how to make this clearer in the docs. {quote} 2) Understanding the distinction between MLLib vs spark.ml is confusing at first, do you think this is perhaps part of the problem? {quote} --> It is a problem, but I think it's orthogonal. {quote} 3) Can you give examples about what is unclear about the current semantics? {quote} I think the examples in the design doc about mutability & shared references within Pipelines are the best I've got. [~mengxr] might have more. As far as distinguishing between Estimators and Models, it really depends on the user's background. I've given a lot of talks about Pipelines, and it's a bit tricky to explain how a Pipeline contains Estimators and Transformers, a Pipeline produces a PipelineModel, a PipelineModel contains only Models and Transformers, Models are a special type of Transformer, etc. This proposal does reduce the number of concepts. {quote} 4) Wouldn't this proposal make it more complex to maintain code going forward? {quote} It actually makes it significantly easier to maintain because it eliminates a lot of duplicated code. This duplicated functionality between Estimator & Model (in setters, schema validation, etc.) leads to inconsistencies and bugs; I actually found bugs in StringIndexer and RFormula while prototyping the merge of StringIndexer & Model. > Merging Estimator & Model > ------------------------- > > Key: SPARK-14033 > URL: https://issues.apache.org/jira/browse/SPARK-14033 > Project: Spark > Issue Type: Improvement > Components: ML > Reporter: Joseph K. Bradley > Assignee: Joseph K. Bradley > Attachments: StyleMutabilityMergingEstimatorandModel.pdf > > > This JIRA is for merging the spark.ml concepts of Estimator and Model. > Goal: Have clearer semantics which match existing libraries (such as > scikit-learn). > For details, please see the linked design doc. Comment on this JIRA to give > feedback on the proposed design. Once the proposal is discussed and this > work is confirmed as ready to proceed, this JIRA will serve as an umbrella > for the merge tasks. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org