[jira] [Commented] (SPARK-14033) Merging Estimator & Model

2016-04-01 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14033?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15222364#comment-15222364
 ] 

Joseph K. Bradley commented on SPARK-14033:
---

I discussed this with [~mengxr] and [~matei], and we've decided to reject this 
proposal.  While there are slight benefits to users (slightly simpler API) and 
tangible benefits to developers (less code duplication), the pain of modifying 
the API will affect users too much for this to be worthwhile.

Thanks everyone for feedback!

I'll close this JIRA.

> Merging Estimator & Model
> -
>
> Key: SPARK-14033
> URL: https://issues.apache.org/jira/browse/SPARK-14033
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Joseph K. Bradley
>Assignee: Joseph K. Bradley
> Attachments: StyleMutabilityMergingEstimatorandModel.pdf
>
>
> This JIRA is for merging the spark.ml concepts of Estimator and Model.
> Goal: Have clearer semantics which match existing libraries (such as 
> scikit-learn).
> For details, please see the linked design doc.  Comment on this JIRA to give 
> feedback on the proposed design.  Once the proposal is discussed and this 
> work is confirmed as ready to proceed, this JIRA will serve as an umbrella 
> for the merge tasks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14033) Merging Estimator & Model

2016-03-30 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14033?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15218495#comment-15218495
 ] 

Joseph K. Bradley commented on SPARK-14033:
---

That's a great point, and it's something which should be doable in either case. 
 The solution we're working towards is moving models outside of the Spark 
package, where the models within MLlib will inherit from those outside ones & 
extend them with operations on DataFrames.  That will be doable if we keep 
Estimator and Model separate, and also if we merge them.

> Merging Estimator & Model
> -
>
> Key: SPARK-14033
> URL: https://issues.apache.org/jira/browse/SPARK-14033
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Joseph K. Bradley
>Assignee: Joseph K. Bradley
> Attachments: StyleMutabilityMergingEstimatorandModel.pdf
>
>
> This JIRA is for merging the spark.ml concepts of Estimator and Model.
> Goal: Have clearer semantics which match existing libraries (such as 
> scikit-learn).
> For details, please see the linked design doc.  Comment on this JIRA to give 
> feedback on the proposed design.  Once the proposal is discussed and this 
> work is confirmed as ready to proceed, this JIRA will serve as an umbrella 
> for the merge tasks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14033) Merging Estimator & Model

2016-03-29 Thread Stefan Krawczyk (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14033?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15217324#comment-15217324
 ] 

Stefan Krawczyk commented on SPARK-14033:
-

{quote}It actually makes it significantly easier to maintain because it 
eliminates a lot of duplicated code. This duplicated functionality between 
Estimator & Model (in setters, schema validation, etc.) leads to 
inconsistencies and bugs; I actually found bugs in StringIndexer and RFormula 
while prototyping the merge of StringIndexer & Model.{quote}
That smells like a different abstraction issue to me. But sure, I can see where 
you're coming from.

One argument for keeping them separate, is that rom a dependency standpoint, 
it'd be advantageous to separate training code (estimators) from prediction 
code (models). That way you could package your trained models and use them 
without having to bring in all the training dependencies. 

> Merging Estimator & Model
> -
>
> Key: SPARK-14033
> URL: https://issues.apache.org/jira/browse/SPARK-14033
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Joseph K. Bradley
>Assignee: Joseph K. Bradley
> Attachments: StyleMutabilityMergingEstimatorandModel.pdf
>
>
> This JIRA is for merging the spark.ml concepts of Estimator and Model.
> Goal: Have clearer semantics which match existing libraries (such as 
> scikit-learn).
> For details, please see the linked design doc.  Comment on this JIRA to give 
> feedback on the proposed design.  Once the proposal is discussed and this 
> work is confirmed as ready to proceed, this JIRA will serve as an umbrella 
> for the merge tasks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14033) Merging Estimator & Model

2016-03-29 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-14033?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15216756#comment-15216756
 ] 

Michael ZieliƄski commented on SPARK-14033:
---

Re: ML vs MLLib, I also think about it in terms RDDs versus DataFrames.

Re: Estimator/Model I prefer the current version that preserves immutability to 
a larger degree. That said, maybe merging those concepts would make it easier 
for the next stage of a Pipeline to use outputs from previous stage. Currently 
if you have:

val a1 = new Estimator1
val a2 = new Estimator2.setParamAbc(a1.getParamCde)

You can only get the members from Estimator1, but not Estimator1Model. If 
they're the same class it would make things easier. As an example you want to 
take top K variables from Random Forest model as input to Logistic Regression. 



> Merging Estimator & Model
> -
>
> Key: SPARK-14033
> URL: https://issues.apache.org/jira/browse/SPARK-14033
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Joseph K. Bradley
>Assignee: Joseph K. Bradley
> Attachments: StyleMutabilityMergingEstimatorandModel.pdf
>
>
> This JIRA is for merging the spark.ml concepts of Estimator and Model.
> Goal: Have clearer semantics which match existing libraries (such as 
> scikit-learn).
> For details, please see the linked design doc.  Comment on this JIRA to give 
> feedback on the proposed design.  Once the proposal is discussed and this 
> work is confirmed as ready to proceed, this JIRA will serve as an umbrella 
> for the merge tasks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14033) Merging Estimator & Model

2016-03-26 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14033?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15213274#comment-15213274
 ] 

Joseph K. Bradley commented on SPARK-14033:
---

We've been using "MLlib" to refer to both spark.mllib and spark.ml.  I think 
it's clearer to users to refer to spark.ml as the "DataFrame-based API" and 
spark.mllib as the "RDD-based API."

{quote}
1) What is lacking about the current spark documentation that makes this 
transition/onboarding difficult for users coming from scikit? 
{quote}
--> The main complaints I've heard have been about either (a) finding the 
estimator in the doc, such as LogisticRegression, but not realizing immediately 
that they also need to check out the docs for LogisticRegressionModel or (b) 
doing the same with imports.  I'm not quite sure how to make this clearer in 
the docs.

{quote}
2) Understanding the distinction between MLLib vs spark.ml is confusing at 
first, do you think this is perhaps part of the problem?
{quote}
--> It is a problem, but I think it's orthogonal.

{quote}
3) Can you give examples about what is unclear about the current semantics?
{quote}
I think the examples in the design doc about mutability & shared references 
within Pipelines are the best I've got.  [~mengxr] might have more.  As far as 
distinguishing between Estimators and Models, it really depends on the user's 
background.  I've given a lot of talks about Pipelines, and it's a bit tricky 
to explain how a Pipeline contains Estimators and Transformers, a Pipeline 
produces a PipelineModel, a PipelineModel contains only Models and 
Transformers, Models are a special type of Transformer, etc.  This proposal 
does reduce the number of concepts.

{quote}
4) Wouldn't this proposal make it more complex to maintain code going forward?
{quote}
It actually makes it significantly easier to maintain because it eliminates a 
lot of duplicated code.  This duplicated functionality between Estimator & 
Model (in setters, schema validation, etc.) leads to inconsistencies and bugs; 
I actually found bugs in StringIndexer and RFormula while prototyping the merge 
of StringIndexer & Model.


> Merging Estimator & Model
> -
>
> Key: SPARK-14033
> URL: https://issues.apache.org/jira/browse/SPARK-14033
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Joseph K. Bradley
>Assignee: Joseph K. Bradley
> Attachments: StyleMutabilityMergingEstimatorandModel.pdf
>
>
> This JIRA is for merging the spark.ml concepts of Estimator and Model.
> Goal: Have clearer semantics which match existing libraries (such as 
> scikit-learn).
> For details, please see the linked design doc.  Comment on this JIRA to give 
> feedback on the proposed design.  Once the proposal is discussed and this 
> work is confirmed as ready to proceed, this JIRA will serve as an umbrella 
> for the merge tasks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14033) Merging Estimator & Model

2016-03-24 Thread Daniel Siegmann (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14033?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15210772#comment-15210772
 ] 

Daniel Siegmann commented on SPARK-14033:
-

Another thing to consider is that, as popular as scikit is, it isn't the only 
library in use. The Java API for liblinear (http://liblinear.bwaldvogel.de/), 
for example, is much closer to the current Spark ML approach. You call the 
train method to get a model object, and then you call predict with that model.

Regarding Stefan's point #2, I myself am still unclear what the distinction is 
between Spark ML and MLlib. I think it's also very easy for a newcomer to get 
confused when you have an MLlib class named "LogisticRegression" and a Spark ML 
class named the same (just for example).

> Merging Estimator & Model
> -
>
> Key: SPARK-14033
> URL: https://issues.apache.org/jira/browse/SPARK-14033
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Joseph K. Bradley
>Assignee: Joseph K. Bradley
> Attachments: StyleMutabilityMergingEstimatorandModel.pdf
>
>
> This JIRA is for merging the spark.ml concepts of Estimator and Model.
> Goal: Have clearer semantics which match existing libraries (such as 
> scikit-learn).
> For details, please see the linked design doc.  Comment on this JIRA to give 
> feedback on the proposed design.  Once the proposal is discussed and this 
> work is confirmed as ready to proceed, this JIRA will serve as an umbrella 
> for the merge tasks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14033) Merging Estimator & Model

2016-03-24 Thread Stefan Krawczyk (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14033?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15210594#comment-15210594
 ] 

Stefan Krawczyk commented on SPARK-14033:
-

Nitpick: your document mentions MLLib, bur really this about spark.ml, right?

Questions:
1) What is lacking about the current spark documentation that makes this 
transition/onboarding difficult for users coming from scikit? 
2) Understanding the distinction between MLLib vs spark.ml is confusing at 
first, do you think this is perhaps part of the problem?
3) Can you give examples about what is unclear about the current semantics? I 
would argue the main concepts 
(http://spark.apache.org/docs/latest/ml-guide.html#main-concepts-in-pipelines) 
are quite crisp. I agree with [~daniel.siegmann.aol] here that this would make 
things less clear.
4) Wouldn't this proposal make it more complex to maintain code going forward? 
Since you're more tightly coupling training with prediction code? 

I agree technology adoption is important for an open source project to survive, 
however I don't think that this proposal will make machine learning simpler to 
use; the pipeline concept with separate transforms and estimators I think has 
made good progress to address this very point.

> Merging Estimator & Model
> -
>
> Key: SPARK-14033
> URL: https://issues.apache.org/jira/browse/SPARK-14033
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Joseph K. Bradley
>Assignee: Joseph K. Bradley
> Attachments: StyleMutabilityMergingEstimatorandModel.pdf
>
>
> This JIRA is for merging the spark.ml concepts of Estimator and Model.
> Goal: Have clearer semantics which match existing libraries (such as 
> scikit-learn).
> For details, please see the linked design doc.  Comment on this JIRA to give 
> feedback on the proposed design.  Once the proposal is discussed and this 
> work is confirmed as ready to proceed, this JIRA will serve as an umbrella 
> for the merge tasks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14033) Merging Estimator, Model, & Transformer

2016-03-22 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14033?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15207118#comment-15207118
 ] 

Joseph K. Bradley commented on SPARK-14033:
---

Academically speaking, I agree with you that there is a distinction between an 
Estimator and a Transformer.

Practically speaking, though, in my experience that distinction is not 
significant for most users.  If a new user wants to use Logistic Regression, 
they will look for LogisticRegression (and have reported being confused by 
finding the separate Estimator and Model classes).  If an expert wants to use 
it, then they will presumably have enough background knowledge to understand 
the semantics of the merged concepts.

This should also help users coming from other popular ML libraries like 
scikit-learn, which uses these merged semantics.

As a Scala user, I like the idea of complete immutability, but that will likely 
require much more code refactoring for users who have become used to Param 
setter methods modifying instances.

It will be good to know if the proposal will disrupt users' workflows.  I 
believe it should still work for existing workflows, with some minor code 
modifications.

> Merging Estimator, Model, & Transformer
> ---
>
> Key: SPARK-14033
> URL: https://issues.apache.org/jira/browse/SPARK-14033
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Joseph K. Bradley
>Assignee: Timothy Hunter
> Attachments: StyleMutabilityMergingEstimatorandModel.pdf
>
>
> This JIRA is for merging the spark.ml concepts of Estimator and Model.
> Goal: Have clearer semantics which match existing libraries (such as 
> scikit-learn).
> For details, please see the linked design doc.  Comment on this JIRA to give 
> feedback on the proposed design.  Once the proposal is discussed and this 
> work is confirmed as ready to proceed, this JIRA will serve as an umbrella 
> for the merge tasks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14033) Merging Estimator, Model, & Transformer

2016-03-22 Thread Daniel Siegmann (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14033?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15206455#comment-15206455
 ] 

Daniel Siegmann commented on SPARK-14033:
-

To me, the semantics of this proposal are _less_ clear. An estimator as a thing 
which produces a transformer is clearer to me than a self-configuring 
transformer. The current design creates a distinction between code which does 
the training (the estimator) and the code which does the scoring (the model, 
which is a transformer).

I also think there's a big difference between being able to mutate the 
hyper-parameters on an estimator and having the fit method modify the model 
parameters. If anything, I'd rather see the estimator be completely immutable.

> Merging Estimator, Model, & Transformer
> ---
>
> Key: SPARK-14033
> URL: https://issues.apache.org/jira/browse/SPARK-14033
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Joseph K. Bradley
>Assignee: Timothy Hunter
> Attachments: StyleMutabilityMergingEstimatorandModel.pdf
>
>
> This JIRA is for merging the spark.ml concepts of Estimator and Model.
> Goal: Have clearer semantics which match existing libraries (such as 
> scikit-learn).
> For details, please see the linked design doc.  Comment on this JIRA to give 
> feedback on the proposed design.  Once the proposal is discussed and this 
> work is confirmed as ready to proceed, this JIRA will serve as an umbrella 
> for the merge tasks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14033) Merging Estimator, Model, & Transformer

2016-03-21 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14033?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15204874#comment-15204874
 ] 

Joseph K. Bradley commented on SPARK-14033:
---

The Google design doc is identical to the attached PDF.

> Merging Estimator, Model, & Transformer
> ---
>
> Key: SPARK-14033
> URL: https://issues.apache.org/jira/browse/SPARK-14033
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Joseph K. Bradley
>Assignee: Timothy Hunter
> Attachments: StyleMutabilityMergingEstimatorandModel.pdf
>
>
> This JIRA is for merging the spark.ml concepts of Estimator and Model.
> Goal: Have clearer semantics which match existing libraries (such as 
> scikit-learn).
> For details, please see the linked design doc.  Comment on this JIRA to give 
> feedback on the proposed design.  Once the proposal is discussed and this 
> work is confirmed as ready to proceed, this JIRA will serve as an umbrella 
> for the merge tasks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org