[jira] [Commented] (SPARK-14033) Merging Estimator & Model
[ https://issues.apache.org/jira/browse/SPARK-14033?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15222364#comment-15222364 ] Joseph K. Bradley commented on SPARK-14033: --- I discussed this with [~mengxr] and [~matei], and we've decided to reject this proposal. While there are slight benefits to users (slightly simpler API) and tangible benefits to developers (less code duplication), the pain of modifying the API will affect users too much for this to be worthwhile. Thanks everyone for feedback! I'll close this JIRA. > Merging Estimator & Model > - > > Key: SPARK-14033 > URL: https://issues.apache.org/jira/browse/SPARK-14033 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Joseph K. Bradley >Assignee: Joseph K. Bradley > Attachments: StyleMutabilityMergingEstimatorandModel.pdf > > > This JIRA is for merging the spark.ml concepts of Estimator and Model. > Goal: Have clearer semantics which match existing libraries (such as > scikit-learn). > For details, please see the linked design doc. Comment on this JIRA to give > feedback on the proposed design. Once the proposal is discussed and this > work is confirmed as ready to proceed, this JIRA will serve as an umbrella > for the merge tasks. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14033) Merging Estimator & Model
[ https://issues.apache.org/jira/browse/SPARK-14033?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15218495#comment-15218495 ] Joseph K. Bradley commented on SPARK-14033: --- That's a great point, and it's something which should be doable in either case. The solution we're working towards is moving models outside of the Spark package, where the models within MLlib will inherit from those outside ones & extend them with operations on DataFrames. That will be doable if we keep Estimator and Model separate, and also if we merge them. > Merging Estimator & Model > - > > Key: SPARK-14033 > URL: https://issues.apache.org/jira/browse/SPARK-14033 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Joseph K. Bradley >Assignee: Joseph K. Bradley > Attachments: StyleMutabilityMergingEstimatorandModel.pdf > > > This JIRA is for merging the spark.ml concepts of Estimator and Model. > Goal: Have clearer semantics which match existing libraries (such as > scikit-learn). > For details, please see the linked design doc. Comment on this JIRA to give > feedback on the proposed design. Once the proposal is discussed and this > work is confirmed as ready to proceed, this JIRA will serve as an umbrella > for the merge tasks. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14033) Merging Estimator & Model
[ https://issues.apache.org/jira/browse/SPARK-14033?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15217324#comment-15217324 ] Stefan Krawczyk commented on SPARK-14033: - {quote}It actually makes it significantly easier to maintain because it eliminates a lot of duplicated code. This duplicated functionality between Estimator & Model (in setters, schema validation, etc.) leads to inconsistencies and bugs; I actually found bugs in StringIndexer and RFormula while prototyping the merge of StringIndexer & Model.{quote} That smells like a different abstraction issue to me. But sure, I can see where you're coming from. One argument for keeping them separate, is that rom a dependency standpoint, it'd be advantageous to separate training code (estimators) from prediction code (models). That way you could package your trained models and use them without having to bring in all the training dependencies. > Merging Estimator & Model > - > > Key: SPARK-14033 > URL: https://issues.apache.org/jira/browse/SPARK-14033 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Joseph K. Bradley >Assignee: Joseph K. Bradley > Attachments: StyleMutabilityMergingEstimatorandModel.pdf > > > This JIRA is for merging the spark.ml concepts of Estimator and Model. > Goal: Have clearer semantics which match existing libraries (such as > scikit-learn). > For details, please see the linked design doc. Comment on this JIRA to give > feedback on the proposed design. Once the proposal is discussed and this > work is confirmed as ready to proceed, this JIRA will serve as an umbrella > for the merge tasks. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14033) Merging Estimator & Model
[ https://issues.apache.org/jira/browse/SPARK-14033?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15216756#comment-15216756 ] Michael ZieliĆski commented on SPARK-14033: --- Re: ML vs MLLib, I also think about it in terms RDDs versus DataFrames. Re: Estimator/Model I prefer the current version that preserves immutability to a larger degree. That said, maybe merging those concepts would make it easier for the next stage of a Pipeline to use outputs from previous stage. Currently if you have: val a1 = new Estimator1 val a2 = new Estimator2.setParamAbc(a1.getParamCde) You can only get the members from Estimator1, but not Estimator1Model. If they're the same class it would make things easier. As an example you want to take top K variables from Random Forest model as input to Logistic Regression. > Merging Estimator & Model > - > > Key: SPARK-14033 > URL: https://issues.apache.org/jira/browse/SPARK-14033 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Joseph K. Bradley >Assignee: Joseph K. Bradley > Attachments: StyleMutabilityMergingEstimatorandModel.pdf > > > This JIRA is for merging the spark.ml concepts of Estimator and Model. > Goal: Have clearer semantics which match existing libraries (such as > scikit-learn). > For details, please see the linked design doc. Comment on this JIRA to give > feedback on the proposed design. Once the proposal is discussed and this > work is confirmed as ready to proceed, this JIRA will serve as an umbrella > for the merge tasks. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14033) Merging Estimator & Model
[ https://issues.apache.org/jira/browse/SPARK-14033?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15213274#comment-15213274 ] Joseph K. Bradley commented on SPARK-14033: --- We've been using "MLlib" to refer to both spark.mllib and spark.ml. I think it's clearer to users to refer to spark.ml as the "DataFrame-based API" and spark.mllib as the "RDD-based API." {quote} 1) What is lacking about the current spark documentation that makes this transition/onboarding difficult for users coming from scikit? {quote} --> The main complaints I've heard have been about either (a) finding the estimator in the doc, such as LogisticRegression, but not realizing immediately that they also need to check out the docs for LogisticRegressionModel or (b) doing the same with imports. I'm not quite sure how to make this clearer in the docs. {quote} 2) Understanding the distinction between MLLib vs spark.ml is confusing at first, do you think this is perhaps part of the problem? {quote} --> It is a problem, but I think it's orthogonal. {quote} 3) Can you give examples about what is unclear about the current semantics? {quote} I think the examples in the design doc about mutability & shared references within Pipelines are the best I've got. [~mengxr] might have more. As far as distinguishing between Estimators and Models, it really depends on the user's background. I've given a lot of talks about Pipelines, and it's a bit tricky to explain how a Pipeline contains Estimators and Transformers, a Pipeline produces a PipelineModel, a PipelineModel contains only Models and Transformers, Models are a special type of Transformer, etc. This proposal does reduce the number of concepts. {quote} 4) Wouldn't this proposal make it more complex to maintain code going forward? {quote} It actually makes it significantly easier to maintain because it eliminates a lot of duplicated code. This duplicated functionality between Estimator & Model (in setters, schema validation, etc.) leads to inconsistencies and bugs; I actually found bugs in StringIndexer and RFormula while prototyping the merge of StringIndexer & Model. > Merging Estimator & Model > - > > Key: SPARK-14033 > URL: https://issues.apache.org/jira/browse/SPARK-14033 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Joseph K. Bradley >Assignee: Joseph K. Bradley > Attachments: StyleMutabilityMergingEstimatorandModel.pdf > > > This JIRA is for merging the spark.ml concepts of Estimator and Model. > Goal: Have clearer semantics which match existing libraries (such as > scikit-learn). > For details, please see the linked design doc. Comment on this JIRA to give > feedback on the proposed design. Once the proposal is discussed and this > work is confirmed as ready to proceed, this JIRA will serve as an umbrella > for the merge tasks. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14033) Merging Estimator & Model
[ https://issues.apache.org/jira/browse/SPARK-14033?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15210772#comment-15210772 ] Daniel Siegmann commented on SPARK-14033: - Another thing to consider is that, as popular as scikit is, it isn't the only library in use. The Java API for liblinear (http://liblinear.bwaldvogel.de/), for example, is much closer to the current Spark ML approach. You call the train method to get a model object, and then you call predict with that model. Regarding Stefan's point #2, I myself am still unclear what the distinction is between Spark ML and MLlib. I think it's also very easy for a newcomer to get confused when you have an MLlib class named "LogisticRegression" and a Spark ML class named the same (just for example). > Merging Estimator & Model > - > > Key: SPARK-14033 > URL: https://issues.apache.org/jira/browse/SPARK-14033 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Joseph K. Bradley >Assignee: Joseph K. Bradley > Attachments: StyleMutabilityMergingEstimatorandModel.pdf > > > This JIRA is for merging the spark.ml concepts of Estimator and Model. > Goal: Have clearer semantics which match existing libraries (such as > scikit-learn). > For details, please see the linked design doc. Comment on this JIRA to give > feedback on the proposed design. Once the proposal is discussed and this > work is confirmed as ready to proceed, this JIRA will serve as an umbrella > for the merge tasks. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14033) Merging Estimator & Model
[ https://issues.apache.org/jira/browse/SPARK-14033?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15210594#comment-15210594 ] Stefan Krawczyk commented on SPARK-14033: - Nitpick: your document mentions MLLib, bur really this about spark.ml, right? Questions: 1) What is lacking about the current spark documentation that makes this transition/onboarding difficult for users coming from scikit? 2) Understanding the distinction between MLLib vs spark.ml is confusing at first, do you think this is perhaps part of the problem? 3) Can you give examples about what is unclear about the current semantics? I would argue the main concepts (http://spark.apache.org/docs/latest/ml-guide.html#main-concepts-in-pipelines) are quite crisp. I agree with [~daniel.siegmann.aol] here that this would make things less clear. 4) Wouldn't this proposal make it more complex to maintain code going forward? Since you're more tightly coupling training with prediction code? I agree technology adoption is important for an open source project to survive, however I don't think that this proposal will make machine learning simpler to use; the pipeline concept with separate transforms and estimators I think has made good progress to address this very point. > Merging Estimator & Model > - > > Key: SPARK-14033 > URL: https://issues.apache.org/jira/browse/SPARK-14033 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Joseph K. Bradley >Assignee: Joseph K. Bradley > Attachments: StyleMutabilityMergingEstimatorandModel.pdf > > > This JIRA is for merging the spark.ml concepts of Estimator and Model. > Goal: Have clearer semantics which match existing libraries (such as > scikit-learn). > For details, please see the linked design doc. Comment on this JIRA to give > feedback on the proposed design. Once the proposal is discussed and this > work is confirmed as ready to proceed, this JIRA will serve as an umbrella > for the merge tasks. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14033) Merging Estimator, Model, & Transformer
[ https://issues.apache.org/jira/browse/SPARK-14033?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15207118#comment-15207118 ] Joseph K. Bradley commented on SPARK-14033: --- Academically speaking, I agree with you that there is a distinction between an Estimator and a Transformer. Practically speaking, though, in my experience that distinction is not significant for most users. If a new user wants to use Logistic Regression, they will look for LogisticRegression (and have reported being confused by finding the separate Estimator and Model classes). If an expert wants to use it, then they will presumably have enough background knowledge to understand the semantics of the merged concepts. This should also help users coming from other popular ML libraries like scikit-learn, which uses these merged semantics. As a Scala user, I like the idea of complete immutability, but that will likely require much more code refactoring for users who have become used to Param setter methods modifying instances. It will be good to know if the proposal will disrupt users' workflows. I believe it should still work for existing workflows, with some minor code modifications. > Merging Estimator, Model, & Transformer > --- > > Key: SPARK-14033 > URL: https://issues.apache.org/jira/browse/SPARK-14033 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Joseph K. Bradley >Assignee: Timothy Hunter > Attachments: StyleMutabilityMergingEstimatorandModel.pdf > > > This JIRA is for merging the spark.ml concepts of Estimator and Model. > Goal: Have clearer semantics which match existing libraries (such as > scikit-learn). > For details, please see the linked design doc. Comment on this JIRA to give > feedback on the proposed design. Once the proposal is discussed and this > work is confirmed as ready to proceed, this JIRA will serve as an umbrella > for the merge tasks. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14033) Merging Estimator, Model, & Transformer
[ https://issues.apache.org/jira/browse/SPARK-14033?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15206455#comment-15206455 ] Daniel Siegmann commented on SPARK-14033: - To me, the semantics of this proposal are _less_ clear. An estimator as a thing which produces a transformer is clearer to me than a self-configuring transformer. The current design creates a distinction between code which does the training (the estimator) and the code which does the scoring (the model, which is a transformer). I also think there's a big difference between being able to mutate the hyper-parameters on an estimator and having the fit method modify the model parameters. If anything, I'd rather see the estimator be completely immutable. > Merging Estimator, Model, & Transformer > --- > > Key: SPARK-14033 > URL: https://issues.apache.org/jira/browse/SPARK-14033 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Joseph K. Bradley >Assignee: Timothy Hunter > Attachments: StyleMutabilityMergingEstimatorandModel.pdf > > > This JIRA is for merging the spark.ml concepts of Estimator and Model. > Goal: Have clearer semantics which match existing libraries (such as > scikit-learn). > For details, please see the linked design doc. Comment on this JIRA to give > feedback on the proposed design. Once the proposal is discussed and this > work is confirmed as ready to proceed, this JIRA will serve as an umbrella > for the merge tasks. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14033) Merging Estimator, Model, & Transformer
[ https://issues.apache.org/jira/browse/SPARK-14033?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15204874#comment-15204874 ] Joseph K. Bradley commented on SPARK-14033: --- The Google design doc is identical to the attached PDF. > Merging Estimator, Model, & Transformer > --- > > Key: SPARK-14033 > URL: https://issues.apache.org/jira/browse/SPARK-14033 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Joseph K. Bradley >Assignee: Timothy Hunter > Attachments: StyleMutabilityMergingEstimatorandModel.pdf > > > This JIRA is for merging the spark.ml concepts of Estimator and Model. > Goal: Have clearer semantics which match existing libraries (such as > scikit-learn). > For details, please see the linked design doc. Comment on this JIRA to give > feedback on the proposed design. Once the proposal is discussed and this > work is confirmed as ready to proceed, this JIRA will serve as an umbrella > for the merge tasks. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org