[ https://issues.apache.org/jira/browse/SPARK-3702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14247582#comment-14247582 ]
Joseph K. Bradley commented on SPARK-3702: ------------------------------------------ I'm canceling my WIP PR for this since I have begun breaking that PR into smaller PRs. The WIP PR branch is in [my ml-api branch | https://github.com/jkbradley/spark/tree/ml-api]. Here's the description of the WIP PR: This is WIP effort to standardize abstractions and developer API for prediction tasks (classification and regression) for the new ML api (org.apache.spark.ml). * Please comment on: ** abstractions, class hierarchy ** functionality required by each abstraction ** naming of types and methods ** ease of use for developers ** ease of use for users migrating from org.apache.spark.mllib * Please ignore for now: ** missing tests and examples ** private/public API (I will make more things private to ml after writing tests and examples.) ** style and other details ** the many TODO items noted in the code Please refer to [https://issues.apache.org/jira/browse/SPARK-3702] for some discussion on design, and [this design doc | https://docs.google.com/document/d/1I-8PD0DSLEZzzXURYZwmqAFn_OMBc08hgDL1FZnVBmw/] for major design decisions. This is not intended to cover all algorithms; e.g., one big missing item is porting the GeneralizedLinearModel class to the new API. But it hopefully lays a fair amount of groundwork. I have included a limited number of concrete classes in this WIP PR, for purposes of illustration: * LogisticRegression (edited, to show effects of abstract classes) * NaiveBayes (simple to show ease of use for developers) * AdaBoost (demonstration of meta-algorithms taking advantage of abstractions) ** (Note discussion of strong vs. weak types for ensemble methods in design doc.) ** This implementation is very incomplete but illustrates using the abstractions. * LinearRegression (example of Regressor, for completeness) * evaluators (to provide default evaluators in the class hierarchy) * IterativeSolver and IterativeEstimator (to expose iterative algorithms) * LabeledPoint (Q: Should this include an instance weight?) Items remaining: - [ ] helper method for simulating a distribution over weighted instances by subsampling (for algorithms which do not support instance weights) - [ ] several TODO items noted in the code - [ ] add tests and examples - [ ] general cleanup - [ ] make more of hierarchy private to ml - [ ] split into several smaller PRs General plan for splitting into multiple PRs, in order: 1. Simple class hierarchy 2. Evaluators 3. IterativeEstimator 4. AdaBoost 5. NaiveBayes (Any time after Evaluators) Thanks to @epahomov and @BigCrunsh for input, including from [https://github.com/apache/spark/pull/2137] which improves upon the org.apache.spark.mllib APIs. > Standardize MLlib classes for learners, models > ---------------------------------------------- > > Key: SPARK-3702 > URL: https://issues.apache.org/jira/browse/SPARK-3702 > Project: Spark > Issue Type: Sub-task > Components: MLlib > Reporter: Joseph K. Bradley > Assignee: Joseph K. Bradley > Priority: Blocker > > Summary: Create a class hierarchy for learning algorithms and the models > those algorithms produce. > This is a super-task of several sub-tasks (but JIRA does not allow subtasks > of subtasks). See the "requires" links below for subtasks. > Goals: > * give intuitive structure to API, both for developers and for generated > documentation > * support meta-algorithms (e.g., boosting) > * support generic functionality (e.g., evaluation) > * reduce code duplication across classes > [Design doc for class hierarchy | > https://docs.google.com/document/d/1I-8PD0DSLEZzzXURYZwmqAFn_OMBc08hgDL1FZnVBmw/] -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org