[ 
https://issues.apache.org/jira/browse/SPARK-3702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14247582#comment-14247582
 ] 

Joseph K. Bradley commented on SPARK-3702:
------------------------------------------

I'm canceling my WIP PR for this since I have begun breaking that PR into 
smaller PRs.
The WIP PR branch is in [my ml-api branch | 
https://github.com/jkbradley/spark/tree/ml-api].

Here's the description of the WIP PR:

This is WIP effort to standardize abstractions and developer API for prediction 
tasks (classification and regression) for the new ML api (org.apache.spark.ml).
* Please comment on:
** abstractions, class hierarchy
** functionality required by each abstraction
** naming of types and methods
** ease of use for developers
** ease of use for users migrating from org.apache.spark.mllib
* Please ignore for now:
** missing tests and examples
** private/public API (I will make more things private to ml after writing 
tests and examples.)
** style and other details
** the many TODO items noted in the code

Please refer to [https://issues.apache.org/jira/browse/SPARK-3702] for some 
discussion on design, and [this design doc | 
https://docs.google.com/document/d/1I-8PD0DSLEZzzXURYZwmqAFn_OMBc08hgDL1FZnVBmw/]
 for major design decisions.

This is not intended to cover all algorithms; e.g., one big missing item is 
porting the GeneralizedLinearModel class to the new API.  But it hopefully lays 
a fair amount of groundwork.

I have included a limited number of concrete classes in this WIP PR, for 
purposes of illustration:
* LogisticRegression (edited, to show effects of abstract classes)
* NaiveBayes (simple to show ease of use for developers)
* AdaBoost (demonstration of meta-algorithms taking advantage of abstractions)
** (Note discussion of strong vs. weak types for ensemble methods in design 
doc.)
** This implementation is very incomplete but illustrates using the 
abstractions.
* LinearRegression (example of Regressor, for completeness)
* evaluators (to provide default evaluators in the class hierarchy)
* IterativeSolver and IterativeEstimator (to expose iterative algorithms)
* LabeledPoint (Q: Should this include an instance weight?)

Items remaining:
- [ ] helper method for simulating a distribution over weighted instances by 
subsampling (for algorithms which do not support instance weights)
- [ ] several TODO items noted in the code
- [ ] add tests and examples
- [ ] general cleanup
- [ ] make more of hierarchy private to ml
- [ ] split into several smaller PRs

General plan for splitting into multiple PRs, in order:
1. Simple class hierarchy
2. Evaluators
3. IterativeEstimator
4. AdaBoost
5. NaiveBayes (Any time after Evaluators)

Thanks to @epahomov and @BigCrunsh for input, including from 
[https://github.com/apache/spark/pull/2137] which improves upon the 
org.apache.spark.mllib APIs.


> Standardize MLlib classes for learners, models
> ----------------------------------------------
>
>                 Key: SPARK-3702
>                 URL: https://issues.apache.org/jira/browse/SPARK-3702
>             Project: Spark
>          Issue Type: Sub-task
>          Components: MLlib
>            Reporter: Joseph K. Bradley
>            Assignee: Joseph K. Bradley
>            Priority: Blocker
>
> Summary: Create a class hierarchy for learning algorithms and the models 
> those algorithms produce.
> This is a super-task of several sub-tasks (but JIRA does not allow subtasks 
> of subtasks).  See the "requires" links below for subtasks.
> Goals:
> * give intuitive structure to API, both for developers and for generated 
> documentation
> * support meta-algorithms (e.g., boosting)
> * support generic functionality (e.g., evaluation)
> * reduce code duplication across classes
> [Design doc for class hierarchy | 
> https://docs.google.com/document/d/1I-8PD0DSLEZzzXURYZwmqAFn_OMBc08hgDL1FZnVBmw/]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to