[ 
https://issues.apache.org/jira/browse/SPARK-5995?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-5995:
-------------------------------------
    Description: 
Previously, some Developer APIs were added to spark.ml for classification and 
regression to make it easier to add new algorithms and models: [SPARK-4789]  
There are ongoing discussions about the best design of the API.  This JIRA is 
to continue that discussion and try to finalize those Developer APIs so that 
they can be made public.

Please see [this design doc from SPARK-4789 | 
https://docs.google.com/document/d/1BH9el33kBX8JiDdgUJXdLW14CA2qhTCWIG46eXZVoJs]
 for details on the original API design.

Some issues under debate:
* Should there be strongly typed APIs for fit()?
** Proposal: No
* Should the strongly typed API for transform() be public (vs. protected)?
** Proposal: Protected for now
* What transformation methods should the API make developers implement for 
classification?
** Proposal: See design doc
* Should there be a way to transform a single Row (instead of only DataFrames)?
** Proposal: Not for now

  was:
Previously, some Developer APIs were added to spark.ml for classification and 
regression to make it easier to add new algorithms and models: [SPARK-4789]  
There are ongoing discussions about the best design of the API.  This JIRA is 
to continue that discussion and try to finalize those Developer APIs so that 
they can be made public.

Please see [this design doc from SPARK-4789 | 
https://docs.google.com/document/d/1BH9el33kBX8JiDdgUJXdLW14CA2qhTCWIG46eXZVoJs]
 for details on the original API design.

Some issues under debate:
* Should there be strongly typed APIs for fit()?
* Should the strongly typed API for transform() be public (vs. protected)?
* What transformation methods should the API make developers implement for 
classification?  (See details below.)
* Should there be a way to transform a single Row (instead of only DataFrames)?

More on "What transformation methods should the API make developers implement 
for classification?":
* Goals:
** Optimize transform: Make it fast, and make it output only the desired 
columns.
** Easy development
** Support Classifier, Regressor, and ProbabilisticClassifier
* (currently) Developers implement predictX methods for each output column X.  
They may override transform() to optimize speed.
** Pros: predictX is easy to understand.
** Cons: An optimized transform() is annoying to write.
* Developers implement more basic transformation methods, such as features2raw, 
raw2pred, raw2prob.
** Pros: Abstract classes may implement optimized transform().
** Cons: Different types of predictors require different methods:
*** Predictor and Regressor: features2pred
*** Classifier: features2raw, raw2pred
*** ProbabilisticClassifier: raw2prob
* Developers implement a single predict() method which takes parameters for 
what columns to output (returning tuple or some type with None for missing 
values).  Abstract classes take the outputs they want and put them into columns.
** Pros: Developers only write 1 method and can optimize it as much as they 
want.  It could be more optimized than the previous 2 options; e.g., if 
LogisticRegressionModel only wants the prediction, then it never has to 
construct intermediate results such as the vector of raw predictions.
** Cons: predict() will have a different signature for different abstractions, 
based on the possible output columns.



> Make ML Prediction Developer APIs public
> ----------------------------------------
>
>                 Key: SPARK-5995
>                 URL: https://issues.apache.org/jira/browse/SPARK-5995
>             Project: Spark
>          Issue Type: Sub-task
>          Components: ML
>    Affects Versions: 1.3.0
>            Reporter: Joseph K. Bradley
>            Assignee: Joseph K. Bradley
>
> Previously, some Developer APIs were added to spark.ml for classification and 
> regression to make it easier to add new algorithms and models: [SPARK-4789]  
> There are ongoing discussions about the best design of the API.  This JIRA is 
> to continue that discussion and try to finalize those Developer APIs so that 
> they can be made public.
> Please see [this design doc from SPARK-4789 | 
> https://docs.google.com/document/d/1BH9el33kBX8JiDdgUJXdLW14CA2qhTCWIG46eXZVoJs]
>  for details on the original API design.
> Some issues under debate:
> * Should there be strongly typed APIs for fit()?
> ** Proposal: No
> * Should the strongly typed API for transform() be public (vs. protected)?
> ** Proposal: Protected for now
> * What transformation methods should the API make developers implement for 
> classification?
> ** Proposal: See design doc
> * Should there be a way to transform a single Row (instead of only 
> DataFrames)?
> ** Proposal: Not for now



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to