[ 
https://issues.apache.org/jira/browse/SPARK-19747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15887421#comment-15887421
 ] 

yuhao yang edited comment on SPARK-19747 at 2/28/17 9:58 PM:
-------------------------------------------------------------

I did notice the code duplication during implementing LinearSVC. Glad to see 
you already started on this. 

While the "loss" clearly can be extracted, we can also perhaps make it more 
generic and support interchangeable penalty, learning_rate, or even optimizer.

* "penalty" (‘none’, ‘l2’, ‘l1’, or ‘elasticnet’),

*  "learning_rate"
**  ‘constant’: eta = eta0
**  ‘optimal’: eta = 1.0 / (alpha * (t + t0)) 
**  ‘invscaling’: eta = eta0 / pow(t, power_t)

* optimizer
** SGD
** LBFGS
** OWLQN etc.

After the generic framework is developed, we can gradually migrate existing 
implementations. I was working on a generic SGDClassifier but there're some 
tricky issues about feature standardization, intercept and multi-class support. 


was (Author: yuhaoyan):
I did notice the code duplication during implementing LinearSVC. Glad to see 
you already started on this. 

While the "loss" clearly can be extracted, we can also perhaps make it more 
generic and support interchangeable penalty, learning_rate, or even optimizer.

* "penalty" (‘none’, ‘l2’, ‘l1’, or ‘elasticnet’),

*  "learning_rate"
**  ‘constant’: eta = eta0
**  ‘optimal’: eta = 1.0 / (alpha * (t + t0)) 
**  ‘invscaling’: eta = eta0 / pow(t, power_t)

* optimizer
** SGD
** LBFGS
** OWLQN etc.

After the generic framework is developed, we can gradually migrate existing 
implementations. I'm working on a generic SGDClassifier but there're some 
tricky issues about feature standardization, intercept and multi-class support. 

> Consolidate code in ML aggregators
> ----------------------------------
>
>                 Key: SPARK-19747
>                 URL: https://issues.apache.org/jira/browse/SPARK-19747
>             Project: Spark
>          Issue Type: Improvement
>          Components: ML
>    Affects Versions: 2.2.0
>            Reporter: Seth Hendrickson
>            Priority: Minor
>
> Many algorithms in Spark ML are posed as optimization of a differentiable 
> loss function over a parameter vector. We implement these by having a loss 
> function accumulate the gradient using an Aggregator class which has methods 
> that amount to a {{seqOp}} and {{combOp}}. So, pretty much every algorithm 
> that obeys this form implements a cost function class and an aggregator 
> class, which are completely separate from one another but share probably 80% 
> of the same code. 
> I think it is important to clean things like this up, and if we can do it 
> properly it will make the code much more maintainable, readable, and bug 
> free. It will also help reduce the overhead of future implementations.
> The design is of course open for discussion, but I think we should aim to:
> 1. Have all aggregators share parent classes, so that they only need to 
> implement the {{add}} function. This is really the only difference in the 
> current aggregators.
> 2. Have a single, generic cost function that is parameterized by the 
> aggregator type. This reduces the many places we implement cost functions and 
> greatly reduces the amount of duplicated code.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to