[ 
https://issues.apache.org/jira/browse/SPARK-6683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14395131#comment-14395131
 ] 

DB Tsai edited comment on SPARK-6683 at 4/3/15 9:44 PM:
--------------------------------------------------------

The squared error will have slightly more work than logistic regression since 
the scaling the objective function will change the behavior of regularization 
while logistic regression will not be changed under scaling. I will update the 
PR of LiR with elastic net with proper scaling to support sparse input by this 
weekend, and we can discuss from there. The api should like R, and behavior 
should just get the same solution which is pretty simple to test and verify. 


was (Author: dbtsai):
The squared error will have slightly more work than logistic regression since 
the scaling the objective function will change the behavior of regularization 
while logistic regression will not be changed under scaling. I will update the 
PR of LiR with elastic net with proper scaling to support sparse input by this 
weekend, and we can discuss from there. The api should like R, and behavior 
should just get the same solution which is pretty simple test case. 

> Handling feature scaling properly for GLMs
> ------------------------------------------
>
>                 Key: SPARK-6683
>                 URL: https://issues.apache.org/jira/browse/SPARK-6683
>             Project: Spark
>          Issue Type: Improvement
>          Components: MLlib
>    Affects Versions: 1.3.0
>            Reporter: Joseph K. Bradley
>
> GeneralizedLinearAlgorithm can scale features.  This has 2 effects:
> * improves optimization behavior (essentially always improves behavior in 
> practice)
> * changes the optimal solution (often for the better in terms of 
> standardizing feature importance)
> Current problems:
> * Inefficient implementation: We make a rescaled copy of the data.
> * Surprising API: For algorithms which use feature scaling, users may get 
> different solutions than with R or other libraries.  (Note: Feature scaling 
> could be handled without changing the solution.)
> * Inconsistent API: Not all algorithms have the same default for feature 
> scaling, and not all expose the option.
> This is a proposal discussed with [~mengxr] for an "ideal" solution.  This 
> solution will require some breaking API changes, but I'd argue these are 
> necessary for the long-term since it's the best API we have thought of.
> Proposal:
> * Implementation: Change to avoid making a rescaled copy of the data 
> (described below).  No API issues here.
> * API:
> ** Hide featureScaling from API. (breaking change)
> ** Internally, handle feature scaling to improve optimization, but modify it 
> so it does not change the optimal solution. (breaking change, in terms of 
> algorithm behavior)
> ** Externally, users who want to rescale feature (to change the solution) 
> should do that scaling as a preprocessing step.
> Details on implementation:
> * GradientDescent could instead scale the step size separately for each 
> feature (and adjust regularization as needed; see the PR linked above).  This 
> would require storing a vector of length numFeatures, rather than making a 
> full copy of the data.
> * I haven't thought this through for LBFGS, but I hope [~dbtsai] can weigh in 
> here.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to