[GitHub] spark pull request: [SPARK-2979][MLlib] Improve the convergence ra...

dbtsai Tue, 12 Aug 2014 18:27:06 -0700

Github user dbtsai commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1897#discussion_r16153527
  
    --- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/regression/GeneralizedLinearAlgorithm.scala
 ---
    @@ -137,11 +154,45 @@ abstract class GeneralizedLinearAlgorithm[M <: 
GeneralizedLinearModel]
           throw new SparkException("Input validation failed.")
         }
     
    +    /**
    +     * Scaling to minimize the condition number:
    +     *
    +     * During the optimization process, the convergence (rate) depends on 
the condition number of
    +     * the training dataset. Scaling the variables often reduces this 
condition number, thus
    +     * improving the convergence rate dramatically. Without reducing the 
condition number,
    +     * some training datasets mixing the columns with different scales may 
not be able to converge.
    +     *
    +     * GLMNET and LIBSVM packages perform the scaling to reduce the 
condition number, and return
    +     * the weights in the original scale.
    +     * See page 9 in 
http://cran.r-project.org/web/packages/glmnet/glmnet.pdf
    +     *
    +     * Here, if useFeatureScaling is enabled, we will standardize the 
training features by dividing
    +     * the variance of each column (without subtracting the mean), and 
train the model in the
    +     * scaled space. Then we transform the coefficients from the scaled 
space to the original scale
    +     * as GLMNET and LIBSVM do.
    +     *
    +     * Currently, it's only enabled in LogisticRegressionWithLBFGS
    +     */
    +    val scaler = if (useFeatureScaling) {
    +      (new StandardScaler).fit(input.map(x => x.features))
    +    } else {
    +      null
    +    }
    +
         // Prepend an extra variable consisting of all 1.0's for the intercept.
         val data = if (addIntercept) {
    -      input.map(labeledPoint => (labeledPoint.label, 
appendBias(labeledPoint.features)))
    +      if(useFeatureScaling) {
    +        input.map(labeledPoint =>
    +          (labeledPoint.label, 
appendBias(scaler.transform(labeledPoint.features))))
    +      } else {
    +        input.map(labeledPoint => (labeledPoint.label, 
appendBias(labeledPoint.features)))
    +      }
         } else {
    -      input.map(labeledPoint => (labeledPoint.label, 
labeledPoint.features))
    +      if (useFeatureScaling) {
    +        input.map(labeledPoint => (labeledPoint.label, 
scaler.transform(labeledPoint.features)))
    +      } else {
    +        input.map(labeledPoint => (labeledPoint.label, 
labeledPoint.features))
    --- End diff --
    
    It's not identical map. It's converting labeledPoint to tuple of response 
and feature vector for optimizer.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-2979][MLlib] Improve the convergence ra...

Reply via email to