spark git commit: [SPARK-5802][MLLIB] cache transformed data in glm

meng Mon, 16 Feb 2015 22:10:02 -0800

Repository: spark
Updated Branches:
  refs/heads/branch-1.3 d0701d9bf -> dfe0fa01c



[SPARK-5802][MLLIB] cache transformed data in glm

If we need to transform the input data, we should cache the output to avoid 
re-computing feature vectors every iteration. dbtsai

Author: Xiangrui Meng <m...@databricks.com>

Closes #4593 from mengxr/SPARK-5802 and squashes the following commits:

ae3be84 [Xiangrui Meng] cache transformed data in glm

(cherry picked from commit fd84229e2aeb6a03760703c9dccd2db853779400)
Signed-off-by: Xiangrui Meng <m...@databricks.com>


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/dfe0fa01
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/dfe0fa01
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/dfe0fa01

Branch: refs/heads/branch-1.3
Commit: dfe0fa01cce2fefc272c0f05f7d63216be553e03
Parents: d0701d9
Author: Xiangrui Meng <m...@databricks.com>
Authored: Mon Feb 16 22:09:04 2015 -0800
Committer: Xiangrui Meng <m...@databricks.com>
Committed: Mon Feb 16 22:09:12 2015 -0800

----------------------------------------------------------------------
 .../regression/GeneralizedLinearAlgorithm.scala | 29 ++++++++++----------
 1 file changed, 15 insertions(+), 14 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/spark/blob/dfe0fa01/mllib/src/main/scala/org/apache/spark/mllib/regression/GeneralizedLinearAlgorithm.scala
----------------------------------------------------------------------
diff --git 
a/mllib/src/main/scala/org/apache/spark/mllib/regression/GeneralizedLinearAlgorithm.scala
 
b/mllib/src/main/scala/org/apache/spark/mllib/regression/GeneralizedLinearAlgorithm.scala
index 17de215..2b71453 100644
--- 
a/mllib/src/main/scala/org/apache/spark/mllib/regression/GeneralizedLinearAlgorithm.scala
+++ 
b/mllib/src/main/scala/org/apache/spark/mllib/regression/GeneralizedLinearAlgorithm.scala
@@ -205,7 +205,7 @@ abstract class GeneralizedLinearAlgorithm[M <: 
GeneralizedLinearModel]
       throw new SparkException("Input validation failed.")
     }
 
-    /**
+    /*
      * Scaling columns to unit variance as a heuristic to reduce the condition 
number:
      *
      * During the optimization process, the convergence (rate) depends on the 
condition number of
@@ -225,26 +225,27 @@ abstract class GeneralizedLinearAlgorithm[M <: 
GeneralizedLinearModel]
      * Currently, it's only enabled in LogisticRegressionWithLBFGS
      */
     val scaler = if (useFeatureScaling) {
-      (new StandardScaler(withStd = true, withMean = false)).fit(input.map(x 
=> x.features))
+      new StandardScaler(withStd = true, withMean = 
false).fit(input.map(_.features))
     } else {
       null
     }
 
     // Prepend an extra variable consisting of all 1.0's for the intercept.
-    val data = if (addIntercept) {
-      if (useFeatureScaling) {
-        input.map(labeledPoint =>
-          (labeledPoint.label, 
appendBias(scaler.transform(labeledPoint.features))))
-      } else {
-        input.map(labeledPoint => (labeledPoint.label, 
appendBias(labeledPoint.features)))
-      }
-    } else {
-      if (useFeatureScaling) {
-        input.map(labeledPoint => (labeledPoint.label, 
scaler.transform(labeledPoint.features)))
+    // TODO: Apply feature scaling to the weight vector instead of input data.
+    val data =
+      if (addIntercept) {
+        if (useFeatureScaling) {
+          input.map(lp => (lp.label, 
appendBias(scaler.transform(lp.features)))).cache()
+        } else {
+          input.map(lp => (lp.label, appendBias(lp.features))).cache()
+        }
       } else {
-        input.map(labeledPoint => (labeledPoint.label, labeledPoint.features))
+        if (useFeatureScaling) {
+          input.map(lp => (lp.label, scaler.transform(lp.features))).cache()
+        } else {
+          input.map(lp => (lp.label, lp.features))
+        }
       }
-    }
 
     /**
      * TODO: For better convergence, in logistic regression, the intercepts 
should be computed


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-5802][MLLIB] cache transformed data in glm

Reply via email to