[GitHub] spark pull request: [SPARK-2979][MLlib] Improve the convergence ra...

2014-08-14 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1897#issuecomment-52148128
  
QA results for PR 1897:br- This patch FAILED unit tests.br- This patch 
merges cleanlybr- This patch adds no public classesbrbrFor more 
information see test 
ouptut:brhttps://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18521/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2979][MLlib] Improve the convergence ra...

2014-08-14 Thread dbtsai
Github user dbtsai commented on the pull request:

https://github.com/apache/spark/pull/1897#issuecomment-52149162
  
Seems that Jenkins is not stable. Failing on issues related to akka.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2979][MLlib] Improve the convergence ra...

2014-08-14 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1897#issuecomment-52149464
  
QA tests have started for PR 1897. This patch merges cleanly. brView 
progress: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18527/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2979][MLlib] Improve the convergence ra...

2014-08-14 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1897#issuecomment-52152780
  
QA results for PR 1897:br- This patch PASSES unit tests.br- This patch 
merges cleanlybr- This patch adds no public classesbrbrFor more 
information see test 
ouptut:brhttps://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18527/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2979][MLlib] Improve the convergence ra...

2014-08-14 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/1897


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2979][MLlib] Improve the convergence ra...

2014-08-14 Thread mengxr
Github user mengxr commented on the pull request:

https://github.com/apache/spark/pull/1897#issuecomment-52226905
  
LGTM. Merged into both master and branch-1.1. Thanks!!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2979][MLlib] Improve the convergence ra...

2014-08-13 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/1897#discussion_r16221810
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/regression/GeneralizedLinearAlgorithm.scala
 ---
@@ -137,11 +154,45 @@ abstract class GeneralizedLinearAlgorithm[M : 
GeneralizedLinearModel]
   throw new SparkException(Input validation failed.)
 }
 
+/**
+ * Scaling to minimize the condition number:
+ *
+ * During the optimization process, the convergence (rate) depends on 
the condition number of
+ * the training dataset. Scaling the variables often reduces this 
condition number, thus
+ * improving the convergence rate dramatically. Without reducing the 
condition number,
+ * some training datasets mixing the columns with different scales may 
not be able to converge.
+ *
+ * GLMNET and LIBSVM packages perform the scaling to reduce the 
condition number, and return
+ * the weights in the original scale.
+ * See page 9 in 
http://cran.r-project.org/web/packages/glmnet/glmnet.pdf
+ *
+ * Here, if useFeatureScaling is enabled, we will standardize the 
training features by dividing
+ * the variance of each column (without subtracting the mean), and 
train the model in the
+ * scaled space. Then we transform the coefficients from the scaled 
space to the original scale
+ * as GLMNET and LIBSVM do.
+ *
+ * Currently, it's only enabled in LogisticRegressionWithLBFGS
+ */
+val scaler = if (useFeatureScaling) {
+  (new StandardScaler).fit(input.map(x = x.features))
+} else {
+  null
+}
+
 // Prepend an extra variable consisting of all 1.0's for the intercept.
 val data = if (addIntercept) {
-  input.map(labeledPoint = (labeledPoint.label, 
appendBias(labeledPoint.features)))
+  if(useFeatureScaling) {
+input.map(labeledPoint =
+  (labeledPoint.label, 
appendBias(scaler.transform(labeledPoint.features
+  } else {
+input.map(labeledPoint = (labeledPoint.label, 
appendBias(labeledPoint.features)))
+  }
 } else {
-  input.map(labeledPoint = (labeledPoint.label, 
labeledPoint.features))
+  if (useFeatureScaling) {
+input.map(labeledPoint = (labeledPoint.label, 
scaler.transform(labeledPoint.features)))
+  } else {
+input.map(labeledPoint = (labeledPoint.label, 
labeledPoint.features))
--- End diff --

Sorry, I didn't realize that.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2979][MLlib] Improve the convergence ra...

2014-08-13 Thread mengxr
Github user mengxr commented on the pull request:

https://github.com/apache/spark/pull/1897#issuecomment-52145394
  
Jenkins, test this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2979][MLlib] Improve the convergence ra...

2014-08-13 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1897#issuecomment-52145716
  
QA tests have started for PR 1897. This patch merges cleanly. brView 
progress: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18521/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2979][MLlib] Improve the convergence ra...

2014-08-12 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/1897#discussion_r16099170
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/mllib/classification/LogisticRegressionSuite.scala
 ---
@@ -185,6 +185,58 @@ class LogisticRegressionSuite extends FunSuite with 
LocalSparkContext with Match
 // Test prediction on Array.
 validatePrediction(validationData.map(row = 
model.predict(row.features)), validationData)
   }
+
+  test(numerical stability of scaling features using logistic regression 
with LBFGS) {
+/**
+ * If we rescale the features, the condition number will be changed so 
the convergence rate
+ * and the solution will not equal to the original solution multiple 
by the scaling factor
+ * which it should be.
+ *
+ * However, since in the LogisticRegressionWithLBFGS, we standardize 
the training dataset first,
+ * no matter how we multiple a scaling factor into the dataset, the 
convergence rate should be
+ * the same, and the solution should equal to the original solution 
multiple by the scaling
+ * factor.
+ */
+
+val nPoints = 1
+val A = 2.0
+val B = -1.5
+
+val testData = LogisticRegressionSuite.generateLogisticInput(A, B, 
nPoints, 42)
+
+val initialWeights = Vectors.dense(0.0)
+
+val testRDD1 = sc.parallelize(testData, 2)
+
+val testRDD2 = sc.parallelize(
+  testData.map(x = LabeledPoint(x.label, 
Vectors.fromBreeze(x.features.toBreeze * 1.0E3))), 2)
+
+val testRDD3 = sc.parallelize(
+  testData.map(x = LabeledPoint(x.label, 
Vectors.fromBreeze(x.features.toBreeze * 1.0E6))), 2)
+
+testRDD1.cache()
+testRDD2.cache()
+testRDD3.cache()
+
+val lrA = new LogisticRegressionWithLBFGS().setIntercept(true)
+val lrB = new 
LogisticRegressionWithLBFGS().setIntercept(true).setFeatureScaling(false)
+
+val modelA1 = lrA.run(testRDD1, initialWeights)
+val modelA2 = lrA.run(testRDD2, initialWeights)
+val modelA3 = lrA.run(testRDD3, initialWeights)
+
+val modelB1 = lrB.run(testRDD1, initialWeights)
+val modelB2 = lrB.run(testRDD2, initialWeights)
+val modelB3 = lrB.run(testRDD3, initialWeights)
+
+// Test the weights
+assert(modelA1.weights(0) ~== modelA2.weights(0) * 1.0E3 absTol 0.01)
+assert(modelA1.weights(0) ~== modelA3.weights(0) * 1.0E6 absTol 0.01)
+
+assert(modelB1.weights(0) !~== modelB2.weights(0) * 1.0E3 absTol 0.1)
--- End diff --

need a comment about the purpose of the tests here


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2979][MLlib] Improve the convergence ra...

2014-08-12 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/1897#discussion_r16099253
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/regression/GeneralizedLinearAlgorithm.scala
 ---
@@ -137,11 +154,45 @@ abstract class GeneralizedLinearAlgorithm[M : 
GeneralizedLinearModel]
   throw new SparkException(Input validation failed.)
 }
 
+/**
+ * Scaling to minimize the condition number:
--- End diff --

`minimize the condition number` is not accurate. We can say `scaling 
columns to unit variance as a heuristic to reduce the condition number`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2979][MLlib] Improve the convergence ra...

2014-08-12 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/1897#discussion_r16153527
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/regression/GeneralizedLinearAlgorithm.scala
 ---
@@ -137,11 +154,45 @@ abstract class GeneralizedLinearAlgorithm[M : 
GeneralizedLinearModel]
   throw new SparkException(Input validation failed.)
 }
 
+/**
+ * Scaling to minimize the condition number:
+ *
+ * During the optimization process, the convergence (rate) depends on 
the condition number of
+ * the training dataset. Scaling the variables often reduces this 
condition number, thus
+ * improving the convergence rate dramatically. Without reducing the 
condition number,
+ * some training datasets mixing the columns with different scales may 
not be able to converge.
+ *
+ * GLMNET and LIBSVM packages perform the scaling to reduce the 
condition number, and return
+ * the weights in the original scale.
+ * See page 9 in 
http://cran.r-project.org/web/packages/glmnet/glmnet.pdf
+ *
+ * Here, if useFeatureScaling is enabled, we will standardize the 
training features by dividing
+ * the variance of each column (without subtracting the mean), and 
train the model in the
+ * scaled space. Then we transform the coefficients from the scaled 
space to the original scale
+ * as GLMNET and LIBSVM do.
+ *
+ * Currently, it's only enabled in LogisticRegressionWithLBFGS
+ */
+val scaler = if (useFeatureScaling) {
+  (new StandardScaler).fit(input.map(x = x.features))
+} else {
+  null
+}
+
 // Prepend an extra variable consisting of all 1.0's for the intercept.
 val data = if (addIntercept) {
-  input.map(labeledPoint = (labeledPoint.label, 
appendBias(labeledPoint.features)))
+  if(useFeatureScaling) {
+input.map(labeledPoint =
+  (labeledPoint.label, 
appendBias(scaler.transform(labeledPoint.features
+  } else {
+input.map(labeledPoint = (labeledPoint.label, 
appendBias(labeledPoint.features)))
+  }
 } else {
-  input.map(labeledPoint = (labeledPoint.label, 
labeledPoint.features))
+  if (useFeatureScaling) {
+input.map(labeledPoint = (labeledPoint.label, 
scaler.transform(labeledPoint.features)))
+  } else {
+input.map(labeledPoint = (labeledPoint.label, 
labeledPoint.features))
--- End diff --

It's not identical map. It's converting labeledPoint to tuple of response 
and feature vector for optimizer. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2979][MLlib ]Improve the convergence ra...

2014-08-11 Thread dbtsai
GitHub user dbtsai opened a pull request:

https://github.com/apache/spark/pull/1897

[SPARK-2979][MLlib ]Improve the convergence rate by minimize the condition 
number

Scaling to minimize the condition number:
During the optimization process, the convergence (rate) depends on the 
condition number of the training dataset. Scaling the variables often reduces 
this condition number, thus mproving the convergence rate dramatically. Without 
reducing the condition number, some training datasets mixing the columns with 
different scales may not be able to converge.
GLMNET and LIBSVM packages perform the scaling to reduce the condition 
number, and return the weights in the original scale.
See page 9 in http://cran.r-project.org/web/packages/glmnet/glmnet.pdf
Here, if useFeatureScaling is enabled, we will standardize the training 
features by dividing the variance of each column (without subtracting the 
mean), and train the model in the scaled space. Then we transform the 
coefficients from the scaled space to the original scale as GLMNET and LIBSVM 
do.
Currently, it's only enabled in LogisticRegressionWithLBFGS


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/AlpineNow/spark dbtsai-feature-scaling

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/1897.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1897


commit 5257751cda9cd0cb284af06c81e1282e1bfb53f7
Author: DB Tsai dbt...@alpinenow.com
Date:   2014-08-08T23:23:21Z

Improve the convergence rate by minimize the condition number in LOR with 
LBFGS




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2979][MLlib ]Improve the convergence ra...

2014-08-11 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1897#issuecomment-51862223
  
QA tests have started for PR 1897. This patch merges cleanly. brView 
progress: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18347/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2979][MLlib] Improve the convergence ra...

2014-08-11 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1897#issuecomment-51865332
  
QA results for PR 1897:br- This patch PASSES unit tests.br- This patch 
merges cleanlybr- This patch adds no public classesbrbrFor more 
information see test 
ouptut:brhttps://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18347/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2979][MLlib] Improve the convergence ra...

2014-08-11 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1897#issuecomment-51870344
  
QA tests have started for PR 1897. This patch merges cleanly. brView 
progress: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18356/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2979][MLlib] Improve the convergence ra...

2014-08-11 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1897#issuecomment-51871303
  
QA tests have started for PR 1897. This patch merges cleanly. brView 
progress: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18358/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2979][MLlib] Improve the convergence ra...

2014-08-11 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1897#issuecomment-51872737
  
QA results for PR 1897:br- This patch PASSES unit tests.br- This patch 
merges cleanlybr- This patch adds no public classesbrbrFor more 
information see test 
ouptut:brhttps://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18356/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2979][MLlib] Improve the convergence ra...

2014-08-11 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1897#issuecomment-51873603
  
QA results for PR 1897:br- This patch PASSES unit tests.br- This patch 
merges cleanlybr- This patch adds no public classesbrbrFor more 
information see test 
ouptut:brhttps://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18358/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org