[GitHub] spark pull request: [SPARK-2979][MLlib] Improve the convergence ra...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1897#issuecomment-52148128 QA results for PR 1897:br- This patch FAILED unit tests.br- This patch merges cleanlybr- This patch adds no public classesbrbrFor more information see test ouptut:brhttps://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18521/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2979][MLlib] Improve the convergence ra...
Github user dbtsai commented on the pull request: https://github.com/apache/spark/pull/1897#issuecomment-52149162 Seems that Jenkins is not stable. Failing on issues related to akka. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2979][MLlib] Improve the convergence ra...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1897#issuecomment-52149464 QA tests have started for PR 1897. This patch merges cleanly. brView progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18527/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2979][MLlib] Improve the convergence ra...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1897#issuecomment-52152780 QA results for PR 1897:br- This patch PASSES unit tests.br- This patch merges cleanlybr- This patch adds no public classesbrbrFor more information see test ouptut:brhttps://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18527/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2979][MLlib] Improve the convergence ra...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/1897 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2979][MLlib] Improve the convergence ra...
Github user mengxr commented on the pull request: https://github.com/apache/spark/pull/1897#issuecomment-52226905 LGTM. Merged into both master and branch-1.1. Thanks!! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2979][MLlib] Improve the convergence ra...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/1897#discussion_r16221810 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/regression/GeneralizedLinearAlgorithm.scala --- @@ -137,11 +154,45 @@ abstract class GeneralizedLinearAlgorithm[M : GeneralizedLinearModel] throw new SparkException(Input validation failed.) } +/** + * Scaling to minimize the condition number: + * + * During the optimization process, the convergence (rate) depends on the condition number of + * the training dataset. Scaling the variables often reduces this condition number, thus + * improving the convergence rate dramatically. Without reducing the condition number, + * some training datasets mixing the columns with different scales may not be able to converge. + * + * GLMNET and LIBSVM packages perform the scaling to reduce the condition number, and return + * the weights in the original scale. + * See page 9 in http://cran.r-project.org/web/packages/glmnet/glmnet.pdf + * + * Here, if useFeatureScaling is enabled, we will standardize the training features by dividing + * the variance of each column (without subtracting the mean), and train the model in the + * scaled space. Then we transform the coefficients from the scaled space to the original scale + * as GLMNET and LIBSVM do. + * + * Currently, it's only enabled in LogisticRegressionWithLBFGS + */ +val scaler = if (useFeatureScaling) { + (new StandardScaler).fit(input.map(x = x.features)) +} else { + null +} + // Prepend an extra variable consisting of all 1.0's for the intercept. val data = if (addIntercept) { - input.map(labeledPoint = (labeledPoint.label, appendBias(labeledPoint.features))) + if(useFeatureScaling) { +input.map(labeledPoint = + (labeledPoint.label, appendBias(scaler.transform(labeledPoint.features + } else { +input.map(labeledPoint = (labeledPoint.label, appendBias(labeledPoint.features))) + } } else { - input.map(labeledPoint = (labeledPoint.label, labeledPoint.features)) + if (useFeatureScaling) { +input.map(labeledPoint = (labeledPoint.label, scaler.transform(labeledPoint.features))) + } else { +input.map(labeledPoint = (labeledPoint.label, labeledPoint.features)) --- End diff -- Sorry, I didn't realize that. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2979][MLlib] Improve the convergence ra...
Github user mengxr commented on the pull request: https://github.com/apache/spark/pull/1897#issuecomment-52145394 Jenkins, test this please. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2979][MLlib] Improve the convergence ra...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1897#issuecomment-52145716 QA tests have started for PR 1897. This patch merges cleanly. brView progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18521/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2979][MLlib] Improve the convergence ra...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/1897#discussion_r16099170 --- Diff: mllib/src/test/scala/org/apache/spark/mllib/classification/LogisticRegressionSuite.scala --- @@ -185,6 +185,58 @@ class LogisticRegressionSuite extends FunSuite with LocalSparkContext with Match // Test prediction on Array. validatePrediction(validationData.map(row = model.predict(row.features)), validationData) } + + test(numerical stability of scaling features using logistic regression with LBFGS) { +/** + * If we rescale the features, the condition number will be changed so the convergence rate + * and the solution will not equal to the original solution multiple by the scaling factor + * which it should be. + * + * However, since in the LogisticRegressionWithLBFGS, we standardize the training dataset first, + * no matter how we multiple a scaling factor into the dataset, the convergence rate should be + * the same, and the solution should equal to the original solution multiple by the scaling + * factor. + */ + +val nPoints = 1 +val A = 2.0 +val B = -1.5 + +val testData = LogisticRegressionSuite.generateLogisticInput(A, B, nPoints, 42) + +val initialWeights = Vectors.dense(0.0) + +val testRDD1 = sc.parallelize(testData, 2) + +val testRDD2 = sc.parallelize( + testData.map(x = LabeledPoint(x.label, Vectors.fromBreeze(x.features.toBreeze * 1.0E3))), 2) + +val testRDD3 = sc.parallelize( + testData.map(x = LabeledPoint(x.label, Vectors.fromBreeze(x.features.toBreeze * 1.0E6))), 2) + +testRDD1.cache() +testRDD2.cache() +testRDD3.cache() + +val lrA = new LogisticRegressionWithLBFGS().setIntercept(true) +val lrB = new LogisticRegressionWithLBFGS().setIntercept(true).setFeatureScaling(false) + +val modelA1 = lrA.run(testRDD1, initialWeights) +val modelA2 = lrA.run(testRDD2, initialWeights) +val modelA3 = lrA.run(testRDD3, initialWeights) + +val modelB1 = lrB.run(testRDD1, initialWeights) +val modelB2 = lrB.run(testRDD2, initialWeights) +val modelB3 = lrB.run(testRDD3, initialWeights) + +// Test the weights +assert(modelA1.weights(0) ~== modelA2.weights(0) * 1.0E3 absTol 0.01) +assert(modelA1.weights(0) ~== modelA3.weights(0) * 1.0E6 absTol 0.01) + +assert(modelB1.weights(0) !~== modelB2.weights(0) * 1.0E3 absTol 0.1) --- End diff -- need a comment about the purpose of the tests here --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2979][MLlib] Improve the convergence ra...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/1897#discussion_r16099253 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/regression/GeneralizedLinearAlgorithm.scala --- @@ -137,11 +154,45 @@ abstract class GeneralizedLinearAlgorithm[M : GeneralizedLinearModel] throw new SparkException(Input validation failed.) } +/** + * Scaling to minimize the condition number: --- End diff -- `minimize the condition number` is not accurate. We can say `scaling columns to unit variance as a heuristic to reduce the condition number`. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2979][MLlib] Improve the convergence ra...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/1897#discussion_r16153527 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/regression/GeneralizedLinearAlgorithm.scala --- @@ -137,11 +154,45 @@ abstract class GeneralizedLinearAlgorithm[M : GeneralizedLinearModel] throw new SparkException(Input validation failed.) } +/** + * Scaling to minimize the condition number: + * + * During the optimization process, the convergence (rate) depends on the condition number of + * the training dataset. Scaling the variables often reduces this condition number, thus + * improving the convergence rate dramatically. Without reducing the condition number, + * some training datasets mixing the columns with different scales may not be able to converge. + * + * GLMNET and LIBSVM packages perform the scaling to reduce the condition number, and return + * the weights in the original scale. + * See page 9 in http://cran.r-project.org/web/packages/glmnet/glmnet.pdf + * + * Here, if useFeatureScaling is enabled, we will standardize the training features by dividing + * the variance of each column (without subtracting the mean), and train the model in the + * scaled space. Then we transform the coefficients from the scaled space to the original scale + * as GLMNET and LIBSVM do. + * + * Currently, it's only enabled in LogisticRegressionWithLBFGS + */ +val scaler = if (useFeatureScaling) { + (new StandardScaler).fit(input.map(x = x.features)) +} else { + null +} + // Prepend an extra variable consisting of all 1.0's for the intercept. val data = if (addIntercept) { - input.map(labeledPoint = (labeledPoint.label, appendBias(labeledPoint.features))) + if(useFeatureScaling) { +input.map(labeledPoint = + (labeledPoint.label, appendBias(scaler.transform(labeledPoint.features + } else { +input.map(labeledPoint = (labeledPoint.label, appendBias(labeledPoint.features))) + } } else { - input.map(labeledPoint = (labeledPoint.label, labeledPoint.features)) + if (useFeatureScaling) { +input.map(labeledPoint = (labeledPoint.label, scaler.transform(labeledPoint.features))) + } else { +input.map(labeledPoint = (labeledPoint.label, labeledPoint.features)) --- End diff -- It's not identical map. It's converting labeledPoint to tuple of response and feature vector for optimizer. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2979][MLlib ]Improve the convergence ra...
GitHub user dbtsai opened a pull request: https://github.com/apache/spark/pull/1897 [SPARK-2979][MLlib ]Improve the convergence rate by minimize the condition number Scaling to minimize the condition number: During the optimization process, the convergence (rate) depends on the condition number of the training dataset. Scaling the variables often reduces this condition number, thus mproving the convergence rate dramatically. Without reducing the condition number, some training datasets mixing the columns with different scales may not be able to converge. GLMNET and LIBSVM packages perform the scaling to reduce the condition number, and return the weights in the original scale. See page 9 in http://cran.r-project.org/web/packages/glmnet/glmnet.pdf Here, if useFeatureScaling is enabled, we will standardize the training features by dividing the variance of each column (without subtracting the mean), and train the model in the scaled space. Then we transform the coefficients from the scaled space to the original scale as GLMNET and LIBSVM do. Currently, it's only enabled in LogisticRegressionWithLBFGS You can merge this pull request into a Git repository by running: $ git pull https://github.com/AlpineNow/spark dbtsai-feature-scaling Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/1897.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #1897 commit 5257751cda9cd0cb284af06c81e1282e1bfb53f7 Author: DB Tsai dbt...@alpinenow.com Date: 2014-08-08T23:23:21Z Improve the convergence rate by minimize the condition number in LOR with LBFGS --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2979][MLlib ]Improve the convergence ra...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1897#issuecomment-51862223 QA tests have started for PR 1897. This patch merges cleanly. brView progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18347/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2979][MLlib] Improve the convergence ra...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1897#issuecomment-51865332 QA results for PR 1897:br- This patch PASSES unit tests.br- This patch merges cleanlybr- This patch adds no public classesbrbrFor more information see test ouptut:brhttps://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18347/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2979][MLlib] Improve the convergence ra...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1897#issuecomment-51870344 QA tests have started for PR 1897. This patch merges cleanly. brView progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18356/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2979][MLlib] Improve the convergence ra...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1897#issuecomment-51871303 QA tests have started for PR 1897. This patch merges cleanly. brView progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18358/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2979][MLlib] Improve the convergence ra...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1897#issuecomment-51872737 QA results for PR 1897:br- This patch PASSES unit tests.br- This patch merges cleanlybr- This patch adds no public classesbrbrFor more information see test ouptut:brhttps://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18356/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2979][MLlib] Improve the convergence ra...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1897#issuecomment-51873603 QA results for PR 1897:br- This patch PASSES unit tests.br- This patch merges cleanlybr- This patch adds no public classesbrbrFor more information see test ouptut:brhttps://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18358/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org