[GitHub] spark pull request: [SPARK-12804][ML] Fix LogisticRegression with ...
Github user feynmanliang commented on the pull request: https://github.com/apache/spark/pull/10743#issuecomment-172811882 @dbtsai validating coefficients with R will be harder than expected, `glmnet` requires feature dimension >= 2 and `glm` doesn't yield +/- Infinity intercepts when given all 0/1 datasets... --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12804][ML] Fix LogisticRegression with ...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/10743#issuecomment-172816195 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/49684/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12804][ML] Fix LogisticRegression with ...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/10743#issuecomment-172815986 **[Test build #49684 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/49684/consoleFull)** for PR 10743 at commit [`95816d4`](https://github.com/apache/spark/commit/95816d4c158d7e11cab8ac7bf28b9bb84026d33a). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12804][ML] Fix LogisticRegression with ...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/10743#issuecomment-172816192 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12804][ML] Fix LogisticRegression with ...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/10743#issuecomment-172802711 **[Test build #49684 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/49684/consoleFull)** for PR 10743 at commit [`95816d4`](https://github.com/apache/spark/commit/95816d4c158d7e11cab8ac7bf28b9bb84026d33a). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12804][ML] Fix LogisticRegression with ...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/10743 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12804][ML] Fix LogisticRegression with ...
Github user dbtsai commented on the pull request: https://github.com/apache/spark/pull/10743#issuecomment-172954547 Merged into master. Thanks. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12804][ML] Fix LogisticRegression with ...
Github user dbtsai commented on the pull request: https://github.com/apache/spark/pull/10743#issuecomment-172965625 A followup PR is created: https://issues.apache.org/jira/browse/SPARK-12908 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12804][ML] Fix LogisticRegression with ...
Github user dbtsai commented on the pull request: https://github.com/apache/spark/pull/10743#issuecomment-172954097 Ideally, we would like to see the one without intercept to ensure that nothing will not be messed up during the refactoring. I'll merge now, and let's add the test for all same labels without intercept. Thanks. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12804][ML] Fix LogisticRegression with ...
Github user feynmanliang commented on a diff in the pull request: https://github.com/apache/spark/pull/10743#discussion_r50006902 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala --- @@ -276,113 +276,123 @@ class LogisticRegression @Since("1.2.0") ( val numClasses = histogram.length val numFeatures = summarizer.mean.size -if (numInvalid != 0) { - val msg = s"Classification labels should be in {0 to ${numClasses - 1} " + -s"Found $numInvalid invalid labels." - logError(msg) - throw new SparkException(msg) -} - -if (numClasses > 2) { - val msg = s"Currently, LogisticRegression with ElasticNet in ML package only supports " + -s"binary classification. Found $numClasses in the input dataset." - logError(msg) - throw new SparkException(msg) -} +val (coefficients, intercept, objectiveHistory) = { + if (numInvalid != 0) { +val msg = s"Classification labels should be in {0 to ${numClasses - 1} " + + s"Found $numInvalid invalid labels." +logError(msg) +throw new SparkException(msg) + } -val featuresMean = summarizer.mean.toArray -val featuresStd = summarizer.variance.toArray.map(math.sqrt) + if (numClasses > 2) { +val msg = s"Currently, LogisticRegression with ElasticNet in ML package only supports " + + s"binary classification. Found $numClasses in the input dataset." +logError(msg) +throw new SparkException(msg) + } else if ($(fitIntercept) && numClasses == 2 && histogram(0) == 0.0) { +logWarning(s"All labels are one and fitIntercept=true, so the coefficients will be " + + s"zeros and the intercept will be positive infinity; as a result, " + + s"training is not needed.") +(Vectors.sparse(numFeatures, Seq()), Double.PositiveInfinity, Array.empty[Double]) + } else if ($(fitIntercept) && numClasses == 1) { +logWarning(s"All labels are one and fitIntercept=true, so the coefficients will be " + + s"zeros and the intercept will be negative infinity; as a result, " + + s"training is not needed.") +(Vectors.sparse(numFeatures, Seq()), Double.NegativeInfinity, Array.empty[Double]) + } else { +val featuresMean = summarizer.mean.toArray +val featuresStd = summarizer.variance.toArray.map(math.sqrt) -val regParamL1 = $(elasticNetParam) * $(regParam) -val regParamL2 = (1.0 - $(elasticNetParam)) * $(regParam) +val regParamL1 = $(elasticNetParam) * $(regParam) +val regParamL2 = (1.0 - $(elasticNetParam)) * $(regParam) -val costFun = new LogisticCostFun(instances, numClasses, $(fitIntercept), $(standardization), - featuresStd, featuresMean, regParamL2) +val costFun = new LogisticCostFun(instances, numClasses, $(fitIntercept), + $(standardization), featuresStd, featuresMean, regParamL2) -val optimizer = if ($(elasticNetParam) == 0.0 || $(regParam) == 0.0) { - new BreezeLBFGS[BDV[Double]]($(maxIter), 10, $(tol)) -} else { - def regParamL1Fun = (index: Int) => { -// Remove the L1 penalization on the intercept -if (index == numFeatures) { - 0.0 +val optimizer = if ($(elasticNetParam) == 0.0 || $(regParam) == 0.0) { + new BreezeLBFGS[BDV[Double]]($(maxIter), 10, $(tol)) } else { - if ($(standardization)) { -regParamL1 - } else { -// If `standardization` is false, we still standardize the data -// to improve the rate of convergence; as a result, we have to -// perform this reverse standardization by penalizing each component -// differently to get effectively the same objective function when -// the training dataset is not standardized. -if (featuresStd(index) != 0.0) regParamL1 / featuresStd(index) else 0.0 + def regParamL1Fun = (index: Int) => { +// Remove the L1 penalization on the intercept +if (index == numFeatures) { + 0.0 +} else { + if ($(standardization)) { +regParamL1 + } else { +// If `standardization` is false, we still standardize the data +// to improve the rate of convergence; as a result, we have to +// perform this reverse standardization by penalizing each component +// differently to get effectively the same objective function when +
[GitHub] spark pull request: [SPARK-12804][ML] Fix LogisticRegression with ...
Github user feynmanliang commented on a diff in the pull request: https://github.com/apache/spark/pull/10743#discussion_r50006858 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala --- @@ -276,113 +276,123 @@ class LogisticRegression @Since("1.2.0") ( val numClasses = histogram.length val numFeatures = summarizer.mean.size -if (numInvalid != 0) { - val msg = s"Classification labels should be in {0 to ${numClasses - 1} " + -s"Found $numInvalid invalid labels." - logError(msg) - throw new SparkException(msg) -} - -if (numClasses > 2) { - val msg = s"Currently, LogisticRegression with ElasticNet in ML package only supports " + -s"binary classification. Found $numClasses in the input dataset." - logError(msg) - throw new SparkException(msg) -} +val (coefficients, intercept, objectiveHistory) = { + if (numInvalid != 0) { +val msg = s"Classification labels should be in {0 to ${numClasses - 1} " + + s"Found $numInvalid invalid labels." +logError(msg) +throw new SparkException(msg) + } -val featuresMean = summarizer.mean.toArray -val featuresStd = summarizer.variance.toArray.map(math.sqrt) + if (numClasses > 2) { +val msg = s"Currently, LogisticRegression with ElasticNet in ML package only supports " + + s"binary classification. Found $numClasses in the input dataset." +logError(msg) +throw new SparkException(msg) + } else if ($(fitIntercept) && numClasses == 2 && histogram(0) == 0.0) { +logWarning(s"All labels are one and fitIntercept=true, so the coefficients will be " + + s"zeros and the intercept will be positive infinity; as a result, " + + s"training is not needed.") +(Vectors.sparse(numFeatures, Seq()), Double.PositiveInfinity, Array.empty[Double]) + } else if ($(fitIntercept) && numClasses == 1) { +logWarning(s"All labels are one and fitIntercept=true, so the coefficients will be " + --- End diff -- OK --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12804][ML] Fix LogisticRegression with ...
Github user feynmanliang commented on a diff in the pull request: https://github.com/apache/spark/pull/10743#discussion_r50007441 --- Diff: mllib/src/test/scala/org/apache/spark/ml/classification/LogisticRegressionSuite.scala --- @@ -883,6 +884,27 @@ class LogisticRegressionSuite assert(model1a0.intercept ~== model1b.intercept absTol 1E-3) } + test("logistic regression with fitIntercept=true and all labels the same") { +val lr = new LogisticRegression() + .setFitIntercept(true) + .setMaxIter(3) +val sameLabels = dataset + .withColumn("zeroLabel", lit(0.0)) + .withColumn("oneLabel", lit(1.0)) + +val allZeroModel = lr + .setLabelCol("zeroLabel") + .fit(sameLabels) +assert(allZeroModel.coefficients ~== Vectors.dense(0.0) absTol 1E-3) +assert(allZeroModel.intercept === Double.NegativeInfinity) + +val allOneModel = lr + .setLabelCol("oneLabel") + .fit(sameLabels) +assert(allOneModel.coefficients ~== Vectors.dense(0.0) absTol 1E-3) +assert(allOneModel.intercept === Double.PositiveInfinity) + } + --- End diff -- OK --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12804][ML] Fix LogisticRegression with ...
Github user feynmanliang commented on the pull request: https://github.com/apache/spark/pull/10743#issuecomment-172556919 @dbtsai Added `fitIntercept=false` tests and fixed comments/`logWarning` messages. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12804][ML] Fix LogisticRegression with ...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/10743#issuecomment-172561357 **[Test build #49597 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/49597/consoleFull)** for PR 10743 at commit [`0f4824d`](https://github.com/apache/spark/commit/0f4824d9f358f451968aa2a6ab2b31afd85cda64). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12804][ML] Fix LogisticRegression with ...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/10743#issuecomment-172561751 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12804][ML] Fix LogisticRegression with ...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/10743#issuecomment-172561755 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/49597/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12804][ML] Fix LogisticRegression with ...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/10743#issuecomment-172561744 **[Test build #49597 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/49597/consoleFull)** for PR 10743 at commit [`0f4824d`](https://github.com/apache/spark/commit/0f4824d9f358f451968aa2a6ab2b31afd85cda64). * This patch **fails Scala style tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12804][ML] Fix LogisticRegression with ...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/10743#issuecomment-172594732 **[Test build #49600 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/49600/consoleFull)** for PR 10743 at commit [`c8c1586`](https://github.com/apache/spark/commit/c8c1586ffd5e5076c6aed564b41d99314099a5a0). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12804][ML] Fix LogisticRegression with ...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/10743#issuecomment-172606483 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/49600/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12804][ML] Fix LogisticRegression with ...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/10743#issuecomment-172606348 **[Test build #49600 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/49600/consoleFull)** for PR 10743 at commit [`c8c1586`](https://github.com/apache/spark/commit/c8c1586ffd5e5076c6aed564b41d99314099a5a0). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12804][ML] Fix LogisticRegression with ...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/10743#issuecomment-172606481 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12804][ML] Fix LogisticRegression with ...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/10743#discussion_r50032693 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala --- @@ -276,113 +276,123 @@ class LogisticRegression @Since("1.2.0") ( val numClasses = histogram.length val numFeatures = summarizer.mean.size -if (numInvalid != 0) { - val msg = s"Classification labels should be in {0 to ${numClasses - 1} " + -s"Found $numInvalid invalid labels." - logError(msg) - throw new SparkException(msg) -} - -if (numClasses > 2) { - val msg = s"Currently, LogisticRegression with ElasticNet in ML package only supports " + -s"binary classification. Found $numClasses in the input dataset." - logError(msg) - throw new SparkException(msg) -} +val (coefficients, intercept, objectiveHistory) = { + if (numInvalid != 0) { +val msg = s"Classification labels should be in {0 to ${numClasses - 1} " + + s"Found $numInvalid invalid labels." +logError(msg) +throw new SparkException(msg) + } -val featuresMean = summarizer.mean.toArray -val featuresStd = summarizer.variance.toArray.map(math.sqrt) + if (numClasses > 2) { +val msg = s"Currently, LogisticRegression with ElasticNet in ML package only supports " + + s"binary classification. Found $numClasses in the input dataset." +logError(msg) +throw new SparkException(msg) + } else if ($(fitIntercept) && numClasses == 2 && histogram(0) == 0.0) { +logWarning(s"All labels are one and fitIntercept=true, so the coefficients will be " + + s"zeros and the intercept will be positive infinity; as a result, " + + s"training is not needed.") +(Vectors.sparse(numFeatures, Seq()), Double.PositiveInfinity, Array.empty[Double]) + } else if ($(fitIntercept) && numClasses == 1) { +logWarning(s"All labels are one and fitIntercept=true, so the coefficients will be " + + s"zeros and the intercept will be negative infinity; as a result, " + + s"training is not needed.") +(Vectors.sparse(numFeatures, Seq()), Double.NegativeInfinity, Array.empty[Double]) + } else { +val featuresMean = summarizer.mean.toArray +val featuresStd = summarizer.variance.toArray.map(math.sqrt) -val regParamL1 = $(elasticNetParam) * $(regParam) -val regParamL2 = (1.0 - $(elasticNetParam)) * $(regParam) +val regParamL1 = $(elasticNetParam) * $(regParam) +val regParamL2 = (1.0 - $(elasticNetParam)) * $(regParam) -val costFun = new LogisticCostFun(instances, numClasses, $(fitIntercept), $(standardization), - featuresStd, featuresMean, regParamL2) +val costFun = new LogisticCostFun(instances, numClasses, $(fitIntercept), + $(standardization), featuresStd, featuresMean, regParamL2) -val optimizer = if ($(elasticNetParam) == 0.0 || $(regParam) == 0.0) { - new BreezeLBFGS[BDV[Double]]($(maxIter), 10, $(tol)) -} else { - def regParamL1Fun = (index: Int) => { -// Remove the L1 penalization on the intercept -if (index == numFeatures) { - 0.0 +val optimizer = if ($(elasticNetParam) == 0.0 || $(regParam) == 0.0) { + new BreezeLBFGS[BDV[Double]]($(maxIter), 10, $(tol)) } else { - if ($(standardization)) { -regParamL1 - } else { -// If `standardization` is false, we still standardize the data -// to improve the rate of convergence; as a result, we have to -// perform this reverse standardization by penalizing each component -// differently to get effectively the same objective function when -// the training dataset is not standardized. -if (featuresStd(index) != 0.0) regParamL1 / featuresStd(index) else 0.0 + def regParamL1Fun = (index: Int) => { +// Remove the L1 penalization on the intercept +if (index == numFeatures) { + 0.0 +} else { + if ($(standardization)) { +regParamL1 + } else { +// If `standardization` is false, we still standardize the data +// to improve the rate of convergence; as a result, we have to +// perform this reverse standardization by penalizing each component +// differently to get effectively the same objective function when +// the
[GitHub] spark pull request: [SPARK-12804][ML] Fix LogisticRegression with ...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/10743#discussion_r50032721 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala --- @@ -276,113 +276,123 @@ class LogisticRegression @Since("1.2.0") ( val numClasses = histogram.length val numFeatures = summarizer.mean.size -if (numInvalid != 0) { - val msg = s"Classification labels should be in {0 to ${numClasses - 1} " + -s"Found $numInvalid invalid labels." - logError(msg) - throw new SparkException(msg) -} - -if (numClasses > 2) { - val msg = s"Currently, LogisticRegression with ElasticNet in ML package only supports " + -s"binary classification. Found $numClasses in the input dataset." - logError(msg) - throw new SparkException(msg) -} +val (coefficients, intercept, objectiveHistory) = { + if (numInvalid != 0) { +val msg = s"Classification labels should be in {0 to ${numClasses - 1} " + + s"Found $numInvalid invalid labels." +logError(msg) +throw new SparkException(msg) + } -val featuresMean = summarizer.mean.toArray -val featuresStd = summarizer.variance.toArray.map(math.sqrt) + if (numClasses > 2) { +val msg = s"Currently, LogisticRegression with ElasticNet in ML package only supports " + + s"binary classification. Found $numClasses in the input dataset." +logError(msg) +throw new SparkException(msg) + } else if ($(fitIntercept) && numClasses == 2 && histogram(0) == 0.0) { +logWarning(s"All labels are one and fitIntercept=true, so the coefficients will be " + + s"zeros and the intercept will be positive infinity; as a result, " + + s"training is not needed.") +(Vectors.sparse(numFeatures, Seq()), Double.PositiveInfinity, Array.empty[Double]) + } else if ($(fitIntercept) && numClasses == 1) { +logWarning(s"All labels are zero and fitIntercept=true, so the coefficients will be " + + s"zeros and the intercept will be negative infinity; as a result, " + + s"training is not needed.") +(Vectors.sparse(numFeatures, Seq()), Double.NegativeInfinity, Array.empty[Double]) + } else { +val featuresMean = summarizer.mean.toArray +val featuresStd = summarizer.variance.toArray.map(math.sqrt) -val regParamL1 = $(elasticNetParam) * $(regParam) -val regParamL2 = (1.0 - $(elasticNetParam)) * $(regParam) +val regParamL1 = $(elasticNetParam) * $(regParam) +val regParamL2 = (1.0 - $(elasticNetParam)) * $(regParam) -val costFun = new LogisticCostFun(instances, numClasses, $(fitIntercept), $(standardization), - featuresStd, featuresMean, regParamL2) +val costFun = new LogisticCostFun(instances, numClasses, $(fitIntercept), + $(standardization), featuresStd, featuresMean, regParamL2) -val optimizer = if ($(elasticNetParam) == 0.0 || $(regParam) == 0.0) { - new BreezeLBFGS[BDV[Double]]($(maxIter), 10, $(tol)) -} else { - def regParamL1Fun = (index: Int) => { -// Remove the L1 penalization on the intercept -if (index == numFeatures) { - 0.0 +val optimizer = if ($(elasticNetParam) == 0.0 || $(regParam) == 0.0) { + new BreezeLBFGS[BDV[Double]]($(maxIter), 10, $(tol)) } else { - if ($(standardization)) { -regParamL1 - } else { -// If `standardization` is false, we still standardize the data -// to improve the rate of convergence; as a result, we have to -// perform this reverse standardization by penalizing each component -// differently to get effectively the same objective function when -// the training dataset is not standardized. -if (featuresStd(index) != 0.0) regParamL1 / featuresStd(index) else 0.0 + def regParamL1Fun = (index: Int) => { +// Remove the L1 penalization on the intercept +if (index == numFeatures) { + 0.0 +} else { + if ($(standardization)) { +regParamL1 + } else { +// If `standardization` is false, we still standardize the data +// to improve the rate of convergence; as a result, we have to +// perform this reverse standardization by penalizing each component +// differently to get effectively the same objective function when +// the
[GitHub] spark pull request: [SPARK-12804][ML] Fix LogisticRegression with ...
Github user feynmanliang commented on a diff in the pull request: https://github.com/apache/spark/pull/10743#discussion_r50041066 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala --- @@ -276,113 +276,123 @@ class LogisticRegression @Since("1.2.0") ( val numClasses = histogram.length val numFeatures = summarizer.mean.size -if (numInvalid != 0) { - val msg = s"Classification labels should be in {0 to ${numClasses - 1} " + -s"Found $numInvalid invalid labels." - logError(msg) - throw new SparkException(msg) -} - -if (numClasses > 2) { - val msg = s"Currently, LogisticRegression with ElasticNet in ML package only supports " + -s"binary classification. Found $numClasses in the input dataset." - logError(msg) - throw new SparkException(msg) -} +val (coefficients, intercept, objectiveHistory) = { + if (numInvalid != 0) { +val msg = s"Classification labels should be in {0 to ${numClasses - 1} " + + s"Found $numInvalid invalid labels." +logError(msg) +throw new SparkException(msg) + } -val featuresMean = summarizer.mean.toArray -val featuresStd = summarizer.variance.toArray.map(math.sqrt) + if (numClasses > 2) { +val msg = s"Currently, LogisticRegression with ElasticNet in ML package only supports " + + s"binary classification. Found $numClasses in the input dataset." +logError(msg) +throw new SparkException(msg) + } else if ($(fitIntercept) && numClasses == 2 && histogram(0) == 0.0) { +logWarning(s"All labels are one and fitIntercept=true, so the coefficients will be " + + s"zeros and the intercept will be positive infinity; as a result, " + + s"training is not needed.") +(Vectors.sparse(numFeatures, Seq()), Double.PositiveInfinity, Array.empty[Double]) + } else if ($(fitIntercept) && numClasses == 1) { +logWarning(s"All labels are one and fitIntercept=true, so the coefficients will be " + + s"zeros and the intercept will be negative infinity; as a result, " + + s"training is not needed.") +(Vectors.sparse(numFeatures, Seq()), Double.NegativeInfinity, Array.empty[Double]) + } else { +val featuresMean = summarizer.mean.toArray +val featuresStd = summarizer.variance.toArray.map(math.sqrt) -val regParamL1 = $(elasticNetParam) * $(regParam) -val regParamL2 = (1.0 - $(elasticNetParam)) * $(regParam) +val regParamL1 = $(elasticNetParam) * $(regParam) +val regParamL2 = (1.0 - $(elasticNetParam)) * $(regParam) -val costFun = new LogisticCostFun(instances, numClasses, $(fitIntercept), $(standardization), - featuresStd, featuresMean, regParamL2) +val costFun = new LogisticCostFun(instances, numClasses, $(fitIntercept), + $(standardization), featuresStd, featuresMean, regParamL2) -val optimizer = if ($(elasticNetParam) == 0.0 || $(regParam) == 0.0) { - new BreezeLBFGS[BDV[Double]]($(maxIter), 10, $(tol)) -} else { - def regParamL1Fun = (index: Int) => { -// Remove the L1 penalization on the intercept -if (index == numFeatures) { - 0.0 +val optimizer = if ($(elasticNetParam) == 0.0 || $(regParam) == 0.0) { + new BreezeLBFGS[BDV[Double]]($(maxIter), 10, $(tol)) } else { - if ($(standardization)) { -regParamL1 - } else { -// If `standardization` is false, we still standardize the data -// to improve the rate of convergence; as a result, we have to -// perform this reverse standardization by penalizing each component -// differently to get effectively the same objective function when -// the training dataset is not standardized. -if (featuresStd(index) != 0.0) regParamL1 / featuresStd(index) else 0.0 + def regParamL1Fun = (index: Int) => { +// Remove the L1 penalization on the intercept +if (index == numFeatures) { + 0.0 +} else { + if ($(standardization)) { +regParamL1 + } else { +// If `standardization` is false, we still standardize the data +// to improve the rate of convergence; as a result, we have to +// perform this reverse standardization by penalizing each component +// differently to get effectively the same objective function when +
[GitHub] spark pull request: [SPARK-12804][ML] Fix LogisticRegression with ...
Github user feynmanliang commented on a diff in the pull request: https://github.com/apache/spark/pull/10743#discussion_r50041068 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala --- @@ -276,113 +276,123 @@ class LogisticRegression @Since("1.2.0") ( val numClasses = histogram.length val numFeatures = summarizer.mean.size -if (numInvalid != 0) { - val msg = s"Classification labels should be in {0 to ${numClasses - 1} " + -s"Found $numInvalid invalid labels." - logError(msg) - throw new SparkException(msg) -} - -if (numClasses > 2) { - val msg = s"Currently, LogisticRegression with ElasticNet in ML package only supports " + -s"binary classification. Found $numClasses in the input dataset." - logError(msg) - throw new SparkException(msg) -} +val (coefficients, intercept, objectiveHistory) = { + if (numInvalid != 0) { +val msg = s"Classification labels should be in {0 to ${numClasses - 1} " + + s"Found $numInvalid invalid labels." +logError(msg) +throw new SparkException(msg) + } -val featuresMean = summarizer.mean.toArray -val featuresStd = summarizer.variance.toArray.map(math.sqrt) + if (numClasses > 2) { +val msg = s"Currently, LogisticRegression with ElasticNet in ML package only supports " + + s"binary classification. Found $numClasses in the input dataset." +logError(msg) +throw new SparkException(msg) + } else if ($(fitIntercept) && numClasses == 2 && histogram(0) == 0.0) { +logWarning(s"All labels are one and fitIntercept=true, so the coefficients will be " + + s"zeros and the intercept will be positive infinity; as a result, " + + s"training is not needed.") +(Vectors.sparse(numFeatures, Seq()), Double.PositiveInfinity, Array.empty[Double]) + } else if ($(fitIntercept) && numClasses == 1) { +logWarning(s"All labels are zero and fitIntercept=true, so the coefficients will be " + + s"zeros and the intercept will be negative infinity; as a result, " + + s"training is not needed.") +(Vectors.sparse(numFeatures, Seq()), Double.NegativeInfinity, Array.empty[Double]) + } else { +val featuresMean = summarizer.mean.toArray +val featuresStd = summarizer.variance.toArray.map(math.sqrt) -val regParamL1 = $(elasticNetParam) * $(regParam) -val regParamL2 = (1.0 - $(elasticNetParam)) * $(regParam) +val regParamL1 = $(elasticNetParam) * $(regParam) +val regParamL2 = (1.0 - $(elasticNetParam)) * $(regParam) -val costFun = new LogisticCostFun(instances, numClasses, $(fitIntercept), $(standardization), - featuresStd, featuresMean, regParamL2) +val costFun = new LogisticCostFun(instances, numClasses, $(fitIntercept), + $(standardization), featuresStd, featuresMean, regParamL2) -val optimizer = if ($(elasticNetParam) == 0.0 || $(regParam) == 0.0) { - new BreezeLBFGS[BDV[Double]]($(maxIter), 10, $(tol)) -} else { - def regParamL1Fun = (index: Int) => { -// Remove the L1 penalization on the intercept -if (index == numFeatures) { - 0.0 +val optimizer = if ($(elasticNetParam) == 0.0 || $(regParam) == 0.0) { + new BreezeLBFGS[BDV[Double]]($(maxIter), 10, $(tol)) } else { - if ($(standardization)) { -regParamL1 - } else { -// If `standardization` is false, we still standardize the data -// to improve the rate of convergence; as a result, we have to -// perform this reverse standardization by penalizing each component -// differently to get effectively the same objective function when -// the training dataset is not standardized. -if (featuresStd(index) != 0.0) regParamL1 / featuresStd(index) else 0.0 + def regParamL1Fun = (index: Int) => { +// Remove the L1 penalization on the intercept +if (index == numFeatures) { + 0.0 +} else { + if ($(standardization)) { +regParamL1 + } else { +// If `standardization` is false, we still standardize the data +// to improve the rate of convergence; as a result, we have to +// perform this reverse standardization by penalizing each component +// differently to get effectively the same objective function when +
[GitHub] spark pull request: [SPARK-12804][ML] Fix LogisticRegression with ...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/10743#discussion_r49955619 --- Diff: mllib/src/test/scala/org/apache/spark/ml/classification/LogisticRegressionSuite.scala --- @@ -883,6 +884,27 @@ class LogisticRegressionSuite assert(model1a0.intercept ~== model1b.intercept absTol 1E-3) } + test("logistic regression with fitIntercept=true and all labels the same") { +val lr = new LogisticRegression() + .setFitIntercept(true) + .setMaxIter(3) --- End diff -- add extra line. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12804][ML] Fix LogisticRegression with ...
Github user dbtsai commented on the pull request: https://github.com/apache/spark/pull/10743#issuecomment-172401612 LGTM except couple minor issues. Thanks. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12804][ML] Fix LogisticRegression with ...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/10743#discussion_r49955940 --- Diff: mllib/src/test/scala/org/apache/spark/ml/classification/LogisticRegressionSuite.scala --- @@ -883,6 +884,27 @@ class LogisticRegressionSuite assert(model1a0.intercept ~== model1b.intercept absTol 1E-3) } + test("logistic regression with fitIntercept=true and all labels the same") { +val lr = new LogisticRegression() + .setFitIntercept(true) + .setMaxIter(3) +val sameLabels = dataset + .withColumn("zeroLabel", lit(0.0)) + .withColumn("oneLabel", lit(1.0)) + +val allZeroModel = lr + .setLabelCol("zeroLabel") + .fit(sameLabels) +assert(allZeroModel.coefficients ~== Vectors.dense(0.0) absTol 1E-3) +assert(allZeroModel.intercept === Double.NegativeInfinity) + +val allOneModel = lr + .setLabelCol("oneLabel") + .fit(sameLabels) +assert(allOneModel.coefficients ~== Vectors.dense(0.0) absTol 1E-3) +assert(allOneModel.intercept === Double.PositiveInfinity) + } + --- End diff -- Also, check if objectiveHistory has length of zero. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12804][ML] Fix LogisticRegression with ...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/10743#discussion_r49955755 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala --- @@ -276,113 +276,123 @@ class LogisticRegression @Since("1.2.0") ( val numClasses = histogram.length val numFeatures = summarizer.mean.size -if (numInvalid != 0) { - val msg = s"Classification labels should be in {0 to ${numClasses - 1} " + -s"Found $numInvalid invalid labels." - logError(msg) - throw new SparkException(msg) -} - -if (numClasses > 2) { - val msg = s"Currently, LogisticRegression with ElasticNet in ML package only supports " + -s"binary classification. Found $numClasses in the input dataset." - logError(msg) - throw new SparkException(msg) -} +val (coefficients, intercept, objectiveHistory) = { + if (numInvalid != 0) { +val msg = s"Classification labels should be in {0 to ${numClasses - 1} " + + s"Found $numInvalid invalid labels." +logError(msg) +throw new SparkException(msg) + } -val featuresMean = summarizer.mean.toArray -val featuresStd = summarizer.variance.toArray.map(math.sqrt) + if (numClasses > 2) { +val msg = s"Currently, LogisticRegression with ElasticNet in ML package only supports " + + s"binary classification. Found $numClasses in the input dataset." +logError(msg) +throw new SparkException(msg) + } else if ($(fitIntercept) && numClasses == 2 && histogram(0) == 0.0) { +logWarning(s"All labels are one and fitIntercept=true, so the coefficients will be " + + s"zeros and the intercept will be positive infinity; as a result, " + + s"training is not needed.") +(Vectors.sparse(numFeatures, Seq()), Double.PositiveInfinity, Array.empty[Double]) + } else if ($(fitIntercept) && numClasses == 1) { +logWarning(s"All labels are one and fitIntercept=true, so the coefficients will be " + --- End diff -- "All labels are zero" --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12804][ML] Fix LogisticRegression with ...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/10743#discussion_r49955925 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala --- @@ -276,113 +276,123 @@ class LogisticRegression @Since("1.2.0") ( val numClasses = histogram.length val numFeatures = summarizer.mean.size -if (numInvalid != 0) { - val msg = s"Classification labels should be in {0 to ${numClasses - 1} " + -s"Found $numInvalid invalid labels." - logError(msg) - throw new SparkException(msg) -} - -if (numClasses > 2) { - val msg = s"Currently, LogisticRegression with ElasticNet in ML package only supports " + -s"binary classification. Found $numClasses in the input dataset." - logError(msg) - throw new SparkException(msg) -} +val (coefficients, intercept, objectiveHistory) = { + if (numInvalid != 0) { +val msg = s"Classification labels should be in {0 to ${numClasses - 1} " + + s"Found $numInvalid invalid labels." +logError(msg) +throw new SparkException(msg) + } -val featuresMean = summarizer.mean.toArray -val featuresStd = summarizer.variance.toArray.map(math.sqrt) + if (numClasses > 2) { +val msg = s"Currently, LogisticRegression with ElasticNet in ML package only supports " + + s"binary classification. Found $numClasses in the input dataset." +logError(msg) +throw new SparkException(msg) + } else if ($(fitIntercept) && numClasses == 2 && histogram(0) == 0.0) { +logWarning(s"All labels are one and fitIntercept=true, so the coefficients will be " + + s"zeros and the intercept will be positive infinity; as a result, " + + s"training is not needed.") +(Vectors.sparse(numFeatures, Seq()), Double.PositiveInfinity, Array.empty[Double]) + } else if ($(fitIntercept) && numClasses == 1) { +logWarning(s"All labels are one and fitIntercept=true, so the coefficients will be " + + s"zeros and the intercept will be negative infinity; as a result, " + + s"training is not needed.") +(Vectors.sparse(numFeatures, Seq()), Double.NegativeInfinity, Array.empty[Double]) + } else { +val featuresMean = summarizer.mean.toArray +val featuresStd = summarizer.variance.toArray.map(math.sqrt) -val regParamL1 = $(elasticNetParam) * $(regParam) -val regParamL2 = (1.0 - $(elasticNetParam)) * $(regParam) +val regParamL1 = $(elasticNetParam) * $(regParam) +val regParamL2 = (1.0 - $(elasticNetParam)) * $(regParam) -val costFun = new LogisticCostFun(instances, numClasses, $(fitIntercept), $(standardization), - featuresStd, featuresMean, regParamL2) +val costFun = new LogisticCostFun(instances, numClasses, $(fitIntercept), + $(standardization), featuresStd, featuresMean, regParamL2) -val optimizer = if ($(elasticNetParam) == 0.0 || $(regParam) == 0.0) { - new BreezeLBFGS[BDV[Double]]($(maxIter), 10, $(tol)) -} else { - def regParamL1Fun = (index: Int) => { -// Remove the L1 penalization on the intercept -if (index == numFeatures) { - 0.0 +val optimizer = if ($(elasticNetParam) == 0.0 || $(regParam) == 0.0) { + new BreezeLBFGS[BDV[Double]]($(maxIter), 10, $(tol)) } else { - if ($(standardization)) { -regParamL1 - } else { -// If `standardization` is false, we still standardize the data -// to improve the rate of convergence; as a result, we have to -// perform this reverse standardization by penalizing each component -// differently to get effectively the same objective function when -// the training dataset is not standardized. -if (featuresStd(index) != 0.0) regParamL1 / featuresStd(index) else 0.0 + def regParamL1Fun = (index: Int) => { +// Remove the L1 penalization on the intercept +if (index == numFeatures) { + 0.0 +} else { + if ($(standardization)) { +regParamL1 + } else { +// If `standardization` is false, we still standardize the data +// to improve the rate of convergence; as a result, we have to +// perform this reverse standardization by penalizing each component +// differently to get effectively the same objective function when +// the
[GitHub] spark pull request: [SPARK-12804][ML] Fix LogisticRegression with ...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/10743#discussion_r49955604 --- Diff: mllib/src/test/scala/org/apache/spark/ml/classification/LogisticRegressionSuite.scala --- @@ -883,6 +884,27 @@ class LogisticRegressionSuite assert(model1a0.intercept ~== model1b.intercept absTol 1E-3) } + test("logistic regression with fitIntercept=true and all labels the same") { +val lr = new LogisticRegression() + .setFitIntercept(true) + .setMaxIter(3) +val sameLabels = dataset + .withColumn("zeroLabel", lit(0.0)) + .withColumn("oneLabel", lit(1.0)) + +val allZeroModel = lr + .setLabelCol("zeroLabel") + .fit(sameLabels) +assert(allZeroModel.coefficients ~== Vectors.dense(0.0) absTol 1E-3) +assert(allZeroModel.intercept === Double.NegativeInfinity) + +val allOneModel = lr + .setLabelCol("oneLabel") + .fit(sameLabels) +assert(allOneModel.coefficients ~== Vectors.dense(0.0) absTol 1E-3) +assert(allOneModel.intercept === Double.PositiveInfinity) + } + --- End diff -- Can you add one test which is all labels the same but `fitIntercept=false` here to avoid the issue in LiR? Thanks. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12804][ML] Fix LogisticRegression with ...
Github user feynmanliang commented on a diff in the pull request: https://github.com/apache/spark/pull/10743#discussion_r49946338 --- Diff: mllib/src/test/scala/org/apache/spark/ml/classification/LogisticRegressionSuite.scala --- @@ -883,6 +884,22 @@ class LogisticRegressionSuite assert(model1a0.intercept ~== model1b.intercept absTol 1E-3) } + test("logistic regression with all labels the same") { +val lr = new LogisticRegression() + .setFitIntercept(true) + .setMaxIter(3) +val sameLabels = dataset + .withColumn("zeroLabel", lit(0.0)) + .withColumn("oneLabel", lit(1.0)) + +val model = lr + .setLabelCol("oneLabel") + .fit(sameLabels) + +assert(model.coefficients ~== Vectors.dense(0.0) absTol 1E-3) +assert(model.intercept === Double.PositiveInfinity) --- End diff -- Thanks for pointing that out! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12804][ML] Fix LogisticRegression with ...
Github user feynmanliang commented on a diff in the pull request: https://github.com/apache/spark/pull/10743#discussion_r49946694 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala --- @@ -339,9 +339,11 @@ class LogisticRegression @Since("1.2.0") ( b = \log{P(1) / P(0)} = \log{count_1 / count_0} }}} */ - initialCoefficientsWithIntercept.toArray(numFeatures) -= math.log(histogram(1) / histogram(0)) -} + if (histogram.length >= 2) { // check to make sure indexing into histogram(1) is safe + initialCoefficientsWithIntercept.toArray(numFeatures) = math.log( + histogram(1) / histogram(0)) --- End diff -- Thank you! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12804][ML] Fix LogisticRegression with ...
Github user feynmanliang commented on the pull request: https://github.com/apache/spark/pull/10743#issuecomment-172328753 @dbtsai @jkbradley ready for second review. The big diff is because I grouped the same label cases with the normal case to generate `coefficients`, `intercept`, and `objectiveTrace` all in the same block. This is to avoid repeated code when generating the model summary. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12804][ML] Fix LogisticRegression with ...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/10743#issuecomment-172329504 **[Test build #49554 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/49554/consoleFull)** for PR 10743 at commit [`d676f62`](https://github.com/apache/spark/commit/d676f6245622cedf61df3875ca0873d77f72a857). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12804][ML] Fix LogisticRegression with ...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/10743#issuecomment-172334066 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/49554/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12804][ML] Fix LogisticRegression with ...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/10743#issuecomment-172333931 **[Test build #49554 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/49554/consoleFull)** for PR 10743 at commit [`d676f62`](https://github.com/apache/spark/commit/d676f6245622cedf61df3875ca0873d77f72a857). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12804][ML] Fix LogisticRegression with ...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/10743#issuecomment-172334065 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12804][ML] Fix LogisticRegression with ...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/10743#issuecomment-171409858 **[Test build #49325 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/49325/consoleFull)** for PR 10743 at commit [`caf7a1b`](https://github.com/apache/spark/commit/caf7a1b2cd4336134d1f29e0fb4432a67d44288e). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12804][ML] Fix LogisticRegression with ...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/10743#issuecomment-171421491 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/49325/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12804][ML] Fix LogisticRegression with ...
GitHub user feynmanliang opened a pull request: https://github.com/apache/spark/pull/10743 [SPARK-12804][ML] Fix LogisticRegression with FitIntercept on all same label training data You can merge this pull request into a Git repository by running: $ git pull https://github.com/feynmanliang/spark SPARK-12804 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/10743.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #10743 commit fbf6b5cab51544c9230567e9479528a9bd8960c5 Author: Feynman LiangDate: 2016-01-13T17:52:56Z Initial fix and println unit test commit e4c13d4a89abc8160f1c2fa906cb3e3d1affd473 Author: Feynman Liang Date: 2016-01-13T19:23:49Z Cleans up test --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12804][ML] Fix LogisticRegression with ...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/10743#discussion_r49652270 --- Diff: mllib/src/test/scala/org/apache/spark/ml/classification/LogisticRegressionSuite.scala --- @@ -883,6 +884,22 @@ class LogisticRegressionSuite assert(model1a0.intercept ~== model1b.intercept absTol 1E-3) } + test("logistic regression with all labels the same") { +val lr = new LogisticRegression() + .setFitIntercept(true) + .setMaxIter(3) +val sameLabels = dataset + .withColumn("zeroLabel", lit(0.0)) + .withColumn("oneLabel", lit(1.0)) + +val model = lr + .setLabelCol("oneLabel") + .fit(sameLabels) + +assert(model.coefficients ~== Vectors.dense(0.0) absTol 1E-3) +assert(model.intercept === Double.PositiveInfinity) --- End diff -- BTW, this bug should not happen when all the labels are one since the histogram should be still size of two. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12804][ML] Fix LogisticRegression with ...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/10743#discussion_r49649843 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala --- @@ -339,9 +339,11 @@ class LogisticRegression @Since("1.2.0") ( b = \log{P(1) / P(0)} = \log{count_1 / count_0} }}} */ - initialCoefficientsWithIntercept.toArray(numFeatures) -= math.log(histogram(1) / histogram(0)) -} + if (histogram.length >= 2) { // check to make sure indexing into histogram(1) is safe + initialCoefficientsWithIntercept.toArray(numFeatures) = math.log( + histogram(1) / histogram(0)) --- End diff -- In this case, the whole training step can be skipped. Currently, we only support binary LoR, so the max of `histogram.length` will be two. In LiR, when the `yStd == 0.0`, the model will be returned immediately without training, see https://github.com/feynmanliang/spark/blob/SPARK-12804/mllib/src/main/scala/org/apache/spark/ml/regression/LinearRegression.scala#L226 We can do similar thing here like ```scala if (histogram.length == 2) { if (histogram(0) == 0.0) { model = (new LogisticRegressionModel(uid, Vectors.sparse(numFeatures, Seq()), Double.PositiveInfinity)) return model } else { initialCoefficientsWithIntercept.toArray(numFeatures) = math.log( histogram(1) / histogram(0)) } else if (histogram.length == 1) { model = (new LogisticRegressionModel(uid, Vectors.sparse(numFeatures, Seq()), Double.NegativeInfinity)) return model } else { some excpetion } } ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12804][ML] Fix LogisticRegression with ...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/10743#issuecomment-171421277 **[Test build #49325 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/49325/consoleFull)** for PR 10743 at commit [`caf7a1b`](https://github.com/apache/spark/commit/caf7a1b2cd4336134d1f29e0fb4432a67d44288e). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12804][ML] Fix LogisticRegression with ...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/10743#issuecomment-171421490 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12804][ML] Fix LogisticRegression with ...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/10743#discussion_r49650452 --- Diff: mllib/src/test/scala/org/apache/spark/ml/classification/LogisticRegressionSuite.scala --- @@ -883,6 +884,22 @@ class LogisticRegressionSuite assert(model1a0.intercept ~== model1b.intercept absTol 1E-3) } + test("logistic regression with all labels the same") { +val lr = new LogisticRegression() + .setFitIntercept(true) + .setMaxIter(3) +val sameLabels = dataset + .withColumn("zeroLabel", lit(0.0)) + .withColumn("oneLabel", lit(1.0)) + +val model = lr + .setLabelCol("oneLabel") + .fit(sameLabels) + +assert(model.coefficients ~== Vectors.dense(0.0) absTol 1E-3) +assert(model.intercept === Double.PositiveInfinity) --- End diff -- Can you add another test showing that all `zeroLabel` will return intercept with `Double.NegativeInfinity`? Thanks. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org