[GitHub] spark pull request: [SPARK-7780][MLLIB] intercept in logisticregre...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/10788#issuecomment-175381442 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/50158/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7780][MLLIB] intercept in logisticregre...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/10788#issuecomment-175381440 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7780][MLLIB] intercept in logisticregre...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/10788#issuecomment-175380873 **[Test build #50158 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/50158/consoleFull)** for PR 10788 at commit [`8016ad8`](https://github.com/apache/spark/commit/8016ad814a359e2e8d300c84b52a1a021f13b9dc). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7780][MLLIB] intercept in logisticregre...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/10788 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7780][MLLIB] intercept in logisticregre...
Github user dbtsai commented on the pull request: https://github.com/apache/spark/pull/10788#issuecomment-175341628 Thanks. Merged into master. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7780][MLLIB] intercept in logisticregre...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/10788#issuecomment-175341287 **[Test build #50158 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/50158/consoleFull)** for PR 10788 at commit [`8016ad8`](https://github.com/apache/spark/commit/8016ad814a359e2e8d300c84b52a1a021f13b9dc). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7780][MLLIB] intercept in logisticregre...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/10788#discussion_r50931598 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala --- @@ -343,22 +355,36 @@ class LogisticRegression @Since("1.2.0") ( val initialCoefficientsWithIntercept = Vectors.zeros(if ($(fitIntercept)) numFeatures + 1 else numFeatures) -if ($(fitIntercept)) { +if (optInitialModel.isDefined && optInitialModel.get.coefficients.size != numFeatures) { + val vec = optInitialModel.get.coefficients + logWarning( +s"Initial coefficients provided ${vec} did not match the expected size ${numFeatures}") +} + +if (optInitialModel.isDefined && optInitialModel.get.coefficients.size == numFeatures) { + val initialCoefficientsWithInterceptArray = initialCoefficientsWithIntercept.toArray + optInitialModel.get.coefficients.foreachActive { case (index, value) => +initialCoefficientsWithInterceptArray(index) = value + } + if ($(fitIntercept)) { +initialCoefficientsWithInterceptArray(numFeatures) == optInitialModel.get.intercept + } +} else if ($(fitIntercept)) { /* For binary logistic regression, when we initialize the coefficients as zeros, it will converge faster if we initialize the intercept such that it follows the distribution of the labels. {{{ - P(0) = 1 / (1 + \exp(b)), and - P(1) = \exp(b) / (1 + \exp(b)) + P(0) = 1 / (1 + \exp(b)), and + P(1) = \exp(b) / (1 + \exp(b)) }}}, hence {{{ - b = \log{P(1) / P(0)} = \log{count_1 / count_0} + b = \log{P(1) / P(0)} = \log{count_1 / count_0} }}} */ - initialCoefficientsWithIntercept.toArray(numFeatures) = math.log( -histogram(1) / histogram(0)) + initialCoefficientsWithIntercept.toArray(numFeatures) += math.log(histogram(1) / histogram(0)) --- End diff -- revert this change. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7780][MLLIB] intercept in logisticregre...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/10788#discussion_r50931620 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala --- @@ -343,22 +355,36 @@ class LogisticRegression @Since("1.2.0") ( val initialCoefficientsWithIntercept = Vectors.zeros(if ($(fitIntercept)) numFeatures + 1 else numFeatures) -if ($(fitIntercept)) { +if (optInitialModel.isDefined && optInitialModel.get.coefficients.size != numFeatures) { + val vec = optInitialModel.get.coefficients + logWarning( +s"Initial coefficients provided ${vec} did not match the expected size ${numFeatures}") +} + +if (optInitialModel.isDefined && optInitialModel.get.coefficients.size == numFeatures) { + val initialCoefficientsWithInterceptArray = initialCoefficientsWithIntercept.toArray + optInitialModel.get.coefficients.foreachActive { case (index, value) => +initialCoefficientsWithInterceptArray(index) = value + } + if ($(fitIntercept)) { +initialCoefficientsWithInterceptArray(numFeatures) == optInitialModel.get.intercept + } +} else if ($(fitIntercept)) { /* For binary logistic regression, when we initialize the coefficients as zeros, it will converge faster if we initialize the intercept such that it follows the distribution of the labels. {{{ - P(0) = 1 / (1 + \exp(b)), and - P(1) = \exp(b) / (1 + \exp(b)) + P(0) = 1 / (1 + \exp(b)), and --- End diff -- put two spaces back. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7780][MLLIB] intercept in logisticregre...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/10788#discussion_r50931610 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala --- @@ -343,22 +355,36 @@ class LogisticRegression @Since("1.2.0") ( val initialCoefficientsWithIntercept = Vectors.zeros(if ($(fitIntercept)) numFeatures + 1 else numFeatures) -if ($(fitIntercept)) { +if (optInitialModel.isDefined && optInitialModel.get.coefficients.size != numFeatures) { + val vec = optInitialModel.get.coefficients + logWarning( +s"Initial coefficients provided ${vec} did not match the expected size ${numFeatures}") +} + +if (optInitialModel.isDefined && optInitialModel.get.coefficients.size == numFeatures) { + val initialCoefficientsWithInterceptArray = initialCoefficientsWithIntercept.toArray + optInitialModel.get.coefficients.foreachActive { case (index, value) => +initialCoefficientsWithInterceptArray(index) = value + } + if ($(fitIntercept)) { +initialCoefficientsWithInterceptArray(numFeatures) == optInitialModel.get.intercept + } +} else if ($(fitIntercept)) { /* For binary logistic regression, when we initialize the coefficients as zeros, it will converge faster if we initialize the intercept such that it follows the distribution of the labels. {{{ - P(0) = 1 / (1 + \exp(b)), and - P(1) = \exp(b) / (1 + \exp(b)) + P(0) = 1 / (1 + \exp(b)), and + P(1) = \exp(b) / (1 + \exp(b)) }}}, hence {{{ - b = \log{P(1) / P(0)} = \log{count_1 / count_0} + b = \log{P(1) / P(0)} = \log{count_1 / count_0} --- End diff -- put two spaces back. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7780][MLLIB] intercept in logisticregre...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/10788#discussion_r50931631 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala --- @@ -343,22 +355,36 @@ class LogisticRegression @Since("1.2.0") ( val initialCoefficientsWithIntercept = Vectors.zeros(if ($(fitIntercept)) numFeatures + 1 else numFeatures) -if ($(fitIntercept)) { +if (optInitialModel.isDefined && optInitialModel.get.coefficients.size != numFeatures) { + val vec = optInitialModel.get.coefficients + logWarning( +s"Initial coefficients provided ${vec} did not match the expected size ${numFeatures}") +} + +if (optInitialModel.isDefined && optInitialModel.get.coefficients.size == numFeatures) { + val initialCoefficientsWithInterceptArray = initialCoefficientsWithIntercept.toArray + optInitialModel.get.coefficients.foreachActive { case (index, value) => +initialCoefficientsWithInterceptArray(index) = value + } + if ($(fitIntercept)) { +initialCoefficientsWithInterceptArray(numFeatures) == optInitialModel.get.intercept + } +} else if ($(fitIntercept)) { /* For binary logistic regression, when we initialize the coefficients as zeros, it will converge faster if we initialize the intercept such that it follows the distribution of the labels. {{{ - P(0) = 1 / (1 + \exp(b)), and - P(1) = \exp(b) / (1 + \exp(b)) + P(0) = 1 / (1 + \exp(b)), and + P(1) = \exp(b) / (1 + \exp(b)) --- End diff -- ditto --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7780][MLLIB] intercept in logisticregre...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/10788#issuecomment-174715010 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7780][MLLIB] intercept in logisticregre...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/10788#issuecomment-174715017 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/50024/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7780][MLLIB] intercept in logisticregre...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/10788#issuecomment-174714408 **[Test build #50024 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/50024/consoleFull)** for PR 10788 at commit [`e6b797a`](https://github.com/apache/spark/commit/e6b797a51696238c3b7b369c77be9763e7d70b52). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7780][MLLIB] intercept in logisticregre...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/10788#issuecomment-174690199 **[Test build #50024 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/50024/consoleFull)** for PR 10788 at commit [`e6b797a`](https://github.com/apache/spark/commit/e6b797a51696238c3b7b369c77be9763e7d70b52). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7780][MLLIB] intercept in logisticregre...
Github user holdenk commented on the pull request: https://github.com/apache/spark/pull/10788#issuecomment-174688059 Jenkins retest this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7780][MLLIB] intercept in logisticregre...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/10788#issuecomment-174682293 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7780][MLLIB] intercept in logisticregre...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/10788#issuecomment-174682295 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/50011/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7780][MLLIB] intercept in logisticregre...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/10788#issuecomment-174682157 **[Test build #50011 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/50011/consoleFull)** for PR 10788 at commit [`e6b797a`](https://github.com/apache/spark/commit/e6b797a51696238c3b7b369c77be9763e7d70b52). * This patch **fails PySpark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7780][MLLIB] intercept in logisticregre...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/10788#issuecomment-174665118 **[Test build #50011 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/50011/consoleFull)** for PR 10788 at commit [`e6b797a`](https://github.com/apache/spark/commit/e6b797a51696238c3b7b369c77be9763e7d70b52). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7780][MLLIB] intercept in logisticregre...
Github user dbtsai commented on the pull request: https://github.com/apache/spark/pull/10788#issuecomment-174662125 I'm going through the caching logic now. Will let you know soon. Thanks. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7780][MLLIB] intercept in logisticregre...
Github user holdenk commented on the pull request: https://github.com/apache/spark/pull/10788#issuecomment-174661379 @dbtsai should have addressed the style concerns, let me know if anything else shows up :) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7780][MLLIB] intercept in logisticregre...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/10788#issuecomment-173519471 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7780][MLLIB] intercept in logisticregre...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/10788#issuecomment-173519472 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/49871/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7780][MLLIB] intercept in logisticregre...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/10788#issuecomment-173519318 **[Test build #49871 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/49871/consoleFull)** for PR 10788 at commit [`46ae406`](https://github.com/apache/spark/commit/46ae406e7d9935ba2d75a092e98622578fb4ce15). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7780][MLLIB] intercept in logisticregre...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/10788#issuecomment-173506673 **[Test build #49871 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/49871/consoleFull)** for PR 10788 at commit [`46ae406`](https://github.com/apache/spark/commit/46ae406e7d9935ba2d75a092e98622578fb4ce15). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7780][MLLIB] intercept in logisticregre...
Github user holdenk commented on a diff in the pull request: https://github.com/apache/spark/pull/10788#discussion_r50372852 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/classification/LogisticRegression.scala --- @@ -374,4 +384,82 @@ class LogisticRegressionWithLBFGS new LogisticRegressionModel(weights, intercept, numFeatures, numOfLinearPredictor + 1) } } + + /** + * Run Logistic Regression with the configured parameters on an input RDD + * of LabeledPoint entries. + * + * If a known updater is used calls the ml implementation, to avoid + * applying a regularization penalty to the intercept, otherwise + * defaults to the mllib implementation. If more than two classes + * or feature scaling is disabled, always uses mllib implementation. + * If using ml implementation, uses ml code to generate initial weights. + */ + override def run(input: RDD[LabeledPoint]): LogisticRegressionModel = { +run(input, generateInitialWeights(input), userSuppliedWeights = false) + } + + /** + * Run Logistic Regression with the configured parameters on an input RDD + * of LabeledPoint entries starting from the initial weights provided. + * + * If a known updater is used calls the ml implementation, to avoid + * applying a regularization penalty to the intercept, otherwise + * defaults to the mllib implementation. If more than two classes + * or feature scaling is disabled, always uses mllib implementation. + * Uses user provided weights. + */ + override def run(input: RDD[LabeledPoint], initialWeights: Vector): LogisticRegressionModel = { +run(input, initialWeights, userSuppliedWeights = true) + } + + private def run(input: RDD[LabeledPoint], initialWeights: Vector, userSuppliedWeights: Boolean): + LogisticRegressionModel = { +// ml's Logisitic regression only supports binary classifcation currently. +if (numOfLinearPredictor == 1) { + def runWithMlLogisitcRegression(elasticNetParam: Double) = { +// Prepare the ml LogisticRegression based on our settings +val lr = new org.apache.spark.ml.classification.LogisticRegression() +lr.setRegParam(optimizer.getRegParam()) +lr.setElasticNetParam(elasticNetParam) +lr.setStandardization(useFeatureScaling) +if (userSuppliedWeights) { + val uid = Identifiable.randomUID("logreg-static") + lr.setInitialModel(new org.apache.spark.ml.classification.LogisticRegressionModel( +uid, initialWeights, 1.0)) +} +lr.setFitIntercept(addIntercept) +lr.setMaxIter(optimizer.getNumIterations()) +lr.setTol(optimizer.getConvergenceTol()) +// Convert our input into a DataFrame +val sqlContext = new SQLContext(input.context) +import sqlContext.implicits._ +val df = input.toDF() +// Determine if we should cache the DF +val handlePersistence = input.getStorageLevel == StorageLevel.NONE --- End diff -- Good point, in a previous version of the code we passed handlePersistence down through to avoid this. I've updated it to do the same here. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7780][MLLIB] intercept in logisticregre...
Github user holdenk commented on a diff in the pull request: https://github.com/apache/spark/pull/10788#discussion_r50372566 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala --- @@ -335,31 +342,45 @@ class LogisticRegression @Since("1.2.0") ( val initialCoefficientsWithIntercept = Vectors.zeros(if ($(fitIntercept)) numFeatures + 1 else numFeatures) -if ($(fitIntercept)) { - /* - For binary logistic regression, when we initialize the coefficients as zeros, - it will converge faster if we initialize the intercept such that - it follows the distribution of the labels. - - {{{ - P(0) = 1 / (1 + \exp(b)), and - P(1) = \exp(b) / (1 + \exp(b)) - }}}, hence - {{{ - b = \log{P(1) / P(0)} = \log{count_1 / count_0} - }}} +if (optInitialModel.isDefined && optInitialModel.get.coefficients != numFeatures) { + val vec = optInitialModel.get.coefficients + logWarning( +s"Initial coefficients provided ${vec} did not match the expected size ${numFeatures}") +} + +if (optInitialModel.isDefined && optInitialModel.get.coefficients == numFeatures) { + val initialCoefficientsWithInterceptArray = initialCoefficientsWithIntercept.toArray + optInitialModel.get.coefficients.foreachActive { case (index, value) => +initialCoefficientsWithInterceptArray(index) = value + } + if ($(fitIntercept)) { +initialCoefficientsWithInterceptArray(numFeatures) == optInitialModel.get.intercept + } +} else if ($(fitIntercept)) { + /** + * For binary logistic regression, when we initialize the coefficients as zeros, + * it will converge faster if we initialize the intercept such that + * it follows the distribution of the labels. + --- End diff -- Ok, looking at the rest of the comments in the file & the style guide it seems to mostly have the `*` but I'll put them back in (it also break auto indent to not have them but thats an emacs bug) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7780][MLLIB] intercept in logisticregre...
Github user holdenk commented on a diff in the pull request: https://github.com/apache/spark/pull/10788#discussion_r50372397 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala --- @@ -335,31 +342,45 @@ class LogisticRegression @Since("1.2.0") ( val initialCoefficientsWithIntercept = Vectors.zeros(if ($(fitIntercept)) numFeatures + 1 else numFeatures) -if ($(fitIntercept)) { - /* - For binary logistic regression, when we initialize the coefficients as zeros, - it will converge faster if we initialize the intercept such that - it follows the distribution of the labels. - - {{{ - P(0) = 1 / (1 + \exp(b)), and - P(1) = \exp(b) / (1 + \exp(b)) - }}}, hence - {{{ - b = \log{P(1) / P(0)} = \log{count_1 / count_0} - }}} +if (optInitialModel.isDefined && optInitialModel.get.coefficients != numFeatures) { + val vec = optInitialModel.get.coefficients --- End diff -- its used on L348 in the log warning --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7780][MLLIB] intercept in logisticregre...
Github user dbtsai commented on the pull request: https://github.com/apache/spark/pull/10788#issuecomment-173493541 LGTM except some styling issues, and concern about caching twice. Thanks. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7780][MLLIB] intercept in logisticregre...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/10788#discussion_r50371017 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/classification/LogisticRegression.scala --- @@ -374,4 +384,82 @@ class LogisticRegressionWithLBFGS new LogisticRegressionModel(weights, intercept, numFeatures, numOfLinearPredictor + 1) } } + + /** + * Run Logistic Regression with the configured parameters on an input RDD + * of LabeledPoint entries. + * + * If a known updater is used calls the ml implementation, to avoid + * applying a regularization penalty to the intercept, otherwise + * defaults to the mllib implementation. If more than two classes + * or feature scaling is disabled, always uses mllib implementation. + * If using ml implementation, uses ml code to generate initial weights. + */ + override def run(input: RDD[LabeledPoint]): LogisticRegressionModel = { +run(input, generateInitialWeights(input), userSuppliedWeights = false) + } + + /** + * Run Logistic Regression with the configured parameters on an input RDD + * of LabeledPoint entries starting from the initial weights provided. + * + * If a known updater is used calls the ml implementation, to avoid + * applying a regularization penalty to the intercept, otherwise + * defaults to the mllib implementation. If more than two classes + * or feature scaling is disabled, always uses mllib implementation. + * Uses user provided weights. + */ + override def run(input: RDD[LabeledPoint], initialWeights: Vector): LogisticRegressionModel = { +run(input, initialWeights, userSuppliedWeights = true) + } + + private def run(input: RDD[LabeledPoint], initialWeights: Vector, userSuppliedWeights: Boolean): + LogisticRegressionModel = { +// ml's Logisitic regression only supports binary classifcation currently. +if (numOfLinearPredictor == 1) { + def runWithMlLogisitcRegression(elasticNetParam: Double) = { +// Prepare the ml LogisticRegression based on our settings +val lr = new org.apache.spark.ml.classification.LogisticRegression() +lr.setRegParam(optimizer.getRegParam()) +lr.setElasticNetParam(elasticNetParam) +lr.setStandardization(useFeatureScaling) +if (userSuppliedWeights) { + val uid = Identifiable.randomUID("logreg-static") + lr.setInitialModel(new org.apache.spark.ml.classification.LogisticRegressionModel( +uid, initialWeights, 1.0)) +} +lr.setFitIntercept(addIntercept) +lr.setMaxIter(optimizer.getNumIterations()) +lr.setTol(optimizer.getConvergenceTol()) +// Convert our input into a DataFrame +val sqlContext = new SQLContext(input.context) +import sqlContext.implicits._ +val df = input.toDF() +// Determine if we should cache the DF +val handlePersistence = input.getStorageLevel == StorageLevel.NONE +if (handlePersistence) { + df.persist(StorageLevel.MEMORY_AND_DISK) +} +// Train our model +val mlLogisticRegresionModel = lr.train(df) +// unpersist if we persisted +if (handlePersistence) { + df.unpersist() +} +// convert the model +val weights = mlLogisticRegresionModel.weights match { + case x: DenseVector => x + case y: Vector => Vectors.dense(y.toArray) +} +createModel(weights, mlLogisticRegresionModel.intercept) + } + optimizer.getUpdater() match { --- End diff -- okay, this will make the test harder to write. I don't care this one now. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7780][MLLIB] intercept in logisticregre...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/10788#discussion_r50370414 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/classification/LogisticRegression.scala --- @@ -374,4 +384,82 @@ class LogisticRegressionWithLBFGS new LogisticRegressionModel(weights, intercept, numFeatures, numOfLinearPredictor + 1) } } + + /** + * Run Logistic Regression with the configured parameters on an input RDD + * of LabeledPoint entries. + * + * If a known updater is used calls the ml implementation, to avoid + * applying a regularization penalty to the intercept, otherwise + * defaults to the mllib implementation. If more than two classes + * or feature scaling is disabled, always uses mllib implementation. + * If using ml implementation, uses ml code to generate initial weights. + */ + override def run(input: RDD[LabeledPoint]): LogisticRegressionModel = { +run(input, generateInitialWeights(input), userSuppliedWeights = false) + } + + /** + * Run Logistic Regression with the configured parameters on an input RDD + * of LabeledPoint entries starting from the initial weights provided. + * + * If a known updater is used calls the ml implementation, to avoid + * applying a regularization penalty to the intercept, otherwise + * defaults to the mllib implementation. If more than two classes + * or feature scaling is disabled, always uses mllib implementation. + * Uses user provided weights. + */ + override def run(input: RDD[LabeledPoint], initialWeights: Vector): LogisticRegressionModel = { +run(input, initialWeights, userSuppliedWeights = true) + } + + private def run(input: RDD[LabeledPoint], initialWeights: Vector, userSuppliedWeights: Boolean): + LogisticRegressionModel = { +// ml's Logisitic regression only supports binary classifcation currently. +if (numOfLinearPredictor == 1) { + def runWithMlLogisitcRegression(elasticNetParam: Double) = { +// Prepare the ml LogisticRegression based on our settings +val lr = new org.apache.spark.ml.classification.LogisticRegression() +lr.setRegParam(optimizer.getRegParam()) +lr.setElasticNetParam(elasticNetParam) +lr.setStandardization(useFeatureScaling) +if (userSuppliedWeights) { + val uid = Identifiable.randomUID("logreg-static") + lr.setInitialModel(new org.apache.spark.ml.classification.LogisticRegressionModel( +uid, initialWeights, 1.0)) +} +lr.setFitIntercept(addIntercept) +lr.setMaxIter(optimizer.getNumIterations()) +lr.setTol(optimizer.getConvergenceTol()) +// Convert our input into a DataFrame +val sqlContext = new SQLContext(input.context) +import sqlContext.implicits._ +val df = input.toDF() +// Determine if we should cache the DF +val handlePersistence = input.getStorageLevel == StorageLevel.NONE +if (handlePersistence) { + df.persist(StorageLevel.MEMORY_AND_DISK) +} +// Train our model +val mlLogisticRegresionModel = lr.train(df) +// unpersist if we persisted +if (handlePersistence) { + df.unpersist() +} +// convert the model +val weights = mlLogisticRegresionModel.weights match { + case x: DenseVector => x + case y: Vector => Vectors.dense(y.toArray) +} +createModel(weights, mlLogisticRegresionModel.intercept) + } + optimizer.getUpdater() match { --- End diff -- when `optimizer.getRegParam() == 0.0`, run the old version. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7780][MLLIB] intercept in logisticregre...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/10788#discussion_r50370273 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/classification/LogisticRegression.scala --- @@ -374,4 +384,82 @@ class LogisticRegressionWithLBFGS new LogisticRegressionModel(weights, intercept, numFeatures, numOfLinearPredictor + 1) } } + + /** + * Run Logistic Regression with the configured parameters on an input RDD + * of LabeledPoint entries. + * + * If a known updater is used calls the ml implementation, to avoid + * applying a regularization penalty to the intercept, otherwise + * defaults to the mllib implementation. If more than two classes + * or feature scaling is disabled, always uses mllib implementation. + * If using ml implementation, uses ml code to generate initial weights. + */ + override def run(input: RDD[LabeledPoint]): LogisticRegressionModel = { +run(input, generateInitialWeights(input), userSuppliedWeights = false) + } + + /** + * Run Logistic Regression with the configured parameters on an input RDD + * of LabeledPoint entries starting from the initial weights provided. + * + * If a known updater is used calls the ml implementation, to avoid + * applying a regularization penalty to the intercept, otherwise + * defaults to the mllib implementation. If more than two classes + * or feature scaling is disabled, always uses mllib implementation. + * Uses user provided weights. + */ + override def run(input: RDD[LabeledPoint], initialWeights: Vector): LogisticRegressionModel = { +run(input, initialWeights, userSuppliedWeights = true) + } + + private def run(input: RDD[LabeledPoint], initialWeights: Vector, userSuppliedWeights: Boolean): + LogisticRegressionModel = { +// ml's Logisitic regression only supports binary classifcation currently. +if (numOfLinearPredictor == 1) { + def runWithMlLogisitcRegression(elasticNetParam: Double) = { +// Prepare the ml LogisticRegression based on our settings +val lr = new org.apache.spark.ml.classification.LogisticRegression() +lr.setRegParam(optimizer.getRegParam()) +lr.setElasticNetParam(elasticNetParam) +lr.setStandardization(useFeatureScaling) +if (userSuppliedWeights) { + val uid = Identifiable.randomUID("logreg-static") + lr.setInitialModel(new org.apache.spark.ml.classification.LogisticRegressionModel( +uid, initialWeights, 1.0)) +} +lr.setFitIntercept(addIntercept) +lr.setMaxIter(optimizer.getNumIterations()) +lr.setTol(optimizer.getConvergenceTol()) +// Convert our input into a DataFrame +val sqlContext = new SQLContext(input.context) +import sqlContext.implicits._ +val df = input.toDF() +// Determine if we should cache the DF +val handlePersistence = input.getStorageLevel == StorageLevel.NONE +if (handlePersistence) { + df.persist(StorageLevel.MEMORY_AND_DISK) +} +// Train our model +val mlLogisticRegresionModel = lr.train(df) +// unpersist if we persisted +if (handlePersistence) { + df.unpersist() +} +// convert the model +val weights = mlLogisticRegresionModel.weights match { --- End diff -- ```scala val weights = Vectors.dense(mlLogisticRegresionModel.coefficients.toArray) ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7780][MLLIB] intercept in logisticregre...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/10788#discussion_r50370169 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/classification/LogisticRegression.scala --- @@ -374,4 +384,82 @@ class LogisticRegressionWithLBFGS new LogisticRegressionModel(weights, intercept, numFeatures, numOfLinearPredictor + 1) } } + + /** + * Run Logistic Regression with the configured parameters on an input RDD + * of LabeledPoint entries. + * + * If a known updater is used calls the ml implementation, to avoid + * applying a regularization penalty to the intercept, otherwise + * defaults to the mllib implementation. If more than two classes + * or feature scaling is disabled, always uses mllib implementation. + * If using ml implementation, uses ml code to generate initial weights. + */ + override def run(input: RDD[LabeledPoint]): LogisticRegressionModel = { +run(input, generateInitialWeights(input), userSuppliedWeights = false) + } + + /** + * Run Logistic Regression with the configured parameters on an input RDD + * of LabeledPoint entries starting from the initial weights provided. + * + * If a known updater is used calls the ml implementation, to avoid + * applying a regularization penalty to the intercept, otherwise + * defaults to the mllib implementation. If more than two classes + * or feature scaling is disabled, always uses mllib implementation. + * Uses user provided weights. + */ + override def run(input: RDD[LabeledPoint], initialWeights: Vector): LogisticRegressionModel = { +run(input, initialWeights, userSuppliedWeights = true) + } + + private def run(input: RDD[LabeledPoint], initialWeights: Vector, userSuppliedWeights: Boolean): + LogisticRegressionModel = { +// ml's Logisitic regression only supports binary classifcation currently. +if (numOfLinearPredictor == 1) { + def runWithMlLogisitcRegression(elasticNetParam: Double) = { +// Prepare the ml LogisticRegression based on our settings +val lr = new org.apache.spark.ml.classification.LogisticRegression() +lr.setRegParam(optimizer.getRegParam()) +lr.setElasticNetParam(elasticNetParam) +lr.setStandardization(useFeatureScaling) +if (userSuppliedWeights) { + val uid = Identifiable.randomUID("logreg-static") + lr.setInitialModel(new org.apache.spark.ml.classification.LogisticRegressionModel( +uid, initialWeights, 1.0)) +} +lr.setFitIntercept(addIntercept) +lr.setMaxIter(optimizer.getNumIterations()) +lr.setTol(optimizer.getConvergenceTol()) +// Convert our input into a DataFrame +val sqlContext = new SQLContext(input.context) +import sqlContext.implicits._ +val df = input.toDF() +// Determine if we should cache the DF +val handlePersistence = input.getStorageLevel == StorageLevel.NONE --- End diff -- Will this cause double caching? Let's say input RDD is cached, so `handlePersistence` will be false. As a result, `df == StorageLevel.NONE` will be true in ml's LOR code, and this will cause caching twice. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7780][MLLIB] intercept in logisticregre...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/10788#discussion_r50369730 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala --- @@ -374,11 +395,11 @@ class LogisticRegression @Since("1.2.0") ( throw new SparkException(msg) } -/* - The coefficients are trained in the scaled space; we're converting them back to - the original space. - Note that the intercept in scaled space and original space is the same; - as a result, no scaling is needed. +/** + * The coefficients are trained in the scaled space; we're converting them back to --- End diff -- ditto --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7780][MLLIB] intercept in logisticregre...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/10788#discussion_r50369722 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala --- @@ -335,31 +342,45 @@ class LogisticRegression @Since("1.2.0") ( val initialCoefficientsWithIntercept = Vectors.zeros(if ($(fitIntercept)) numFeatures + 1 else numFeatures) -if ($(fitIntercept)) { - /* - For binary logistic regression, when we initialize the coefficients as zeros, - it will converge faster if we initialize the intercept such that - it follows the distribution of the labels. - - {{{ - P(0) = 1 / (1 + \exp(b)), and - P(1) = \exp(b) / (1 + \exp(b)) - }}}, hence - {{{ - b = \log{P(1) / P(0)} = \log{count_1 / count_0} - }}} +if (optInitialModel.isDefined && optInitialModel.get.coefficients != numFeatures) { + val vec = optInitialModel.get.coefficients + logWarning( +s"Initial coefficients provided ${vec} did not match the expected size ${numFeatures}") +} + +if (optInitialModel.isDefined && optInitialModel.get.coefficients == numFeatures) { + val initialCoefficientsWithInterceptArray = initialCoefficientsWithIntercept.toArray + optInitialModel.get.coefficients.foreachActive { case (index, value) => +initialCoefficientsWithInterceptArray(index) = value + } + if ($(fitIntercept)) { +initialCoefficientsWithInterceptArray(numFeatures) == optInitialModel.get.intercept + } +} else if ($(fitIntercept)) { + /** + * For binary logistic regression, when we initialize the coefficients as zeros, + * it will converge faster if we initialize the intercept such that + * it follows the distribution of the labels. + + * {{{ + * P(0) = 1 / (1 + \exp(b)), and + * P(1) = \exp(b) / (1 + \exp(b)) + * }}}, hence + * {{{ + * b = \log{P(1) / P(0)} = \log{count_1 / count_0} + * }}} */ - initialCoefficientsWithIntercept.toArray(numFeatures) = math.log( -histogram(1) / histogram(0)) + initialCoefficientsWithIntercept.toArray(numFeatures) + = math.log(histogram(1) / histogram(0)) } val states = optimizer.iterations(new CachedDiffFunction(costFun), initialCoefficientsWithIntercept.toBreeze.toDenseVector) -/* - Note that in Logistic Regression, the objective history (loss + regularization) - is log-likelihood which is invariance under feature standardization. As a result, - the objective history from optimizer is the same as the one in the original space. +/** + * Note that in Logistic Regression, the objective history (loss + regularization) --- End diff -- reverse the style change --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7780][MLLIB] intercept in logisticregre...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/10788#discussion_r50369668 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala --- @@ -335,31 +342,45 @@ class LogisticRegression @Since("1.2.0") ( val initialCoefficientsWithIntercept = Vectors.zeros(if ($(fitIntercept)) numFeatures + 1 else numFeatures) -if ($(fitIntercept)) { - /* - For binary logistic regression, when we initialize the coefficients as zeros, - it will converge faster if we initialize the intercept such that - it follows the distribution of the labels. - - {{{ - P(0) = 1 / (1 + \exp(b)), and - P(1) = \exp(b) / (1 + \exp(b)) - }}}, hence - {{{ - b = \log{P(1) / P(0)} = \log{count_1 / count_0} - }}} +if (optInitialModel.isDefined && optInitialModel.get.coefficients != numFeatures) { + val vec = optInitialModel.get.coefficients + logWarning( +s"Initial coefficients provided ${vec} did not match the expected size ${numFeatures}") +} + +if (optInitialModel.isDefined && optInitialModel.get.coefficients == numFeatures) { + val initialCoefficientsWithInterceptArray = initialCoefficientsWithIntercept.toArray + optInitialModel.get.coefficients.foreachActive { case (index, value) => +initialCoefficientsWithInterceptArray(index) = value + } + if ($(fitIntercept)) { +initialCoefficientsWithInterceptArray(numFeatures) == optInitialModel.get.intercept + } +} else if ($(fitIntercept)) { + /** + * For binary logistic regression, when we initialize the coefficients as zeros, + * it will converge faster if we initialize the intercept such that + * it follows the distribution of the labels. + --- End diff -- I think u have to remove all the `*`. I think we decide to do comment like ``` /* Start the sentence. */ --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7780][MLLIB] intercept in logisticregre...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/10788#discussion_r50369552 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala --- @@ -335,31 +342,45 @@ class LogisticRegression @Since("1.2.0") ( val initialCoefficientsWithIntercept = Vectors.zeros(if ($(fitIntercept)) numFeatures + 1 else numFeatures) -if ($(fitIntercept)) { - /* - For binary logistic regression, when we initialize the coefficients as zeros, - it will converge faster if we initialize the intercept such that - it follows the distribution of the labels. - - {{{ - P(0) = 1 / (1 + \exp(b)), and - P(1) = \exp(b) / (1 + \exp(b)) - }}}, hence - {{{ - b = \log{P(1) / P(0)} = \log{count_1 / count_0} - }}} +if (optInitialModel.isDefined && optInitialModel.get.coefficients != numFeatures) { + val vec = optInitialModel.get.coefficients + logWarning( +s"Initial coefficients provided ${vec} did not match the expected size ${numFeatures}") +} + +if (optInitialModel.isDefined && optInitialModel.get.coefficients == numFeatures) { + val initialCoefficientsWithInterceptArray = initialCoefficientsWithIntercept.toArray + optInitialModel.get.coefficients.foreachActive { case (index, value) => +initialCoefficientsWithInterceptArray(index) = value + } + if ($(fitIntercept)) { +initialCoefficientsWithInterceptArray(numFeatures) == optInitialModel.get.intercept + } +} else if ($(fitIntercept)) { + /** + * For binary logistic regression, when we initialize the coefficients as zeros, + * it will converge faster if we initialize the intercept such that + * it follows the distribution of the labels. + --- End diff -- remove the extra line --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7780][MLLIB] intercept in logisticregre...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/10788#discussion_r50369516 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala --- @@ -335,31 +342,45 @@ class LogisticRegression @Since("1.2.0") ( val initialCoefficientsWithIntercept = Vectors.zeros(if ($(fitIntercept)) numFeatures + 1 else numFeatures) -if ($(fitIntercept)) { - /* - For binary logistic regression, when we initialize the coefficients as zeros, - it will converge faster if we initialize the intercept such that - it follows the distribution of the labels. - - {{{ - P(0) = 1 / (1 + \exp(b)), and - P(1) = \exp(b) / (1 + \exp(b)) - }}}, hence - {{{ - b = \log{P(1) / P(0)} = \log{count_1 / count_0} - }}} +if (optInitialModel.isDefined && optInitialModel.get.coefficients != numFeatures) { + val vec = optInitialModel.get.coefficients + logWarning( +s"Initial coefficients provided ${vec} did not match the expected size ${numFeatures}") +} + +if (optInitialModel.isDefined && optInitialModel.get.coefficients == numFeatures) { + val initialCoefficientsWithInterceptArray = initialCoefficientsWithIntercept.toArray + optInitialModel.get.coefficients.foreachActive { case (index, value) => +initialCoefficientsWithInterceptArray(index) = value + } + if ($(fitIntercept)) { +initialCoefficientsWithInterceptArray(numFeatures) == optInitialModel.get.intercept + } +} else if ($(fitIntercept)) { + /** + * For binary logistic regression, when we initialize the coefficients as zeros, + * it will converge faster if we initialize the intercept such that + * it follows the distribution of the labels. + + * {{{ + * P(0) = 1 / (1 + \exp(b)), and + * P(1) = \exp(b) / (1 + \exp(b)) + * }}}, hence + * {{{ + * b = \log{P(1) / P(0)} = \log{count_1 / count_0} + * }}} */ - initialCoefficientsWithIntercept.toArray(numFeatures) = math.log( -histogram(1) / histogram(0)) + initialCoefficientsWithIntercept.toArray(numFeatures) + = math.log(histogram(1) / histogram(0)) --- End diff -- add two spaces. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7780][MLLIB] intercept in logisticregre...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/10788#discussion_r50369207 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala --- @@ -335,31 +342,45 @@ class LogisticRegression @Since("1.2.0") ( val initialCoefficientsWithIntercept = Vectors.zeros(if ($(fitIntercept)) numFeatures + 1 else numFeatures) -if ($(fitIntercept)) { - /* - For binary logistic regression, when we initialize the coefficients as zeros, - it will converge faster if we initialize the intercept such that - it follows the distribution of the labels. - - {{{ - P(0) = 1 / (1 + \exp(b)), and - P(1) = \exp(b) / (1 + \exp(b)) - }}}, hence - {{{ - b = \log{P(1) / P(0)} = \log{count_1 / count_0} - }}} +if (optInitialModel.isDefined && optInitialModel.get.coefficients != numFeatures) { + val vec = optInitialModel.get.coefficients + logWarning( +s"Initial coefficients provided ${vec} did not match the expected size ${numFeatures}") +} + +if (optInitialModel.isDefined && optInitialModel.get.coefficients == numFeatures) { --- End diff -- ditto --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7780][MLLIB] intercept in logisticregre...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/10788#discussion_r50369141 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala --- @@ -335,31 +342,45 @@ class LogisticRegression @Since("1.2.0") ( val initialCoefficientsWithIntercept = Vectors.zeros(if ($(fitIntercept)) numFeatures + 1 else numFeatures) -if ($(fitIntercept)) { - /* - For binary logistic regression, when we initialize the coefficients as zeros, - it will converge faster if we initialize the intercept such that - it follows the distribution of the labels. - - {{{ - P(0) = 1 / (1 + \exp(b)), and - P(1) = \exp(b) / (1 + \exp(b)) - }}}, hence - {{{ - b = \log{P(1) / P(0)} = \log{count_1 / count_0} - }}} +if (optInitialModel.isDefined && optInitialModel.get.coefficients != numFeatures) { + val vec = optInitialModel.get.coefficients --- End diff -- `vec` is not used. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7780][MLLIB] intercept in logisticregre...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/10788#discussion_r50369078 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala --- @@ -335,31 +342,45 @@ class LogisticRegression @Since("1.2.0") ( val initialCoefficientsWithIntercept = Vectors.zeros(if ($(fitIntercept)) numFeatures + 1 else numFeatures) -if ($(fitIntercept)) { - /* - For binary logistic regression, when we initialize the coefficients as zeros, - it will converge faster if we initialize the intercept such that - it follows the distribution of the labels. - - {{{ - P(0) = 1 / (1 + \exp(b)), and - P(1) = \exp(b) / (1 + \exp(b)) - }}}, hence - {{{ - b = \log{P(1) / P(0)} = \log{count_1 / count_0} - }}} +if (optInitialModel.isDefined && optInitialModel.get.coefficients != numFeatures) { --- End diff -- How can this compile? Should be `optInitialModel.get.coefficients.size != numFeatures` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7780][MLLIB] intercept in logisticregre...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/10788#issuecomment-172979039 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/49697/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7780][MLLIB] intercept in logisticregre...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/10788#issuecomment-172979036 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7780][MLLIB] intercept in logisticregre...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/10788#issuecomment-172978875 **[Test build #49697 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/49697/consoleFull)** for PR 10788 at commit [`7501b4b`](https://github.com/apache/spark/commit/7501b4b29d0d08d1363cb1f16be1397887a569b1). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7780][MLLIB] intercept in logisticregre...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/10788#issuecomment-172973109 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7780][MLLIB] intercept in logisticregre...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/10788#issuecomment-172973112 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/49692/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7780][MLLIB] intercept in logisticregre...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/10788#issuecomment-172972956 **[Test build #49692 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/49692/consoleFull)** for PR 10788 at commit [`e1b0389`](https://github.com/apache/spark/commit/e1b038926b7506cfa240883ae177785a24cc9870). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7780][MLLIB] intercept in logisticregre...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/10788#issuecomment-172968912 **[Test build #49697 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/49697/consoleFull)** for PR 10788 at commit [`7501b4b`](https://github.com/apache/spark/commit/7501b4b29d0d08d1363cb1f16be1397887a569b1). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7780][MLLIB] intercept in logisticregre...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/10788#issuecomment-172968528 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7780][MLLIB] intercept in logisticregre...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/10788#issuecomment-172968530 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/49696/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7780][MLLIB] intercept in logisticregre...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/10788#issuecomment-172957321 **[Test build #49692 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/49692/consoleFull)** for PR 10788 at commit [`e1b0389`](https://github.com/apache/spark/commit/e1b038926b7506cfa240883ae177785a24cc9870). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7780][MLLIB] intercept in logisticregre...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/10788#discussion_r50041495 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/classification/LogisticRegression.scala --- @@ -374,4 +384,85 @@ class LogisticRegressionWithLBFGS new LogisticRegressionModel(weights, intercept, numFeatures, numOfLinearPredictor + 1) } } + + /** + * Run Logistic Regression with the configured parameters on an input RDD + * of LabeledPoint entries starting from the initial weights provided. + * If a known updater is used calls the ml implementation, to avoid + * applying a regularization penalty to the intercept, otherwise + * defaults to the mllib implementation. If more than two classes + * or feature scaling is disabled, always uses mllib implementation. + * If using ml implementation, uses ml code to generate initial weights. + */ + override def run(input: RDD[LabeledPoint]): LogisticRegressionModel = { +run(input, generateInitialWeights(input), userSuppliedWeights = false) + } + + /** + * Run Logistic Regression with the configured parameters on an input RDD + * of LabeledPoint entries starting from the initial weights provided. + * If a known updater is used calls the ml implementation, to avoid + * applying a regularization penalty to the intercept, otherwise + * defaults to the mllib implementation. If more than two classes + * or feature scaling is disabled, always uses mllib implementation. + * Uses user provided weights. + */ + override def run(input: RDD[LabeledPoint], initialWeights: Vector): LogisticRegressionModel = { +run(input, initialWeights, userSuppliedWeights = true) + } + + private def run(input: RDD[LabeledPoint], initialWeights: Vector, userSuppliedWeights: Boolean): + LogisticRegressionModel = { +// ml's Logisitic regression only supports binary classifcation currently. +if (numOfLinearPredictor == 1) { + def runWithMlLogisitcRegression(elasticNetParam: Double) = { +// Prepare the ml LogisticRegression based on our settings +val lr = new org.apache.spark.ml.classification.LogisticRegression() +lr.setRegParam(optimizer.getRegParam()) +lr.setElasticNetParam(elasticNetParam) +lr.setStandardization(useFeatureScaling) +if (userSuppliedWeights) { + val initialWeightsWithIntercept = if (addIntercept) { --- End diff -- This is not used anymore. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7780][MLLIB] intercept in logisticregre...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/10788#discussion_r50041320 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/classification/LogisticRegression.scala --- @@ -374,4 +384,85 @@ class LogisticRegressionWithLBFGS new LogisticRegressionModel(weights, intercept, numFeatures, numOfLinearPredictor + 1) } } + + /** + * Run Logistic Regression with the configured parameters on an input RDD + * of LabeledPoint entries starting from the initial weights provided. --- End diff -- Removed `starting from the initial weights provided.` and add extra new line here for readability. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7780][MLLIB] intercept in logisticregre...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/10788#discussion_r50041364 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/classification/LogisticRegression.scala --- @@ -374,4 +384,85 @@ class LogisticRegressionWithLBFGS new LogisticRegressionModel(weights, intercept, numFeatures, numOfLinearPredictor + 1) } } + + /** + * Run Logistic Regression with the configured parameters on an input RDD + * of LabeledPoint entries starting from the initial weights provided. + * If a known updater is used calls the ml implementation, to avoid + * applying a regularization penalty to the intercept, otherwise + * defaults to the mllib implementation. If more than two classes + * or feature scaling is disabled, always uses mllib implementation. + * If using ml implementation, uses ml code to generate initial weights. + */ + override def run(input: RDD[LabeledPoint]): LogisticRegressionModel = { +run(input, generateInitialWeights(input), userSuppliedWeights = false) + } + + /** + * Run Logistic Regression with the configured parameters on an input RDD + * of LabeledPoint entries starting from the initial weights provided. --- End diff -- Add extra new line before `If a known updater is...` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7780][MLLIB] intercept in logisticregre...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/10788#discussion_r50041152 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala --- @@ -343,8 +365,8 @@ class LogisticRegression @Since("1.2.0") ( = math.log(histogram(1) / histogram(0)) } -val states = optimizer.iterations(new CachedDiffFunction(costFun), - initialCoefficientsWithIntercept.toBreeze.toDenseVector) + val states = optimizer.iterations(new CachedDiffFunction(costFun), +initialCoefficientsWithIntercept.toBreeze.toDenseVector) --- End diff -- Wrong indentation. Remove two spaces. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7780][MLLIB] intercept in logisticregre...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/10788#discussion_r50040773 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala --- @@ -322,10 +329,25 @@ class LogisticRegression @Since("1.2.0") ( new BreezeOWLQN[Int, BDV[Double]]($(maxIter), 10, regParamL1Fun, $(tol)) } +val numFeaturesWithIntercept = if ($(fitIntercept)) numFeatures + 1 else numFeatures --- End diff -- Is this used? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7780][MLLIB] intercept in logisticregre...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/10788#issuecomment-172468869 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/49586/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7780][MLLIB] intercept in logisticregre...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/10788#issuecomment-172468868 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7780][MLLIB] intercept in logisticregre...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/10788#issuecomment-172468731 **[Test build #49586 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/49586/consoleFull)** for PR 10788 at commit [`43a3a32`](https://github.com/apache/spark/commit/43a3a3246f793d467751f40b4dceba6ccaed394b). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7780][MLLIB] intercept in logisticregre...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/10788#issuecomment-172459633 **[Test build #49586 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/49586/consoleFull)** for PR 10788 at commit [`43a3a32`](https://github.com/apache/spark/commit/43a3a3246f793d467751f40b4dceba6ccaed394b). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7780][MLLIB] intercept in logisticregre...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/10788#issuecomment-172442232 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/49571/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7780][MLLIB] intercept in logisticregre...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/10788#issuecomment-172442231 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7780][MLLIB] intercept in logisticregre...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/10788#issuecomment-172442212 **[Test build #49571 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/49571/consoleFull)** for PR 10788 at commit [`67f`](https://github.com/apache/spark/commit/67f2b9d22ddb0e8c8391d5c744b8895e91e4). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7780][MLLIB] intercept in logisticregre...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/10788#issuecomment-172437627 **[Test build #49571 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/49571/consoleFull)** for PR 10788 at commit [`67f`](https://github.com/apache/spark/commit/67f2b9d22ddb0e8c8391d5c744b8895e91e4). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7780][MLLIB] intercept in logisticregre...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/10788#issuecomment-172434715 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/49570/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7780][MLLIB] intercept in logisticregre...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/10788#issuecomment-172434712 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7780][MLLIB] intercept in logisticregre...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/10788#issuecomment-172434615 **[Test build #49570 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/49570/consoleFull)** for PR 10788 at commit [`0e2ea49`](https://github.com/apache/spark/commit/0e2ea495ad0020f89df9e70653ff380673d3563e). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7780][MLLIB] intercept in logisticregre...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/10788#discussion_r49964184 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala --- @@ -322,10 +329,11 @@ class LogisticRegression @Since("1.2.0") ( new BreezeOWLQN[Int, BDV[Double]]($(maxIter), 10, regParamL1Fun, $(tol)) } -val initialCoefficientsWithIntercept = - Vectors.zeros(if ($(fitIntercept)) numFeatures + 1 else numFeatures) +val numFeaturesWithIntercept = if ($(fitIntercept)) numFeatures + 1 else numFeatures +val initialCoefficientsWithIntercept = optInitialCoefficients.getOrElse( + Vectors.zeros(numFeaturesWithIntercept)) --- End diff -- btw, may we want to log. `if (optInitialModel.isDefined && optInitialModel.get.coefficients.size != numFeatures)`, let's log it. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7780][MLLIB] intercept in logisticregre...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/10788#discussion_r49963912 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala --- @@ -322,10 +329,11 @@ class LogisticRegression @Since("1.2.0") ( new BreezeOWLQN[Int, BDV[Double]]($(maxIter), 10, regParamL1Fun, $(tol)) } -val initialCoefficientsWithIntercept = - Vectors.zeros(if ($(fitIntercept)) numFeatures + 1 else numFeatures) +val numFeaturesWithIntercept = if ($(fitIntercept)) numFeatures + 1 else numFeatures +val initialCoefficientsWithIntercept = optInitialCoefficients.getOrElse( + Vectors.zeros(numFeaturesWithIntercept)) --- End diff -- here, ```scala val initialCoefficientsWithIntercept = Vectors.zeros(if ($(fitIntercept)) numFeatures + 1 else numFeatures) if (optInitialModel.isDefined && optInitialModel.get.coefficients.size == numFeatures) { val initialCoefficientsWithInterceptArray = initialCoefficientsWithIntercept.toArray optInitialModel.get.coefficients.foreachActive { case (index, value) => initialCoefficientsWithInterceptArray(index) = value } if ($(fitIntercept) { initialCoefficientsWithInterceptArray(numFeatures) == optInitialModel.get.intercept } } else if ($(fitIntercept)) { /* For binary logistic regression, when we initialize the coefficients as zeros, it will converge faster if we initialize the intercept such that it follows the distribution of the labels. {{{ P(0) = 1 / (1 + \exp(b)), and P(1) = \exp(b) / (1 + \exp(b)) }}}, hence {{{ b = \log{P(1) / P(0)} = \log{count_1 / count_0} }}} */ initialCoefficientsWithIntercept.toArray(numFeatures) = math.log(histogram(1) / histogram(0)) } ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7780][MLLIB] intercept in logisticregre...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/10788#issuecomment-172431216 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7780][MLLIB] intercept in logisticregre...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/10788#issuecomment-172431217 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/49569/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7780][MLLIB] intercept in logisticregre...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/10788#discussion_r49963455 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala --- @@ -247,8 +247,15 @@ class LogisticRegression @Since("1.2.0") ( @Since("1.5.0") override def getThresholds: Array[Double] = super.getThresholds - override protected def train(dataset: DataFrame): LogisticRegressionModel = { -// Extract columns from data. If dataset is persisted, do not persist oldDataset. + private var optInitialCoefficients: Option[Vector] = None --- End diff -- Keep the reference to `InitialModel` here. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7780][MLLIB] intercept in logisticregre...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/10788#discussion_r49963463 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala --- @@ -247,8 +247,15 @@ class LogisticRegression @Since("1.2.0") ( @Since("1.5.0") override def getThresholds: Array[Double] = super.getThresholds - override protected def train(dataset: DataFrame): LogisticRegressionModel = { -// Extract columns from data. If dataset is persisted, do not persist oldDataset. + private var optInitialCoefficients: Option[Vector] = None + + /** @group setParam */ + private[spark] def setInitialModel(model: LogisticRegressionModel): this.type = { +this.optInitialCoefficients = Some(model.coefficients) --- End diff -- You don't want to lose the information of intercept in model. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7780][MLLIB] intercept in logisticregre...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/10788#issuecomment-172430774 **[Test build #49570 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/49570/consoleFull)** for PR 10788 at commit [`0e2ea49`](https://github.com/apache/spark/commit/0e2ea495ad0020f89df9e70653ff380673d3563e). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7780][MLLIB] intercept in logisticregre...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/10788#discussion_r49963357 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/classification/LogisticRegression.scala --- @@ -374,4 +383,82 @@ class LogisticRegressionWithLBFGS new LogisticRegressionModel(weights, intercept, numFeatures, numOfLinearPredictor + 1) } } + + /** + * Run the algorithm with the configured parameters on an input RDD + * of LabeledPoint entries starting from the initial weights provided. + * If a known updater is used calls the ml implementation, to avoid + * applying a regularization penalty to the intercept, otherwise + * defaults to the mllib implementation. If more than two classes + * or feature scaling is disabled, always uses mllib implementation. + * If using ml implementation, uses ml code to generate initial weights. + */ + override def run(input: RDD[LabeledPoint]): LogisticRegressionModel = { +run(input, generateInitialWeights(input), false) + } + + /** + * Run the algorithm with the configured parameters on an input RDD + * of LabeledPoint entries starting from the initial weights provided. + * If a known updater is used calls the ml implementation, to avoid + * applying a regularization penalty to the intercept, otherwise + * defaults to the mllib implementation. If more than two classes + * or feature scaling is disabled, always uses mllib implementation. + * Uses user provided weights. + */ + override def run(input: RDD[LabeledPoint], initialWeights: Vector): LogisticRegressionModel = { +run(input, initialWeights, true) + } + + private def run(input: RDD[LabeledPoint], initialWeights: Vector, userSuppliedWeights: Boolean): + LogisticRegressionModel = { +// ml's Logisitic regression only supports binary classifcation currently. +if (numOfLinearPredictor == 1 && useFeatureScaling) { + def runWithMlLogisitcRegression(elasticNetParam: Double) = { +// Prepare the ml LogisticRegression based on our settings +val lr = new org.apache.spark.ml.classification.LogisticRegression() +lr.setRegParam(optimizer.getRegParam()) +lr.setElasticNetParam(elasticNetParam) +if (userSuppliedWeights) { + val initialWeightsWithIntercept = if (addIntercept) { +appendBias(initialWeights) + } else { +initialWeights + } + lr.setInitialWeights(initialWeightsWithIntercept) +} +lr.setFitIntercept(addIntercept) +lr.setMaxIter(optimizer.getNumIterations()) +lr.setTol(optimizer.getConvergenceTol()) +// Convert our input into a DataFrame +val sqlContext = new SQLContext(input.context) +import sqlContext.implicits._ +val df = input.toDF() +// Determine if we should cache the DF +val handlePersistence = input.getStorageLevel == StorageLevel.NONE +if (handlePersistence) { + df.persist(StorageLevel.MEMORY_AND_DISK) +} --- End diff -- that makes sense. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7780][MLLIB] intercept in logisticregre...
Github user holdenk commented on a diff in the pull request: https://github.com/apache/spark/pull/10788#discussion_r49963160 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala --- @@ -322,10 +343,12 @@ class LogisticRegression @Since("1.2.0") ( new BreezeOWLQN[Int, BDV[Double]]($(maxIter), 10, regParamL1Fun, $(tol)) } -val initialCoefficientsWithIntercept = - Vectors.zeros(if ($(fitIntercept)) numFeatures + 1 else numFeatures) +val numFeaturesWithIntercept = if ($(fitIntercept)) numFeatures + 1 else numFeatures +val userSuppliedCoefficients = validateWeights(optInitialCoefficients, numFeaturesWithIntercept) --- End diff -- Ah then there is no validation step we just assume if they set the initial model they set a valid initial model. Ok :) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7780][MLLIB] intercept in logisticregre...
Github user holdenk commented on a diff in the pull request: https://github.com/apache/spark/pull/10788#discussion_r49962990 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/classification/LogisticRegression.scala --- @@ -374,4 +383,82 @@ class LogisticRegressionWithLBFGS new LogisticRegressionModel(weights, intercept, numFeatures, numOfLinearPredictor + 1) } } + + /** + * Run the algorithm with the configured parameters on an input RDD + * of LabeledPoint entries starting from the initial weights provided. + * If a known updater is used calls the ml implementation, to avoid + * applying a regularization penalty to the intercept, otherwise + * defaults to the mllib implementation. If more than two classes + * or feature scaling is disabled, always uses mllib implementation. + * If using ml implementation, uses ml code to generate initial weights. + */ + override def run(input: RDD[LabeledPoint]): LogisticRegressionModel = { +run(input, generateInitialWeights(input), false) + } + + /** + * Run the algorithm with the configured parameters on an input RDD + * of LabeledPoint entries starting from the initial weights provided. + * If a known updater is used calls the ml implementation, to avoid + * applying a regularization penalty to the intercept, otherwise + * defaults to the mllib implementation. If more than two classes + * or feature scaling is disabled, always uses mllib implementation. + * Uses user provided weights. + */ + override def run(input: RDD[LabeledPoint], initialWeights: Vector): LogisticRegressionModel = { +run(input, initialWeights, true) + } + + private def run(input: RDD[LabeledPoint], initialWeights: Vector, userSuppliedWeights: Boolean): + LogisticRegressionModel = { +// ml's Logisitic regression only supports binary classifcation currently. +if (numOfLinearPredictor == 1 && useFeatureScaling) { + def runWithMlLogisitcRegression(elasticNetParam: Double) = { +// Prepare the ml LogisticRegression based on our settings +val lr = new org.apache.spark.ml.classification.LogisticRegression() +lr.setRegParam(optimizer.getRegParam()) +lr.setElasticNetParam(elasticNetParam) +if (userSuppliedWeights) { + val initialWeightsWithIntercept = if (addIntercept) { +appendBias(initialWeights) + } else { +initialWeights + } + lr.setInitialWeights(initialWeightsWithIntercept) +} +lr.setFitIntercept(addIntercept) +lr.setMaxIter(optimizer.getNumIterations()) +lr.setTol(optimizer.getConvergenceTol()) +// Convert our input into a DataFrame +val sqlContext = new SQLContext(input.context) +import sqlContext.implicits._ +val df = input.toDF() +// Determine if we should cache the DF +val handlePersistence = input.getStorageLevel == StorageLevel.NONE +if (handlePersistence) { + df.persist(StorageLevel.MEMORY_AND_DISK) +} +// Train our model +val mlLogisticRegresionModel = lr.train(df) +// unpersist if we persisted +if (handlePersistence) { + df.unpersist() +} --- End diff -- same --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7780][MLLIB] intercept in logisticregre...
Github user holdenk commented on a diff in the pull request: https://github.com/apache/spark/pull/10788#discussion_r49962987 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/classification/LogisticRegression.scala --- @@ -374,4 +383,82 @@ class LogisticRegressionWithLBFGS new LogisticRegressionModel(weights, intercept, numFeatures, numOfLinearPredictor + 1) } } + + /** + * Run the algorithm with the configured parameters on an input RDD + * of LabeledPoint entries starting from the initial weights provided. + * If a known updater is used calls the ml implementation, to avoid + * applying a regularization penalty to the intercept, otherwise + * defaults to the mllib implementation. If more than two classes + * or feature scaling is disabled, always uses mllib implementation. + * If using ml implementation, uses ml code to generate initial weights. + */ + override def run(input: RDD[LabeledPoint]): LogisticRegressionModel = { +run(input, generateInitialWeights(input), false) + } + + /** + * Run the algorithm with the configured parameters on an input RDD + * of LabeledPoint entries starting from the initial weights provided. + * If a known updater is used calls the ml implementation, to avoid + * applying a regularization penalty to the intercept, otherwise + * defaults to the mllib implementation. If more than two classes + * or feature scaling is disabled, always uses mllib implementation. + * Uses user provided weights. + */ + override def run(input: RDD[LabeledPoint], initialWeights: Vector): LogisticRegressionModel = { +run(input, initialWeights, true) + } + + private def run(input: RDD[LabeledPoint], initialWeights: Vector, userSuppliedWeights: Boolean): + LogisticRegressionModel = { +// ml's Logisitic regression only supports binary classifcation currently. +if (numOfLinearPredictor == 1 && useFeatureScaling) { + def runWithMlLogisitcRegression(elasticNetParam: Double) = { +// Prepare the ml LogisticRegression based on our settings +val lr = new org.apache.spark.ml.classification.LogisticRegression() +lr.setRegParam(optimizer.getRegParam()) +lr.setElasticNetParam(elasticNetParam) +if (userSuppliedWeights) { + val initialWeightsWithIntercept = if (addIntercept) { +appendBias(initialWeights) + } else { +initialWeights + } + lr.setInitialWeights(initialWeightsWithIntercept) +} +lr.setFitIntercept(addIntercept) +lr.setMaxIter(optimizer.getNumIterations()) +lr.setTol(optimizer.getConvergenceTol()) +// Convert our input into a DataFrame +val sqlContext = new SQLContext(input.context) +import sqlContext.implicits._ +val df = input.toDF() +// Determine if we should cache the DF +val handlePersistence = input.getStorageLevel == StorageLevel.NONE +if (handlePersistence) { + df.persist(StorageLevel.MEMORY_AND_DISK) +} --- End diff -- So the ML code checks on the DataFrame - which will never be cached. So we check on the user supplied input and if the user supplied input is not persisted we handle our own persistance but if the user supplied input is persisted then we don't. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7780][MLLIB] intercept in logisticregre...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/10788#discussion_r49962964 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/classification/LogisticRegression.scala --- @@ -374,4 +383,82 @@ class LogisticRegressionWithLBFGS new LogisticRegressionModel(weights, intercept, numFeatures, numOfLinearPredictor + 1) } } + + /** + * Run the algorithm with the configured parameters on an input RDD + * of LabeledPoint entries starting from the initial weights provided. + * If a known updater is used calls the ml implementation, to avoid + * applying a regularization penalty to the intercept, otherwise + * defaults to the mllib implementation. If more than two classes + * or feature scaling is disabled, always uses mllib implementation. + * If using ml implementation, uses ml code to generate initial weights. + */ + override def run(input: RDD[LabeledPoint]): LogisticRegressionModel = { +run(input, generateInitialWeights(input), false) + } + + /** + * Run the algorithm with the configured parameters on an input RDD + * of LabeledPoint entries starting from the initial weights provided. + * If a known updater is used calls the ml implementation, to avoid + * applying a regularization penalty to the intercept, otherwise + * defaults to the mllib implementation. If more than two classes + * or feature scaling is disabled, always uses mllib implementation. + * Uses user provided weights. + */ + override def run(input: RDD[LabeledPoint], initialWeights: Vector): LogisticRegressionModel = { +run(input, initialWeights, true) --- End diff -- haha... yes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7780][MLLIB] intercept in logisticregre...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/10788#discussion_r49962944 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala --- @@ -322,10 +343,12 @@ class LogisticRegression @Since("1.2.0") ( new BreezeOWLQN[Int, BDV[Double]]($(maxIter), 10, regParamL1Fun, $(tol)) } -val initialCoefficientsWithIntercept = - Vectors.zeros(if ($(fitIntercept)) numFeatures + 1 else numFeatures) +val numFeaturesWithIntercept = if ($(fitIntercept)) numFeatures + 1 else numFeatures +val userSuppliedCoefficients = validateWeights(optInitialCoefficients, numFeaturesWithIntercept) --- End diff -- You will know # of features by the size of coefficients set by setInitialModel. There is no ambiguity here since it's binary, and intercept has a separate variable. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7780][MLLIB] intercept in logisticregre...
Github user holdenk commented on a diff in the pull request: https://github.com/apache/spark/pull/10788#discussion_r49962879 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/classification/LogisticRegression.scala --- @@ -374,4 +383,82 @@ class LogisticRegressionWithLBFGS new LogisticRegressionModel(weights, intercept, numFeatures, numOfLinearPredictor + 1) } } + + /** + * Run the algorithm with the configured parameters on an input RDD + * of LabeledPoint entries starting from the initial weights provided. + * If a known updater is used calls the ml implementation, to avoid + * applying a regularization penalty to the intercept, otherwise + * defaults to the mllib implementation. If more than two classes + * or feature scaling is disabled, always uses mllib implementation. + * If using ml implementation, uses ml code to generate initial weights. + */ + override def run(input: RDD[LabeledPoint]): LogisticRegressionModel = { +run(input, generateInitialWeights(input), false) + } + + /** + * Run the algorithm with the configured parameters on an input RDD + * of LabeledPoint entries starting from the initial weights provided. + * If a known updater is used calls the ml implementation, to avoid + * applying a regularization penalty to the intercept, otherwise + * defaults to the mllib implementation. If more than two classes + * or feature scaling is disabled, always uses mllib implementation. + * Uses user provided weights. + */ + override def run(input: RDD[LabeledPoint], initialWeights: Vector): LogisticRegressionModel = { +run(input, initialWeights, true) --- End diff -- I'm assuming you meant `run(input, input, userSuppliedWeights = true)` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7780][MLLIB] intercept in logisticregre...
Github user holdenk commented on a diff in the pull request: https://github.com/apache/spark/pull/10788#discussion_r49962737 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala --- @@ -322,10 +343,12 @@ class LogisticRegression @Since("1.2.0") ( new BreezeOWLQN[Int, BDV[Double]]($(maxIter), 10, regParamL1Fun, $(tol)) } -val initialCoefficientsWithIntercept = - Vectors.zeros(if ($(fitIntercept)) numFeatures + 1 else numFeatures) +val numFeaturesWithIntercept = if ($(fitIntercept)) numFeatures + 1 else numFeatures +val userSuppliedCoefficients = validateWeights(optInitialCoefficients, numFeaturesWithIntercept) --- End diff -- I don't think we know the number of features at that point. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7780][MLLIB] intercept in logisticregre...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/10788#discussion_r49962716 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/classification/LogisticRegression.scala --- @@ -374,4 +383,82 @@ class LogisticRegressionWithLBFGS new LogisticRegressionModel(weights, intercept, numFeatures, numOfLinearPredictor + 1) } } + + /** + * Run the algorithm with the configured parameters on an input RDD + * of LabeledPoint entries starting from the initial weights provided. + * If a known updater is used calls the ml implementation, to avoid + * applying a regularization penalty to the intercept, otherwise + * defaults to the mllib implementation. If more than two classes + * or feature scaling is disabled, always uses mllib implementation. + * If using ml implementation, uses ml code to generate initial weights. + */ + override def run(input: RDD[LabeledPoint]): LogisticRegressionModel = { +run(input, generateInitialWeights(input), false) + } + + /** + * Run the algorithm with the configured parameters on an input RDD + * of LabeledPoint entries starting from the initial weights provided. + * If a known updater is used calls the ml implementation, to avoid + * applying a regularization penalty to the intercept, otherwise + * defaults to the mllib implementation. If more than two classes + * or feature scaling is disabled, always uses mllib implementation. + * Uses user provided weights. + */ + override def run(input: RDD[LabeledPoint], initialWeights: Vector): LogisticRegressionModel = { +run(input, initialWeights, true) + } + + private def run(input: RDD[LabeledPoint], initialWeights: Vector, userSuppliedWeights: Boolean): + LogisticRegressionModel = { +// ml's Logisitic regression only supports binary classifcation currently. +if (numOfLinearPredictor == 1 && useFeatureScaling) { + def runWithMlLogisitcRegression(elasticNetParam: Double) = { +// Prepare the ml LogisticRegression based on our settings +val lr = new org.apache.spark.ml.classification.LogisticRegression() +lr.setRegParam(optimizer.getRegParam()) +lr.setElasticNetParam(elasticNetParam) +if (userSuppliedWeights) { + val initialWeightsWithIntercept = if (addIntercept) { +appendBias(initialWeights) + } else { +initialWeights + } + lr.setInitialWeights(initialWeightsWithIntercept) +} +lr.setFitIntercept(addIntercept) +lr.setMaxIter(optimizer.getNumIterations()) +lr.setTol(optimizer.getConvergenceTol()) +// Convert our input into a DataFrame +val sqlContext = new SQLContext(input.context) +import sqlContext.implicits._ +val df = input.toDF() +// Determine if we should cache the DF +val handlePersistence = input.getStorageLevel == StorageLevel.NONE +if (handlePersistence) { + df.persist(StorageLevel.MEMORY_AND_DISK) +} --- End diff -- Why do we need to do it? I through those check is already in ML code. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7780][MLLIB] intercept in logisticregre...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/10788#discussion_r49962722 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/classification/LogisticRegression.scala --- @@ -374,4 +383,82 @@ class LogisticRegressionWithLBFGS new LogisticRegressionModel(weights, intercept, numFeatures, numOfLinearPredictor + 1) } } + + /** + * Run the algorithm with the configured parameters on an input RDD + * of LabeledPoint entries starting from the initial weights provided. + * If a known updater is used calls the ml implementation, to avoid + * applying a regularization penalty to the intercept, otherwise + * defaults to the mllib implementation. If more than two classes + * or feature scaling is disabled, always uses mllib implementation. + * If using ml implementation, uses ml code to generate initial weights. + */ + override def run(input: RDD[LabeledPoint]): LogisticRegressionModel = { +run(input, generateInitialWeights(input), false) + } + + /** + * Run the algorithm with the configured parameters on an input RDD + * of LabeledPoint entries starting from the initial weights provided. + * If a known updater is used calls the ml implementation, to avoid + * applying a regularization penalty to the intercept, otherwise + * defaults to the mllib implementation. If more than two classes + * or feature scaling is disabled, always uses mllib implementation. + * Uses user provided weights. + */ + override def run(input: RDD[LabeledPoint], initialWeights: Vector): LogisticRegressionModel = { +run(input, initialWeights, true) + } + + private def run(input: RDD[LabeledPoint], initialWeights: Vector, userSuppliedWeights: Boolean): + LogisticRegressionModel = { +// ml's Logisitic regression only supports binary classifcation currently. +if (numOfLinearPredictor == 1 && useFeatureScaling) { + def runWithMlLogisitcRegression(elasticNetParam: Double) = { +// Prepare the ml LogisticRegression based on our settings +val lr = new org.apache.spark.ml.classification.LogisticRegression() +lr.setRegParam(optimizer.getRegParam()) +lr.setElasticNetParam(elasticNetParam) +if (userSuppliedWeights) { + val initialWeightsWithIntercept = if (addIntercept) { +appendBias(initialWeights) + } else { +initialWeights + } + lr.setInitialWeights(initialWeightsWithIntercept) +} +lr.setFitIntercept(addIntercept) +lr.setMaxIter(optimizer.getNumIterations()) +lr.setTol(optimizer.getConvergenceTol()) +// Convert our input into a DataFrame +val sqlContext = new SQLContext(input.context) +import sqlContext.implicits._ +val df = input.toDF() +// Determine if we should cache the DF +val handlePersistence = input.getStorageLevel == StorageLevel.NONE +if (handlePersistence) { + df.persist(StorageLevel.MEMORY_AND_DISK) +} +// Train our model +val mlLogisticRegresionModel = lr.train(df) +// unpersist if we persisted +if (handlePersistence) { + df.unpersist() +} --- End diff -- ditto? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7780][MLLIB] intercept in logisticregre...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/10788#issuecomment-172427339 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/49568/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7780][MLLIB] intercept in logisticregre...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/10788#discussion_r49962541 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/classification/LogisticRegression.scala --- @@ -374,4 +383,82 @@ class LogisticRegressionWithLBFGS new LogisticRegressionModel(weights, intercept, numFeatures, numOfLinearPredictor + 1) } } + + /** + * Run the algorithm with the configured parameters on an input RDD + * of LabeledPoint entries starting from the initial weights provided. + * If a known updater is used calls the ml implementation, to avoid + * applying a regularization penalty to the intercept, otherwise + * defaults to the mllib implementation. If more than two classes + * or feature scaling is disabled, always uses mllib implementation. + * If using ml implementation, uses ml code to generate initial weights. + */ + override def run(input: RDD[LabeledPoint]): LogisticRegressionModel = { +run(input, generateInitialWeights(input), false) + } + + /** + * Run the algorithm with the configured parameters on an input RDD + * of LabeledPoint entries starting from the initial weights provided. + * If a known updater is used calls the ml implementation, to avoid + * applying a regularization penalty to the intercept, otherwise + * defaults to the mllib implementation. If more than two classes + * or feature scaling is disabled, always uses mllib implementation. + * Uses user provided weights. + */ + override def run(input: RDD[LabeledPoint], initialWeights: Vector): LogisticRegressionModel = { +run(input, initialWeights, true) + } + + private def run(input: RDD[LabeledPoint], initialWeights: Vector, userSuppliedWeights: Boolean): + LogisticRegressionModel = { +// ml's Logisitic regression only supports binary classifcation currently. +if (numOfLinearPredictor == 1 && useFeatureScaling) { --- End diff -- You can remove `useFeatureScaling`, and pass it as setStandardization in ML implementation. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7780][MLLIB] intercept in logisticregre...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/10788#issuecomment-172427338 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7780][MLLIB] intercept in logisticregre...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/10788#issuecomment-172427291 **[Test build #49568 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/49568/consoleFull)** for PR 10788 at commit [`4caab8c`](https://github.com/apache/spark/commit/4caab8ca2ac23f24fe84cf741ab2c013e319752d). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7780][MLLIB] intercept in logisticregre...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/10788#discussion_r49962512 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/classification/LogisticRegression.scala --- @@ -374,4 +383,82 @@ class LogisticRegressionWithLBFGS new LogisticRegressionModel(weights, intercept, numFeatures, numOfLinearPredictor + 1) } } + + /** + * Run the algorithm with the configured parameters on an input RDD + * of LabeledPoint entries starting from the initial weights provided. + * If a known updater is used calls the ml implementation, to avoid + * applying a regularization penalty to the intercept, otherwise + * defaults to the mllib implementation. If more than two classes + * or feature scaling is disabled, always uses mllib implementation. + * If using ml implementation, uses ml code to generate initial weights. + */ + override def run(input: RDD[LabeledPoint]): LogisticRegressionModel = { +run(input, generateInitialWeights(input), false) + } + + /** + * Run the algorithm with the configured parameters on an input RDD --- End diff -- Replace `algorithm` by `Logistic Regression`, and add a new line. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7780][MLLIB] intercept in logisticregre...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/10788#discussion_r49962448 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/classification/LogisticRegression.scala --- @@ -374,4 +383,82 @@ class LogisticRegressionWithLBFGS new LogisticRegressionModel(weights, intercept, numFeatures, numOfLinearPredictor + 1) } } + + /** + * Run the algorithm with the configured parameters on an input RDD + * of LabeledPoint entries starting from the initial weights provided. + * If a known updater is used calls the ml implementation, to avoid + * applying a regularization penalty to the intercept, otherwise + * defaults to the mllib implementation. If more than two classes + * or feature scaling is disabled, always uses mllib implementation. + * If using ml implementation, uses ml code to generate initial weights. + */ + override def run(input: RDD[LabeledPoint]): LogisticRegressionModel = { +run(input, generateInitialWeights(input), false) + } + + /** + * Run the algorithm with the configured parameters on an input RDD + * of LabeledPoint entries starting from the initial weights provided. + * If a known updater is used calls the ml implementation, to avoid + * applying a regularization penalty to the intercept, otherwise + * defaults to the mllib implementation. If more than two classes + * or feature scaling is disabled, always uses mllib implementation. + * Uses user provided weights. + */ + override def run(input: RDD[LabeledPoint], initialWeights: Vector): LogisticRegressionModel = { +run(input, initialWeights, true) --- End diff -- `run(input, generateInitialWeights(input), userSuppliedWeights = true)` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7780][MLLIB] intercept in logisticregre...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/10788#discussion_r49962416 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/classification/LogisticRegression.scala --- @@ -374,4 +383,82 @@ class LogisticRegressionWithLBFGS new LogisticRegressionModel(weights, intercept, numFeatures, numOfLinearPredictor + 1) } } + + /** + * Run the algorithm with the configured parameters on an input RDD + * of LabeledPoint entries starting from the initial weights provided. + * If a known updater is used calls the ml implementation, to avoid + * applying a regularization penalty to the intercept, otherwise + * defaults to the mllib implementation. If more than two classes + * or feature scaling is disabled, always uses mllib implementation. + * If using ml implementation, uses ml code to generate initial weights. + */ + override def run(input: RDD[LabeledPoint]): LogisticRegressionModel = { +run(input, generateInitialWeights(input), false) --- End diff -- `run(input, generateInitialWeights(input), userSuppliedWeights = false)` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7780][MLLIB] intercept in logisticregre...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/10788#discussion_r49962408 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/classification/LogisticRegression.scala --- @@ -374,4 +383,82 @@ class LogisticRegressionWithLBFGS new LogisticRegressionModel(weights, intercept, numFeatures, numOfLinearPredictor + 1) } } + + /** + * Run the algorithm with the configured parameters on an input RDD + * of LabeledPoint entries starting from the initial weights provided. --- End diff -- Replace `algorithm` by `Logistic Regression`, and remove `starting from the initial weights provided`. Add a new line between `of LabeledPoint entries` and `If a known updater is used`. Actually, in ml version, disabling feature scaling is supported now. So please call ml implementation in this case. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7780][MLLIB] intercept in logisticregre...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/10788#discussion_r49962157 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala --- @@ -322,10 +343,12 @@ class LogisticRegression @Since("1.2.0") ( new BreezeOWLQN[Int, BDV[Double]]($(maxIter), 10, regParamL1Fun, $(tol)) } -val initialCoefficientsWithIntercept = - Vectors.zeros(if ($(fitIntercept)) numFeatures + 1 else numFeatures) +val numFeaturesWithIntercept = if ($(fitIntercept)) numFeatures + 1 else numFeatures +val userSuppliedCoefficients = validateWeights(optInitialCoefficients, numFeaturesWithIntercept) --- End diff -- let's handle it through setInitialModel, and have another PR to make it public. Thanks. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7780][MLLIB] intercept in logisticregre...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/10788#discussion_r49961964 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala --- @@ -247,10 +247,31 @@ class LogisticRegression @Since("1.2.0") ( @Since("1.5.0") override def getThresholds: Array[Double] = super.getThresholds - override protected def train(dataset: DataFrame): LogisticRegressionModel = { -// Extract columns from data. If dataset is persisted, do not persist oldDataset. + private var optInitialCoefficients: Option[Vector] = None + /** @group setParam */ + private[spark] def setInitialWeights(value: Vector): this.type = { +this.optInitialCoefficients = Some(value) +this + } + + /** + * Validate the initial weights, return an Option, if not the expected size return None + * and log a warning. + */ + private def validateWeights(vectorOpt: Option[Vector], numFeatures: Int): Option[Vector] = { +vectorOpt.flatMap(vec => + if (vec.size == numFeatures) { +Some(vec) + } else { +logWarning( + s"""Initial weights provided (${vec})did not match the expected size ${numFeatures}""") +None + }) + } + + override protected[spark] def train(dataset: DataFrame): LogisticRegressionModel = { val w = if ($(weightCol).isEmpty) lit(1.0) else col($(weightCol)) -val instances: RDD[Instance] = dataset.select(col($(labelCol)), w, col($(featuresCol))).map { +val instances = dataset.select(col($(labelCol)), w, col($(featuresCol))).map { --- End diff -- why this line is changed? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7780][MLLIB] intercept in logisticregre...
Github user holdenk commented on a diff in the pull request: https://github.com/apache/spark/pull/10788#discussion_r49961906 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala --- @@ -247,10 +247,30 @@ class LogisticRegression @Since("1.2.0") ( @Since("1.5.0") override def getThresholds: Array[Double] = super.getThresholds - override protected def train(dataset: DataFrame): LogisticRegressionModel = { -// Extract columns from data. If dataset is persisted, do not persist oldDataset. + private var optInitialWeights: Option[Vector] = None + /** @group setParam */ + private[spark] def setInitialWeights(value: Vector): this.type = { +this.optInitialWeights = Some(value) +this + } --- End diff -- So we have setInitialWeights on StreamingLogisticRegressionWithSGD - would it be better to have it match StreamingLogisticRegressionWithSGD ? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7780][MLLIB] intercept in logisticregre...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/10788#discussion_r49961903 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala --- @@ -247,10 +247,31 @@ class LogisticRegression @Since("1.2.0") ( @Since("1.5.0") override def getThresholds: Array[Double] = super.getThresholds - override protected def train(dataset: DataFrame): LogisticRegressionModel = { -// Extract columns from data. If dataset is persisted, do not persist oldDataset. + private var optInitialCoefficients: Option[Vector] = None + /** @group setParam */ + private[spark] def setInitialWeights(value: Vector): this.type = { +this.optInitialCoefficients = Some(value) +this + } + + /** + * Validate the initial weights, return an Option, if not the expected size return None + * and log a warning. + */ + private def validateWeights(vectorOpt: Option[Vector], numFeatures: Int): Option[Vector] = { +vectorOpt.flatMap(vec => + if (vec.size == numFeatures) { +Some(vec) + } else { +logWarning( + s"""Initial weights provided (${vec})did not match the expected size ${numFeatures}""") --- End diff -- btw, why `s"""`, also change `weights` to coefficients --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7780][MLLIB] intercept in logisticregre...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/10788#discussion_r49961912 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala --- @@ -247,10 +247,31 @@ class LogisticRegression @Since("1.2.0") ( @Since("1.5.0") override def getThresholds: Array[Double] = super.getThresholds - override protected def train(dataset: DataFrame): LogisticRegressionModel = { -// Extract columns from data. If dataset is persisted, do not persist oldDataset. + private var optInitialCoefficients: Option[Vector] = None + /** @group setParam */ + private[spark] def setInitialWeights(value: Vector): this.type = { +this.optInitialCoefficients = Some(value) +this + } + + /** + * Validate the initial weights, return an Option, if not the expected size return None + * and log a warning. + */ + private def validateWeights(vectorOpt: Option[Vector], numFeatures: Int): Option[Vector] = { --- End diff -- validateCoefficients --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7780][MLLIB] intercept in logisticregre...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/10788#discussion_r49961836 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/classification/LogisticRegression.scala --- @@ -374,4 +383,82 @@ class LogisticRegressionWithLBFGS new LogisticRegressionModel(weights, intercept, numFeatures, numOfLinearPredictor + 1) } } + + /** + * Run the algorithm with the configured parameters on an input RDD + * of LabeledPoint entries starting from the initial weights provided. + * If a known updater is used calls the ml implementation, to avoid + * applying a regularization penalty to the intercept, otherwise + * defaults to the mllib implementation. If more than two classes + * or feature scaling is disabled, always uses mllib implementation. + * If using ml implementation, uses ml code to generate initial weights. + */ + override def run(input: RDD[LabeledPoint]): LogisticRegressionModel = { +run(input, generateInitialWeights(input), false) + } + + /** + * Run the algorithm with the configured parameters on an input RDD + * of LabeledPoint entries starting from the initial weights provided. + * If a known updater is used calls the ml implementation, to avoid + * applying a regularization penalty to the intercept, otherwise + * defaults to the mllib implementation. If more than two classes + * or feature scaling is disabled, always uses mllib implementation. + * Uses user provided weights. + */ + override def run(input: RDD[LabeledPoint], initialWeights: Vector): LogisticRegressionModel = { +run(input, initialWeights, true) + } + + private def run(input: RDD[LabeledPoint], initialWeights: Vector, userSuppliedWeights: Boolean): + LogisticRegressionModel = { +// ml's Logisitic regression only supports binary classifcation currently. +if (numOfLinearPredictor == 1 && useFeatureScaling) { + def runWithMlLogisitcRegression(elasticNetParam: Double) = { +// Prepare the ml LogisticRegression based on our settings +val lr = new org.apache.spark.ml.classification.LogisticRegression() +lr.setRegParam(optimizer.getRegParam()) +lr.setElasticNetParam(elasticNetParam) +if (userSuppliedWeights) { + val initialWeightsWithIntercept = if (addIntercept) { +appendBias(initialWeights) + } else { +initialWeights + } + lr.setInitialWeights(initialWeightsWithIntercept) --- End diff -- Here will be ```scala lr.setInitialModel(new org.apache.spark.ml.classification.LogisticRegressionModel(uid, initialWeights, 1)) ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7780][MLLIB] intercept in logisticregre...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/10788#discussion_r49961714 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala --- @@ -247,10 +247,30 @@ class LogisticRegression @Since("1.2.0") ( @Since("1.5.0") override def getThresholds: Array[Double] = super.getThresholds - override protected def train(dataset: DataFrame): LogisticRegressionModel = { -// Extract columns from data. If dataset is persisted, do not persist oldDataset. + private var optInitialWeights: Option[Vector] = None + /** @group setParam */ + private[spark] def setInitialWeights(value: Vector): this.type = { +this.optInitialWeights = Some(value) +this + } --- End diff -- How about we follow https://github.com/apache/spark/pull/8972 , and have the following code. We can create another seprate JIRA for moving `setInitialModel` to public with a sharedParam. ```scala private var initialModel: Option[LogisticRegressionModel] = None private def setInitialModel(model: LogisticRegressionModel): this.type = { ... ... this } ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7780][MLLIB] intercept in logisticregre...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/10788#issuecomment-172421474 **[Test build #49568 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/49568/consoleFull)** for PR 10788 at commit [`4caab8c`](https://github.com/apache/spark/commit/4caab8ca2ac23f24fe84cf741ab2c013e319752d). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org