[GitHub] spark pull request: [SPARK-12804][ML] Fix LogisticRegression with ...

2016-01-19 Thread feynmanliang
Github user feynmanliang commented on the pull request:

https://github.com/apache/spark/pull/10743#issuecomment-172811882
  
@dbtsai validating coefficients with R will be harder than expected, 
`glmnet` requires feature dimension >= 2 and `glm` doesn't yield +/- Infinity 
intercepts when given all 0/1 datasets...


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12804][ML] Fix LogisticRegression with ...

2016-01-19 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/10743#issuecomment-172816195
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/49684/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12804][ML] Fix LogisticRegression with ...

2016-01-19 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/10743#issuecomment-172815986
  
**[Test build #49684 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/49684/consoleFull)**
 for PR 10743 at commit 
[`95816d4`](https://github.com/apache/spark/commit/95816d4c158d7e11cab8ac7bf28b9bb84026d33a).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12804][ML] Fix LogisticRegression with ...

2016-01-19 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/10743#issuecomment-172816192
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12804][ML] Fix LogisticRegression with ...

2016-01-19 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/10743#issuecomment-172802711
  
**[Test build #49684 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/49684/consoleFull)**
 for PR 10743 at commit 
[`95816d4`](https://github.com/apache/spark/commit/95816d4c158d7e11cab8ac7bf28b9bb84026d33a).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12804][ML] Fix LogisticRegression with ...

2016-01-19 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/10743


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12804][ML] Fix LogisticRegression with ...

2016-01-19 Thread dbtsai
Github user dbtsai commented on the pull request:

https://github.com/apache/spark/pull/10743#issuecomment-172954547
  
Merged into master. Thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12804][ML] Fix LogisticRegression with ...

2016-01-19 Thread dbtsai
Github user dbtsai commented on the pull request:

https://github.com/apache/spark/pull/10743#issuecomment-172965625
  
A followup PR is created: https://issues.apache.org/jira/browse/SPARK-12908


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12804][ML] Fix LogisticRegression with ...

2016-01-19 Thread dbtsai
Github user dbtsai commented on the pull request:

https://github.com/apache/spark/pull/10743#issuecomment-172954097
  
Ideally, we would like to see the one without intercept to ensure that 
nothing will not be messed up during the refactoring. I'll merge now, and let's 
add the test for all same labels without intercept. Thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12804][ML] Fix LogisticRegression with ...

2016-01-18 Thread feynmanliang
Github user feynmanliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/10743#discussion_r50006902
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
 ---
@@ -276,113 +276,123 @@ class LogisticRegression @Since("1.2.0") (
 val numClasses = histogram.length
 val numFeatures = summarizer.mean.size
 
-if (numInvalid != 0) {
-  val msg = s"Classification labels should be in {0 to ${numClasses - 
1} " +
-s"Found $numInvalid invalid labels."
-  logError(msg)
-  throw new SparkException(msg)
-}
-
-if (numClasses > 2) {
-  val msg = s"Currently, LogisticRegression with ElasticNet in ML 
package only supports " +
-s"binary classification. Found $numClasses in the input dataset."
-  logError(msg)
-  throw new SparkException(msg)
-}
+val (coefficients, intercept, objectiveHistory) = {
+  if (numInvalid != 0) {
+val msg = s"Classification labels should be in {0 to ${numClasses 
- 1} " +
+  s"Found $numInvalid invalid labels."
+logError(msg)
+throw new SparkException(msg)
+  }
 
-val featuresMean = summarizer.mean.toArray
-val featuresStd = summarizer.variance.toArray.map(math.sqrt)
+  if (numClasses > 2) {
+val msg = s"Currently, LogisticRegression with ElasticNet in ML 
package only supports " +
+  s"binary classification. Found $numClasses in the input dataset."
+logError(msg)
+throw new SparkException(msg)
+  } else if ($(fitIntercept) && numClasses == 2 && histogram(0) == 
0.0) {
+logWarning(s"All labels are one and fitIntercept=true, so the 
coefficients will be " +
+  s"zeros and the intercept will be positive infinity; as a 
result, " +
+  s"training is not needed.")
+(Vectors.sparse(numFeatures, Seq()), Double.PositiveInfinity, 
Array.empty[Double])
+  } else if ($(fitIntercept) && numClasses == 1) {
+logWarning(s"All labels are one and fitIntercept=true, so the 
coefficients will be " +
+  s"zeros and the intercept will be negative infinity; as a 
result, " +
+  s"training is not needed.")
+(Vectors.sparse(numFeatures, Seq()), Double.NegativeInfinity, 
Array.empty[Double])
+  } else {
+val featuresMean = summarizer.mean.toArray
+val featuresStd = summarizer.variance.toArray.map(math.sqrt)
 
-val regParamL1 = $(elasticNetParam) * $(regParam)
-val regParamL2 = (1.0 - $(elasticNetParam)) * $(regParam)
+val regParamL1 = $(elasticNetParam) * $(regParam)
+val regParamL2 = (1.0 - $(elasticNetParam)) * $(regParam)
 
-val costFun = new LogisticCostFun(instances, numClasses, 
$(fitIntercept), $(standardization),
-  featuresStd, featuresMean, regParamL2)
+val costFun = new LogisticCostFun(instances, numClasses, 
$(fitIntercept),
+  $(standardization), featuresStd, featuresMean, regParamL2)
 
-val optimizer = if ($(elasticNetParam) == 0.0 || $(regParam) == 0.0) {
-  new BreezeLBFGS[BDV[Double]]($(maxIter), 10, $(tol))
-} else {
-  def regParamL1Fun = (index: Int) => {
-// Remove the L1 penalization on the intercept
-if (index == numFeatures) {
-  0.0
+val optimizer = if ($(elasticNetParam) == 0.0 || $(regParam) == 
0.0) {
+  new BreezeLBFGS[BDV[Double]]($(maxIter), 10, $(tol))
 } else {
-  if ($(standardization)) {
-regParamL1
-  } else {
-// If `standardization` is false, we still standardize the data
-// to improve the rate of convergence; as a result, we have to
-// perform this reverse standardization by penalizing each 
component
-// differently to get effectively the same objective function 
when
-// the training dataset is not standardized.
-if (featuresStd(index) != 0.0) regParamL1 / featuresStd(index) 
else 0.0
+  def regParamL1Fun = (index: Int) => {
+// Remove the L1 penalization on the intercept
+if (index == numFeatures) {
+  0.0
+} else {
+  if ($(standardization)) {
+regParamL1
+  } else {
+// If `standardization` is false, we still standardize the 
data
+// to improve the rate of convergence; as a result, we 
have to
+// perform this reverse standardization by penalizing each 
component
+// differently to get effectively the same objective 
function when
+

[GitHub] spark pull request: [SPARK-12804][ML] Fix LogisticRegression with ...

2016-01-18 Thread feynmanliang
Github user feynmanliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/10743#discussion_r50006858
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
 ---
@@ -276,113 +276,123 @@ class LogisticRegression @Since("1.2.0") (
 val numClasses = histogram.length
 val numFeatures = summarizer.mean.size
 
-if (numInvalid != 0) {
-  val msg = s"Classification labels should be in {0 to ${numClasses - 
1} " +
-s"Found $numInvalid invalid labels."
-  logError(msg)
-  throw new SparkException(msg)
-}
-
-if (numClasses > 2) {
-  val msg = s"Currently, LogisticRegression with ElasticNet in ML 
package only supports " +
-s"binary classification. Found $numClasses in the input dataset."
-  logError(msg)
-  throw new SparkException(msg)
-}
+val (coefficients, intercept, objectiveHistory) = {
+  if (numInvalid != 0) {
+val msg = s"Classification labels should be in {0 to ${numClasses 
- 1} " +
+  s"Found $numInvalid invalid labels."
+logError(msg)
+throw new SparkException(msg)
+  }
 
-val featuresMean = summarizer.mean.toArray
-val featuresStd = summarizer.variance.toArray.map(math.sqrt)
+  if (numClasses > 2) {
+val msg = s"Currently, LogisticRegression with ElasticNet in ML 
package only supports " +
+  s"binary classification. Found $numClasses in the input dataset."
+logError(msg)
+throw new SparkException(msg)
+  } else if ($(fitIntercept) && numClasses == 2 && histogram(0) == 
0.0) {
+logWarning(s"All labels are one and fitIntercept=true, so the 
coefficients will be " +
+  s"zeros and the intercept will be positive infinity; as a 
result, " +
+  s"training is not needed.")
+(Vectors.sparse(numFeatures, Seq()), Double.PositiveInfinity, 
Array.empty[Double])
+  } else if ($(fitIntercept) && numClasses == 1) {
+logWarning(s"All labels are one and fitIntercept=true, so the 
coefficients will be " +
--- End diff --

OK


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12804][ML] Fix LogisticRegression with ...

2016-01-18 Thread feynmanliang
Github user feynmanliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/10743#discussion_r50007441
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/classification/LogisticRegressionSuite.scala
 ---
@@ -883,6 +884,27 @@ class LogisticRegressionSuite
 assert(model1a0.intercept ~== model1b.intercept absTol 1E-3)
   }
 
+  test("logistic regression with fitIntercept=true and all labels the 
same") {
+val lr = new LogisticRegression()
+  .setFitIntercept(true)
+  .setMaxIter(3)
+val sameLabels = dataset
+  .withColumn("zeroLabel", lit(0.0))
+  .withColumn("oneLabel", lit(1.0))
+
+val allZeroModel = lr
+  .setLabelCol("zeroLabel")
+  .fit(sameLabels)
+assert(allZeroModel.coefficients ~== Vectors.dense(0.0) absTol 1E-3)
+assert(allZeroModel.intercept === Double.NegativeInfinity)
+
+val allOneModel = lr
+  .setLabelCol("oneLabel")
+  .fit(sameLabels)
+assert(allOneModel.coefficients ~== Vectors.dense(0.0) absTol 1E-3)
+assert(allOneModel.intercept === Double.PositiveInfinity)
+  }
+
--- End diff --

OK


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12804][ML] Fix LogisticRegression with ...

2016-01-18 Thread feynmanliang
Github user feynmanliang commented on the pull request:

https://github.com/apache/spark/pull/10743#issuecomment-172556919
  
@dbtsai Added `fitIntercept=false` tests and fixed comments/`logWarning` 
messages.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12804][ML] Fix LogisticRegression with ...

2016-01-18 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/10743#issuecomment-172561357
  
**[Test build #49597 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/49597/consoleFull)**
 for PR 10743 at commit 
[`0f4824d`](https://github.com/apache/spark/commit/0f4824d9f358f451968aa2a6ab2b31afd85cda64).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12804][ML] Fix LogisticRegression with ...

2016-01-18 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/10743#issuecomment-172561751
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12804][ML] Fix LogisticRegression with ...

2016-01-18 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/10743#issuecomment-172561755
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/49597/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12804][ML] Fix LogisticRegression with ...

2016-01-18 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/10743#issuecomment-172561744
  
**[Test build #49597 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/49597/consoleFull)**
 for PR 10743 at commit 
[`0f4824d`](https://github.com/apache/spark/commit/0f4824d9f358f451968aa2a6ab2b31afd85cda64).
 * This patch **fails Scala style tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12804][ML] Fix LogisticRegression with ...

2016-01-18 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/10743#issuecomment-172594732
  
**[Test build #49600 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/49600/consoleFull)**
 for PR 10743 at commit 
[`c8c1586`](https://github.com/apache/spark/commit/c8c1586ffd5e5076c6aed564b41d99314099a5a0).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12804][ML] Fix LogisticRegression with ...

2016-01-18 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/10743#issuecomment-172606483
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/49600/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12804][ML] Fix LogisticRegression with ...

2016-01-18 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/10743#issuecomment-172606348
  
**[Test build #49600 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/49600/consoleFull)**
 for PR 10743 at commit 
[`c8c1586`](https://github.com/apache/spark/commit/c8c1586ffd5e5076c6aed564b41d99314099a5a0).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12804][ML] Fix LogisticRegression with ...

2016-01-18 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/10743#issuecomment-172606481
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12804][ML] Fix LogisticRegression with ...

2016-01-18 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/10743#discussion_r50032693
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
 ---
@@ -276,113 +276,123 @@ class LogisticRegression @Since("1.2.0") (
 val numClasses = histogram.length
 val numFeatures = summarizer.mean.size
 
-if (numInvalid != 0) {
-  val msg = s"Classification labels should be in {0 to ${numClasses - 
1} " +
-s"Found $numInvalid invalid labels."
-  logError(msg)
-  throw new SparkException(msg)
-}
-
-if (numClasses > 2) {
-  val msg = s"Currently, LogisticRegression with ElasticNet in ML 
package only supports " +
-s"binary classification. Found $numClasses in the input dataset."
-  logError(msg)
-  throw new SparkException(msg)
-}
+val (coefficients, intercept, objectiveHistory) = {
+  if (numInvalid != 0) {
+val msg = s"Classification labels should be in {0 to ${numClasses 
- 1} " +
+  s"Found $numInvalid invalid labels."
+logError(msg)
+throw new SparkException(msg)
+  }
 
-val featuresMean = summarizer.mean.toArray
-val featuresStd = summarizer.variance.toArray.map(math.sqrt)
+  if (numClasses > 2) {
+val msg = s"Currently, LogisticRegression with ElasticNet in ML 
package only supports " +
+  s"binary classification. Found $numClasses in the input dataset."
+logError(msg)
+throw new SparkException(msg)
+  } else if ($(fitIntercept) && numClasses == 2 && histogram(0) == 
0.0) {
+logWarning(s"All labels are one and fitIntercept=true, so the 
coefficients will be " +
+  s"zeros and the intercept will be positive infinity; as a 
result, " +
+  s"training is not needed.")
+(Vectors.sparse(numFeatures, Seq()), Double.PositiveInfinity, 
Array.empty[Double])
+  } else if ($(fitIntercept) && numClasses == 1) {
+logWarning(s"All labels are one and fitIntercept=true, so the 
coefficients will be " +
+  s"zeros and the intercept will be negative infinity; as a 
result, " +
+  s"training is not needed.")
+(Vectors.sparse(numFeatures, Seq()), Double.NegativeInfinity, 
Array.empty[Double])
+  } else {
+val featuresMean = summarizer.mean.toArray
+val featuresStd = summarizer.variance.toArray.map(math.sqrt)
 
-val regParamL1 = $(elasticNetParam) * $(regParam)
-val regParamL2 = (1.0 - $(elasticNetParam)) * $(regParam)
+val regParamL1 = $(elasticNetParam) * $(regParam)
+val regParamL2 = (1.0 - $(elasticNetParam)) * $(regParam)
 
-val costFun = new LogisticCostFun(instances, numClasses, 
$(fitIntercept), $(standardization),
-  featuresStd, featuresMean, regParamL2)
+val costFun = new LogisticCostFun(instances, numClasses, 
$(fitIntercept),
+  $(standardization), featuresStd, featuresMean, regParamL2)
 
-val optimizer = if ($(elasticNetParam) == 0.0 || $(regParam) == 0.0) {
-  new BreezeLBFGS[BDV[Double]]($(maxIter), 10, $(tol))
-} else {
-  def regParamL1Fun = (index: Int) => {
-// Remove the L1 penalization on the intercept
-if (index == numFeatures) {
-  0.0
+val optimizer = if ($(elasticNetParam) == 0.0 || $(regParam) == 
0.0) {
+  new BreezeLBFGS[BDV[Double]]($(maxIter), 10, $(tol))
 } else {
-  if ($(standardization)) {
-regParamL1
-  } else {
-// If `standardization` is false, we still standardize the data
-// to improve the rate of convergence; as a result, we have to
-// perform this reverse standardization by penalizing each 
component
-// differently to get effectively the same objective function 
when
-// the training dataset is not standardized.
-if (featuresStd(index) != 0.0) regParamL1 / featuresStd(index) 
else 0.0
+  def regParamL1Fun = (index: Int) => {
+// Remove the L1 penalization on the intercept
+if (index == numFeatures) {
+  0.0
+} else {
+  if ($(standardization)) {
+regParamL1
+  } else {
+// If `standardization` is false, we still standardize the 
data
+// to improve the rate of convergence; as a result, we 
have to
+// perform this reverse standardization by penalizing each 
component
+// differently to get effectively the same objective 
function when
+// the 

[GitHub] spark pull request: [SPARK-12804][ML] Fix LogisticRegression with ...

2016-01-18 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/10743#discussion_r50032721
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
 ---
@@ -276,113 +276,123 @@ class LogisticRegression @Since("1.2.0") (
 val numClasses = histogram.length
 val numFeatures = summarizer.mean.size
 
-if (numInvalid != 0) {
-  val msg = s"Classification labels should be in {0 to ${numClasses - 
1} " +
-s"Found $numInvalid invalid labels."
-  logError(msg)
-  throw new SparkException(msg)
-}
-
-if (numClasses > 2) {
-  val msg = s"Currently, LogisticRegression with ElasticNet in ML 
package only supports " +
-s"binary classification. Found $numClasses in the input dataset."
-  logError(msg)
-  throw new SparkException(msg)
-}
+val (coefficients, intercept, objectiveHistory) = {
+  if (numInvalid != 0) {
+val msg = s"Classification labels should be in {0 to ${numClasses 
- 1} " +
+  s"Found $numInvalid invalid labels."
+logError(msg)
+throw new SparkException(msg)
+  }
 
-val featuresMean = summarizer.mean.toArray
-val featuresStd = summarizer.variance.toArray.map(math.sqrt)
+  if (numClasses > 2) {
+val msg = s"Currently, LogisticRegression with ElasticNet in ML 
package only supports " +
+  s"binary classification. Found $numClasses in the input dataset."
+logError(msg)
+throw new SparkException(msg)
+  } else if ($(fitIntercept) && numClasses == 2 && histogram(0) == 
0.0) {
+logWarning(s"All labels are one and fitIntercept=true, so the 
coefficients will be " +
+  s"zeros and the intercept will be positive infinity; as a 
result, " +
+  s"training is not needed.")
+(Vectors.sparse(numFeatures, Seq()), Double.PositiveInfinity, 
Array.empty[Double])
+  } else if ($(fitIntercept) && numClasses == 1) {
+logWarning(s"All labels are zero and fitIntercept=true, so the 
coefficients will be " +
+  s"zeros and the intercept will be negative infinity; as a 
result, " +
+  s"training is not needed.")
+(Vectors.sparse(numFeatures, Seq()), Double.NegativeInfinity, 
Array.empty[Double])
+  } else {
+val featuresMean = summarizer.mean.toArray
+val featuresStd = summarizer.variance.toArray.map(math.sqrt)
 
-val regParamL1 = $(elasticNetParam) * $(regParam)
-val regParamL2 = (1.0 - $(elasticNetParam)) * $(regParam)
+val regParamL1 = $(elasticNetParam) * $(regParam)
+val regParamL2 = (1.0 - $(elasticNetParam)) * $(regParam)
 
-val costFun = new LogisticCostFun(instances, numClasses, 
$(fitIntercept), $(standardization),
-  featuresStd, featuresMean, regParamL2)
+val costFun = new LogisticCostFun(instances, numClasses, 
$(fitIntercept),
+  $(standardization), featuresStd, featuresMean, regParamL2)
 
-val optimizer = if ($(elasticNetParam) == 0.0 || $(regParam) == 0.0) {
-  new BreezeLBFGS[BDV[Double]]($(maxIter), 10, $(tol))
-} else {
-  def regParamL1Fun = (index: Int) => {
-// Remove the L1 penalization on the intercept
-if (index == numFeatures) {
-  0.0
+val optimizer = if ($(elasticNetParam) == 0.0 || $(regParam) == 
0.0) {
+  new BreezeLBFGS[BDV[Double]]($(maxIter), 10, $(tol))
 } else {
-  if ($(standardization)) {
-regParamL1
-  } else {
-// If `standardization` is false, we still standardize the data
-// to improve the rate of convergence; as a result, we have to
-// perform this reverse standardization by penalizing each 
component
-// differently to get effectively the same objective function 
when
-// the training dataset is not standardized.
-if (featuresStd(index) != 0.0) regParamL1 / featuresStd(index) 
else 0.0
+  def regParamL1Fun = (index: Int) => {
+// Remove the L1 penalization on the intercept
+if (index == numFeatures) {
+  0.0
+} else {
+  if ($(standardization)) {
+regParamL1
+  } else {
+// If `standardization` is false, we still standardize the 
data
+// to improve the rate of convergence; as a result, we 
have to
+// perform this reverse standardization by penalizing each 
component
+// differently to get effectively the same objective 
function when
+// the 

[GitHub] spark pull request: [SPARK-12804][ML] Fix LogisticRegression with ...

2016-01-18 Thread feynmanliang
Github user feynmanliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/10743#discussion_r50041066
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
 ---
@@ -276,113 +276,123 @@ class LogisticRegression @Since("1.2.0") (
 val numClasses = histogram.length
 val numFeatures = summarizer.mean.size
 
-if (numInvalid != 0) {
-  val msg = s"Classification labels should be in {0 to ${numClasses - 
1} " +
-s"Found $numInvalid invalid labels."
-  logError(msg)
-  throw new SparkException(msg)
-}
-
-if (numClasses > 2) {
-  val msg = s"Currently, LogisticRegression with ElasticNet in ML 
package only supports " +
-s"binary classification. Found $numClasses in the input dataset."
-  logError(msg)
-  throw new SparkException(msg)
-}
+val (coefficients, intercept, objectiveHistory) = {
+  if (numInvalid != 0) {
+val msg = s"Classification labels should be in {0 to ${numClasses 
- 1} " +
+  s"Found $numInvalid invalid labels."
+logError(msg)
+throw new SparkException(msg)
+  }
 
-val featuresMean = summarizer.mean.toArray
-val featuresStd = summarizer.variance.toArray.map(math.sqrt)
+  if (numClasses > 2) {
+val msg = s"Currently, LogisticRegression with ElasticNet in ML 
package only supports " +
+  s"binary classification. Found $numClasses in the input dataset."
+logError(msg)
+throw new SparkException(msg)
+  } else if ($(fitIntercept) && numClasses == 2 && histogram(0) == 
0.0) {
+logWarning(s"All labels are one and fitIntercept=true, so the 
coefficients will be " +
+  s"zeros and the intercept will be positive infinity; as a 
result, " +
+  s"training is not needed.")
+(Vectors.sparse(numFeatures, Seq()), Double.PositiveInfinity, 
Array.empty[Double])
+  } else if ($(fitIntercept) && numClasses == 1) {
+logWarning(s"All labels are one and fitIntercept=true, so the 
coefficients will be " +
+  s"zeros and the intercept will be negative infinity; as a 
result, " +
+  s"training is not needed.")
+(Vectors.sparse(numFeatures, Seq()), Double.NegativeInfinity, 
Array.empty[Double])
+  } else {
+val featuresMean = summarizer.mean.toArray
+val featuresStd = summarizer.variance.toArray.map(math.sqrt)
 
-val regParamL1 = $(elasticNetParam) * $(regParam)
-val regParamL2 = (1.0 - $(elasticNetParam)) * $(regParam)
+val regParamL1 = $(elasticNetParam) * $(regParam)
+val regParamL2 = (1.0 - $(elasticNetParam)) * $(regParam)
 
-val costFun = new LogisticCostFun(instances, numClasses, 
$(fitIntercept), $(standardization),
-  featuresStd, featuresMean, regParamL2)
+val costFun = new LogisticCostFun(instances, numClasses, 
$(fitIntercept),
+  $(standardization), featuresStd, featuresMean, regParamL2)
 
-val optimizer = if ($(elasticNetParam) == 0.0 || $(regParam) == 0.0) {
-  new BreezeLBFGS[BDV[Double]]($(maxIter), 10, $(tol))
-} else {
-  def regParamL1Fun = (index: Int) => {
-// Remove the L1 penalization on the intercept
-if (index == numFeatures) {
-  0.0
+val optimizer = if ($(elasticNetParam) == 0.0 || $(regParam) == 
0.0) {
+  new BreezeLBFGS[BDV[Double]]($(maxIter), 10, $(tol))
 } else {
-  if ($(standardization)) {
-regParamL1
-  } else {
-// If `standardization` is false, we still standardize the data
-// to improve the rate of convergence; as a result, we have to
-// perform this reverse standardization by penalizing each 
component
-// differently to get effectively the same objective function 
when
-// the training dataset is not standardized.
-if (featuresStd(index) != 0.0) regParamL1 / featuresStd(index) 
else 0.0
+  def regParamL1Fun = (index: Int) => {
+// Remove the L1 penalization on the intercept
+if (index == numFeatures) {
+  0.0
+} else {
+  if ($(standardization)) {
+regParamL1
+  } else {
+// If `standardization` is false, we still standardize the 
data
+// to improve the rate of convergence; as a result, we 
have to
+// perform this reverse standardization by penalizing each 
component
+// differently to get effectively the same objective 
function when
+

[GitHub] spark pull request: [SPARK-12804][ML] Fix LogisticRegression with ...

2016-01-18 Thread feynmanliang
Github user feynmanliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/10743#discussion_r50041068
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
 ---
@@ -276,113 +276,123 @@ class LogisticRegression @Since("1.2.0") (
 val numClasses = histogram.length
 val numFeatures = summarizer.mean.size
 
-if (numInvalid != 0) {
-  val msg = s"Classification labels should be in {0 to ${numClasses - 
1} " +
-s"Found $numInvalid invalid labels."
-  logError(msg)
-  throw new SparkException(msg)
-}
-
-if (numClasses > 2) {
-  val msg = s"Currently, LogisticRegression with ElasticNet in ML 
package only supports " +
-s"binary classification. Found $numClasses in the input dataset."
-  logError(msg)
-  throw new SparkException(msg)
-}
+val (coefficients, intercept, objectiveHistory) = {
+  if (numInvalid != 0) {
+val msg = s"Classification labels should be in {0 to ${numClasses 
- 1} " +
+  s"Found $numInvalid invalid labels."
+logError(msg)
+throw new SparkException(msg)
+  }
 
-val featuresMean = summarizer.mean.toArray
-val featuresStd = summarizer.variance.toArray.map(math.sqrt)
+  if (numClasses > 2) {
+val msg = s"Currently, LogisticRegression with ElasticNet in ML 
package only supports " +
+  s"binary classification. Found $numClasses in the input dataset."
+logError(msg)
+throw new SparkException(msg)
+  } else if ($(fitIntercept) && numClasses == 2 && histogram(0) == 
0.0) {
+logWarning(s"All labels are one and fitIntercept=true, so the 
coefficients will be " +
+  s"zeros and the intercept will be positive infinity; as a 
result, " +
+  s"training is not needed.")
+(Vectors.sparse(numFeatures, Seq()), Double.PositiveInfinity, 
Array.empty[Double])
+  } else if ($(fitIntercept) && numClasses == 1) {
+logWarning(s"All labels are zero and fitIntercept=true, so the 
coefficients will be " +
+  s"zeros and the intercept will be negative infinity; as a 
result, " +
+  s"training is not needed.")
+(Vectors.sparse(numFeatures, Seq()), Double.NegativeInfinity, 
Array.empty[Double])
+  } else {
+val featuresMean = summarizer.mean.toArray
+val featuresStd = summarizer.variance.toArray.map(math.sqrt)
 
-val regParamL1 = $(elasticNetParam) * $(regParam)
-val regParamL2 = (1.0 - $(elasticNetParam)) * $(regParam)
+val regParamL1 = $(elasticNetParam) * $(regParam)
+val regParamL2 = (1.0 - $(elasticNetParam)) * $(regParam)
 
-val costFun = new LogisticCostFun(instances, numClasses, 
$(fitIntercept), $(standardization),
-  featuresStd, featuresMean, regParamL2)
+val costFun = new LogisticCostFun(instances, numClasses, 
$(fitIntercept),
+  $(standardization), featuresStd, featuresMean, regParamL2)
 
-val optimizer = if ($(elasticNetParam) == 0.0 || $(regParam) == 0.0) {
-  new BreezeLBFGS[BDV[Double]]($(maxIter), 10, $(tol))
-} else {
-  def regParamL1Fun = (index: Int) => {
-// Remove the L1 penalization on the intercept
-if (index == numFeatures) {
-  0.0
+val optimizer = if ($(elasticNetParam) == 0.0 || $(regParam) == 
0.0) {
+  new BreezeLBFGS[BDV[Double]]($(maxIter), 10, $(tol))
 } else {
-  if ($(standardization)) {
-regParamL1
-  } else {
-// If `standardization` is false, we still standardize the data
-// to improve the rate of convergence; as a result, we have to
-// perform this reverse standardization by penalizing each 
component
-// differently to get effectively the same objective function 
when
-// the training dataset is not standardized.
-if (featuresStd(index) != 0.0) regParamL1 / featuresStd(index) 
else 0.0
+  def regParamL1Fun = (index: Int) => {
+// Remove the L1 penalization on the intercept
+if (index == numFeatures) {
+  0.0
+} else {
+  if ($(standardization)) {
+regParamL1
+  } else {
+// If `standardization` is false, we still standardize the 
data
+// to improve the rate of convergence; as a result, we 
have to
+// perform this reverse standardization by penalizing each 
component
+// differently to get effectively the same objective 
function when
+

[GitHub] spark pull request: [SPARK-12804][ML] Fix LogisticRegression with ...

2016-01-17 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/10743#discussion_r49955619
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/classification/LogisticRegressionSuite.scala
 ---
@@ -883,6 +884,27 @@ class LogisticRegressionSuite
 assert(model1a0.intercept ~== model1b.intercept absTol 1E-3)
   }
 
+  test("logistic regression with fitIntercept=true and all labels the 
same") {
+val lr = new LogisticRegression()
+  .setFitIntercept(true)
+  .setMaxIter(3)
--- End diff --

add extra line.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12804][ML] Fix LogisticRegression with ...

2016-01-17 Thread dbtsai
Github user dbtsai commented on the pull request:

https://github.com/apache/spark/pull/10743#issuecomment-172401612
  
LGTM except couple minor issues. Thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12804][ML] Fix LogisticRegression with ...

2016-01-17 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/10743#discussion_r49955940
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/classification/LogisticRegressionSuite.scala
 ---
@@ -883,6 +884,27 @@ class LogisticRegressionSuite
 assert(model1a0.intercept ~== model1b.intercept absTol 1E-3)
   }
 
+  test("logistic regression with fitIntercept=true and all labels the 
same") {
+val lr = new LogisticRegression()
+  .setFitIntercept(true)
+  .setMaxIter(3)
+val sameLabels = dataset
+  .withColumn("zeroLabel", lit(0.0))
+  .withColumn("oneLabel", lit(1.0))
+
+val allZeroModel = lr
+  .setLabelCol("zeroLabel")
+  .fit(sameLabels)
+assert(allZeroModel.coefficients ~== Vectors.dense(0.0) absTol 1E-3)
+assert(allZeroModel.intercept === Double.NegativeInfinity)
+
+val allOneModel = lr
+  .setLabelCol("oneLabel")
+  .fit(sameLabels)
+assert(allOneModel.coefficients ~== Vectors.dense(0.0) absTol 1E-3)
+assert(allOneModel.intercept === Double.PositiveInfinity)
+  }
+
--- End diff --

Also, check if objectiveHistory has length of zero.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12804][ML] Fix LogisticRegression with ...

2016-01-17 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/10743#discussion_r49955755
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
 ---
@@ -276,113 +276,123 @@ class LogisticRegression @Since("1.2.0") (
 val numClasses = histogram.length
 val numFeatures = summarizer.mean.size
 
-if (numInvalid != 0) {
-  val msg = s"Classification labels should be in {0 to ${numClasses - 
1} " +
-s"Found $numInvalid invalid labels."
-  logError(msg)
-  throw new SparkException(msg)
-}
-
-if (numClasses > 2) {
-  val msg = s"Currently, LogisticRegression with ElasticNet in ML 
package only supports " +
-s"binary classification. Found $numClasses in the input dataset."
-  logError(msg)
-  throw new SparkException(msg)
-}
+val (coefficients, intercept, objectiveHistory) = {
+  if (numInvalid != 0) {
+val msg = s"Classification labels should be in {0 to ${numClasses 
- 1} " +
+  s"Found $numInvalid invalid labels."
+logError(msg)
+throw new SparkException(msg)
+  }
 
-val featuresMean = summarizer.mean.toArray
-val featuresStd = summarizer.variance.toArray.map(math.sqrt)
+  if (numClasses > 2) {
+val msg = s"Currently, LogisticRegression with ElasticNet in ML 
package only supports " +
+  s"binary classification. Found $numClasses in the input dataset."
+logError(msg)
+throw new SparkException(msg)
+  } else if ($(fitIntercept) && numClasses == 2 && histogram(0) == 
0.0) {
+logWarning(s"All labels are one and fitIntercept=true, so the 
coefficients will be " +
+  s"zeros and the intercept will be positive infinity; as a 
result, " +
+  s"training is not needed.")
+(Vectors.sparse(numFeatures, Seq()), Double.PositiveInfinity, 
Array.empty[Double])
+  } else if ($(fitIntercept) && numClasses == 1) {
+logWarning(s"All labels are one and fitIntercept=true, so the 
coefficients will be " +
--- End diff --

"All labels are zero"


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12804][ML] Fix LogisticRegression with ...

2016-01-17 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/10743#discussion_r49955925
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
 ---
@@ -276,113 +276,123 @@ class LogisticRegression @Since("1.2.0") (
 val numClasses = histogram.length
 val numFeatures = summarizer.mean.size
 
-if (numInvalid != 0) {
-  val msg = s"Classification labels should be in {0 to ${numClasses - 
1} " +
-s"Found $numInvalid invalid labels."
-  logError(msg)
-  throw new SparkException(msg)
-}
-
-if (numClasses > 2) {
-  val msg = s"Currently, LogisticRegression with ElasticNet in ML 
package only supports " +
-s"binary classification. Found $numClasses in the input dataset."
-  logError(msg)
-  throw new SparkException(msg)
-}
+val (coefficients, intercept, objectiveHistory) = {
+  if (numInvalid != 0) {
+val msg = s"Classification labels should be in {0 to ${numClasses 
- 1} " +
+  s"Found $numInvalid invalid labels."
+logError(msg)
+throw new SparkException(msg)
+  }
 
-val featuresMean = summarizer.mean.toArray
-val featuresStd = summarizer.variance.toArray.map(math.sqrt)
+  if (numClasses > 2) {
+val msg = s"Currently, LogisticRegression with ElasticNet in ML 
package only supports " +
+  s"binary classification. Found $numClasses in the input dataset."
+logError(msg)
+throw new SparkException(msg)
+  } else if ($(fitIntercept) && numClasses == 2 && histogram(0) == 
0.0) {
+logWarning(s"All labels are one and fitIntercept=true, so the 
coefficients will be " +
+  s"zeros and the intercept will be positive infinity; as a 
result, " +
+  s"training is not needed.")
+(Vectors.sparse(numFeatures, Seq()), Double.PositiveInfinity, 
Array.empty[Double])
+  } else if ($(fitIntercept) && numClasses == 1) {
+logWarning(s"All labels are one and fitIntercept=true, so the 
coefficients will be " +
+  s"zeros and the intercept will be negative infinity; as a 
result, " +
+  s"training is not needed.")
+(Vectors.sparse(numFeatures, Seq()), Double.NegativeInfinity, 
Array.empty[Double])
+  } else {
+val featuresMean = summarizer.mean.toArray
+val featuresStd = summarizer.variance.toArray.map(math.sqrt)
 
-val regParamL1 = $(elasticNetParam) * $(regParam)
-val regParamL2 = (1.0 - $(elasticNetParam)) * $(regParam)
+val regParamL1 = $(elasticNetParam) * $(regParam)
+val regParamL2 = (1.0 - $(elasticNetParam)) * $(regParam)
 
-val costFun = new LogisticCostFun(instances, numClasses, 
$(fitIntercept), $(standardization),
-  featuresStd, featuresMean, regParamL2)
+val costFun = new LogisticCostFun(instances, numClasses, 
$(fitIntercept),
+  $(standardization), featuresStd, featuresMean, regParamL2)
 
-val optimizer = if ($(elasticNetParam) == 0.0 || $(regParam) == 0.0) {
-  new BreezeLBFGS[BDV[Double]]($(maxIter), 10, $(tol))
-} else {
-  def regParamL1Fun = (index: Int) => {
-// Remove the L1 penalization on the intercept
-if (index == numFeatures) {
-  0.0
+val optimizer = if ($(elasticNetParam) == 0.0 || $(regParam) == 
0.0) {
+  new BreezeLBFGS[BDV[Double]]($(maxIter), 10, $(tol))
 } else {
-  if ($(standardization)) {
-regParamL1
-  } else {
-// If `standardization` is false, we still standardize the data
-// to improve the rate of convergence; as a result, we have to
-// perform this reverse standardization by penalizing each 
component
-// differently to get effectively the same objective function 
when
-// the training dataset is not standardized.
-if (featuresStd(index) != 0.0) regParamL1 / featuresStd(index) 
else 0.0
+  def regParamL1Fun = (index: Int) => {
+// Remove the L1 penalization on the intercept
+if (index == numFeatures) {
+  0.0
+} else {
+  if ($(standardization)) {
+regParamL1
+  } else {
+// If `standardization` is false, we still standardize the 
data
+// to improve the rate of convergence; as a result, we 
have to
+// perform this reverse standardization by penalizing each 
component
+// differently to get effectively the same objective 
function when
+// the 

[GitHub] spark pull request: [SPARK-12804][ML] Fix LogisticRegression with ...

2016-01-17 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/10743#discussion_r49955604
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/classification/LogisticRegressionSuite.scala
 ---
@@ -883,6 +884,27 @@ class LogisticRegressionSuite
 assert(model1a0.intercept ~== model1b.intercept absTol 1E-3)
   }
 
+  test("logistic regression with fitIntercept=true and all labels the 
same") {
+val lr = new LogisticRegression()
+  .setFitIntercept(true)
+  .setMaxIter(3)
+val sameLabels = dataset
+  .withColumn("zeroLabel", lit(0.0))
+  .withColumn("oneLabel", lit(1.0))
+
+val allZeroModel = lr
+  .setLabelCol("zeroLabel")
+  .fit(sameLabels)
+assert(allZeroModel.coefficients ~== Vectors.dense(0.0) absTol 1E-3)
+assert(allZeroModel.intercept === Double.NegativeInfinity)
+
+val allOneModel = lr
+  .setLabelCol("oneLabel")
+  .fit(sameLabels)
+assert(allOneModel.coefficients ~== Vectors.dense(0.0) absTol 1E-3)
+assert(allOneModel.intercept === Double.PositiveInfinity)
+  }
+
--- End diff --

Can you add one test which is all labels the same but `fitIntercept=false` 
here to avoid the issue in LiR? Thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12804][ML] Fix LogisticRegression with ...

2016-01-17 Thread feynmanliang
Github user feynmanliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/10743#discussion_r49946338
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/classification/LogisticRegressionSuite.scala
 ---
@@ -883,6 +884,22 @@ class LogisticRegressionSuite
 assert(model1a0.intercept ~== model1b.intercept absTol 1E-3)
   }
 
+  test("logistic regression with all labels the same") {
+val lr = new LogisticRegression()
+  .setFitIntercept(true)
+  .setMaxIter(3)
+val sameLabels = dataset
+  .withColumn("zeroLabel", lit(0.0))
+  .withColumn("oneLabel", lit(1.0))
+
+val model = lr
+  .setLabelCol("oneLabel")
+  .fit(sameLabels)
+
+assert(model.coefficients ~== Vectors.dense(0.0) absTol 1E-3)
+assert(model.intercept === Double.PositiveInfinity)
--- End diff --

Thanks for pointing that out!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12804][ML] Fix LogisticRegression with ...

2016-01-17 Thread feynmanliang
Github user feynmanliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/10743#discussion_r49946694
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
 ---
@@ -339,9 +339,11 @@ class LogisticRegression @Since("1.2.0") (
  b = \log{P(1) / P(0)} = \log{count_1 / count_0}
  }}}
*/
-  initialCoefficientsWithIntercept.toArray(numFeatures)
-= math.log(histogram(1) / histogram(0))
-}
+   if (histogram.length >= 2) { // check to make sure indexing into 
histogram(1) is safe
+ initialCoefficientsWithIntercept.toArray(numFeatures) = math.log(
+   histogram(1) / histogram(0))
--- End diff --

Thank you!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12804][ML] Fix LogisticRegression with ...

2016-01-17 Thread feynmanliang
Github user feynmanliang commented on the pull request:

https://github.com/apache/spark/pull/10743#issuecomment-172328753
  
@dbtsai @jkbradley ready for second review.

The big diff is because I grouped the same label cases with the normal case 
to generate `coefficients`, `intercept`, and `objectiveTrace` all in the same 
block. This is to avoid repeated code when generating the model summary.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12804][ML] Fix LogisticRegression with ...

2016-01-17 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/10743#issuecomment-172329504
  
**[Test build #49554 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/49554/consoleFull)**
 for PR 10743 at commit 
[`d676f62`](https://github.com/apache/spark/commit/d676f6245622cedf61df3875ca0873d77f72a857).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12804][ML] Fix LogisticRegression with ...

2016-01-17 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/10743#issuecomment-172334066
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/49554/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12804][ML] Fix LogisticRegression with ...

2016-01-17 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/10743#issuecomment-172333931
  
**[Test build #49554 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/49554/consoleFull)**
 for PR 10743 at commit 
[`d676f62`](https://github.com/apache/spark/commit/d676f6245622cedf61df3875ca0873d77f72a857).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12804][ML] Fix LogisticRegression with ...

2016-01-17 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/10743#issuecomment-172334065
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12804][ML] Fix LogisticRegression with ...

2016-01-13 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/10743#issuecomment-171409858
  
**[Test build #49325 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/49325/consoleFull)**
 for PR 10743 at commit 
[`caf7a1b`](https://github.com/apache/spark/commit/caf7a1b2cd4336134d1f29e0fb4432a67d44288e).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12804][ML] Fix LogisticRegression with ...

2016-01-13 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/10743#issuecomment-171421491
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/49325/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12804][ML] Fix LogisticRegression with ...

2016-01-13 Thread feynmanliang
GitHub user feynmanliang opened a pull request:

https://github.com/apache/spark/pull/10743

[SPARK-12804][ML] Fix LogisticRegression with FitIntercept on all same 
label training data



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/feynmanliang/spark SPARK-12804

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/10743.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #10743


commit fbf6b5cab51544c9230567e9479528a9bd8960c5
Author: Feynman Liang 
Date:   2016-01-13T17:52:56Z

Initial fix and println unit test

commit e4c13d4a89abc8160f1c2fa906cb3e3d1affd473
Author: Feynman Liang 
Date:   2016-01-13T19:23:49Z

Cleans up test




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12804][ML] Fix LogisticRegression with ...

2016-01-13 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/10743#discussion_r49652270
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/classification/LogisticRegressionSuite.scala
 ---
@@ -883,6 +884,22 @@ class LogisticRegressionSuite
 assert(model1a0.intercept ~== model1b.intercept absTol 1E-3)
   }
 
+  test("logistic regression with all labels the same") {
+val lr = new LogisticRegression()
+  .setFitIntercept(true)
+  .setMaxIter(3)
+val sameLabels = dataset
+  .withColumn("zeroLabel", lit(0.0))
+  .withColumn("oneLabel", lit(1.0))
+
+val model = lr
+  .setLabelCol("oneLabel")
+  .fit(sameLabels)
+
+assert(model.coefficients ~== Vectors.dense(0.0) absTol 1E-3)
+assert(model.intercept === Double.PositiveInfinity)
--- End diff --

BTW, this bug should not happen when all the labels are one since the 
histogram should be still size of two. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12804][ML] Fix LogisticRegression with ...

2016-01-13 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/10743#discussion_r49649843
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
 ---
@@ -339,9 +339,11 @@ class LogisticRegression @Since("1.2.0") (
  b = \log{P(1) / P(0)} = \log{count_1 / count_0}
  }}}
*/
-  initialCoefficientsWithIntercept.toArray(numFeatures)
-= math.log(histogram(1) / histogram(0))
-}
+   if (histogram.length >= 2) { // check to make sure indexing into 
histogram(1) is safe
+ initialCoefficientsWithIntercept.toArray(numFeatures) = math.log(
+   histogram(1) / histogram(0))
--- End diff --

In this case, the whole training step can be skipped. Currently, we only 
support binary LoR, so the max of `histogram.length` will be two. In LiR, when 
the `yStd == 0.0`, the model will be returned immediately without training, see 
https://github.com/feynmanliang/spark/blob/SPARK-12804/mllib/src/main/scala/org/apache/spark/ml/regression/LinearRegression.scala#L226

We can do similar thing here like

```scala
if (histogram.length == 2) {
  if (histogram(0) == 0.0) {
model = (new LogisticRegressionModel(uid, Vectors.sparse(numFeatures, 
Seq()), Double.PositiveInfinity))
return model
  } else {
initialCoefficientsWithIntercept.toArray(numFeatures) = math.log(
histogram(1) / histogram(0))
  } else if (histogram.length == 1) {
model = (new LogisticRegressionModel(uid, Vectors.sparse(numFeatures, 
Seq()), Double.NegativeInfinity))
return model
  } else {
some excpetion
  }
}
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12804][ML] Fix LogisticRegression with ...

2016-01-13 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/10743#issuecomment-171421277
  
**[Test build #49325 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/49325/consoleFull)**
 for PR 10743 at commit 
[`caf7a1b`](https://github.com/apache/spark/commit/caf7a1b2cd4336134d1f29e0fb4432a67d44288e).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12804][ML] Fix LogisticRegression with ...

2016-01-13 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/10743#issuecomment-171421490
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12804][ML] Fix LogisticRegression with ...

2016-01-13 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/10743#discussion_r49650452
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/classification/LogisticRegressionSuite.scala
 ---
@@ -883,6 +884,22 @@ class LogisticRegressionSuite
 assert(model1a0.intercept ~== model1b.intercept absTol 1E-3)
   }
 
+  test("logistic regression with all labels the same") {
+val lr = new LogisticRegression()
+  .setFitIntercept(true)
+  .setMaxIter(3)
+val sameLabels = dataset
+  .withColumn("zeroLabel", lit(0.0))
+  .withColumn("oneLabel", lit(1.0))
+
+val model = lr
+  .setLabelCol("oneLabel")
+  .fit(sameLabels)
+
+assert(model.coefficients ~== Vectors.dense(0.0) absTol 1E-3)
+assert(model.intercept === Double.PositiveInfinity)
--- End diff --

Can you add another test showing that all `zeroLabel` will return intercept 
with `Double.NegativeInfinity`?

Thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org