[GitHub] spark pull request #18896: [SPARK-21681][ML] fix bug of MLOR do not work cor...

2017-08-22 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/18896


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18896: [SPARK-21681][ML] fix bug of MLOR do not work cor...

2017-08-18 Thread sethah
Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/18896#discussion_r134076580
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/classification/LogisticRegressionSuite.scala
 ---
@@ -1392,6 +1415,61 @@ class LogisticRegressionSuite
 assert(model2.interceptVector.toArray.sum ~== 0.0 absTol eps)
   }
 
+  test("multinomial logistic regression with zero variance (SPARK-21681)") 
{
+val sqlContext = multinomialDatasetWithZeroVar.sqlContext
+import sqlContext.implicits._
+val mlr = new 
LogisticRegression().setFamily("multinomial").setFitIntercept(true)
+  
.setElasticNetParam(0.0).setRegParam(0.0).setStandardization(true).setWeightCol("weight")
+
+val model = mlr.fit(multinomialDatasetWithZeroVar)
+
+/*
+ Use the following R code to load the data and train the model using 
glmnet package.
+
+ library("glmnet")
+ data <- read.csv("path", header=FALSE)
+ label = as.factor(data$V1)
+ w = data$V2
+ features = as.matrix(data.frame(data$V3, data$V4))
+ coefficients = coef(glmnet(features, label, weights=w, 
family="multinomial",
+ alpha = 0, lambda = 0))
+ coefficients
+ $`0`
+ 3 x 1 sparse Matrix of class "dgCMatrix"
+s0
+ 0.2658824
+ data.V3 0.1881871
+ data.V4 .
+
+ $`1`
+ 3 x 1 sparse Matrix of class "dgCMatrix"
+  s0
+  0.53604701
+ data.V3 -0.02412645
+ data.V4  .
+
+ $`2`
+ 3 x 1 sparse Matrix of class "dgCMatrix"
+ s0
+ -0.8019294
+ data.V3 -0.1640607
+ data.V4  .
+*/
+
+val coefficientsR = new DenseMatrix(3, 2, Array(
+  0.1881871, -0.0,
--- End diff --

Why `-0.0`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18896: [SPARK-21681][ML] fix bug of MLOR do not work cor...

2017-08-18 Thread sethah
Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/18896#discussion_r134076552
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/optim/aggregator/LogisticAggregatorSuite.scala
 ---
@@ -238,8 +238,17 @@ class LogisticAggregatorSuite extends SparkFunSuite 
with MLlibTestSparkContext {
 val aggConstantFeature = getNewAggregator(instancesConstantFeature,
   Vectors.dense(coefArray ++ interceptArray), fitIntercept = true, 
isMultinomial = true)
 instances.foreach(aggConstantFeature.add)
+
 // constant features should not affect gradient
-assert(aggConstantFeature.gradient(0) === 0.0)
+def validateGradient(grad: Vector): Unit = {
+  assert(grad(0) === 0.0)
+  grad.toArray.foreach { gradientValue =>
--- End diff --

The problem with this test was that it checked that part of the gradient 
was zero, but didn't check that the rest of the gradient was correct. Here, 
you're checking that the rest of the gradient isn't nan or infinite, but not 
that it's actually correct. A more appropriate test, IMO, is to also run an 
aggregator over the same instances with the constant feature filtered out, then 
check that the portion of the gradients they share are the same. e.g.

scala
val aggConstantFeature = getNewAggregator(instancesConstantFeature,
  Vectors.dense(coefArray ++ interceptArray), fitIntercept = true, 
isMultinomial = true)
val filteredInstances = instancesConstantFeature.map { case Instance(l, 
w, f) =>
  Instance(l, w, Vectors.dense(f.toArray.tail))
}
val aggMultinomial = getNewAggregator(filteredInstances,
  Vectors.dense(coefArray.slice(3, 6) ++ interceptArray), fitIntercept 
= true,
  isMultinomial = true)
filteredInstances.foreach(aggMultinomial.add)
instancesConstantFeature.foreach(aggConstantFeature.add)

// constant features should not affect gradient
assert(aggConstantFeature.gradient.toArray.take(numClasses) === 
Array.fill(numClasses)(0.0))
assert(aggMultinomial.gradient.toArray === 
aggConstantFeature.gradient.toArray.slice(3, 9))


Just to note, this code is just for an example, not meant to be copy and 
pasted.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18896: [SPARK-21681][ML] fix bug of MLOR do not work cor...

2017-08-15 Thread MrBago
Github user MrBago commented on a diff in the pull request:

https://github.com/apache/spark/pull/18896#discussion_r133301688
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/classification/LogisticRegressionSuite.scala
 ---
@@ -1392,6 +1415,61 @@ class LogisticRegressionSuite
 assert(model2.interceptVector.toArray.sum ~== 0.0 absTol eps)
   }
 
+  test("test SPARK-21681") {
--- End diff --

I would include a description of the test in addition to the ticket #.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18896: [SPARK-21681][ML] fix bug of MLOR do not work cor...

2017-08-09 Thread WeichenXu123
GitHub user WeichenXu123 opened a pull request:

https://github.com/apache/spark/pull/18896

[SPARK-21681][ML] fix bug of MLOR do not work correctly when featureStd 
contains zero   

## What changes were proposed in this pull request?

fix bug of MLOR do not work correctly when featureStd contains zero 

## How was this patch tested?

N/A

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/WeichenXu123/spark fix_mlor_stdvalue_zero_bug

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/18896.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #18896


commit c415ddeb1182f8243e5330d294665079c21a8a19
Author: WeichenXu 
Date:   2017-08-09T17:17:43Z

init pr




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org