[GitHub] spark pull request #16740: [SPARK-19400][ML] Allow GLM to handle intercept o...

2017-02-02 Thread actuaryzhang
Github user actuaryzhang commented on a diff in the pull request:

https://github.com/apache/spark/pull/16740#discussion_r99269006
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/regression/GeneralizedLinearRegressionSuite.scala
 ---
@@ -743,6 +743,55 @@ class GeneralizedLinearRegressionSuite
 }
   }
 
+  test("generalized linear regression: intercept only") {
+/*
+  R code:
+  y <- c(17, 19, 23, 29)
+  w <- c(1, 2, 3, 4)
+  model1 <- glm(y ~ 1, family = poisson)
+  model2 <- glm(y ~ 1, family = poisson, weights = w)
+  as.vector(c(coef(model1), coef(model2)))
+  [1] 3.091042 3.178054
+ */
+
+val dataset = Seq(
+  Instance(17.0, 1.0, Vectors.zeros(0)),
+  Instance(19.0, 2.0, Vectors.zeros(0)),
+  Instance(23.0, 3.0, Vectors.zeros(0)),
+  Instance(29.0, 4.0, Vectors.zeros(0))
+).toDF()
+
+val expected = Seq(3.091, 3.178)
+
+import GeneralizedLinearRegression._
+
+var idx = 0
+for (useWeight <- Seq(false, true)) {
+  val trainer = new GeneralizedLinearRegression().setFamily("poisson")
+.setLinkPredictionCol("linkPrediction")
+  if (useWeight) trainer.setWeightCol("weight")
+  val model = trainer.fit(dataset)
+  val actual = model.intercept
+  assert(actual ~== expected(idx) absTol 1E-3, "Model mismatch: 
intercept only GLM with " +
+s"useWeight = $useWeight.")
+  assert(model.coefficients === new DenseVector(Array.empty[Double]))
+
+  val familyLink = FamilyAndLink(trainer)
+  model.transform(dataset).select("features", "prediction", 
"linkPrediction").collect()
+.foreach {
+  case Row(features: DenseVector, prediction1: Double, 
linkPrediction1: Double) =>
+val eta = BLAS.dot(features, model.coefficients) + 
model.intercept
+val prediction2 = familyLink.fitted(eta)
--- End diff --

@sethah Agree. Removed this. Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16740: [SPARK-19400][ML] Allow GLM to handle intercept only mod...

2017-02-02 Thread actuaryzhang
Github user actuaryzhang commented on the issue:

https://github.com/apache/spark/pull/16740
  
@srowen would you please take a look and merge this if all is good? Thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16740: [SPARK-19400][ML] Allow GLM to handle intercept only mod...

2017-02-02 Thread actuaryzhang
Github user actuaryzhang commented on the issue:

https://github.com/apache/spark/pull/16740
  
@sethah Your formula for offset does not seem to be a general solution, and 
I'm not sure if there exists an analytical formula, in particular when the link 
function is not identity or log. In GLM, the normal equation for the 
coefficient is: (y - mu) * w = 0. When there is offset, mu = link.inv(intcpt + 
offset). Take the case where link.inv is inverse as an example, then we have (y 
- inverse(intcpt + offset)) * w = 0. This is nonlinear and has to be solved 
iteratively, right? The following is a simple R example to show that the 
formula you provided does not recover the coefficient. 

```
set.seed(11)
off <- rlnorm(200)
a <- 0.52
mu <- 1/(a + off)
y <- rgamma(200, 1, scale = mu)
f <- glm(y~1, offset = off, family = Gamma())
coef(f)
1/mean(y - off)
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16729: [SPARK-19391][SparkR][ML] Tweedie GLM API for SparkR

2017-02-02 Thread actuaryzhang
Github user actuaryzhang commented on the issue:

https://github.com/apache/spark/pull/16729
  
@felixcheung  The one from statmod will be masked and must be called using 
`statmod:tweedie`. 
We can copy the whole `tweedie` function from statmod into `SparkR` and 
this will avoid the issue (we will give credit to the original author). Does 
this sound reasonable?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16729: [SPARK-19391][SparkR][ML] Tweedie GLM API for SparkR

2017-02-01 Thread actuaryzhang
Github user actuaryzhang commented on the issue:

https://github.com/apache/spark/pull/16729
  
@felixcheung Great point! Yes, I think it's better to stick with the 
statmod syntax and allow the tweedie family to be specified 
as`tweedie(var.power, link.power)`. I tried a few ways to allow this and 
independence of statmod. The solution I came up with is creating and exporting 
a utility `tweedie` function within the SparkR package to mimic the statmod 
syntax. 

It will be better to avoid exporting it while still allowing users to use 
the syntax. We would need to use unevaluated expression, e.g., 
`substitute(family)`. This works in `spark.glm` by calling first `family = 
eval(substitute(family))` and then processing as before. However, it won't work 
with `glm`: `substitute` does not work in methods that have different arguments 
from the generic:  
https://stat.ethz.ch/pipermail/r-help/2008-January/152296.html


I also updated the doc and added some examples and comments in the 
vignettes. Let me know what you think. Thanks much! 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16740: [SPARK-19400][ML] Allow GLM to handle intercept only mod...

2017-01-31 Thread actuaryzhang
Github user actuaryzhang commented on the issue:

https://github.com/apache/spark/pull/16740
  
@sethah Thanks for the clarification and providing an implementation. So, 
the pros is some speed improvement and the cons is the increased complexity 
(now we have three case: one for intercept only, one for Gaussian with identity 
and one for all the others). Let's see get other committers' opinions. 

Yes, I will throw an error for the special case of no intercept and no 
features. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16740: [SPARK-19400][ML] Allow GLM to handle intercept only mod...

2017-01-31 Thread actuaryzhang
Github user actuaryzhang commented on the issue:

https://github.com/apache/spark/pull/16740
  
@sethah Thanks for your input. I can add more tests, but they are not 
adding too much since the algorithm is already tested in other tests. 

The analytical approach does not integrate well with the summary method. 
One has to derive the general formula for the standard error of the intercept, 
and then change the code substantially to make it work with summary. This is 
not an optimal solution IMO. 

BTW, R fits the intercept only model also using IWLS with multiple 
iterations. It is just weird to have a special implementation in this case 
which does not integrate with the current setup. 

@srowen @yanboliang Please advise. Thanks. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16740: [SPARK-19400][ML] Allow GLM to handle intercept only mod...

2017-01-31 Thread actuaryzhang
Github user actuaryzhang commented on the issue:

https://github.com/apache/spark/pull/16740
  
@sethah Yes, we can directly compute the intercept easily. But I'm 
concerned that such special handling may not integrate well with other features 
or future changes. For example, we will need to compute the standard error 
analytically as well, which is not difficult. But the point is that every time 
there is new feature, one would have to modify the intercept calculation part 
to handle it. This does not seem efficient. Thoughts?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16740: [SPARK-19400][ML] Allow GLM to handle intercept only mod...

2017-01-30 Thread actuaryzhang
Github user actuaryzhang commented on the issue:

https://github.com/apache/spark/pull/16740
  
@sethah Thanks for your review. Yes, using `foldLeft` would be the simplest 
fix. I have included both your suggested changes in the new commit. 

Yes, we could handle the special case of the intercept only model outside 
IWLS. But I think it would be better to use the current fix: a) it allows IWLS 
to handle edge case, b) the fix is general and less likely to be affected by 
future structure changes of GLM (e.g., allowing offset). Does this make sense?  
 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16699: [SPARK-18710][ML] Add offset in GLM

2017-01-30 Thread actuaryzhang
Github user actuaryzhang commented on a diff in the pull request:

https://github.com/apache/spark/pull/16699#discussion_r98580805
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala
 ---
@@ -336,14 +361,19 @@ class GeneralizedLinearRegression @Since("2.0.0") 
(@Since("2.0.0") override val
 }
 
 val w = if (!isDefined(weightCol) || $(weightCol).isEmpty) lit(1.0) 
else col($(weightCol))
-val instances: RDD[Instance] =
-  dataset.select(col($(labelCol)), w, col($(featuresCol))).rdd.map {
-case Row(label: Double, weight: Double, features: Vector) =>
-  Instance(label, weight, features)
-  }
+val offset = if (!isDefined(offsetCol) || $(offsetCol).isEmpty) {
+  lit(0.0)
+} else {
+  col($(offsetCol)).cast(DoubleType)
+}
--- End diff --

I'm following the style in existing code, for example: 
```
  lazy val rank: Long = if (model.getFitIntercept) {
model.coefficients.size + 1
  } else {
model.coefficients.size
  }
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16699: [SPARK-18710][ML] Add offset in GLM

2017-01-30 Thread actuaryzhang
Github user actuaryzhang commented on the issue:

https://github.com/apache/spark/pull/16699
  
@yanboliang @zhengruifeng @srowen 
Could you guys take a look and let me know if there is any changes needed? 
Thanks much!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16699: [SPARK-18710][ML] Add offset in GLM

2017-01-30 Thread actuaryzhang
Github user actuaryzhang commented on the issue:

https://github.com/apache/spark/pull/16699
  
@imatiach-msft Thanks much for your review. Renamed `off`. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16699: [SPARK-18710][ML] Add offset in GLM

2017-01-30 Thread actuaryzhang
Github user actuaryzhang commented on a diff in the pull request:

https://github.com/apache/spark/pull/16699#discussion_r98496017
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala
 ---
@@ -336,14 +361,19 @@ class GeneralizedLinearRegression @Since("2.0.0") 
(@Since("2.0.0") override val
 }
 
 val w = if (!isDefined(weightCol) || $(weightCol).isEmpty) lit(1.0) 
else col($(weightCol))
-val instances: RDD[Instance] =
-  dataset.select(col($(labelCol)), w, col($(featuresCol))).rdd.map {
-case Row(label: Double, weight: Double, features: Vector) =>
-  Instance(label, weight, features)
-  }
+val off = if (!isDefined(offsetCol) || $(offsetCol).isEmpty) {
--- End diff --

Could you point out which lines have the spacing issue? Thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16740: [SPARK-19400][ML] Allow GLM to handle intercept o...

2017-01-30 Thread actuaryzhang
Github user actuaryzhang commented on a diff in the pull request:

https://github.com/apache/spark/pull/16740#discussion_r98493237
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/optim/IterativelyReweightedLeastSquares.scala
 ---
@@ -86,13 +86,11 @@ private[ml] class IterativelyReweightedLeastSquares(
 standardizeFeatures = false, standardizeLabel = 
false).fit(newInstances)
 
   // Check convergence
-  val oldCoefficients = oldModel.coefficients
-  val coefficients = model.coefficients
-  BLAS.axpy(-1.0, coefficients, oldCoefficients)
-  val maxTolOfCoefficients = oldCoefficients.toArray.reduce { (x, y) =>
-math.max(math.abs(x), math.abs(y))
+  val oldCoefficients = oldModel.coefficients.toArray :+ 
oldModel.intercept
+  val coefficients = model.coefficients.toArray :+ model.intercept
+  val maxTol = oldCoefficients.zip(coefficients).map(x => x._1 - 
x._2).reduce {
--- End diff --

@imatiach-msft The performance change is minor since the coefficients are 
not high-dimensional and these are simple operations and don't consume much 
time than the actual model fitting.  


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16740: [SPARK-19400][ML] Allow GLM to handle intercept only mod...

2017-01-30 Thread actuaryzhang
Github user actuaryzhang commented on the issue:

https://github.com/apache/spark/pull/16740
  
@imatiach-msft Thanks for the comments! This test is based on existing 
tests in GLM. I can try to improve the style and streamline **all** tests in 
another PR but it will be weird to just change this test. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16740: [SPARK-19400][ML] Allow GLM to handle intercept only mod...

2017-01-30 Thread actuaryzhang
Github user actuaryzhang commented on the issue:

https://github.com/apache/spark/pull/16740
  
@srowen Thanks much for the suggestion. Included the simplification. Please 
let me know if there is anything else needed for this PR. Thanks! 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16740: [SPARK-19400] Allow GLM to handle intercept only model

2017-01-30 Thread actuaryzhang
Github user actuaryzhang commented on the issue:

https://github.com/apache/spark/pull/16740
  
The following is a simple example to illustrate the issue. 

```
val dataset = Seq(
  (1.0, 1.0, 2.0, 0.0, 5.0),
  (0.5, 2.0, 1.0, 1.0, 2.0),
  (1.0, 3.0, 0.5, 2.0, 1.0),
  (2.0, 4.0, 1.5, 3.0, 3.0)
).toDF("y", "w", "off", "x1", "x2")

val formula = new RFormula().setFormula("y ~ 1")
val output = formula.fit(dataset).transform(dataset)
val glr = new GeneralizedLinearRegression().setFamily("poisson")
val model = glr.fit(output)
```

The above prints out the following error message:
```
java.lang.UnsupportedOperationException: empty.reduceLeft
  at 
scala.collection.TraversableOnce$class.reduceLeft(TraversableOnce.scala:180)
  at 
scala.collection.mutable.ArrayOps$ofDouble.scala$collection$IndexedSeqOptimized$$super$reduceLeft(ArrayOps.scala:270)
  at 
scala.collection.IndexedSeqOptimized$class.reduceLeft(IndexedSeqOptimized.scala:74)
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16740: [SPARK-19400] Allow GLM to handle intercept only ...

2017-01-30 Thread actuaryzhang
GitHub user actuaryzhang opened a pull request:

https://github.com/apache/spark/pull/16740

[SPARK-19400] Allow GLM to handle intercept only model

## What changes were proposed in this pull request?
Intercept-only GLM is failing for non-Gaussian family because of reducing 
an empty array in IWLS. The following code `val maxTolOfCoefficients = 
oldCoefficients.toArray.reduce { (x, y) => math.max(math.abs(x), math.abs(y))` 
fails in the intercept-only model because `oldCoefficients` is empty. This PR 
fixes this issue. 

@yanboliang @srowen @imatiach-msft @zhengruifeng 

## How was this patch tested?
New test for intercept only model.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/actuaryzhang/spark interceptOnly

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/16740.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #16740


commit 69d96319abaa7f4aa98c04ce32556a2cb2c065d2
Author: actuaryzhang <actuaryzhan...@gmail.com>
Date:   2017-01-30T07:59:11Z

fix error in reducing empty array




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16729: [SPARK-19391][SparkR][ML] Tweedie GLM API for Spa...

2017-01-29 Thread actuaryzhang
Github user actuaryzhang commented on a diff in the pull request:

https://github.com/apache/spark/pull/16729#discussion_r98363560
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/r/GeneralizedLinearRegressionWrapper.scala
 ---
@@ -143,7 +150,12 @@ private[r] object GeneralizedLinearRegressionWrapper
 val rDeviance: Double = summary.deviance
 val rResidualDegreeOfFreedomNull: Long = 
summary.residualDegreeOfFreedomNull
 val rResidualDegreeOfFreedom: Long = summary.residualDegreeOfFreedom
-val rAic: Double = summary.aic
+val rAic: Double = if (family.toLowerCase == "tweedie" &&
+  !Array(0.0, 1.0, 2.0).contains(variancePower)) {
--- End diff --

Thanks for the suggestion. Changed it to comparison instead.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16729: [SPARK-19391][SparkR][ML] Tweedie GLM API for Spa...

2017-01-29 Thread actuaryzhang
Github user actuaryzhang commented on a diff in the pull request:

https://github.com/apache/spark/pull/16729#discussion_r98363550
  
--- Diff: R/pkg/inst/tests/testthat/test_mllib_regression.R ---
@@ -77,6 +77,18 @@ test_that("spark.glm and predict", {
   out <- capture.output(print(summary(model)))
   expect_true(any(grepl("Dispersion parameter for gamma family", out)))
 
+  # tweedie family
+  require(statmod)
+  model <- spark.glm(training, Sepal_Width ~ Sepal_Length + Species,
+ family = tweedie(var.power = 1.2, link.power = 1.0))
+  prediction <- predict(model, training)
+  expect_equal(typeof(take(select(prediction, "prediction"), 
1)$prediction), "double")
--- End diff --

Would you remind me what `dtypes` is and why we need to use it here? 
Thanks.  


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16729: [SPARK-19391][SparkR][ML] Tweedie GLM API for SparkR

2017-01-29 Thread actuaryzhang
Github user actuaryzhang commented on the issue:

https://github.com/apache/spark/pull/16729
  
@felixcheung Thanks so much for your quick and detailed review. I have made 
a new commit that removed dependency on `statmod` and fixed the issues you 
pointed out. The major change is to add `variancePower` and `linkPower` as 
arguments in `spark.glm` to avoid using the `tweedie` family from `statmod`. 
Let me know if this design is reasonable. Then I can update additional docs and 
the vignette. Thanks!  



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16730: [SPARK-19395][SparkR]Convert coefficients in summary to ...

2017-01-29 Thread actuaryzhang
Github user actuaryzhang commented on the issue:

https://github.com/apache/spark/pull/16730
  
@felixcheung I created a JIRA ticket and added in some tests. Please take a 
look. Thanks. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16730: [Minor][SparkR]Convert coefficients in summary to...

2017-01-28 Thread actuaryzhang
GitHub user actuaryzhang opened a pull request:

https://github.com/apache/spark/pull/16730

[Minor][SparkR]Convert coefficients in summary to matrix

## What changes were proposed in this pull request?
The `coefficients` component in model summary should be 'matrix' but the 
underlying structure is indeed list. This affects several models except for 
'AFTSurvivalRegressionModel' which has the correct implementation. The fix is 
to first `unlist` the coefficients returned from the `callJMethod` before 
converting to matrix. An example illustrates the issues:

```
data(iris)
df <- createDataFrame(iris)
model <- spark.glm(df, Sepal_Length ~ Sepal_Width, family = "gaussian")
s <- summary(model)

> str(s$coefficients)
List of 8
 $ : num 6.53
 $ : num -0.223
 $ : num 0.479
 $ : num 0.155
 $ : num 13.6
 $ : num -1.44
 $ : num 0
 $ : num 0.152
 - attr(*, "dim")= int [1:2] 2 4
 - attr(*, "dimnames")=List of 2
  ..$ : chr [1:2] "(Intercept)" "Sepal_Width"
  ..$ : chr [1:4] "Estimate" "Std. Error" "t value" "Pr(>|t|)"
> s$coefficients[, 2]
$`(Intercept)`
[1] 0.4788963

$Sepal_Width
[1] 0.1550809
```

This  shows that the underlying structure of coefficients is still `list`. 

@felixcheung @wangmiao1981 


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/actuaryzhang/spark sparkRCoef

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/16730.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #16730


commit c80d5301cddfe6b49fa534b60a859b609b547324
Author: actuaryzhang <actuaryzhan...@gmail.com>
Date:   2017-01-28T21:07:26Z

convert coefficients to matrix




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16729: [SPARK-19391][SparkR][ML] Tweedie GLM API for Spa...

2017-01-28 Thread actuaryzhang
GitHub user actuaryzhang opened a pull request:

https://github.com/apache/spark/pull/16729

[SPARK-19391][SparkR][ML] Tweedie GLM API for SparkR

## What changes were proposed in this pull request?
Port Tweedie GLM  #16344  to SparkR

@felixcheung @yanboliang 

## How was this patch tested?
new test in SparkR


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/actuaryzhang/spark sparkRTweedie

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/16729.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #16729


commit 67364abee7aae34a093828e843de5ba9315b45a9
Author: actuaryzhang <actuaryzhan...@gmail.com>
Date:   2017-01-27T19:56:49Z

start working on SparkR tweedie API

commit 654551b207315b33fce00f98706360934f09c55a
Author: actuaryzhang <actuaryzhan...@gmail.com>
Date:   2017-01-28T00:10:54Z

set link only for non-tweedie; fix issue on aic

commit 852dd6eec9ee960487f878dc332ea36acfd6102e
Author: actuaryzhang <actuaryzhan...@gmail.com>
Date:   2017-01-28T20:05:57Z

add test for tweedie




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16699: [SPARK-18710][ML] Add offset in GLM

2017-01-27 Thread actuaryzhang
Github user actuaryzhang commented on the issue:

https://github.com/apache/spark/pull/16699
  
@zhengruifeng Thanks for the suggestions. Added casting and 
instrumentation. 
@imatiach-msft Thanks for the clarification! It is probably worth  another 
PR to clean up all tests in GLM.  
Let me know if there is any additional comments! 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16699: [SPARK-18710][ML] Add offset in GLM

2017-01-26 Thread actuaryzhang
Github user actuaryzhang commented on a diff in the pull request:

https://github.com/apache/spark/pull/16699#discussion_r98077231
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Instance.scala 
---
@@ -27,3 +27,25 @@ import org.apache.spark.ml.linalg.Vector
  * @param features The vector of features for this data point.
  */
 private[ml] case class Instance(label: Double, weight: Double, features: 
Vector)
+
+/**
+ * Case class that represents an instance of data point with
+ * label, weight, offset and features.
+ *
+ * @param label Label for this data point.
+ * @param weight The weight of this instance.
+ * @param offset The offset used for this data point.
+ * @param features The vector of features for this data point.
+ */
+private[ml] case class OffsetInstance(label: Double, weight: Double, 
offset: Double,
+   features: Vector) {
+
+  /** Constructs from an [[Instance]] object and offset */
+  def this(instance: Instance, offset: Double = 0.0) = {
+this(instance.label, instance.weight, offset, instance.features)
+  }
+
+  /** Converts to an [[Instance]] object by leaving out the offset. */
+  private[ml] def toInstance: Instance = Instance(label, weight, features)
--- End diff --

Yes, this is only used once in the current code, and I can get rid of it. 
But I feel other regression-type models may use offset at some point and having 
this method will make it easier to switch between Instance and OffsetInstance.  


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16699: [SPARK-18710][ML] Add offset in GLM

2017-01-26 Thread actuaryzhang
Github user actuaryzhang commented on the issue:

https://github.com/apache/spark/pull/16699
  
@imatiach-msft Many thanks again for the review. I have incorporated some 
of your suggestions:
1. Create initialization of instance directly if it is Gaussian(identity) 
to avoid expensive conversion from OffsetInstance. 
2. Fix spacing issues in test
3. Change `off` to `offset` to be more informative. 

Feedback on the other suggestions:
1. `move import to beginning, use ~==`:  I did not change this for 
consistency with other tests in GLM. It's probably better to update all tests 
in a separate PR if this is preferred. 
2. Inheriting Instance: I'm not sure about this and leaning toward not 
touching `Instance` since it is used a lot elsewhere. Let's see what others 
suggest.  
Question: if `OffsetInstance` is a subclass of `Instance`, will a method 
with a signature `RDD[Instance]` also work with `RDD[OffsetInstance]`? 
3. Would you please elaborate on the following comment:
> if family = Gaussian and link = Identity we shouldn't be passing any 
offset.

This is the linear regression case and it still allows `offset`. The offset 
will be subtracted from the response as can be seen from the initialization in 
WLS `Instance(label - offset, weight, features)`.   



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16699: [SPARK-18710][ML] Add offset in GLM

2017-01-25 Thread actuaryzhang
Github user actuaryzhang commented on the issue:

https://github.com/apache/spark/pull/16699
  
@imatiach-msft Thanks so much for your detailed review. Incredibly helpful. 
I've addressed all your comments in the new commit. Major changes are 
highlighted below:
1. Create `OffsetInstance` in Instance.scala and remove dependency of IWLS 
on regression. 
2. Add param checking for offset. 

Let me know if there is anything else that needs improvement. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16699: [SPARK-18710][ML] Add offset in GLM

2017-01-25 Thread actuaryzhang
Github user actuaryzhang commented on the issue:

https://github.com/apache/spark/pull/16699
  
@zhengruifeng @imatiach-msft 
Thanks much for pointing out the issue due to the hasOffset trait. This is 
what caused the test to fail. I have moved it to the GLRBase class. Things are 
working now 👍 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16630: [SPARK-19270][ML] Add summary table to GLM summary

2017-01-24 Thread actuaryzhang
Github user actuaryzhang commented on the issue:

https://github.com/apache/spark/pull/16630
  
This test just resists to start. Could someone help? Many thanks!
@srowen @jkbradley @MLnick @yanboliang 



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16699: [SPARK-18710] Add offset in GLM

2017-01-24 Thread actuaryzhang
Github user actuaryzhang commented on the issue:

https://github.com/apache/spark/pull/16699
  
jenkins, retest this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16699: [SPARK-18710] Add offset in GLM

2017-01-24 Thread actuaryzhang
GitHub user actuaryzhang opened a pull request:

https://github.com/apache/spark/pull/16699

[SPARK-18710] Add offset in GLM

## What changes were proposed in this pull request?
Add support for offset in GLM. This is useful for at least two reasons:

1. Account for exposure: e.g., when modeling the number of accidents, we 
may need to use miles driven as an offset to access factors on frequency.
2. Test incremental effects of new variables: we can use predictions from 
the existing model as offset and run a much smaller model on only new 
variables. This avoids re-estimating the large model with all variables (old + 
new) and can be very important for efficient large-scaled analysis. 

## How was this patch tested?
New test.

@yanboliang @srowen @felixcheung @sethah 

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/actuaryzhang/spark offset

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/16699.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #16699


commit 3bf2718c1a1e68273508e63499bb5d1cc8230155
Author: actuaryzhang <actuaryzhan...@gmail.com>
Date:   2017-01-24T23:46:16Z

add trait offset

commit 0e240eb313aa91cb645fb3ab8d70e51b6c65b3c7
Author: actuaryzhang <actuaryzhan...@gmail.com>
Date:   2017-01-24T23:48:03Z

add offset setter

commit 9c41453a19c0f9c31403fafaf1995c642c37c70d
Author: actuaryzhang <actuaryzhan...@gmail.com>
Date:   2017-01-25T05:15:50Z

implement offset in GLM

commit 7823f8af8b0926790816c9e79e9425e503e494ad
Author: actuaryzhang <actuaryzhan...@gmail.com>
Date:   2017-01-25T06:55:56Z

add test for glm with offset




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16630: [SPARK-19270][ML] Add summary table to GLM summary

2017-01-23 Thread actuaryzhang
Github user actuaryzhang commented on the issue:

https://github.com/apache/spark/pull/16630
  
jenkins, test this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16675: [SPARK-19155][ML] Make family case insensitive in GLM

2017-01-22 Thread actuaryzhang
Github user actuaryzhang commented on the issue:

https://github.com/apache/spark/pull/16675
  
@yanboliang Thanks. Seems to have passed tests. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16344: [SPARK-18929][ML] Add Tweedie distribution in GLM

2017-01-22 Thread actuaryzhang
Github user actuaryzhang commented on the issue:

https://github.com/apache/spark/pull/16344
  
@yanboliang Thanks so much for your detailed review. Your suggestions make 
lots of sense and I have included all of them in the new commit. Let me know if 
there is any other change needed. 



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16675: [SPARK-19155][ML] Make family case insensitive in GLM

2017-01-22 Thread actuaryzhang
Github user actuaryzhang commented on the issue:

https://github.com/apache/spark/pull/16675
  
@yanboliang Thanks for the quick response. How about the new commit, where 
I just change the value from `getFamily` to lower case when necessary, i.e., in 
the calculation of p-value and dispersion?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16675: [SPARK-19155][ML] Make family case insensitive in GLM

2017-01-22 Thread actuaryzhang
Github user actuaryzhang commented on the issue:

https://github.com/apache/spark/pull/16675
  
I would prefer that `getFamily` returns lower case values directly, because 
using `getFamily.toLowerCase` can get very cumbersome and I use this a lot in 
another PR #16344.  If we want to keep `getFamily` to retrieve the raw value of 
family, then I can create a private method `getFamilyLowerCase`.  Please advise.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16675: [SPARK-19155][ML] make getFamily case insensitive

2017-01-22 Thread actuaryzhang
GitHub user actuaryzhang opened a pull request:

https://github.com/apache/spark/pull/16675

[SPARK-19155][ML] make getFamily case insensitive

## What changes were proposed in this pull request?
This is a supplement to PR #16516 which did not make the value from 
`getFamily` case insensitive. This affects the calculation of `dispersion` and 
`pValue` since the value of family is checked there: ` model.getFamily == 
Binomial.name || model.getFamily == Poisson.name) `. Current tests of 
poisson/binomial glm with weight fail when specifying 'Poisson' or 'Binomial'. 

A simple fix is to is to convert the value of `getFamily` to lower case:
```
def getFamily: String = $(family).toLowerCase
```

## How was this patch tested?
Update existing tests for 'Poisson' and 'Binomial'. 

@yanboliang @felixcheung @imatiach-msft 

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/actuaryzhang/spark family

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/16675.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #16675


commit d33e2f135ae62df20337e2752753bcda2756a73d
Author: actuaryzhang <actuaryzhan...@gmail.com>
Date:   2017-01-23T02:59:12Z

make getFamily case insensitive




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16630: [SPARK-19270][ML] Add summary table to GLM summary

2017-01-19 Thread actuaryzhang
Github user actuaryzhang commented on the issue:

https://github.com/apache/spark/pull/16630
  
jenkins test this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16344: [SPARK-18929][ML] Add Tweedie distribution in GLM

2017-01-19 Thread actuaryzhang
Github user actuaryzhang commented on the issue:

https://github.com/apache/spark/pull/16344
  
Could anybody help me understand what's causing this test to fail? I see 
several other ML PR failing as well, with the same error message like below:

> Error instrumenting class:org.apache.spark.deploy.mesos.ui.MesosClusterUI
Error instrumenting 
class:org.apache.spark.mllib.regression.IsotonicRegressionModel$SaveLoadV1_0$
> 

Was this related to the recent commit on instrumentation #15671? How do I 
fix this? Thanks! 
@zhengruifeng @yanboliang 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16630: [SPARK-19270][ML] Add summary table to GLM summary

2017-01-19 Thread actuaryzhang
Github user actuaryzhang commented on the issue:

https://github.com/apache/spark/pull/16630
  
Jenkins, test this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16344: [SPARK-18929][ML] Add Tweedie distribution in GLM

2017-01-18 Thread actuaryzhang
Github user actuaryzhang commented on the issue:

https://github.com/apache/spark/pull/16344
  
@srowen @yanboliang @felixcheung @jkbradley Could you help kick off the new 
test please? Thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16344: [SPARK-18929][ML] Add Tweedie distribution in GLM

2017-01-18 Thread actuaryzhang
Github user actuaryzhang commented on the issue:

https://github.com/apache/spark/pull/16344
  
@srowen @yanboliang @felixcheung Could you help kick off the new test 
please? Seems to be hanging for a day now. Thanks much. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16630: [SPARK-19270][ML] Add summary table to GLM summary

2017-01-18 Thread actuaryzhang
Github user actuaryzhang commented on the issue:

https://github.com/apache/spark/pull/16630
  
Jenkins, test this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16630: [SPARK-19270][ML] Add summary table to GLM summary

2017-01-17 Thread actuaryzhang
Github user actuaryzhang commented on the issue:

https://github.com/apache/spark/pull/16630
  
The following code illustrates the idea of this PR. 

```
val datasetWithWeight = Seq(
(1.0, 1.0, 0.0, 5.0),
(0.5, 2.0, 1.0, 2.0),
(1.0, 3.0, 2.0, 1.0),
(0.0, 4.0, 3.0, 3.0)
  ).toDF("y", "w", "x1", "x2")

val formula = (new RFormula()
  .setFormula("y ~ x1 + x2")
  .setFeaturesCol("features")
  .setLabelCol("label"))
val output = formula.fit(datasetWithWeight).transform(datasetWithWeight)

val glr = new GeneralizedLinearRegression()
val model = glr.fit(output)
model.summary.summaryTable.show
```

This prints out: 
```

+-++---+---+---+
|  Feature|Estimate|   StdError| TValue|
 PValue|

+-++---+---+---+
|Intercept|  1.4523809523809539| 0.9245946589975053| 1.5708299180050451| 
0.3609009059280113|
|   
x1|-0.33387|0.28171808490950573|-1.1832159566199243|0.44669962096188565|
|   x2|-0.11904761904761924|   0.2129588548|-0.5590169943749482| 
0.6754896416955616|

+-++---+---+---+
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16630: [SPARK-19270][ML] Add summary table to GLM summar...

2017-01-17 Thread actuaryzhang
GitHub user actuaryzhang opened a pull request:

https://github.com/apache/spark/pull/16630

[SPARK-19270][ML] Add summary table to GLM summary

## What changes were proposed in this pull request?

Add R-like summary table to GLM summary, which includes feature name (if 
exist), parameter estimate, standard error, t-stat and p-value. This allows 
scala users to easily gather these commonly used inference results.

@srowen @yanboliang 

## How was this patch tested?
New tests. One for testing feature Name, and one for testing the summary 
Table. 


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/actuaryzhang/spark glmTable

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/16630.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #16630






---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16344: [SPARK-18929][ML] Add Tweedie distribution in GLM

2017-01-17 Thread actuaryzhang
Github user actuaryzhang commented on the issue:

https://github.com/apache/spark/pull/16344
  
@yanboliang Finally, the test is done. Is there anything else needed for 
this PR? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16344: [SPARK-18929][ML] Add Tweedie distribution in GLM

2017-01-17 Thread actuaryzhang
Github user actuaryzhang commented on the issue:

https://github.com/apache/spark/pull/16344
  
Jenkins, retest this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16344: [SPARK-18929][ML] Add Tweedie distribution in GLM

2017-01-17 Thread actuaryzhang
Github user actuaryzhang commented on the issue:

https://github.com/apache/spark/pull/16344
  
Still not testing... Been in the status "Asked to test" for a few days now. 
How can we resolve this? Please help kick off the test. Thanks!
@yanboliang @felixcheung @srowen @rxin @jkbradley 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16344: [SPARK-18929][ML] Add Tweedie distribution in GLM

2017-01-15 Thread actuaryzhang
Github user actuaryzhang commented on the issue:

https://github.com/apache/spark/pull/16344
  
Jenkins, retest this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16344: [SPARK-18929][ML] Add Tweedie distribution in GLM

2017-01-14 Thread actuaryzhang
Github user actuaryzhang commented on the issue:

https://github.com/apache/spark/pull/16344
  
Jenkins, test this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16344: [SPARK-18929][ML] Add Tweedie distribution in GLM

2017-01-13 Thread actuaryzhang
Github user actuaryzhang commented on the issue:

https://github.com/apache/spark/pull/16344
  
@yanboliang Thanks for the review and comments. I have made a new commit 
that addressed all your comments. The main change is the new companion object 
`FamilyAndLink` and factory methods to construct desired `FamilyAndLink` 
objects from input param map.  Please see my inline responses for details.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16344: [SPARK-18929][ML] Add Tweedie distribution in GLM

2017-01-13 Thread actuaryzhang
Github user actuaryzhang commented on a diff in the pull request:

https://github.com/apache/spark/pull/16344#discussion_r96061883
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala
 ---
@@ -242,9 +316,9 @@ class GeneralizedLinearRegression @Since("2.0.0") 
(@Since("2.0.0") override val
   def setLinkPredictionCol(value: String): this.type = 
set(linkPredictionCol, value)
 
   override protected def train(dataset: Dataset[_]): 
GeneralizedLinearRegressionModel = {
-val familyObj = Family.fromName($(family))
-val linkObj = if (isDefined(link)) {
-  Link.fromName($(link))
+val familyObj = Family.fromParams(this)
+val linkObj = if (isDefined(link) || isDefined(linkPower)) {
+  Link.fromParams(this)
 } else {
   familyObj.defaultLink
 }
--- End diff --

Makes sense. I created a companion object `FamilyAndLink` with a factory 
method to construct desired `FamilyAndLink` objects from the input param map. 



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16344: [SPARK-18929][ML] Add Tweedie distribution in GLM

2017-01-13 Thread actuaryzhang
Github user actuaryzhang commented on a diff in the pull request:

https://github.com/apache/spark/pull/16344#discussion_r96061873
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala
 ---
@@ -613,25 +758,67 @@ object GeneralizedLinearRegression extends 
DefaultParamsReadable[GeneralizedLine
   private[regression] object Link {
 
 /**
- * Gets the [[Link]] object from its name.
+ * Gets the [[Link]] object based on link and linkPower.
+ * 1) if family is "tweedie", retrieve object using linkPower
+ * 2) otherwise, retrieve object based on link name
  *
- * @param name link name: "identity", "logit", "log",
- * "inverse", "probit", "cloglog" or "sqrt".
+ * @param params the parameter map containing link and link power
  */
-def fromName(name: String): Link = {
-  name match {
-case Identity.name => Identity
-case Logit.name => Logit
-case Log.name => Log
-case Inverse.name => Inverse
-case Probit.name => Probit
-case CLogLog.name => CLogLog
-case Sqrt.name => Sqrt
+def fromParams(params: GeneralizedLinearRegressionBase): Link = {
+  if (params.getFamily == "tweedie") {
+params.getLinkPower match {
+  case 0.0 => Log
+  case 1.0 => Identity
+  case -1.0 => Inverse
+  case 0.5 => Sqrt
+  case others => new PowerLink(others)
+}
--- End diff --

The last version of the code only allows setting `linkPower` for tweedie, 
or `link` for non-tweedie. But now since we only give warnings, I need to add 
the logic for checking the correct specification. In the `FamilyAndLink` 
companion object, I use the following logic. When family is not tweedie and 
link is set,  or when family is tweedie and linkPower is set, I will use 
`Link.fromParams` to construct the Link object. Otherwise, use the default link 
from the corresponding family.  

So, if the user does not specify `linkPower`,  the link object will be the 
default `new Power(1 - variancePower)` set in the `Tweedie` class. The test for 
`tweedie family against glm (default power link)` covers this. The reason for 
not setting a default linkPower in the param map is to mimic the existing 
behavior: for tweedie with variancePower = 0, this will be the Gaussian with 
the identity link;  for tweedie with variancePower = 1, this will be the 
Poisson with the log link; etc.   

```
 val linkObj = if ((params.getFamily != "tweedie" && 
params.isDefined(params.link)) ||
(params.getFamily == "tweedie" && 
params.isDefined(params.linkPower))) {
Link.fromParams(params)
  } else {
familyObj.defaultLink
  }
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16344: [SPARK-18929][ML] Add Tweedie distribution in GLM

2017-01-11 Thread actuaryzhang
Github user actuaryzhang commented on the issue:

https://github.com/apache/spark/pull/16344
  
@yanboliang Thanks. Look forward to your feedback. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16344: [SPARK-18929][ML] Add Tweedie distribution in GLM

2017-01-10 Thread actuaryzhang
Github user actuaryzhang commented on the issue:

https://github.com/apache/spark/pull/16344
  
Sorry about closing this prematurely. I'm giving it another shot and I 
think I have an elegant solution to include `linkPower`. The new commit adds 
the following:
1. It implements the broad family of power link function, specified through 
`linkPower`. There is now a `PowerLink` class for the power link function. It 
has subclasses `Identity`, `Log`, `Inverse` and `Sqrt`. With this, the GLM now 
supports all distributions characterized by power variance function and power 
link function. For now, I restrict the `linkPower` to be in `[-10, 10]` for 
numerical stability, but can change that.  
2. The key to avoid all the messy coding is to use **only** `link` for 
non-tweedie family, and **only** `lnkPower` for tweedie family, as @yanboliang 
suggested. I have added validation for this, and also the `fromParams` method 
in `Link` to get the correct Link object based on `link` and `linkPower`. 
3. I added new tests to test tweedie with default link. For example, this 
reproduces the Gaussian GLM estimate when `variancePower = 0` (default link 
would be identity). Similarly, this addresses @yanboliang example where the 
result from `val trainer = new 
GeneralizedLinearRegression().setFamily("tweedie").setVariancePower(1.5)` 
produces the same estimate as in R `glm(formula = "b ~ .", family = 
tweedie(var.power=1.5), data = df)`. 

@yanboliang @srowen Would you please take another look and let me know if 
there is additional changes needed? Thanks.
 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16344: [SPARK-18929][ML] Add Tweedie distribution in GLM

2017-01-09 Thread actuaryzhang
Github user actuaryzhang commented on the issue:

https://github.com/apache/spark/pull/16344
  
Jenkins, test this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16344: [SPARK-18929][ML] Add Tweedie distribution in GLM

2017-01-09 Thread actuaryzhang
GitHub user actuaryzhang reopened a pull request:

https://github.com/apache/spark/pull/16344

[SPARK-18929][ML] Add Tweedie distribution in GLM

## What changes were proposed in this pull request?
I propose to add the full Tweedie family into the 
GeneralizedLinearRegression model. The Tweedie family is characterized by a 
power variance function. Currently supported distributions such as Gaussian, 
Poisson and Gamma families are a special case of the Tweedie 
https://en.wikipedia.org/wiki/Tweedie_distribution.

@yanboliang @srowen @sethah 

I propose to add support for the other distributions:
- compound Poisson: 1 < varPower < 2. This one is widely used to model 
zero-inflated continuous distributions, e.g., in insurance, finance, ecology, 
meteorology, advertising etc.
- positive stable: varPower > 2 and varPower != 3. Used to model extreme 
values.
- inverse Gaussian: varPower = 3.

The Tweedie family is supported in most statistical packages such as R 
(statmod), SAS, h2o etc.

Changes made:
- Allow `tweedie` in family. Only `identity` and `log` links are allowed 
for now. 
- Add `varPower` to `GeneralizedLinearRegressionBase`, which takes values 
in (1, 2) and (2, infty). Also set default value to 1.5 and add getter method.
- Add `Tweedie` class
- Add tests for tweedie GLM

Note:
- In computing deviance, use `math.max(y, 0.1)` to avoid taking inverse of 
0. This is the same as in R: `tweedie()$dev.res`
- `aic` is not supported in this PR because the evaluation of the [Tweedie 
density](http://www.statsci.org/smyth/pubs/tweediepdf-series-preprint.pdf) in 
these cases are non-trivial. I will implement the density approximation method 
in a future PR.  R returns `null` (see `tweedie()$aic`).


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/actuaryzhang/spark tweedie

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/16344.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #16344


commit 952887e485fb0d5fa669b3b4c9289b8069ee7769
Author: actuaryzhang <actuaryzhan...@gmail.com>
Date:   2016-12-16T00:50:51Z

Add Tweedie family to GLM

commit 4f184ec458f5ed7d70bc5b8165481425f911d2a3
Author: actuaryzhang <actuaryzhan...@gmail.com>
Date:   2016-12-19T22:50:02Z

Fix calculation in dev resid; Add test for different var power

commit 7fe39106332663d3671b94a8ffac48ca61c48470
Author: actuaryzhang <actuaryzhan...@gmail.com>
Date:   2016-12-19T23:14:37Z

Merge test into GLR

commit bfcc4fb08d54156efc66b90d14c62ea7ff172afa
Author: actuaryzhang <actuaryzhan...@gmail.com>
Date:   2016-12-20T22:59:05Z

Use Tweedie class instead of global object Tweedie; change variancePower to 
varPower

commit a8feea7d8095170c1b5f18b7887f16a6d763e42c
Author: actuaryzhang <actuaryzhan...@gmail.com>
Date:   2016-12-21T23:42:40Z

Allow Family to use GLRBase object directly

commit 233e2d338be8d36a74eaf578bfea804ae3617d4e
Author: actuaryzhang <actuaryzhan...@gmail.com>
Date:   2016-12-22T01:56:34Z

Add TweedieFamily and implement specific distn within Tweedie

commit 17c55816c914bc96a8b5141356e3c117f343f303
Author: actuaryzhang <actuaryzhan...@gmail.com>
Date:   2016-12-22T04:39:54Z

Clean up doc

commit 0b41825e99020976a34d8fe9c983f26de6c8c40f
Author: actuaryzhang <actuaryzhan...@gmail.com>
Date:   2016-12-22T17:52:01Z

Move defaultLink and name to subclass of TweedieFamily

commit 6e8e60771afb4abe43e47c7fe186bad1541a8fac
Author: actuaryzhang <actuaryzhan...@gmail.com>
Date:   2016-12-22T18:10:51Z

Change style for AIC

commit 8d7d34e258f9c7c03c80754d837ce847fcb0526e
Author: actuaryzhang <actuaryzhan...@gmail.com>
Date:   2016-12-23T19:10:20Z

Rename Family methods and restore methods for tweedie subclasses

commit 6da7e3068e2c45a0faf7ff35c10b2750784d765e
Author: actuaryzhang <actuaryzhan...@gmail.com>
Date:   2016-12-23T19:12:25Z

Update test

commit 9a71e89f629260c775922901a04c989f36ea4946
Author: actuaryzhang <actuaryzhan...@gmail.com>
Date:   2016-12-27T17:16:40Z

Clean up doc

commit f461c09e65360f695ad3092b41bc26e0c61bbd95
Author: actuaryzhang <actuaryzhan...@gmail.com>
Date:   2016-12-27T22:18:39Z

Put delta in Tweedie companion object

commit a839c4631dd17c4f3d0a0cc99e1b0af81419dda4
Author: actuaryzhang <actuaryzhan...@gmail.com>
Date:   2016-12-27T22:23:57Z

Clean up doc

commit fab265278109eede4cce7ee506e8b29d481c4549
Author: actuaryzhang <actuaryzhan...@gmail.com>
Date:   2017-01-05T19:32:06Z

Allow more link functions in tweedie




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your pro

[GitHub] spark issue #16344: [SPARK-18929][ML] Add Tweedie distribution in GLM

2017-01-06 Thread actuaryzhang
Github user actuaryzhang commented on the issue:

https://github.com/apache/spark/pull/16344
  
@srowen @yanboliang 
I'm closing this PR since it does not seem to be very clean to integrate 
into the current GLM setup. I appreciate all the comments and discussions. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16344: [SPARK-18929][ML] Add Tweedie distribution in GLM

2017-01-06 Thread actuaryzhang
Github user actuaryzhang closed the pull request at:

https://github.com/apache/spark/pull/16344


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16344: [SPARK-18929][ML] Add Tweedie distribution in GLM

2017-01-06 Thread actuaryzhang
Github user actuaryzhang commented on the issue:

https://github.com/apache/spark/pull/16344
  
@yanboliang Thanks for the feedback. However, I'm not sure why we need to 
be consistent with R on this one. The usage of 'tweedie' glm almost always uses 
`link.power = 0, 1, -1`. There is barely any use case for the default 
`1-variancePower` link.  I think a better design is to set the default to the 
one that is used more often, and make it clear that `we only support identity, 
log, inverse and sqrt link functions`. 

On the other hand, even if we want to be consistent with R, I actually 
wrote the R package `cplm` which fits the tweedie 
[glm](http://finzi.psych.upenn.edu/library/cplm/html/cpglm.html) as well as 
[mixed effect 
model](https://www.rdocumentation.org/packages/cplm/versions/0.7-4/topics/cpglmm).
 In that package I set the default to the `log` link and never had an issue. 

I don't think I will implement `linkPower`, but I'm fine if you want to do 
it. However, if the lack of `linkPower` means this PR won't get merged, please 
let me know and we can close this and move on. 






---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16344: [SPARK-18929][ML] Add Tweedie distribution in GLM

2017-01-05 Thread actuaryzhang
Github user actuaryzhang commented on a diff in the pull request:

https://github.com/apache/spark/pull/16344#discussion_r94849556
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala
 ---
@@ -397,32 +432,121 @@ object GeneralizedLinearRegression extends 
DefaultParamsReadable[GeneralizedLine
 
 /** Trim the fitted value so that it will be in valid range. */
 def project(mu: Double): Double = mu
+
   }
 
   private[regression] object Family {
 
 /**
- * Gets the [[Family]] object from its name.
+ * Gets the [[Family]] object based on family and variancePower.
+ * 1) retrieve object based on family name
+ * 2) if family name is tweedie, retrieve object based on variancePower
  *
- * @param name family name: "gaussian", "binomial", "poisson" or 
"gamma".
+ * @param params the parameter map containing family name and variance 
power
  */
-def fromName(name: String): Family = {
-  name match {
-case Gaussian.name => Gaussian
-case Binomial.name => Binomial
-case Poisson.name => Poisson
-case Gamma.name => Gamma
+def fromParams(params: GeneralizedLinearRegressionBase): Family = {
+  params.getFamily match {
+case "gaussian" => Gaussian
+case "binomial" => Binomial
+case "poisson" => Poisson
+case "gamma" => Gamma
+case "tweedie" =>
+  params.getVariancePower match {
+case 0.0 => Gaussian
+case 1.0 => Poisson
+case 2.0 => Gamma
+case default => new Tweedie(default)
--- End diff --

Done.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16344: [SPARK-18929][ML] Add Tweedie distribution in GLM

2017-01-05 Thread actuaryzhang
Github user actuaryzhang commented on a diff in the pull request:

https://github.com/apache/spark/pull/16344#discussion_r94849540
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala
 ---
@@ -365,7 +401,6 @@ object GeneralizedLinearRegression extends 
DefaultParamsReadable[GeneralizedLine
   /**
* A description of the error distribution to be used in the model.
*
-   * @param name the name of the family.
--- End diff --

Sorry. added this back.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16344: [SPARK-18929][ML] Add Tweedie distribution in GLM

2017-01-05 Thread actuaryzhang
Github user actuaryzhang commented on a diff in the pull request:

https://github.com/apache/spark/pull/16344#discussion_r94849501
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala
 ---
@@ -158,6 +183,16 @@ class GeneralizedLinearRegression @Since("2.0.0") 
(@Since("2.0.0") override val
   setDefault(family -> Gaussian.name)
 
   /**
+* Sets the value of param [[variancePower]].
+* Used only when family is "tweedie".
+*
+* @group setParam
+*/
+  @Since("2.2.0")
+  def setVariancePower(value: Double): this.type = set(variancePower, 
value)
+  setDefault(variancePower -> 1.5)
--- End diff --

Done. change default variancePower to 0.0, which will use Gaussian (with 
default identity link)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16344: [SPARK-18929][ML] Add Tweedie distribution in GLM

2017-01-05 Thread actuaryzhang
Github user actuaryzhang commented on the issue:

https://github.com/apache/spark/pull/16344
  
@yanboliang Thanks for the detailed review. I have made all changes you 
suggested except for the part on the new power link function. Yes, the 
canonical link in the Tweedie in general is `1.0 - variancePower`. But other 
than the special cases of 'identity', 'log', 'inverse', it is rarely used. The 
implementation of the generic power link function is unnecessary in my opinion 
for the following reasons:

1. The default link in the Tweedie is overridden in Gaussian, Poisson and 
Gamma to the respective canonical link. So when one specifies tweedie with 
variancePower = 0, 1, 2, they will have the correct canonical link.  I now 
allow Tweedie to take link function 'identity', 'log', 'inverse', 'sqrt', which 
cover almost all possible links that will be used in practice.  
2. If we allow an additional 'linkPower" parameter, we will have two ways 
to specify the link: one through 'link' and one through 'linkPower'. This will 
be confusing. 
3. The implementation of the 'linkPower' is messy with almost no additional 
gain. Have to write new powerLink class, fromParams method which determines the 
link function based on 'link', 'linkPower' as well as 'family', and also 
various checks (consistency between link and linkPower, supported Family and 
link, etc). I tried it, and it's very ugly. 
 
If we really want a power link function, can I suggest that we implement it 
in a separate PR since it is really not specific to the 'tweedie' family? It 
can also be used together with other distributions. 

@srowen 




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16344: [SPARK-18929][ML] Add Tweedie distribution in GLM

2017-01-03 Thread actuaryzhang
Github user actuaryzhang commented on the issue:

https://github.com/apache/spark/pull/16344
  
@yanboliang Did you get a chance to take another look at this? Thanks. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16344: [SPARK-18929][ML] Add Tweedie distribution in GLM

2016-12-27 Thread actuaryzhang
Github user actuaryzhang commented on the issue:

https://github.com/apache/spark/pull/16344
  
@srowen Made a new commit according to your suggestion. Everything looking 
good now? 
@yanboliang 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16344: [SPARK-18929][ML] Add Tweedie distribution in GLM

2016-12-26 Thread actuaryzhang
Github user actuaryzhang commented on the issue:

https://github.com/apache/spark/pull/16344
  
@srowen @yanboliang 
Any additional issues regarding this PR?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16344: [SPARK-18929][ML] Add Tweedie distribution in GLM

2016-12-23 Thread actuaryzhang
Github user actuaryzhang commented on the issue:

https://github.com/apache/spark/pull/16344
  
@yanboliang Thanks much for the detailed comments. I have addressed all of 
them in the new commits. Please take another look. 
@srowen 




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16344: [SPARK-18929][ML] Add Tweedie distribution in GLM

2016-12-22 Thread actuaryzhang
Github user actuaryzhang commented on a diff in the pull request:

https://github.com/apache/spark/pull/16344#discussion_r93672741
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala
 ---
@@ -303,20 +337,24 @@ object GeneralizedLinearRegression extends 
DefaultParamsReadable[GeneralizedLine
 
   /** Set of family and link pairs that GeneralizedLinearRegression 
supports. */
   private[regression] lazy val supportedFamilyAndLinkPairs = Set(
-Gaussian -> Identity, Gaussian -> Log, Gaussian -> Inverse,
-Binomial -> Logit, Binomial -> Probit, Binomial -> CLogLog,
-Poisson -> Log, Poisson -> Identity, Poisson -> Sqrt,
-Gamma -> Inverse, Gamma -> Identity, Gamma -> Log
+"gaussian" -> Identity, "gaussian" -> Log, "gaussian" -> Inverse,
+"binomial" -> Logit, "binomial" -> Probit, "binomial" -> CLogLog,
+"poisson" -> Log, "poisson" -> Identity, "poisson" -> Sqrt,
+"gamma" -> Inverse, "gamma" -> Identity, "gamma" -> Log,
+"tweedie" -> Identity, "tweedie" -> Log
   )
 
   /** Set of family names that GeneralizedLinearRegression supports. */
-  private[regression] lazy val supportedFamilyNames = 
supportedFamilyAndLinkPairs.map(_._1.name)
+  private[regression] lazy val supportedFamilyNames = 
supportedFamilyAndLinkPairs.map(_._1)
 
   /** Set of link names that GeneralizedLinearRegression supports. */
   private[regression] lazy val supportedLinkNames = 
supportedFamilyAndLinkPairs.map(_._2.name)
 
   private[regression] val epsilon: Double = 1E-16
 
+  /** Constant used in initialization and deviance to avoid numerical 
issues. */
+  private[regression] val delta: Double = 0.1
--- End diff --

They are already in the `GeneralizedLinearRegression` object, aren't they? 
Or do you mean creating a new object say `Constant` that stores these two 
constants, and using them like `Constant.delta`? 

Since `delta` is only used in the `TweedieFamily` class, I can also move it 
there. Let me know what is best. Thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16344: [SPARK-18929][ML] Add Tweedie distribution in GLM

2016-12-22 Thread actuaryzhang
Github user actuaryzhang commented on the issue:

https://github.com/apache/spark/pull/16344
  
@srowen Thanks for the comments. Makes lots of sense to move the switch to 
subclass. I did not know one could override a `val`. 

In the new commit, I have moved the `defaultLink` and `aic` to subclasses. 
I also added back `name` in the constructor. The other methods `initialize`, 
`variance`, `deviance` and `project` are common and thus implemented in the 
`TweedieFamily` parent class. Hope this makes sense. 






---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16344: [SPARK-18929][ML] Add Tweedie distribution in GLM

2016-12-21 Thread actuaryzhang
Github user actuaryzhang commented on the issue:

https://github.com/apache/spark/pull/16344
  
@srowen @yanboliang Thanks much for the feedback. I now have a better 
understanding of the code and the issue. I have made new commits reflecting 
your suggestions. The major changes are summarized below:

- Create new method `fromModel` in the `Family` object to get the desired 
family object by accessing the `GeneralizedLinearRegressionBase` object 
directly. This solves the main challenge of setting the variancePower of the 
Tweedie family. The old `fromName` method is deleted since we now need both 
`name` and `variancePower`. 
- Create `TweedieFamily` as new subclass of `family`. This class 
encompasses the special cases of Gaussian, Poisson and Gamma, which can be 
handled together using the general formula based on variancePower. 

Please let me know if there are any other issues. Thanks.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16344: [SPARK-18929][ML] Add Tweedie distribution in GLM

2016-12-21 Thread actuaryzhang
Github user actuaryzhang commented on a diff in the pull request:

https://github.com/apache/spark/pull/16344#discussion_r93565567
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala
 ---
@@ -303,14 +341,15 @@ object GeneralizedLinearRegression extends 
DefaultParamsReadable[GeneralizedLine
 
   /** Set of family and link pairs that GeneralizedLinearRegression 
supports. */
   private[regression] lazy val supportedFamilyAndLinkPairs = Set(
-Gaussian -> Identity, Gaussian -> Log, Gaussian -> Inverse,
-Binomial -> Logit, Binomial -> Probit, Binomial -> CLogLog,
-Poisson -> Log, Poisson -> Identity, Poisson -> Sqrt,
-Gamma -> Inverse, Gamma -> Identity, Gamma -> Log
+"gaussian" -> Identity, "gaussian" -> Log, "gaussian" -> Inverse,
--- End diff --

@yanboliang Could you help me understand the issue caused by using string? 
If I use object, then I have to create a Tweedie object that is not used 
anywhere else. And also I have to write two methods in `Family`: one returns 
the global Tweedie object (where the variancePower is preset) and one returns 
the a TweedieFamily object created using the user-specified variancePower. I 
hope we are fine using string since there are only a few values. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16344: [SPARK-18929][ML] Add Tweedie distribution in GLM

2016-12-21 Thread actuaryzhang
Github user actuaryzhang commented on a diff in the pull request:

https://github.com/apache/spark/pull/16344#discussion_r93565335
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala
 ---
@@ -64,6 +64,27 @@ private[regression] trait 
GeneralizedLinearRegressionBase extends PredictorParam
   def getFamily: String = $(family)
 
   /**
+   * Param for the power in the variance function of the Tweedie 
distribution which provides
--- End diff --

changed tweedie. but other docs have been using Param..


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16344: [SPARK-18929][ML] Add Tweedie distribution in GLM

2016-12-21 Thread actuaryzhang
Github user actuaryzhang commented on a diff in the pull request:

https://github.com/apache/spark/pull/16344#discussion_r93486159
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala
 ---
@@ -303,14 +341,15 @@ object GeneralizedLinearRegression extends 
DefaultParamsReadable[GeneralizedLine
 
   /** Set of family and link pairs that GeneralizedLinearRegression 
supports. */
   private[regression] lazy val supportedFamilyAndLinkPairs = Set(
-Gaussian -> Identity, Gaussian -> Log, Gaussian -> Inverse,
-Binomial -> Logit, Binomial -> Probit, Binomial -> CLogLog,
-Poisson -> Log, Poisson -> Identity, Poisson -> Sqrt,
-Gamma -> Inverse, Gamma -> Identity, Gamma -> Log
+"gaussian" -> Identity, "gaussian" -> Log, "gaussian" -> Inverse,
--- End diff --

@yanboliang  The member object (in `GeneralizedLinearRegressionBase`) won't 
be accessible in `Family`, right? The method `Family.fromName($(family))` uses 
global objects like `Poisson`, `Gamma` etc. To use  `Family.fromName`, I need 
to create a `Tweedie` global object. Then we are back to the issue that @srowen 
pointed out of setting `variancePower` of the global object. Please advise. 
Thanks. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16344: [SPARK-18929][ML] Add Tweedie distribution in GLM

2016-12-21 Thread actuaryzhang
Github user actuaryzhang commented on a diff in the pull request:

https://github.com/apache/spark/pull/16344#discussion_r93482978
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala
 ---
@@ -242,7 +275,12 @@ class GeneralizedLinearRegression @Since("2.0.0") 
(@Since("2.0.0") override val
   def setLinkPredictionCol(value: String): this.type = 
set(linkPredictionCol, value)
 
   override protected def train(dataset: Dataset[_]): 
GeneralizedLinearRegressionModel = {
-val familyObj = Family.fromName($(family))
+val familyObj = if ($(family) == "tweedie") {
+  new Tweedie($(varPower))
--- End diff --

`Family` is not subclass of `GeneralizedLinearRegression`. Could you 
elaborate how to make it get the `variancePower` value? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16344: [SPARK-18929][ML] Add Tweedie distribution in GLM

2016-12-20 Thread actuaryzhang
Github user actuaryzhang commented on the issue:

https://github.com/apache/spark/pull/16344
  
@srowen Thanks for the comments. Really helpful. I have made a new commit 
that addresses the issues you raised:

- I think the use of a global family object does not work well for the 
tweedie case, since we have to set the variance power. I now define a tweedie 
class and create the tweedie object within the `train` method of 
`GeneralizedLinearRegression` where the variance power is set. Does this make 
sense? 
- I created a constant `delta = 0.1` in the `Family` class which is used to 
shift `mu (or y) = 0` to avoid numerical issues. 
- I now change `variancePower` to `varPower` to save typing (and consistent 
with R and H2o).
- The `project` method you asked about is copied from the other families, 
which I suppose is to bound the mean to stabilize estimation.
- I now throw `UnsupportedOperationException` for tweedie AIC.

Let me know if there is any other issues. 

Copying others for reviewing. @yanboliang @sethah 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16344: [SPARK-18929][ML] Add Tweedie distribution in GLM

2016-12-20 Thread actuaryzhang
Github user actuaryzhang commented on a diff in the pull request:

https://github.com/apache/spark/pull/16344#discussion_r93290858
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala
 ---
@@ -592,6 +629,59 @@ object GeneralizedLinearRegression extends 
DefaultParamsReadable[GeneralizedLine
   }
 
   /**
+* Tweedie exponential family distribution.
+* The default link for the Tweedie family is the log link.
+*/
+  private[regression] object Tweedie extends Family("tweedie") {
+
+val defaultLink: Link = Log
+
+var variancePower: Double = 1.5
+
+override def initialize(y: Double, weight: Double): Double = {
+  if (variancePower > 1.0 && variancePower < 2.0) {
+require(y >= 0.0, "The response variable of the specified Tweedie 
distribution " +
+  s"should be non-negative, but got $y")
+math.max(y, 0.1)
--- End diff --

I have not seen a formal justification for the choice of 0.1 in R. This 
seminal 
[paper](http://users.du.se/~lrn/StatMod10/HomeExercise2/Nelder_Pregibon.pdf) 
suggests 1/6 (about 0.17) to be the best constant. I would prefer to be 
consistent with R so that we can make comparison. Using a constant is a good 
idea. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16344: [SPARK-18929][ML] Add Tweedie distribution in GLM

2016-12-20 Thread actuaryzhang
Github user actuaryzhang commented on a diff in the pull request:

https://github.com/apache/spark/pull/16344#discussion_r93289668
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala
 ---
@@ -592,6 +629,59 @@ object GeneralizedLinearRegression extends 
DefaultParamsReadable[GeneralizedLine
   }
 
   /**
+* Tweedie exponential family distribution.
+* The default link for the Tweedie family is the log link.
+*/
+  private[regression] object Tweedie extends Family("tweedie") {
+
+val defaultLink: Link = Log
+
+var variancePower: Double = 1.5
--- End diff --

Would you please suggest a better way to set the variancePower? I want to 
be consistent with the existing code to have the `Family` objects, but I need 
to also pass on the input `variancePower` to the `Tweedie` object which is used 
to compute the variance function. Any suggestion will be highly appreciated. 
@srowen @yanboliang   


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16344: [SPARK-18929][ML] Add Tweedie distribution in GLM

2016-12-19 Thread actuaryzhang
Github user actuaryzhang commented on the issue:

https://github.com/apache/spark/pull/16344
  
Jenkins, add to whitelist


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16344: [SPARK-18929][ML] Add Tweedie distribution in GLM

2016-12-19 Thread actuaryzhang
GitHub user actuaryzhang opened a pull request:

https://github.com/apache/spark/pull/16344

[SPARK-18929][ML] Add Tweedie distribution in GLM

## What changes were proposed in this pull request?
I propose to add the full Tweedie family into the 
GeneralizedLinearRegression model. The Tweedie family is characterized by a 
power variance function. Currently supported distributions such as Gaussian, 
Poisson and Gamma families are a special case of the Tweedie 
https://en.wikipedia.org/wiki/Tweedie_distribution.

@yanboliang @srowen @sethah 

I propose to add support for the other distributions:
- compound Poisson: 1 < variancePower < 2. This one is widely used to model 
zero-inflated continuous distributions, e.g., in insurance, finance, ecology, 
meteorology, advertising etc.
- positive stable: variancePower > 2 and variancePower != 3. Used to model 
extreme values.
- inverse Gaussian: variancePower = 3.

The Tweedie family is supported in most statistical packages such as R 
(statmod), SAS, h2o etc.

Changes made:
- Allow `tweedie` in family. Only `identity` and `log` links are allowed 
for now. 
- Add `variancePower` to `GeneralizedLinearRegressionBase`, which takes 
values in (1, 2) and [3, infty). Also set default value to 1.5 and add getter 
method.
- `Family.fromName` has a second argument `variancePower`
- Add `Tweedie` object
- Add tests for tweedie GLM

Note:
- In computing deviance, use `math.max(y, 0.1)` to avoid taking inverse of 
0. This is the same as in R: `tweedie()$dev.res`
- `aic` is not supported in this PR because the evaluation of the [Tweedie 
density](http://www.statsci.org/smyth/pubs/tweediepdf-series-preprint.pdf) in 
these cases are non-trivial. I will implement the density approximation method 
in a future PR.  R returns `null` (see `tweedie()$aic`).


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/actuaryzhang/spark tweedie

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/16344.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #16344


commit 952887e485fb0d5fa669b3b4c9289b8069ee7769
Author: actuaryzhang <actuaryzhan...@gmail.com>
Date:   2016-12-16T00:50:51Z

Add Tweedie family to GLM

commit 4f184ec458f5ed7d70bc5b8165481425f911d2a3
Author: actuaryzhang <actuaryzhan...@gmail.com>
Date:   2016-12-19T22:50:02Z

Fix calculation in dev resid; Add test for different var power

commit 7fe39106332663d3671b94a8ffac48ca61c48470
Author: actuaryzhang <actuaryzhan...@gmail.com>
Date:   2016-12-19T23:14:37Z

Merge test into GLR




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16149: [SPARK-18715][ML]Fix AIC calculations in Binomial GLM

2016-12-13 Thread actuaryzhang
Github user actuaryzhang commented on the issue:

https://github.com/apache/spark/pull/16149
  
@srowen @sethah 
Thanks for all the helpful discussions! 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16149: [SPARK-18715][ML]Fix AIC calculations in Binomial GLM

2016-12-12 Thread actuaryzhang
Github user actuaryzhang commented on the issue:

https://github.com/apache/spark/pull/16149
  
@srowen @sethah 
One more commit that adds a test case with `weight = 4.7` which will round 
up to 5 to test the case @sethah described. All tests passed.  I'm pretty sure 
R's rounding is the same as what I'm doing here. Please merge if there is no 
other issue. Thanks. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16149: [SPARK-18715][ML]Fix AIC calculations in Binomial GLM

2016-12-10 Thread actuaryzhang
Github user actuaryzhang commented on the issue:

https://github.com/apache/spark/pull/16149
  
@sethah Would you please review this? Thanks. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16149: [SPARK-18715][ML]Fix AIC calculations in Binomial GLM

2016-12-08 Thread actuaryzhang
Github user actuaryzhang commented on the issue:

https://github.com/apache/spark/pull/16149
  
@sethah @srowen 
I updated the documentation. I think we have everything needed for this 
fix. Please merge and close this PR if there is no other issue. Thanks much for 
all the comments. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16149: [SPARK-18715][ML]Fix AIC calculations in Binomial...

2016-12-08 Thread actuaryzhang
Github user actuaryzhang commented on a diff in the pull request:

https://github.com/apache/spark/pull/16149#discussion_r91643659
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala
 ---
@@ -468,11 +469,7 @@ object GeneralizedLinearRegression extends 
DefaultParamsReadable[GeneralizedLine
 override def variance(mu: Double): Double = mu * (1.0 - mu)
 
 private def ylogy(y: Double, mu: Double): Double = {
-  if (y == 0) {
-0.0
-  } else {
-y * math.log(y / mu)
-  }
+  if (y == 0) 0.0 else y * math.log(y / mu)
--- End diff --

I actually prefer not putting the whole thing in one line. This is more 
readable. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16149: [SPARK-18715][ML]Fix AIC calculations in Binomial...

2016-12-08 Thread actuaryzhang
Github user actuaryzhang commented on a diff in the pull request:

https://github.com/apache/spark/pull/16149#discussion_r91643515
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/regression/GeneralizedLinearRegressionSuite.scala
 ---
@@ -715,7 +715,7 @@ class GeneralizedLinearRegressionSuite
 val datasetWithWeight = Seq(
   Instance(1.0, 1.0, Vectors.dense(0.0, 5.0).toSparse),
   Instance(0.5, 2.0, Vectors.dense(1.0, 2.0)),
-  Instance(1.0, 3.0, Vectors.dense(2.0, 1.0)),
+  Instance(1.0, 0.3, Vectors.dense(2.0, 1.0)),
--- End diff --

If you mean input weight = 0, then it is better to be tested separately 
along with all the other families from the current PR. I think the current test 
is good for the current PR. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16149: [SPARK-18715][ML]Fix AIC calculations in Binomial...

2016-12-08 Thread actuaryzhang
Github user actuaryzhang commented on a diff in the pull request:

https://github.com/apache/spark/pull/16149#discussion_r91563322
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/regression/GeneralizedLinearRegressionSuite.scala
 ---
@@ -715,7 +715,7 @@ class GeneralizedLinearRegressionSuite
 val datasetWithWeight = Seq(
   Instance(1.0, 1.0, Vectors.dense(0.0, 5.0).toSparse),
   Instance(0.5, 2.0, Vectors.dense(1.0, 2.0)),
-  Instance(1.0, 3.0, Vectors.dense(2.0, 1.0)),
+  Instance(1.0, 0.3, Vectors.dense(2.0, 1.0)),
--- End diff --

Isn't `round(0.3) = 0`? 
Or do you mean testing the raw (input) weight  = 0? How is that handled? I 
would assume that is just leaving out that observation, since that's how R does 
it. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16149: [SPARK-18715][ML]Fix AIC calculations in Binomial GLM

2016-12-07 Thread actuaryzhang
Github user actuaryzhang commented on the issue:

https://github.com/apache/spark/pull/16149
  
@sethah @srowen I have added a comment to the weigthCol doc for the 
Binomial case. 
I also updated to test the case `weight < 0.5`, i.e., `round(weight) = 0`. 
All tests passed.  


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16149: [SPARK-18715][ML]Fix AIC calculations in Binomial...

2016-12-07 Thread actuaryzhang
Github user actuaryzhang commented on a diff in the pull request:

https://github.com/apache/spark/pull/16149#discussion_r91440862
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala
 ---
@@ -479,7 +485,12 @@ object GeneralizedLinearRegression extends 
DefaultParamsReadable[GeneralizedLine
 numInstances: Double,
 weightSum: Double): Double = {
   -2.0 * predictions.map { case (y: Double, mu: Double, weight: 
Double) =>
-weight * dist.Binomial(1, mu).logProbabilityOf(math.round(y).toInt)
+val wt = math.round(weight).toInt
+if (wt == 0) {
+  0.0
+} else {
+  dist.Binomial(wt, mu).logProbabilityOf(math.round(y * 
weight).toInt)
--- End diff --

@sethah Code updated as suggested


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16149: [SPARK-18715][ML]Fix AIC calculations in Binomial GLM

2016-12-07 Thread actuaryzhang
Github user actuaryzhang commented on the issue:

https://github.com/apache/spark/pull/16149
  
@srowen @sethah 
I have cleaned up the change as suggested. Please review and let me know if 
there is any question.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16149: [SPARK-18715][ML]Fix AIC calculations in Binomial GLM

2016-12-07 Thread actuaryzhang
Github user actuaryzhang commented on the issue:

https://github.com/apache/spark/pull/16149
  
@srowen @sethah 
Thanks for the comments. Yes, the major use case is to be able to handle 
multiple trials (integer weight, real-valued response). Indeed, a better way to 
do this is through `offset`, which I have proposed to do in this JIRA 
[SPARK-18710](https://issues.apache.org/jira/browse/SPARK-18710). Please let me 
know if this is worth pursuing. 

I have submitted another two commits.

1. One commit makes minimal modification to the exiting Binomial GLM test 
so that one response record is now non-integer. This test still failed because 
the deviance residual calculation seems to work only for `y in (0, 1)`
2. The second commit fixes the issue in calculating the deviance. But I 
think the code can be improved, especially regarding the function `y_logy`. 
What's the best way to create a utility function like this? Please advise. 

With the two commits, all tests now passed, including the ones on AIC and 
the deviance.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16149: [SPARK-18715][ML]Fix AIC calculations in Binomial...

2016-12-07 Thread actuaryzhang
Github user actuaryzhang commented on a diff in the pull request:

https://github.com/apache/spark/pull/16149#discussion_r91246870
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala
 ---
@@ -479,7 +479,12 @@ object GeneralizedLinearRegression extends 
DefaultParamsReadable[GeneralizedLine
 numInstances: Double,
 weightSum: Double): Double = {
   -2.0 * predictions.map { case (y: Double, mu: Double, weight: 
Double) =>
-weight * dist.Binomial(1, mu).logProbabilityOf(math.round(y).toInt)
+val wt = math.round(weight).toInt
+if (wt == 0) {
+  0.0
+} else {
+  dist.Binomial(wt, mu).logProbabilityOf(math.round(y * 
weight).toInt)
--- End diff --


@srowen The current uses 
[breeze](https://github.com/scalanlp/breeze/blob/master/math/src/main/scala/breeze/stats/distributions/Binomial.scala)
 for computing the Binomial density, which does not allow 0 trials. `require(n 
> 0, "n must be positive!")`. That is why I have to handle the case of `wt = 0` 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16131: [SPARK-18701][ML] Fix Poisson GLM failure due to wrong i...

2016-12-06 Thread actuaryzhang
Github user actuaryzhang commented on the issue:

https://github.com/apache/spark/pull/16131
  
@srowen Is this ready to be merged? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16131: [SPARK-18701][ML] Fix Poisson GLM failure due to wrong i...

2016-12-05 Thread actuaryzhang
Github user actuaryzhang commented on the issue:

https://github.com/apache/spark/pull/16131
  
@srowen Done. Thanks for the suggestion. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16131: [SPARK-18701][ML] Fix Poisson GLM failure due to wrong i...

2016-12-05 Thread actuaryzhang
Github user actuaryzhang commented on the issue:

https://github.com/apache/spark/pull/16131
  
@sethah Thanks for the review. I have updated according to your suggestion. 

@yanboliang @srowen  Please take another look. Thanks. 




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16131: [SPARK-18701][ML] Fix Poisson GLM failure due to wrong i...

2016-12-05 Thread actuaryzhang
Github user actuaryzhang commented on the issue:

https://github.com/apache/spark/pull/16131
  
@srowen @yanboliang 
I have updated the code and further cleaned up the test. Please review and 
let me know if there is any question. Thanks. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16131: [SPARK-18701][ML] Fix Poisson GLM failure due to ...

2016-12-05 Thread actuaryzhang
Github user actuaryzhang commented on a diff in the pull request:

https://github.com/apache/spark/pull/16131#discussion_r90965352
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala
 ---
@@ -505,7 +505,7 @@ object GeneralizedLinearRegression extends 
DefaultParamsReadable[GeneralizedLine
 override def initialize(y: Double, weight: Double): Double = {
   require(y >= 0.0, "The response variable of Poisson family " +
 s"should be non-negative, but got $y")
-  y
+  y + 0.1
--- End diff --

@srowen I verified that the following approaches give the same results for 
a few data sets 
a)`y + 0.1`
b) `if (y == 0) y + 0.1 else y`
c)` math.max(0.1, y)`
Which one do you want to use, c)? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16131: [SPARK-18701][ML] Fix Poisson GLM failure due to ...

2016-12-05 Thread actuaryzhang
Github user actuaryzhang commented on a diff in the pull request:

https://github.com/apache/spark/pull/16131#discussion_r90955908
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala
 ---
@@ -505,7 +505,7 @@ object GeneralizedLinearRegression extends 
DefaultParamsReadable[GeneralizedLine
 override def initialize(y: Double, weight: Double): Double = {
   require(y >= 0.0, "The response variable of Poisson family " +
 s"should be non-negative, but got $y")
-  y
+  y + 0.1
--- End diff --

@srowen Theoretically, one only needs to add 0.1 to the y = 0 case, which 
is a guess of the mean for those cases. But I think it may be better to add 
this small number to all cases. Imagine that one models the rates of 
occurrence, i.e., frequency divided by exposure. For certain large exposure, 
the rate can get tiny and close to zero. Adding 0.1 to that may help avoid 
numerical issues too in that case. Does that make sense? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16149: [SPARK-18715][ML]Fix AIC calculations in Binomial GLM

2016-12-05 Thread actuaryzhang
Github user actuaryzhang commented on the issue:

https://github.com/apache/spark/pull/16149
  
Jenkins, add to whitelist


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



<    1   2   3   4   5   6   >