[GitHub] spark pull request #16740: [SPARK-19400][ML] Allow GLM to handle intercept o...
Github user actuaryzhang commented on a diff in the pull request: https://github.com/apache/spark/pull/16740#discussion_r99269006 --- Diff: mllib/src/test/scala/org/apache/spark/ml/regression/GeneralizedLinearRegressionSuite.scala --- @@ -743,6 +743,55 @@ class GeneralizedLinearRegressionSuite } } + test("generalized linear regression: intercept only") { +/* + R code: + y <- c(17, 19, 23, 29) + w <- c(1, 2, 3, 4) + model1 <- glm(y ~ 1, family = poisson) + model2 <- glm(y ~ 1, family = poisson, weights = w) + as.vector(c(coef(model1), coef(model2))) + [1] 3.091042 3.178054 + */ + +val dataset = Seq( + Instance(17.0, 1.0, Vectors.zeros(0)), + Instance(19.0, 2.0, Vectors.zeros(0)), + Instance(23.0, 3.0, Vectors.zeros(0)), + Instance(29.0, 4.0, Vectors.zeros(0)) +).toDF() + +val expected = Seq(3.091, 3.178) + +import GeneralizedLinearRegression._ + +var idx = 0 +for (useWeight <- Seq(false, true)) { + val trainer = new GeneralizedLinearRegression().setFamily("poisson") +.setLinkPredictionCol("linkPrediction") + if (useWeight) trainer.setWeightCol("weight") + val model = trainer.fit(dataset) + val actual = model.intercept + assert(actual ~== expected(idx) absTol 1E-3, "Model mismatch: intercept only GLM with " + +s"useWeight = $useWeight.") + assert(model.coefficients === new DenseVector(Array.empty[Double])) + + val familyLink = FamilyAndLink(trainer) + model.transform(dataset).select("features", "prediction", "linkPrediction").collect() +.foreach { + case Row(features: DenseVector, prediction1: Double, linkPrediction1: Double) => +val eta = BLAS.dot(features, model.coefficients) + model.intercept +val prediction2 = familyLink.fitted(eta) --- End diff -- @sethah Agree. Removed this. Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16740: [SPARK-19400][ML] Allow GLM to handle intercept only mod...
Github user actuaryzhang commented on the issue: https://github.com/apache/spark/pull/16740 @srowen would you please take a look and merge this if all is good? Thanks. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16740: [SPARK-19400][ML] Allow GLM to handle intercept only mod...
Github user actuaryzhang commented on the issue: https://github.com/apache/spark/pull/16740 @sethah Your formula for offset does not seem to be a general solution, and I'm not sure if there exists an analytical formula, in particular when the link function is not identity or log. In GLM, the normal equation for the coefficient is: (y - mu) * w = 0. When there is offset, mu = link.inv(intcpt + offset). Take the case where link.inv is inverse as an example, then we have (y - inverse(intcpt + offset)) * w = 0. This is nonlinear and has to be solved iteratively, right? The following is a simple R example to show that the formula you provided does not recover the coefficient. ``` set.seed(11) off <- rlnorm(200) a <- 0.52 mu <- 1/(a + off) y <- rgamma(200, 1, scale = mu) f <- glm(y~1, offset = off, family = Gamma()) coef(f) 1/mean(y - off) ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16729: [SPARK-19391][SparkR][ML] Tweedie GLM API for SparkR
Github user actuaryzhang commented on the issue: https://github.com/apache/spark/pull/16729 @felixcheung The one from statmod will be masked and must be called using `statmod:tweedie`. We can copy the whole `tweedie` function from statmod into `SparkR` and this will avoid the issue (we will give credit to the original author). Does this sound reasonable? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16729: [SPARK-19391][SparkR][ML] Tweedie GLM API for SparkR
Github user actuaryzhang commented on the issue: https://github.com/apache/spark/pull/16729 @felixcheung Great point! Yes, I think it's better to stick with the statmod syntax and allow the tweedie family to be specified as`tweedie(var.power, link.power)`. I tried a few ways to allow this and independence of statmod. The solution I came up with is creating and exporting a utility `tweedie` function within the SparkR package to mimic the statmod syntax. It will be better to avoid exporting it while still allowing users to use the syntax. We would need to use unevaluated expression, e.g., `substitute(family)`. This works in `spark.glm` by calling first `family = eval(substitute(family))` and then processing as before. However, it won't work with `glm`: `substitute` does not work in methods that have different arguments from the generic: https://stat.ethz.ch/pipermail/r-help/2008-January/152296.html I also updated the doc and added some examples and comments in the vignettes. Let me know what you think. Thanks much! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16740: [SPARK-19400][ML] Allow GLM to handle intercept only mod...
Github user actuaryzhang commented on the issue: https://github.com/apache/spark/pull/16740 @sethah Thanks for the clarification and providing an implementation. So, the pros is some speed improvement and the cons is the increased complexity (now we have three case: one for intercept only, one for Gaussian with identity and one for all the others). Let's see get other committers' opinions. Yes, I will throw an error for the special case of no intercept and no features. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16740: [SPARK-19400][ML] Allow GLM to handle intercept only mod...
Github user actuaryzhang commented on the issue: https://github.com/apache/spark/pull/16740 @sethah Thanks for your input. I can add more tests, but they are not adding too much since the algorithm is already tested in other tests. The analytical approach does not integrate well with the summary method. One has to derive the general formula for the standard error of the intercept, and then change the code substantially to make it work with summary. This is not an optimal solution IMO. BTW, R fits the intercept only model also using IWLS with multiple iterations. It is just weird to have a special implementation in this case which does not integrate with the current setup. @srowen @yanboliang Please advise. Thanks. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16740: [SPARK-19400][ML] Allow GLM to handle intercept only mod...
Github user actuaryzhang commented on the issue: https://github.com/apache/spark/pull/16740 @sethah Yes, we can directly compute the intercept easily. But I'm concerned that such special handling may not integrate well with other features or future changes. For example, we will need to compute the standard error analytically as well, which is not difficult. But the point is that every time there is new feature, one would have to modify the intercept calculation part to handle it. This does not seem efficient. Thoughts? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16740: [SPARK-19400][ML] Allow GLM to handle intercept only mod...
Github user actuaryzhang commented on the issue: https://github.com/apache/spark/pull/16740 @sethah Thanks for your review. Yes, using `foldLeft` would be the simplest fix. I have included both your suggested changes in the new commit. Yes, we could handle the special case of the intercept only model outside IWLS. But I think it would be better to use the current fix: a) it allows IWLS to handle edge case, b) the fix is general and less likely to be affected by future structure changes of GLM (e.g., allowing offset). Does this make sense? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16699: [SPARK-18710][ML] Add offset in GLM
Github user actuaryzhang commented on a diff in the pull request: https://github.com/apache/spark/pull/16699#discussion_r98580805 --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala --- @@ -336,14 +361,19 @@ class GeneralizedLinearRegression @Since("2.0.0") (@Since("2.0.0") override val } val w = if (!isDefined(weightCol) || $(weightCol).isEmpty) lit(1.0) else col($(weightCol)) -val instances: RDD[Instance] = - dataset.select(col($(labelCol)), w, col($(featuresCol))).rdd.map { -case Row(label: Double, weight: Double, features: Vector) => - Instance(label, weight, features) - } +val offset = if (!isDefined(offsetCol) || $(offsetCol).isEmpty) { + lit(0.0) +} else { + col($(offsetCol)).cast(DoubleType) +} --- End diff -- I'm following the style in existing code, for example: ``` lazy val rank: Long = if (model.getFitIntercept) { model.coefficients.size + 1 } else { model.coefficients.size } ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16699: [SPARK-18710][ML] Add offset in GLM
Github user actuaryzhang commented on the issue: https://github.com/apache/spark/pull/16699 @yanboliang @zhengruifeng @srowen Could you guys take a look and let me know if there is any changes needed? Thanks much! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16699: [SPARK-18710][ML] Add offset in GLM
Github user actuaryzhang commented on the issue: https://github.com/apache/spark/pull/16699 @imatiach-msft Thanks much for your review. Renamed `off`. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16699: [SPARK-18710][ML] Add offset in GLM
Github user actuaryzhang commented on a diff in the pull request: https://github.com/apache/spark/pull/16699#discussion_r98496017 --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala --- @@ -336,14 +361,19 @@ class GeneralizedLinearRegression @Since("2.0.0") (@Since("2.0.0") override val } val w = if (!isDefined(weightCol) || $(weightCol).isEmpty) lit(1.0) else col($(weightCol)) -val instances: RDD[Instance] = - dataset.select(col($(labelCol)), w, col($(featuresCol))).rdd.map { -case Row(label: Double, weight: Double, features: Vector) => - Instance(label, weight, features) - } +val off = if (!isDefined(offsetCol) || $(offsetCol).isEmpty) { --- End diff -- Could you point out which lines have the spacing issue? Thanks. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16740: [SPARK-19400][ML] Allow GLM to handle intercept o...
Github user actuaryzhang commented on a diff in the pull request: https://github.com/apache/spark/pull/16740#discussion_r98493237 --- Diff: mllib/src/main/scala/org/apache/spark/ml/optim/IterativelyReweightedLeastSquares.scala --- @@ -86,13 +86,11 @@ private[ml] class IterativelyReweightedLeastSquares( standardizeFeatures = false, standardizeLabel = false).fit(newInstances) // Check convergence - val oldCoefficients = oldModel.coefficients - val coefficients = model.coefficients - BLAS.axpy(-1.0, coefficients, oldCoefficients) - val maxTolOfCoefficients = oldCoefficients.toArray.reduce { (x, y) => -math.max(math.abs(x), math.abs(y)) + val oldCoefficients = oldModel.coefficients.toArray :+ oldModel.intercept + val coefficients = model.coefficients.toArray :+ model.intercept + val maxTol = oldCoefficients.zip(coefficients).map(x => x._1 - x._2).reduce { --- End diff -- @imatiach-msft The performance change is minor since the coefficients are not high-dimensional and these are simple operations and don't consume much time than the actual model fitting. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16740: [SPARK-19400][ML] Allow GLM to handle intercept only mod...
Github user actuaryzhang commented on the issue: https://github.com/apache/spark/pull/16740 @imatiach-msft Thanks for the comments! This test is based on existing tests in GLM. I can try to improve the style and streamline **all** tests in another PR but it will be weird to just change this test. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16740: [SPARK-19400][ML] Allow GLM to handle intercept only mod...
Github user actuaryzhang commented on the issue: https://github.com/apache/spark/pull/16740 @srowen Thanks much for the suggestion. Included the simplification. Please let me know if there is anything else needed for this PR. Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16740: [SPARK-19400] Allow GLM to handle intercept only model
Github user actuaryzhang commented on the issue: https://github.com/apache/spark/pull/16740 The following is a simple example to illustrate the issue. ``` val dataset = Seq( (1.0, 1.0, 2.0, 0.0, 5.0), (0.5, 2.0, 1.0, 1.0, 2.0), (1.0, 3.0, 0.5, 2.0, 1.0), (2.0, 4.0, 1.5, 3.0, 3.0) ).toDF("y", "w", "off", "x1", "x2") val formula = new RFormula().setFormula("y ~ 1") val output = formula.fit(dataset).transform(dataset) val glr = new GeneralizedLinearRegression().setFamily("poisson") val model = glr.fit(output) ``` The above prints out the following error message: ``` java.lang.UnsupportedOperationException: empty.reduceLeft at scala.collection.TraversableOnce$class.reduceLeft(TraversableOnce.scala:180) at scala.collection.mutable.ArrayOps$ofDouble.scala$collection$IndexedSeqOptimized$$super$reduceLeft(ArrayOps.scala:270) at scala.collection.IndexedSeqOptimized$class.reduceLeft(IndexedSeqOptimized.scala:74) ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16740: [SPARK-19400] Allow GLM to handle intercept only ...
GitHub user actuaryzhang opened a pull request: https://github.com/apache/spark/pull/16740 [SPARK-19400] Allow GLM to handle intercept only model ## What changes were proposed in this pull request? Intercept-only GLM is failing for non-Gaussian family because of reducing an empty array in IWLS. The following code `val maxTolOfCoefficients = oldCoefficients.toArray.reduce { (x, y) => math.max(math.abs(x), math.abs(y))` fails in the intercept-only model because `oldCoefficients` is empty. This PR fixes this issue. @yanboliang @srowen @imatiach-msft @zhengruifeng ## How was this patch tested? New test for intercept only model. You can merge this pull request into a Git repository by running: $ git pull https://github.com/actuaryzhang/spark interceptOnly Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/16740.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #16740 commit 69d96319abaa7f4aa98c04ce32556a2cb2c065d2 Author: actuaryzhang <actuaryzhan...@gmail.com> Date: 2017-01-30T07:59:11Z fix error in reducing empty array --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16729: [SPARK-19391][SparkR][ML] Tweedie GLM API for Spa...
Github user actuaryzhang commented on a diff in the pull request: https://github.com/apache/spark/pull/16729#discussion_r98363560 --- Diff: mllib/src/main/scala/org/apache/spark/ml/r/GeneralizedLinearRegressionWrapper.scala --- @@ -143,7 +150,12 @@ private[r] object GeneralizedLinearRegressionWrapper val rDeviance: Double = summary.deviance val rResidualDegreeOfFreedomNull: Long = summary.residualDegreeOfFreedomNull val rResidualDegreeOfFreedom: Long = summary.residualDegreeOfFreedom -val rAic: Double = summary.aic +val rAic: Double = if (family.toLowerCase == "tweedie" && + !Array(0.0, 1.0, 2.0).contains(variancePower)) { --- End diff -- Thanks for the suggestion. Changed it to comparison instead. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16729: [SPARK-19391][SparkR][ML] Tweedie GLM API for Spa...
Github user actuaryzhang commented on a diff in the pull request: https://github.com/apache/spark/pull/16729#discussion_r98363550 --- Diff: R/pkg/inst/tests/testthat/test_mllib_regression.R --- @@ -77,6 +77,18 @@ test_that("spark.glm and predict", { out <- capture.output(print(summary(model))) expect_true(any(grepl("Dispersion parameter for gamma family", out))) + # tweedie family + require(statmod) + model <- spark.glm(training, Sepal_Width ~ Sepal_Length + Species, + family = tweedie(var.power = 1.2, link.power = 1.0)) + prediction <- predict(model, training) + expect_equal(typeof(take(select(prediction, "prediction"), 1)$prediction), "double") --- End diff -- Would you remind me what `dtypes` is and why we need to use it here? Thanks. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16729: [SPARK-19391][SparkR][ML] Tweedie GLM API for SparkR
Github user actuaryzhang commented on the issue: https://github.com/apache/spark/pull/16729 @felixcheung Thanks so much for your quick and detailed review. I have made a new commit that removed dependency on `statmod` and fixed the issues you pointed out. The major change is to add `variancePower` and `linkPower` as arguments in `spark.glm` to avoid using the `tweedie` family from `statmod`. Let me know if this design is reasonable. Then I can update additional docs and the vignette. Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16730: [SPARK-19395][SparkR]Convert coefficients in summary to ...
Github user actuaryzhang commented on the issue: https://github.com/apache/spark/pull/16730 @felixcheung I created a JIRA ticket and added in some tests. Please take a look. Thanks. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16730: [Minor][SparkR]Convert coefficients in summary to...
GitHub user actuaryzhang opened a pull request: https://github.com/apache/spark/pull/16730 [Minor][SparkR]Convert coefficients in summary to matrix ## What changes were proposed in this pull request? The `coefficients` component in model summary should be 'matrix' but the underlying structure is indeed list. This affects several models except for 'AFTSurvivalRegressionModel' which has the correct implementation. The fix is to first `unlist` the coefficients returned from the `callJMethod` before converting to matrix. An example illustrates the issues: ``` data(iris) df <- createDataFrame(iris) model <- spark.glm(df, Sepal_Length ~ Sepal_Width, family = "gaussian") s <- summary(model) > str(s$coefficients) List of 8 $ : num 6.53 $ : num -0.223 $ : num 0.479 $ : num 0.155 $ : num 13.6 $ : num -1.44 $ : num 0 $ : num 0.152 - attr(*, "dim")= int [1:2] 2 4 - attr(*, "dimnames")=List of 2 ..$ : chr [1:2] "(Intercept)" "Sepal_Width" ..$ : chr [1:4] "Estimate" "Std. Error" "t value" "Pr(>|t|)" > s$coefficients[, 2] $`(Intercept)` [1] 0.4788963 $Sepal_Width [1] 0.1550809 ``` This shows that the underlying structure of coefficients is still `list`. @felixcheung @wangmiao1981 You can merge this pull request into a Git repository by running: $ git pull https://github.com/actuaryzhang/spark sparkRCoef Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/16730.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #16730 commit c80d5301cddfe6b49fa534b60a859b609b547324 Author: actuaryzhang <actuaryzhan...@gmail.com> Date: 2017-01-28T21:07:26Z convert coefficients to matrix --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16729: [SPARK-19391][SparkR][ML] Tweedie GLM API for Spa...
GitHub user actuaryzhang opened a pull request: https://github.com/apache/spark/pull/16729 [SPARK-19391][SparkR][ML] Tweedie GLM API for SparkR ## What changes were proposed in this pull request? Port Tweedie GLM #16344 to SparkR @felixcheung @yanboliang ## How was this patch tested? new test in SparkR You can merge this pull request into a Git repository by running: $ git pull https://github.com/actuaryzhang/spark sparkRTweedie Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/16729.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #16729 commit 67364abee7aae34a093828e843de5ba9315b45a9 Author: actuaryzhang <actuaryzhan...@gmail.com> Date: 2017-01-27T19:56:49Z start working on SparkR tweedie API commit 654551b207315b33fce00f98706360934f09c55a Author: actuaryzhang <actuaryzhan...@gmail.com> Date: 2017-01-28T00:10:54Z set link only for non-tweedie; fix issue on aic commit 852dd6eec9ee960487f878dc332ea36acfd6102e Author: actuaryzhang <actuaryzhan...@gmail.com> Date: 2017-01-28T20:05:57Z add test for tweedie --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16699: [SPARK-18710][ML] Add offset in GLM
Github user actuaryzhang commented on the issue: https://github.com/apache/spark/pull/16699 @zhengruifeng Thanks for the suggestions. Added casting and instrumentation. @imatiach-msft Thanks for the clarification! It is probably worth another PR to clean up all tests in GLM. Let me know if there is any additional comments! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16699: [SPARK-18710][ML] Add offset in GLM
Github user actuaryzhang commented on a diff in the pull request: https://github.com/apache/spark/pull/16699#discussion_r98077231 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Instance.scala --- @@ -27,3 +27,25 @@ import org.apache.spark.ml.linalg.Vector * @param features The vector of features for this data point. */ private[ml] case class Instance(label: Double, weight: Double, features: Vector) + +/** + * Case class that represents an instance of data point with + * label, weight, offset and features. + * + * @param label Label for this data point. + * @param weight The weight of this instance. + * @param offset The offset used for this data point. + * @param features The vector of features for this data point. + */ +private[ml] case class OffsetInstance(label: Double, weight: Double, offset: Double, + features: Vector) { + + /** Constructs from an [[Instance]] object and offset */ + def this(instance: Instance, offset: Double = 0.0) = { +this(instance.label, instance.weight, offset, instance.features) + } + + /** Converts to an [[Instance]] object by leaving out the offset. */ + private[ml] def toInstance: Instance = Instance(label, weight, features) --- End diff -- Yes, this is only used once in the current code, and I can get rid of it. But I feel other regression-type models may use offset at some point and having this method will make it easier to switch between Instance and OffsetInstance. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16699: [SPARK-18710][ML] Add offset in GLM
Github user actuaryzhang commented on the issue: https://github.com/apache/spark/pull/16699 @imatiach-msft Many thanks again for the review. I have incorporated some of your suggestions: 1. Create initialization of instance directly if it is Gaussian(identity) to avoid expensive conversion from OffsetInstance. 2. Fix spacing issues in test 3. Change `off` to `offset` to be more informative. Feedback on the other suggestions: 1. `move import to beginning, use ~==`: I did not change this for consistency with other tests in GLM. It's probably better to update all tests in a separate PR if this is preferred. 2. Inheriting Instance: I'm not sure about this and leaning toward not touching `Instance` since it is used a lot elsewhere. Let's see what others suggest. Question: if `OffsetInstance` is a subclass of `Instance`, will a method with a signature `RDD[Instance]` also work with `RDD[OffsetInstance]`? 3. Would you please elaborate on the following comment: > if family = Gaussian and link = Identity we shouldn't be passing any offset. This is the linear regression case and it still allows `offset`. The offset will be subtracted from the response as can be seen from the initialization in WLS `Instance(label - offset, weight, features)`. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16699: [SPARK-18710][ML] Add offset in GLM
Github user actuaryzhang commented on the issue: https://github.com/apache/spark/pull/16699 @imatiach-msft Thanks so much for your detailed review. Incredibly helpful. I've addressed all your comments in the new commit. Major changes are highlighted below: 1. Create `OffsetInstance` in Instance.scala and remove dependency of IWLS on regression. 2. Add param checking for offset. Let me know if there is anything else that needs improvement. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16699: [SPARK-18710][ML] Add offset in GLM
Github user actuaryzhang commented on the issue: https://github.com/apache/spark/pull/16699 @zhengruifeng @imatiach-msft Thanks much for pointing out the issue due to the hasOffset trait. This is what caused the test to fail. I have moved it to the GLRBase class. Things are working now ð --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16630: [SPARK-19270][ML] Add summary table to GLM summary
Github user actuaryzhang commented on the issue: https://github.com/apache/spark/pull/16630 This test just resists to start. Could someone help? Many thanks! @srowen @jkbradley @MLnick @yanboliang --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16699: [SPARK-18710] Add offset in GLM
Github user actuaryzhang commented on the issue: https://github.com/apache/spark/pull/16699 jenkins, retest this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16699: [SPARK-18710] Add offset in GLM
GitHub user actuaryzhang opened a pull request: https://github.com/apache/spark/pull/16699 [SPARK-18710] Add offset in GLM ## What changes were proposed in this pull request? Add support for offset in GLM. This is useful for at least two reasons: 1. Account for exposure: e.g., when modeling the number of accidents, we may need to use miles driven as an offset to access factors on frequency. 2. Test incremental effects of new variables: we can use predictions from the existing model as offset and run a much smaller model on only new variables. This avoids re-estimating the large model with all variables (old + new) and can be very important for efficient large-scaled analysis. ## How was this patch tested? New test. @yanboliang @srowen @felixcheung @sethah You can merge this pull request into a Git repository by running: $ git pull https://github.com/actuaryzhang/spark offset Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/16699.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #16699 commit 3bf2718c1a1e68273508e63499bb5d1cc8230155 Author: actuaryzhang <actuaryzhan...@gmail.com> Date: 2017-01-24T23:46:16Z add trait offset commit 0e240eb313aa91cb645fb3ab8d70e51b6c65b3c7 Author: actuaryzhang <actuaryzhan...@gmail.com> Date: 2017-01-24T23:48:03Z add offset setter commit 9c41453a19c0f9c31403fafaf1995c642c37c70d Author: actuaryzhang <actuaryzhan...@gmail.com> Date: 2017-01-25T05:15:50Z implement offset in GLM commit 7823f8af8b0926790816c9e79e9425e503e494ad Author: actuaryzhang <actuaryzhan...@gmail.com> Date: 2017-01-25T06:55:56Z add test for glm with offset --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16630: [SPARK-19270][ML] Add summary table to GLM summary
Github user actuaryzhang commented on the issue: https://github.com/apache/spark/pull/16630 jenkins, test this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16675: [SPARK-19155][ML] Make family case insensitive in GLM
Github user actuaryzhang commented on the issue: https://github.com/apache/spark/pull/16675 @yanboliang Thanks. Seems to have passed tests. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16344: [SPARK-18929][ML] Add Tweedie distribution in GLM
Github user actuaryzhang commented on the issue: https://github.com/apache/spark/pull/16344 @yanboliang Thanks so much for your detailed review. Your suggestions make lots of sense and I have included all of them in the new commit. Let me know if there is any other change needed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16675: [SPARK-19155][ML] Make family case insensitive in GLM
Github user actuaryzhang commented on the issue: https://github.com/apache/spark/pull/16675 @yanboliang Thanks for the quick response. How about the new commit, where I just change the value from `getFamily` to lower case when necessary, i.e., in the calculation of p-value and dispersion? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16675: [SPARK-19155][ML] Make family case insensitive in GLM
Github user actuaryzhang commented on the issue: https://github.com/apache/spark/pull/16675 I would prefer that `getFamily` returns lower case values directly, because using `getFamily.toLowerCase` can get very cumbersome and I use this a lot in another PR #16344. If we want to keep `getFamily` to retrieve the raw value of family, then I can create a private method `getFamilyLowerCase`. Please advise. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16675: [SPARK-19155][ML] make getFamily case insensitive
GitHub user actuaryzhang opened a pull request: https://github.com/apache/spark/pull/16675 [SPARK-19155][ML] make getFamily case insensitive ## What changes were proposed in this pull request? This is a supplement to PR #16516 which did not make the value from `getFamily` case insensitive. This affects the calculation of `dispersion` and `pValue` since the value of family is checked there: ` model.getFamily == Binomial.name || model.getFamily == Poisson.name) `. Current tests of poisson/binomial glm with weight fail when specifying 'Poisson' or 'Binomial'. A simple fix is to is to convert the value of `getFamily` to lower case: ``` def getFamily: String = $(family).toLowerCase ``` ## How was this patch tested? Update existing tests for 'Poisson' and 'Binomial'. @yanboliang @felixcheung @imatiach-msft You can merge this pull request into a Git repository by running: $ git pull https://github.com/actuaryzhang/spark family Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/16675.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #16675 commit d33e2f135ae62df20337e2752753bcda2756a73d Author: actuaryzhang <actuaryzhan...@gmail.com> Date: 2017-01-23T02:59:12Z make getFamily case insensitive --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16630: [SPARK-19270][ML] Add summary table to GLM summary
Github user actuaryzhang commented on the issue: https://github.com/apache/spark/pull/16630 jenkins test this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16344: [SPARK-18929][ML] Add Tweedie distribution in GLM
Github user actuaryzhang commented on the issue: https://github.com/apache/spark/pull/16344 Could anybody help me understand what's causing this test to fail? I see several other ML PR failing as well, with the same error message like below: > Error instrumenting class:org.apache.spark.deploy.mesos.ui.MesosClusterUI Error instrumenting class:org.apache.spark.mllib.regression.IsotonicRegressionModel$SaveLoadV1_0$ > Was this related to the recent commit on instrumentation #15671? How do I fix this? Thanks! @zhengruifeng @yanboliang --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16630: [SPARK-19270][ML] Add summary table to GLM summary
Github user actuaryzhang commented on the issue: https://github.com/apache/spark/pull/16630 Jenkins, test this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16344: [SPARK-18929][ML] Add Tweedie distribution in GLM
Github user actuaryzhang commented on the issue: https://github.com/apache/spark/pull/16344 @srowen @yanboliang @felixcheung @jkbradley Could you help kick off the new test please? Thanks. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16344: [SPARK-18929][ML] Add Tweedie distribution in GLM
Github user actuaryzhang commented on the issue: https://github.com/apache/spark/pull/16344 @srowen @yanboliang @felixcheung Could you help kick off the new test please? Seems to be hanging for a day now. Thanks much. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16630: [SPARK-19270][ML] Add summary table to GLM summary
Github user actuaryzhang commented on the issue: https://github.com/apache/spark/pull/16630 Jenkins, test this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16630: [SPARK-19270][ML] Add summary table to GLM summary
Github user actuaryzhang commented on the issue: https://github.com/apache/spark/pull/16630 The following code illustrates the idea of this PR. ``` val datasetWithWeight = Seq( (1.0, 1.0, 0.0, 5.0), (0.5, 2.0, 1.0, 2.0), (1.0, 3.0, 2.0, 1.0), (0.0, 4.0, 3.0, 3.0) ).toDF("y", "w", "x1", "x2") val formula = (new RFormula() .setFormula("y ~ x1 + x2") .setFeaturesCol("features") .setLabelCol("label")) val output = formula.fit(datasetWithWeight).transform(datasetWithWeight) val glr = new GeneralizedLinearRegression() val model = glr.fit(output) model.summary.summaryTable.show ``` This prints out: ``` +-++---+---+---+ | Feature|Estimate| StdError| TValue| PValue| +-++---+---+---+ |Intercept| 1.4523809523809539| 0.9245946589975053| 1.5708299180050451| 0.3609009059280113| | x1|-0.33387|0.28171808490950573|-1.1832159566199243|0.44669962096188565| | x2|-0.11904761904761924| 0.2129588548|-0.5590169943749482| 0.6754896416955616| +-++---+---+---+ ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16630: [SPARK-19270][ML] Add summary table to GLM summar...
GitHub user actuaryzhang opened a pull request: https://github.com/apache/spark/pull/16630 [SPARK-19270][ML] Add summary table to GLM summary ## What changes were proposed in this pull request? Add R-like summary table to GLM summary, which includes feature name (if exist), parameter estimate, standard error, t-stat and p-value. This allows scala users to easily gather these commonly used inference results. @srowen @yanboliang ## How was this patch tested? New tests. One for testing feature Name, and one for testing the summary Table. You can merge this pull request into a Git repository by running: $ git pull https://github.com/actuaryzhang/spark glmTable Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/16630.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #16630 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16344: [SPARK-18929][ML] Add Tweedie distribution in GLM
Github user actuaryzhang commented on the issue: https://github.com/apache/spark/pull/16344 @yanboliang Finally, the test is done. Is there anything else needed for this PR? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16344: [SPARK-18929][ML] Add Tweedie distribution in GLM
Github user actuaryzhang commented on the issue: https://github.com/apache/spark/pull/16344 Jenkins, retest this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16344: [SPARK-18929][ML] Add Tweedie distribution in GLM
Github user actuaryzhang commented on the issue: https://github.com/apache/spark/pull/16344 Still not testing... Been in the status "Asked to test" for a few days now. How can we resolve this? Please help kick off the test. Thanks! @yanboliang @felixcheung @srowen @rxin @jkbradley --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16344: [SPARK-18929][ML] Add Tweedie distribution in GLM
Github user actuaryzhang commented on the issue: https://github.com/apache/spark/pull/16344 Jenkins, retest this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16344: [SPARK-18929][ML] Add Tweedie distribution in GLM
Github user actuaryzhang commented on the issue: https://github.com/apache/spark/pull/16344 Jenkins, test this please. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16344: [SPARK-18929][ML] Add Tweedie distribution in GLM
Github user actuaryzhang commented on the issue: https://github.com/apache/spark/pull/16344 @yanboliang Thanks for the review and comments. I have made a new commit that addressed all your comments. The main change is the new companion object `FamilyAndLink` and factory methods to construct desired `FamilyAndLink` objects from input param map. Please see my inline responses for details. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16344: [SPARK-18929][ML] Add Tweedie distribution in GLM
Github user actuaryzhang commented on a diff in the pull request: https://github.com/apache/spark/pull/16344#discussion_r96061883 --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala --- @@ -242,9 +316,9 @@ class GeneralizedLinearRegression @Since("2.0.0") (@Since("2.0.0") override val def setLinkPredictionCol(value: String): this.type = set(linkPredictionCol, value) override protected def train(dataset: Dataset[_]): GeneralizedLinearRegressionModel = { -val familyObj = Family.fromName($(family)) -val linkObj = if (isDefined(link)) { - Link.fromName($(link)) +val familyObj = Family.fromParams(this) +val linkObj = if (isDefined(link) || isDefined(linkPower)) { + Link.fromParams(this) } else { familyObj.defaultLink } --- End diff -- Makes sense. I created a companion object `FamilyAndLink` with a factory method to construct desired `FamilyAndLink` objects from the input param map. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16344: [SPARK-18929][ML] Add Tweedie distribution in GLM
Github user actuaryzhang commented on a diff in the pull request: https://github.com/apache/spark/pull/16344#discussion_r96061873 --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala --- @@ -613,25 +758,67 @@ object GeneralizedLinearRegression extends DefaultParamsReadable[GeneralizedLine private[regression] object Link { /** - * Gets the [[Link]] object from its name. + * Gets the [[Link]] object based on link and linkPower. + * 1) if family is "tweedie", retrieve object using linkPower + * 2) otherwise, retrieve object based on link name * - * @param name link name: "identity", "logit", "log", - * "inverse", "probit", "cloglog" or "sqrt". + * @param params the parameter map containing link and link power */ -def fromName(name: String): Link = { - name match { -case Identity.name => Identity -case Logit.name => Logit -case Log.name => Log -case Inverse.name => Inverse -case Probit.name => Probit -case CLogLog.name => CLogLog -case Sqrt.name => Sqrt +def fromParams(params: GeneralizedLinearRegressionBase): Link = { + if (params.getFamily == "tweedie") { +params.getLinkPower match { + case 0.0 => Log + case 1.0 => Identity + case -1.0 => Inverse + case 0.5 => Sqrt + case others => new PowerLink(others) +} --- End diff -- The last version of the code only allows setting `linkPower` for tweedie, or `link` for non-tweedie. But now since we only give warnings, I need to add the logic for checking the correct specification. In the `FamilyAndLink` companion object, I use the following logic. When family is not tweedie and link is set, or when family is tweedie and linkPower is set, I will use `Link.fromParams` to construct the Link object. Otherwise, use the default link from the corresponding family. So, if the user does not specify `linkPower`, the link object will be the default `new Power(1 - variancePower)` set in the `Tweedie` class. The test for `tweedie family against glm (default power link)` covers this. The reason for not setting a default linkPower in the param map is to mimic the existing behavior: for tweedie with variancePower = 0, this will be the Gaussian with the identity link; for tweedie with variancePower = 1, this will be the Poisson with the log link; etc. ``` val linkObj = if ((params.getFamily != "tweedie" && params.isDefined(params.link)) || (params.getFamily == "tweedie" && params.isDefined(params.linkPower))) { Link.fromParams(params) } else { familyObj.defaultLink } ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16344: [SPARK-18929][ML] Add Tweedie distribution in GLM
Github user actuaryzhang commented on the issue: https://github.com/apache/spark/pull/16344 @yanboliang Thanks. Look forward to your feedback. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16344: [SPARK-18929][ML] Add Tweedie distribution in GLM
Github user actuaryzhang commented on the issue: https://github.com/apache/spark/pull/16344 Sorry about closing this prematurely. I'm giving it another shot and I think I have an elegant solution to include `linkPower`. The new commit adds the following: 1. It implements the broad family of power link function, specified through `linkPower`. There is now a `PowerLink` class for the power link function. It has subclasses `Identity`, `Log`, `Inverse` and `Sqrt`. With this, the GLM now supports all distributions characterized by power variance function and power link function. For now, I restrict the `linkPower` to be in `[-10, 10]` for numerical stability, but can change that. 2. The key to avoid all the messy coding is to use **only** `link` for non-tweedie family, and **only** `lnkPower` for tweedie family, as @yanboliang suggested. I have added validation for this, and also the `fromParams` method in `Link` to get the correct Link object based on `link` and `linkPower`. 3. I added new tests to test tweedie with default link. For example, this reproduces the Gaussian GLM estimate when `variancePower = 0` (default link would be identity). Similarly, this addresses @yanboliang example where the result from `val trainer = new GeneralizedLinearRegression().setFamily("tweedie").setVariancePower(1.5)` produces the same estimate as in R `glm(formula = "b ~ .", family = tweedie(var.power=1.5), data = df)`. @yanboliang @srowen Would you please take another look and let me know if there is additional changes needed? Thanks. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16344: [SPARK-18929][ML] Add Tweedie distribution in GLM
Github user actuaryzhang commented on the issue: https://github.com/apache/spark/pull/16344 Jenkins, test this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16344: [SPARK-18929][ML] Add Tweedie distribution in GLM
GitHub user actuaryzhang reopened a pull request: https://github.com/apache/spark/pull/16344 [SPARK-18929][ML] Add Tweedie distribution in GLM ## What changes were proposed in this pull request? I propose to add the full Tweedie family into the GeneralizedLinearRegression model. The Tweedie family is characterized by a power variance function. Currently supported distributions such as Gaussian, Poisson and Gamma families are a special case of the Tweedie https://en.wikipedia.org/wiki/Tweedie_distribution. @yanboliang @srowen @sethah I propose to add support for the other distributions: - compound Poisson: 1 < varPower < 2. This one is widely used to model zero-inflated continuous distributions, e.g., in insurance, finance, ecology, meteorology, advertising etc. - positive stable: varPower > 2 and varPower != 3. Used to model extreme values. - inverse Gaussian: varPower = 3. The Tweedie family is supported in most statistical packages such as R (statmod), SAS, h2o etc. Changes made: - Allow `tweedie` in family. Only `identity` and `log` links are allowed for now. - Add `varPower` to `GeneralizedLinearRegressionBase`, which takes values in (1, 2) and (2, infty). Also set default value to 1.5 and add getter method. - Add `Tweedie` class - Add tests for tweedie GLM Note: - In computing deviance, use `math.max(y, 0.1)` to avoid taking inverse of 0. This is the same as in R: `tweedie()$dev.res` - `aic` is not supported in this PR because the evaluation of the [Tweedie density](http://www.statsci.org/smyth/pubs/tweediepdf-series-preprint.pdf) in these cases are non-trivial. I will implement the density approximation method in a future PR. R returns `null` (see `tweedie()$aic`). You can merge this pull request into a Git repository by running: $ git pull https://github.com/actuaryzhang/spark tweedie Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/16344.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #16344 commit 952887e485fb0d5fa669b3b4c9289b8069ee7769 Author: actuaryzhang <actuaryzhan...@gmail.com> Date: 2016-12-16T00:50:51Z Add Tweedie family to GLM commit 4f184ec458f5ed7d70bc5b8165481425f911d2a3 Author: actuaryzhang <actuaryzhan...@gmail.com> Date: 2016-12-19T22:50:02Z Fix calculation in dev resid; Add test for different var power commit 7fe39106332663d3671b94a8ffac48ca61c48470 Author: actuaryzhang <actuaryzhan...@gmail.com> Date: 2016-12-19T23:14:37Z Merge test into GLR commit bfcc4fb08d54156efc66b90d14c62ea7ff172afa Author: actuaryzhang <actuaryzhan...@gmail.com> Date: 2016-12-20T22:59:05Z Use Tweedie class instead of global object Tweedie; change variancePower to varPower commit a8feea7d8095170c1b5f18b7887f16a6d763e42c Author: actuaryzhang <actuaryzhan...@gmail.com> Date: 2016-12-21T23:42:40Z Allow Family to use GLRBase object directly commit 233e2d338be8d36a74eaf578bfea804ae3617d4e Author: actuaryzhang <actuaryzhan...@gmail.com> Date: 2016-12-22T01:56:34Z Add TweedieFamily and implement specific distn within Tweedie commit 17c55816c914bc96a8b5141356e3c117f343f303 Author: actuaryzhang <actuaryzhan...@gmail.com> Date: 2016-12-22T04:39:54Z Clean up doc commit 0b41825e99020976a34d8fe9c983f26de6c8c40f Author: actuaryzhang <actuaryzhan...@gmail.com> Date: 2016-12-22T17:52:01Z Move defaultLink and name to subclass of TweedieFamily commit 6e8e60771afb4abe43e47c7fe186bad1541a8fac Author: actuaryzhang <actuaryzhan...@gmail.com> Date: 2016-12-22T18:10:51Z Change style for AIC commit 8d7d34e258f9c7c03c80754d837ce847fcb0526e Author: actuaryzhang <actuaryzhan...@gmail.com> Date: 2016-12-23T19:10:20Z Rename Family methods and restore methods for tweedie subclasses commit 6da7e3068e2c45a0faf7ff35c10b2750784d765e Author: actuaryzhang <actuaryzhan...@gmail.com> Date: 2016-12-23T19:12:25Z Update test commit 9a71e89f629260c775922901a04c989f36ea4946 Author: actuaryzhang <actuaryzhan...@gmail.com> Date: 2016-12-27T17:16:40Z Clean up doc commit f461c09e65360f695ad3092b41bc26e0c61bbd95 Author: actuaryzhang <actuaryzhan...@gmail.com> Date: 2016-12-27T22:18:39Z Put delta in Tweedie companion object commit a839c4631dd17c4f3d0a0cc99e1b0af81419dda4 Author: actuaryzhang <actuaryzhan...@gmail.com> Date: 2016-12-27T22:23:57Z Clean up doc commit fab265278109eede4cce7ee506e8b29d481c4549 Author: actuaryzhang <actuaryzhan...@gmail.com> Date: 2017-01-05T19:32:06Z Allow more link functions in tweedie --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your pro
[GitHub] spark issue #16344: [SPARK-18929][ML] Add Tweedie distribution in GLM
Github user actuaryzhang commented on the issue: https://github.com/apache/spark/pull/16344 @srowen @yanboliang I'm closing this PR since it does not seem to be very clean to integrate into the current GLM setup. I appreciate all the comments and discussions. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16344: [SPARK-18929][ML] Add Tweedie distribution in GLM
Github user actuaryzhang closed the pull request at: https://github.com/apache/spark/pull/16344 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16344: [SPARK-18929][ML] Add Tweedie distribution in GLM
Github user actuaryzhang commented on the issue: https://github.com/apache/spark/pull/16344 @yanboliang Thanks for the feedback. However, I'm not sure why we need to be consistent with R on this one. The usage of 'tweedie' glm almost always uses `link.power = 0, 1, -1`. There is barely any use case for the default `1-variancePower` link. I think a better design is to set the default to the one that is used more often, and make it clear that `we only support identity, log, inverse and sqrt link functions`. On the other hand, even if we want to be consistent with R, I actually wrote the R package `cplm` which fits the tweedie [glm](http://finzi.psych.upenn.edu/library/cplm/html/cpglm.html) as well as [mixed effect model](https://www.rdocumentation.org/packages/cplm/versions/0.7-4/topics/cpglmm). In that package I set the default to the `log` link and never had an issue. I don't think I will implement `linkPower`, but I'm fine if you want to do it. However, if the lack of `linkPower` means this PR won't get merged, please let me know and we can close this and move on. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16344: [SPARK-18929][ML] Add Tweedie distribution in GLM
Github user actuaryzhang commented on a diff in the pull request: https://github.com/apache/spark/pull/16344#discussion_r94849556 --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala --- @@ -397,32 +432,121 @@ object GeneralizedLinearRegression extends DefaultParamsReadable[GeneralizedLine /** Trim the fitted value so that it will be in valid range. */ def project(mu: Double): Double = mu + } private[regression] object Family { /** - * Gets the [[Family]] object from its name. + * Gets the [[Family]] object based on family and variancePower. + * 1) retrieve object based on family name + * 2) if family name is tweedie, retrieve object based on variancePower * - * @param name family name: "gaussian", "binomial", "poisson" or "gamma". + * @param params the parameter map containing family name and variance power */ -def fromName(name: String): Family = { - name match { -case Gaussian.name => Gaussian -case Binomial.name => Binomial -case Poisson.name => Poisson -case Gamma.name => Gamma +def fromParams(params: GeneralizedLinearRegressionBase): Family = { + params.getFamily match { +case "gaussian" => Gaussian +case "binomial" => Binomial +case "poisson" => Poisson +case "gamma" => Gamma +case "tweedie" => + params.getVariancePower match { +case 0.0 => Gaussian +case 1.0 => Poisson +case 2.0 => Gamma +case default => new Tweedie(default) --- End diff -- Done. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16344: [SPARK-18929][ML] Add Tweedie distribution in GLM
Github user actuaryzhang commented on a diff in the pull request: https://github.com/apache/spark/pull/16344#discussion_r94849540 --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala --- @@ -365,7 +401,6 @@ object GeneralizedLinearRegression extends DefaultParamsReadable[GeneralizedLine /** * A description of the error distribution to be used in the model. * - * @param name the name of the family. --- End diff -- Sorry. added this back. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16344: [SPARK-18929][ML] Add Tweedie distribution in GLM
Github user actuaryzhang commented on a diff in the pull request: https://github.com/apache/spark/pull/16344#discussion_r94849501 --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala --- @@ -158,6 +183,16 @@ class GeneralizedLinearRegression @Since("2.0.0") (@Since("2.0.0") override val setDefault(family -> Gaussian.name) /** +* Sets the value of param [[variancePower]]. +* Used only when family is "tweedie". +* +* @group setParam +*/ + @Since("2.2.0") + def setVariancePower(value: Double): this.type = set(variancePower, value) + setDefault(variancePower -> 1.5) --- End diff -- Done. change default variancePower to 0.0, which will use Gaussian (with default identity link) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16344: [SPARK-18929][ML] Add Tweedie distribution in GLM
Github user actuaryzhang commented on the issue: https://github.com/apache/spark/pull/16344 @yanboliang Thanks for the detailed review. I have made all changes you suggested except for the part on the new power link function. Yes, the canonical link in the Tweedie in general is `1.0 - variancePower`. But other than the special cases of 'identity', 'log', 'inverse', it is rarely used. The implementation of the generic power link function is unnecessary in my opinion for the following reasons: 1. The default link in the Tweedie is overridden in Gaussian, Poisson and Gamma to the respective canonical link. So when one specifies tweedie with variancePower = 0, 1, 2, they will have the correct canonical link. I now allow Tweedie to take link function 'identity', 'log', 'inverse', 'sqrt', which cover almost all possible links that will be used in practice. 2. If we allow an additional 'linkPower" parameter, we will have two ways to specify the link: one through 'link' and one through 'linkPower'. This will be confusing. 3. The implementation of the 'linkPower' is messy with almost no additional gain. Have to write new powerLink class, fromParams method which determines the link function based on 'link', 'linkPower' as well as 'family', and also various checks (consistency between link and linkPower, supported Family and link, etc). I tried it, and it's very ugly. If we really want a power link function, can I suggest that we implement it in a separate PR since it is really not specific to the 'tweedie' family? It can also be used together with other distributions. @srowen --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16344: [SPARK-18929][ML] Add Tweedie distribution in GLM
Github user actuaryzhang commented on the issue: https://github.com/apache/spark/pull/16344 @yanboliang Did you get a chance to take another look at this? Thanks. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16344: [SPARK-18929][ML] Add Tweedie distribution in GLM
Github user actuaryzhang commented on the issue: https://github.com/apache/spark/pull/16344 @srowen Made a new commit according to your suggestion. Everything looking good now? @yanboliang --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16344: [SPARK-18929][ML] Add Tweedie distribution in GLM
Github user actuaryzhang commented on the issue: https://github.com/apache/spark/pull/16344 @srowen @yanboliang Any additional issues regarding this PR? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16344: [SPARK-18929][ML] Add Tweedie distribution in GLM
Github user actuaryzhang commented on the issue: https://github.com/apache/spark/pull/16344 @yanboliang Thanks much for the detailed comments. I have addressed all of them in the new commits. Please take another look. @srowen --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16344: [SPARK-18929][ML] Add Tweedie distribution in GLM
Github user actuaryzhang commented on a diff in the pull request: https://github.com/apache/spark/pull/16344#discussion_r93672741 --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala --- @@ -303,20 +337,24 @@ object GeneralizedLinearRegression extends DefaultParamsReadable[GeneralizedLine /** Set of family and link pairs that GeneralizedLinearRegression supports. */ private[regression] lazy val supportedFamilyAndLinkPairs = Set( -Gaussian -> Identity, Gaussian -> Log, Gaussian -> Inverse, -Binomial -> Logit, Binomial -> Probit, Binomial -> CLogLog, -Poisson -> Log, Poisson -> Identity, Poisson -> Sqrt, -Gamma -> Inverse, Gamma -> Identity, Gamma -> Log +"gaussian" -> Identity, "gaussian" -> Log, "gaussian" -> Inverse, +"binomial" -> Logit, "binomial" -> Probit, "binomial" -> CLogLog, +"poisson" -> Log, "poisson" -> Identity, "poisson" -> Sqrt, +"gamma" -> Inverse, "gamma" -> Identity, "gamma" -> Log, +"tweedie" -> Identity, "tweedie" -> Log ) /** Set of family names that GeneralizedLinearRegression supports. */ - private[regression] lazy val supportedFamilyNames = supportedFamilyAndLinkPairs.map(_._1.name) + private[regression] lazy val supportedFamilyNames = supportedFamilyAndLinkPairs.map(_._1) /** Set of link names that GeneralizedLinearRegression supports. */ private[regression] lazy val supportedLinkNames = supportedFamilyAndLinkPairs.map(_._2.name) private[regression] val epsilon: Double = 1E-16 + /** Constant used in initialization and deviance to avoid numerical issues. */ + private[regression] val delta: Double = 0.1 --- End diff -- They are already in the `GeneralizedLinearRegression` object, aren't they? Or do you mean creating a new object say `Constant` that stores these two constants, and using them like `Constant.delta`? Since `delta` is only used in the `TweedieFamily` class, I can also move it there. Let me know what is best. Thanks. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16344: [SPARK-18929][ML] Add Tweedie distribution in GLM
Github user actuaryzhang commented on the issue: https://github.com/apache/spark/pull/16344 @srowen Thanks for the comments. Makes lots of sense to move the switch to subclass. I did not know one could override a `val`. In the new commit, I have moved the `defaultLink` and `aic` to subclasses. I also added back `name` in the constructor. The other methods `initialize`, `variance`, `deviance` and `project` are common and thus implemented in the `TweedieFamily` parent class. Hope this makes sense. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16344: [SPARK-18929][ML] Add Tweedie distribution in GLM
Github user actuaryzhang commented on the issue: https://github.com/apache/spark/pull/16344 @srowen @yanboliang Thanks much for the feedback. I now have a better understanding of the code and the issue. I have made new commits reflecting your suggestions. The major changes are summarized below: - Create new method `fromModel` in the `Family` object to get the desired family object by accessing the `GeneralizedLinearRegressionBase` object directly. This solves the main challenge of setting the variancePower of the Tweedie family. The old `fromName` method is deleted since we now need both `name` and `variancePower`. - Create `TweedieFamily` as new subclass of `family`. This class encompasses the special cases of Gaussian, Poisson and Gamma, which can be handled together using the general formula based on variancePower. Please let me know if there are any other issues. Thanks. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16344: [SPARK-18929][ML] Add Tweedie distribution in GLM
Github user actuaryzhang commented on a diff in the pull request: https://github.com/apache/spark/pull/16344#discussion_r93565567 --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala --- @@ -303,14 +341,15 @@ object GeneralizedLinearRegression extends DefaultParamsReadable[GeneralizedLine /** Set of family and link pairs that GeneralizedLinearRegression supports. */ private[regression] lazy val supportedFamilyAndLinkPairs = Set( -Gaussian -> Identity, Gaussian -> Log, Gaussian -> Inverse, -Binomial -> Logit, Binomial -> Probit, Binomial -> CLogLog, -Poisson -> Log, Poisson -> Identity, Poisson -> Sqrt, -Gamma -> Inverse, Gamma -> Identity, Gamma -> Log +"gaussian" -> Identity, "gaussian" -> Log, "gaussian" -> Inverse, --- End diff -- @yanboliang Could you help me understand the issue caused by using string? If I use object, then I have to create a Tweedie object that is not used anywhere else. And also I have to write two methods in `Family`: one returns the global Tweedie object (where the variancePower is preset) and one returns the a TweedieFamily object created using the user-specified variancePower. I hope we are fine using string since there are only a few values. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16344: [SPARK-18929][ML] Add Tweedie distribution in GLM
Github user actuaryzhang commented on a diff in the pull request: https://github.com/apache/spark/pull/16344#discussion_r93565335 --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala --- @@ -64,6 +64,27 @@ private[regression] trait GeneralizedLinearRegressionBase extends PredictorParam def getFamily: String = $(family) /** + * Param for the power in the variance function of the Tweedie distribution which provides --- End diff -- changed tweedie. but other docs have been using Param.. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16344: [SPARK-18929][ML] Add Tweedie distribution in GLM
Github user actuaryzhang commented on a diff in the pull request: https://github.com/apache/spark/pull/16344#discussion_r93486159 --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala --- @@ -303,14 +341,15 @@ object GeneralizedLinearRegression extends DefaultParamsReadable[GeneralizedLine /** Set of family and link pairs that GeneralizedLinearRegression supports. */ private[regression] lazy val supportedFamilyAndLinkPairs = Set( -Gaussian -> Identity, Gaussian -> Log, Gaussian -> Inverse, -Binomial -> Logit, Binomial -> Probit, Binomial -> CLogLog, -Poisson -> Log, Poisson -> Identity, Poisson -> Sqrt, -Gamma -> Inverse, Gamma -> Identity, Gamma -> Log +"gaussian" -> Identity, "gaussian" -> Log, "gaussian" -> Inverse, --- End diff -- @yanboliang The member object (in `GeneralizedLinearRegressionBase`) won't be accessible in `Family`, right? The method `Family.fromName($(family))` uses global objects like `Poisson`, `Gamma` etc. To use `Family.fromName`, I need to create a `Tweedie` global object. Then we are back to the issue that @srowen pointed out of setting `variancePower` of the global object. Please advise. Thanks. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16344: [SPARK-18929][ML] Add Tweedie distribution in GLM
Github user actuaryzhang commented on a diff in the pull request: https://github.com/apache/spark/pull/16344#discussion_r93482978 --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala --- @@ -242,7 +275,12 @@ class GeneralizedLinearRegression @Since("2.0.0") (@Since("2.0.0") override val def setLinkPredictionCol(value: String): this.type = set(linkPredictionCol, value) override protected def train(dataset: Dataset[_]): GeneralizedLinearRegressionModel = { -val familyObj = Family.fromName($(family)) +val familyObj = if ($(family) == "tweedie") { + new Tweedie($(varPower)) --- End diff -- `Family` is not subclass of `GeneralizedLinearRegression`. Could you elaborate how to make it get the `variancePower` value? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16344: [SPARK-18929][ML] Add Tweedie distribution in GLM
Github user actuaryzhang commented on the issue: https://github.com/apache/spark/pull/16344 @srowen Thanks for the comments. Really helpful. I have made a new commit that addresses the issues you raised: - I think the use of a global family object does not work well for the tweedie case, since we have to set the variance power. I now define a tweedie class and create the tweedie object within the `train` method of `GeneralizedLinearRegression` where the variance power is set. Does this make sense? - I created a constant `delta = 0.1` in the `Family` class which is used to shift `mu (or y) = 0` to avoid numerical issues. - I now change `variancePower` to `varPower` to save typing (and consistent with R and H2o). - The `project` method you asked about is copied from the other families, which I suppose is to bound the mean to stabilize estimation. - I now throw `UnsupportedOperationException` for tweedie AIC. Let me know if there is any other issues. Copying others for reviewing. @yanboliang @sethah --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16344: [SPARK-18929][ML] Add Tweedie distribution in GLM
Github user actuaryzhang commented on a diff in the pull request: https://github.com/apache/spark/pull/16344#discussion_r93290858 --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala --- @@ -592,6 +629,59 @@ object GeneralizedLinearRegression extends DefaultParamsReadable[GeneralizedLine } /** +* Tweedie exponential family distribution. +* The default link for the Tweedie family is the log link. +*/ + private[regression] object Tweedie extends Family("tweedie") { + +val defaultLink: Link = Log + +var variancePower: Double = 1.5 + +override def initialize(y: Double, weight: Double): Double = { + if (variancePower > 1.0 && variancePower < 2.0) { +require(y >= 0.0, "The response variable of the specified Tweedie distribution " + + s"should be non-negative, but got $y") +math.max(y, 0.1) --- End diff -- I have not seen a formal justification for the choice of 0.1 in R. This seminal [paper](http://users.du.se/~lrn/StatMod10/HomeExercise2/Nelder_Pregibon.pdf) suggests 1/6 (about 0.17) to be the best constant. I would prefer to be consistent with R so that we can make comparison. Using a constant is a good idea. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16344: [SPARK-18929][ML] Add Tweedie distribution in GLM
Github user actuaryzhang commented on a diff in the pull request: https://github.com/apache/spark/pull/16344#discussion_r93289668 --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala --- @@ -592,6 +629,59 @@ object GeneralizedLinearRegression extends DefaultParamsReadable[GeneralizedLine } /** +* Tweedie exponential family distribution. +* The default link for the Tweedie family is the log link. +*/ + private[regression] object Tweedie extends Family("tweedie") { + +val defaultLink: Link = Log + +var variancePower: Double = 1.5 --- End diff -- Would you please suggest a better way to set the variancePower? I want to be consistent with the existing code to have the `Family` objects, but I need to also pass on the input `variancePower` to the `Tweedie` object which is used to compute the variance function. Any suggestion will be highly appreciated. @srowen @yanboliang --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16344: [SPARK-18929][ML] Add Tweedie distribution in GLM
Github user actuaryzhang commented on the issue: https://github.com/apache/spark/pull/16344 Jenkins, add to whitelist --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16344: [SPARK-18929][ML] Add Tweedie distribution in GLM
GitHub user actuaryzhang opened a pull request: https://github.com/apache/spark/pull/16344 [SPARK-18929][ML] Add Tweedie distribution in GLM ## What changes were proposed in this pull request? I propose to add the full Tweedie family into the GeneralizedLinearRegression model. The Tweedie family is characterized by a power variance function. Currently supported distributions such as Gaussian, Poisson and Gamma families are a special case of the Tweedie https://en.wikipedia.org/wiki/Tweedie_distribution. @yanboliang @srowen @sethah I propose to add support for the other distributions: - compound Poisson: 1 < variancePower < 2. This one is widely used to model zero-inflated continuous distributions, e.g., in insurance, finance, ecology, meteorology, advertising etc. - positive stable: variancePower > 2 and variancePower != 3. Used to model extreme values. - inverse Gaussian: variancePower = 3. The Tweedie family is supported in most statistical packages such as R (statmod), SAS, h2o etc. Changes made: - Allow `tweedie` in family. Only `identity` and `log` links are allowed for now. - Add `variancePower` to `GeneralizedLinearRegressionBase`, which takes values in (1, 2) and [3, infty). Also set default value to 1.5 and add getter method. - `Family.fromName` has a second argument `variancePower` - Add `Tweedie` object - Add tests for tweedie GLM Note: - In computing deviance, use `math.max(y, 0.1)` to avoid taking inverse of 0. This is the same as in R: `tweedie()$dev.res` - `aic` is not supported in this PR because the evaluation of the [Tweedie density](http://www.statsci.org/smyth/pubs/tweediepdf-series-preprint.pdf) in these cases are non-trivial. I will implement the density approximation method in a future PR. R returns `null` (see `tweedie()$aic`). You can merge this pull request into a Git repository by running: $ git pull https://github.com/actuaryzhang/spark tweedie Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/16344.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #16344 commit 952887e485fb0d5fa669b3b4c9289b8069ee7769 Author: actuaryzhang <actuaryzhan...@gmail.com> Date: 2016-12-16T00:50:51Z Add Tweedie family to GLM commit 4f184ec458f5ed7d70bc5b8165481425f911d2a3 Author: actuaryzhang <actuaryzhan...@gmail.com> Date: 2016-12-19T22:50:02Z Fix calculation in dev resid; Add test for different var power commit 7fe39106332663d3671b94a8ffac48ca61c48470 Author: actuaryzhang <actuaryzhan...@gmail.com> Date: 2016-12-19T23:14:37Z Merge test into GLR --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16149: [SPARK-18715][ML]Fix AIC calculations in Binomial GLM
Github user actuaryzhang commented on the issue: https://github.com/apache/spark/pull/16149 @srowen @sethah Thanks for all the helpful discussions! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16149: [SPARK-18715][ML]Fix AIC calculations in Binomial GLM
Github user actuaryzhang commented on the issue: https://github.com/apache/spark/pull/16149 @srowen @sethah One more commit that adds a test case with `weight = 4.7` which will round up to 5 to test the case @sethah described. All tests passed. I'm pretty sure R's rounding is the same as what I'm doing here. Please merge if there is no other issue. Thanks. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16149: [SPARK-18715][ML]Fix AIC calculations in Binomial GLM
Github user actuaryzhang commented on the issue: https://github.com/apache/spark/pull/16149 @sethah Would you please review this? Thanks. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16149: [SPARK-18715][ML]Fix AIC calculations in Binomial GLM
Github user actuaryzhang commented on the issue: https://github.com/apache/spark/pull/16149 @sethah @srowen I updated the documentation. I think we have everything needed for this fix. Please merge and close this PR if there is no other issue. Thanks much for all the comments. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16149: [SPARK-18715][ML]Fix AIC calculations in Binomial...
Github user actuaryzhang commented on a diff in the pull request: https://github.com/apache/spark/pull/16149#discussion_r91643659 --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala --- @@ -468,11 +469,7 @@ object GeneralizedLinearRegression extends DefaultParamsReadable[GeneralizedLine override def variance(mu: Double): Double = mu * (1.0 - mu) private def ylogy(y: Double, mu: Double): Double = { - if (y == 0) { -0.0 - } else { -y * math.log(y / mu) - } + if (y == 0) 0.0 else y * math.log(y / mu) --- End diff -- I actually prefer not putting the whole thing in one line. This is more readable. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16149: [SPARK-18715][ML]Fix AIC calculations in Binomial...
Github user actuaryzhang commented on a diff in the pull request: https://github.com/apache/spark/pull/16149#discussion_r91643515 --- Diff: mllib/src/test/scala/org/apache/spark/ml/regression/GeneralizedLinearRegressionSuite.scala --- @@ -715,7 +715,7 @@ class GeneralizedLinearRegressionSuite val datasetWithWeight = Seq( Instance(1.0, 1.0, Vectors.dense(0.0, 5.0).toSparse), Instance(0.5, 2.0, Vectors.dense(1.0, 2.0)), - Instance(1.0, 3.0, Vectors.dense(2.0, 1.0)), + Instance(1.0, 0.3, Vectors.dense(2.0, 1.0)), --- End diff -- If you mean input weight = 0, then it is better to be tested separately along with all the other families from the current PR. I think the current test is good for the current PR. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16149: [SPARK-18715][ML]Fix AIC calculations in Binomial...
Github user actuaryzhang commented on a diff in the pull request: https://github.com/apache/spark/pull/16149#discussion_r91563322 --- Diff: mllib/src/test/scala/org/apache/spark/ml/regression/GeneralizedLinearRegressionSuite.scala --- @@ -715,7 +715,7 @@ class GeneralizedLinearRegressionSuite val datasetWithWeight = Seq( Instance(1.0, 1.0, Vectors.dense(0.0, 5.0).toSparse), Instance(0.5, 2.0, Vectors.dense(1.0, 2.0)), - Instance(1.0, 3.0, Vectors.dense(2.0, 1.0)), + Instance(1.0, 0.3, Vectors.dense(2.0, 1.0)), --- End diff -- Isn't `round(0.3) = 0`? Or do you mean testing the raw (input) weight = 0? How is that handled? I would assume that is just leaving out that observation, since that's how R does it. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16149: [SPARK-18715][ML]Fix AIC calculations in Binomial GLM
Github user actuaryzhang commented on the issue: https://github.com/apache/spark/pull/16149 @sethah @srowen I have added a comment to the weigthCol doc for the Binomial case. I also updated to test the case `weight < 0.5`, i.e., `round(weight) = 0`. All tests passed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16149: [SPARK-18715][ML]Fix AIC calculations in Binomial...
Github user actuaryzhang commented on a diff in the pull request: https://github.com/apache/spark/pull/16149#discussion_r91440862 --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala --- @@ -479,7 +485,12 @@ object GeneralizedLinearRegression extends DefaultParamsReadable[GeneralizedLine numInstances: Double, weightSum: Double): Double = { -2.0 * predictions.map { case (y: Double, mu: Double, weight: Double) => -weight * dist.Binomial(1, mu).logProbabilityOf(math.round(y).toInt) +val wt = math.round(weight).toInt +if (wt == 0) { + 0.0 +} else { + dist.Binomial(wt, mu).logProbabilityOf(math.round(y * weight).toInt) --- End diff -- @sethah Code updated as suggested --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16149: [SPARK-18715][ML]Fix AIC calculations in Binomial GLM
Github user actuaryzhang commented on the issue: https://github.com/apache/spark/pull/16149 @srowen @sethah I have cleaned up the change as suggested. Please review and let me know if there is any question. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16149: [SPARK-18715][ML]Fix AIC calculations in Binomial GLM
Github user actuaryzhang commented on the issue: https://github.com/apache/spark/pull/16149 @srowen @sethah Thanks for the comments. Yes, the major use case is to be able to handle multiple trials (integer weight, real-valued response). Indeed, a better way to do this is through `offset`, which I have proposed to do in this JIRA [SPARK-18710](https://issues.apache.org/jira/browse/SPARK-18710). Please let me know if this is worth pursuing. I have submitted another two commits. 1. One commit makes minimal modification to the exiting Binomial GLM test so that one response record is now non-integer. This test still failed because the deviance residual calculation seems to work only for `y in (0, 1)` 2. The second commit fixes the issue in calculating the deviance. But I think the code can be improved, especially regarding the function `y_logy`. What's the best way to create a utility function like this? Please advise. With the two commits, all tests now passed, including the ones on AIC and the deviance. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16149: [SPARK-18715][ML]Fix AIC calculations in Binomial...
Github user actuaryzhang commented on a diff in the pull request: https://github.com/apache/spark/pull/16149#discussion_r91246870 --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala --- @@ -479,7 +479,12 @@ object GeneralizedLinearRegression extends DefaultParamsReadable[GeneralizedLine numInstances: Double, weightSum: Double): Double = { -2.0 * predictions.map { case (y: Double, mu: Double, weight: Double) => -weight * dist.Binomial(1, mu).logProbabilityOf(math.round(y).toInt) +val wt = math.round(weight).toInt +if (wt == 0) { + 0.0 +} else { + dist.Binomial(wt, mu).logProbabilityOf(math.round(y * weight).toInt) --- End diff -- @srowen The current uses [breeze](https://github.com/scalanlp/breeze/blob/master/math/src/main/scala/breeze/stats/distributions/Binomial.scala) for computing the Binomial density, which does not allow 0 trials. `require(n > 0, "n must be positive!")`. That is why I have to handle the case of `wt = 0` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16131: [SPARK-18701][ML] Fix Poisson GLM failure due to wrong i...
Github user actuaryzhang commented on the issue: https://github.com/apache/spark/pull/16131 @srowen Is this ready to be merged? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16131: [SPARK-18701][ML] Fix Poisson GLM failure due to wrong i...
Github user actuaryzhang commented on the issue: https://github.com/apache/spark/pull/16131 @srowen Done. Thanks for the suggestion. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16131: [SPARK-18701][ML] Fix Poisson GLM failure due to wrong i...
Github user actuaryzhang commented on the issue: https://github.com/apache/spark/pull/16131 @sethah Thanks for the review. I have updated according to your suggestion. @yanboliang @srowen Please take another look. Thanks. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16131: [SPARK-18701][ML] Fix Poisson GLM failure due to wrong i...
Github user actuaryzhang commented on the issue: https://github.com/apache/spark/pull/16131 @srowen @yanboliang I have updated the code and further cleaned up the test. Please review and let me know if there is any question. Thanks. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16131: [SPARK-18701][ML] Fix Poisson GLM failure due to ...
Github user actuaryzhang commented on a diff in the pull request: https://github.com/apache/spark/pull/16131#discussion_r90965352 --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala --- @@ -505,7 +505,7 @@ object GeneralizedLinearRegression extends DefaultParamsReadable[GeneralizedLine override def initialize(y: Double, weight: Double): Double = { require(y >= 0.0, "The response variable of Poisson family " + s"should be non-negative, but got $y") - y + y + 0.1 --- End diff -- @srowen I verified that the following approaches give the same results for a few data sets a)`y + 0.1` b) `if (y == 0) y + 0.1 else y` c)` math.max(0.1, y)` Which one do you want to use, c)? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16131: [SPARK-18701][ML] Fix Poisson GLM failure due to ...
Github user actuaryzhang commented on a diff in the pull request: https://github.com/apache/spark/pull/16131#discussion_r90955908 --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala --- @@ -505,7 +505,7 @@ object GeneralizedLinearRegression extends DefaultParamsReadable[GeneralizedLine override def initialize(y: Double, weight: Double): Double = { require(y >= 0.0, "The response variable of Poisson family " + s"should be non-negative, but got $y") - y + y + 0.1 --- End diff -- @srowen Theoretically, one only needs to add 0.1 to the y = 0 case, which is a guess of the mean for those cases. But I think it may be better to add this small number to all cases. Imagine that one models the rates of occurrence, i.e., frequency divided by exposure. For certain large exposure, the rate can get tiny and close to zero. Adding 0.1 to that may help avoid numerical issues too in that case. Does that make sense? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16149: [SPARK-18715][ML]Fix AIC calculations in Binomial GLM
Github user actuaryzhang commented on the issue: https://github.com/apache/spark/pull/16149 Jenkins, add to whitelist --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org