GitHub user actuaryzhang reopened a pull request: https://github.com/apache/spark/pull/16344
[SPARK-18929][ML] Add Tweedie distribution in GLM ## What changes were proposed in this pull request? I propose to add the full Tweedie family into the GeneralizedLinearRegression model. The Tweedie family is characterized by a power variance function. Currently supported distributions such as Gaussian, Poisson and Gamma families are a special case of the Tweedie https://en.wikipedia.org/wiki/Tweedie_distribution. @yanboliang @srowen @sethah I propose to add support for the other distributions: - compound Poisson: 1 < varPower < 2. This one is widely used to model zero-inflated continuous distributions, e.g., in insurance, finance, ecology, meteorology, advertising etc. - positive stable: varPower > 2 and varPower != 3. Used to model extreme values. - inverse Gaussian: varPower = 3. The Tweedie family is supported in most statistical packages such as R (statmod), SAS, h2o etc. Changes made: - Allow `tweedie` in family. Only `identity` and `log` links are allowed for now. - Add `varPower` to `GeneralizedLinearRegressionBase`, which takes values in (1, 2) and (2, infty). Also set default value to 1.5 and add getter method. - Add `Tweedie` class - Add tests for tweedie GLM Note: - In computing deviance, use `math.max(y, 0.1)` to avoid taking inverse of 0. This is the same as in R: `tweedie()$dev.res` - `aic` is not supported in this PR because the evaluation of the [Tweedie density](http://www.statsci.org/smyth/pubs/tweediepdf-series-preprint.pdf) in these cases are non-trivial. I will implement the density approximation method in a future PR. R returns `null` (see `tweedie()$aic`). You can merge this pull request into a Git repository by running: $ git pull https://github.com/actuaryzhang/spark tweedie Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/16344.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #16344 ---- commit 952887e485fb0d5fa669b3b4c9289b8069ee7769 Author: actuaryzhang <actuaryzhan...@gmail.com> Date: 2016-12-16T00:50:51Z Add Tweedie family to GLM commit 4f184ec458f5ed7d70bc5b8165481425f911d2a3 Author: actuaryzhang <actuaryzhan...@gmail.com> Date: 2016-12-19T22:50:02Z Fix calculation in dev resid; Add test for different var power commit 7fe39106332663d3671b94a8ffac48ca61c48470 Author: actuaryzhang <actuaryzhan...@gmail.com> Date: 2016-12-19T23:14:37Z Merge test into GLR commit bfcc4fb08d54156efc66b90d14c62ea7ff172afa Author: actuaryzhang <actuaryzhan...@gmail.com> Date: 2016-12-20T22:59:05Z Use Tweedie class instead of global object Tweedie; change variancePower to varPower commit a8feea7d8095170c1b5f18b7887f16a6d763e42c Author: actuaryzhang <actuaryzhan...@gmail.com> Date: 2016-12-21T23:42:40Z Allow Family to use GLRBase object directly commit 233e2d338be8d36a74eaf578bfea804ae3617d4e Author: actuaryzhang <actuaryzhan...@gmail.com> Date: 2016-12-22T01:56:34Z Add TweedieFamily and implement specific distn within Tweedie commit 17c55816c914bc96a8b5141356e3c117f343f303 Author: actuaryzhang <actuaryzhan...@gmail.com> Date: 2016-12-22T04:39:54Z Clean up doc commit 0b41825e99020976a34d8fe9c983f26de6c8c40f Author: actuaryzhang <actuaryzhan...@gmail.com> Date: 2016-12-22T17:52:01Z Move defaultLink and name to subclass of TweedieFamily commit 6e8e60771afb4abe43e47c7fe186bad1541a8fac Author: actuaryzhang <actuaryzhan...@gmail.com> Date: 2016-12-22T18:10:51Z Change style for AIC commit 8d7d34e258f9c7c03c80754d837ce847fcb0526e Author: actuaryzhang <actuaryzhan...@gmail.com> Date: 2016-12-23T19:10:20Z Rename Family methods and restore methods for tweedie subclasses commit 6da7e3068e2c45a0faf7ff35c10b2750784d765e Author: actuaryzhang <actuaryzhan...@gmail.com> Date: 2016-12-23T19:12:25Z Update test commit 9a71e89f629260c775922901a04c989f36ea4946 Author: actuaryzhang <actuaryzhan...@gmail.com> Date: 2016-12-27T17:16:40Z Clean up doc commit f461c09e65360f695ad3092b41bc26e0c61bbd95 Author: actuaryzhang <actuaryzhan...@gmail.com> Date: 2016-12-27T22:18:39Z Put delta in Tweedie companion object commit a839c4631dd17c4f3d0a0cc99e1b0af81419dda4 Author: actuaryzhang <actuaryzhan...@gmail.com> Date: 2016-12-27T22:23:57Z Clean up doc commit fab265278109eede4cce7ee506e8b29d481c4549 Author: actuaryzhang <actuaryzhan...@gmail.com> Date: 2017-01-05T19:32:06Z Allow more link functions in tweedie ---- --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org