[jira] [Updated] (SPARK-18701) Poisson GLM fails due to wrong initialization
[ https://issues.apache.org/jira/browse/SPARK-18701?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-18701: -- Assignee: Wayne Zhang > Poisson GLM fails due to wrong initialization > - > > Key: SPARK-18701 > URL: https://issues.apache.org/jira/browse/SPARK-18701 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.0.2 >Reporter: Wayne Zhang >Assignee: Wayne Zhang >Priority: Critical > Fix For: 2.1.0 > > Original Estimate: 1h > Remaining Estimate: 1h > > Poisson GLM fails for many standard data sets. The issue is incorrect > initialization leading to almost zero probability and weights. The following > simple example reproduces the error. > {code:borderStyle=solid} > val datasetPoissonLogWithZero = Seq( > LabeledPoint(0.0, Vectors.dense(18, 1.0)), > LabeledPoint(1.0, Vectors.dense(12, 0.0)), > LabeledPoint(0.0, Vectors.dense(15, 0.0)), > LabeledPoint(0.0, Vectors.dense(13, 2.0)), > LabeledPoint(0.0, Vectors.dense(15, 1.0)), > LabeledPoint(1.0, Vectors.dense(16, 1.0)), > LabeledPoint(0.0, Vectors.dense(10, 0.0)), > LabeledPoint(0.0, Vectors.dense(15, 0.0)), > LabeledPoint(0.0, Vectors.dense(12, 2.0)), > LabeledPoint(0.0, Vectors.dense(13, 0.0)), > LabeledPoint(1.0, Vectors.dense(15, 0.0)), > LabeledPoint(1.0, Vectors.dense(15, 0.0)), > LabeledPoint(0.0, Vectors.dense(15, 0.0)), > LabeledPoint(0.0, Vectors.dense(12, 2.0)), > LabeledPoint(1.0, Vectors.dense(12, 2.0)) > ).toDF() > > val glr = new GeneralizedLinearRegression() > .setFamily("poisson") > .setLink("log") > .setMaxIter(20) > .setRegParam(0) > val model = glr.fit(datasetPoissonLogWithZero) > {code} > The issue is in the initialization: the mean is initialized as the response, > which could be zero. Applying the log link results in very negative numbers > (protected against -Inf), which again leads to close to zero probability and > weights in the weighted least squares. The fix is easy: just add a small > constant, highlighted in red below. > > override def initialize(y: Double, weight: Double): Double = { > require(y >= 0.0, "The response variable of Poisson family " + > s"should be non-negative, but got $y") > y {color:red}+ 0.1 {color} > } > I already have a fix and test code. Will create a PR. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18701) Poisson GLM fails due to wrong initialization
[ https://issues.apache.org/jira/browse/SPARK-18701?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wayne Zhang updated SPARK-18701: Shepherd: Sean Owen (was: sean corkum) Issue Type: Bug (was: New Feature) > Poisson GLM fails due to wrong initialization > - > > Key: SPARK-18701 > URL: https://issues.apache.org/jira/browse/SPARK-18701 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.0.2 >Reporter: Wayne Zhang >Priority: Critical > Fix For: 2.2.0 > > Original Estimate: 1h > Remaining Estimate: 1h > > Poisson GLM fails for many standard data sets. The issue is incorrect > initialization leading to almost zero probability and weights. The following > simple example reproduces the error. > {code:borderStyle=solid} > val datasetPoissonLogWithZero = Seq( > LabeledPoint(0.0, Vectors.dense(18, 1.0)), > LabeledPoint(1.0, Vectors.dense(12, 0.0)), > LabeledPoint(0.0, Vectors.dense(15, 0.0)), > LabeledPoint(0.0, Vectors.dense(13, 2.0)), > LabeledPoint(0.0, Vectors.dense(15, 1.0)), > LabeledPoint(1.0, Vectors.dense(16, 1.0)), > LabeledPoint(0.0, Vectors.dense(10, 0.0)), > LabeledPoint(0.0, Vectors.dense(15, 0.0)), > LabeledPoint(0.0, Vectors.dense(12, 2.0)), > LabeledPoint(0.0, Vectors.dense(13, 0.0)), > LabeledPoint(1.0, Vectors.dense(15, 0.0)), > LabeledPoint(1.0, Vectors.dense(15, 0.0)), > LabeledPoint(0.0, Vectors.dense(15, 0.0)), > LabeledPoint(0.0, Vectors.dense(12, 2.0)), > LabeledPoint(1.0, Vectors.dense(12, 2.0)) > ).toDF() > > val glr = new GeneralizedLinearRegression() > .setFamily("poisson") > .setLink("log") > .setMaxIter(20) > .setRegParam(0) > val model = glr.fit(datasetPoissonLogWithZero) > {code} > The issue is in the initialization: the mean is initialized as the response, > which could be zero. Applying the log link results in very negative numbers > (protected against -Inf), which again leads to close to zero probability and > weights in the weighted least squares. The fix is easy: just add a small > constant, highlighted in red below. > > override def initialize(y: Double, weight: Double): Double = { > require(y >= 0.0, "The response variable of Poisson family " + > s"should be non-negative, but got $y") > y {color:red}+ 0.1 {color} > } > I already have a fix and test code. Will create a PR. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org