[ https://issues.apache.org/jira/browse/SPARK-18715?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Apache Spark reassigned SPARK-18715: ------------------------------------ Assignee: Apache Spark > Fix wrong AIC calculation in Binomial GLM > ----------------------------------------- > > Key: SPARK-18715 > URL: https://issues.apache.org/jira/browse/SPARK-18715 > Project: Spark > Issue Type: Bug > Components: ML > Affects Versions: 2.0.2 > Reporter: Wayne Zhang > Assignee: Apache Spark > Priority: Critical > Labels: patch > Fix For: 2.2.0 > > Original Estimate: 120h > Remaining Estimate: 120h > > The AIC calculation in Binomial GLM seems to be wrong when there are weights. > The result is different from that in R. > The current implementation is: > {code} > -2.0 * predictions.map { case (y: Double, mu: Double, weight: Double) => > weight * dist.Binomial(1, mu).logProbabilityOf(math.round(y).toInt) > }.sum() > {code} > Suggest changing this to > {code} > -2.0 * predictions.map { case (y: Double, mu: Double, weight: Double) => > val wt = math.round(weight).toInt > if (wt == 0){ > 0.0 > } else { > dist.Binomial(wt, mu).logProbabilityOf(math.round(y * weight).toInt) > } > }.sum() > {code} > ---- > ---- > The following is an example to illustrate the problem. > {code} > val dataset = Seq( > LabeledPoint(0.0, Vectors.dense(18, 1.0)), > LabeledPoint(0.5, Vectors.dense(12, 0.0)), > LabeledPoint(1.0, Vectors.dense(15, 0.0)), > LabeledPoint(0.0, Vectors.dense(13, 2.0)), > LabeledPoint(0.0, Vectors.dense(15, 1.0)), > LabeledPoint(0.5, Vectors.dense(16, 1.0)) > ).toDF().withColumn("weight", col("label") + 1.0) > val glr = new GeneralizedLinearRegression() > .setFamily("binomial") > .setWeightCol("weight") > .setRegParam(0) > val model = glr.fit(dataset) > model.summary.aic > {code} > This calculation shows the AIC is 14.189026847171382. To verify whether this > is correct, I run the same analysis in R but got AIC = 11.66092, -2 * LogLik > = 5.660918. > {code} > da <- scan(, what=list(y = 0, x1 = 0, x2 = 0, w = 0), sep = ",") > 0,18,1,1 > 0.5,12,0,1.5 > 1,15,0,2 > 0,13,2,1 > 0,15,1,1 > 0.5,16,1,1.5 > da <- as.data.frame(da) > f <- glm(y ~ x1 + x2 , data = da, family = binomial(), weight = w) > AIC(f) > -2 * logLik(f) > {code} > Now, I check whether the proposed change is correct. The following calculates > -2 * LogLik manually and get 5.6609177228379055, the same as that in R. > {code} > val predictions = model.transform(dataset) > -2.0 * predictions.select("label", "prediction", "weight").rdd.map {case > Row(y: Double, mu: Double, weight: Double) => > val wt = math.round(weight).toInt > if (wt == 0){ > 0.0 > } else { > dist.Binomial(wt, mu).logProbabilityOf(math.round(y * weight).toInt) > } > }.sum() > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org