Github user viirya commented on the issue: https://github.com/apache/spark/pull/20146 Hmm, I reconsider this https://github.com/apache/spark/pull/20146#pullrequestreview-87070102. Even we use a dataset without duplicate values, if the string indexer order from R glm is different than the index used by RFormula, we still can't get the same results because looks like R glm doesn't follow frequency/alphabet. For example, I've tried the dataset Puromycin: ```R > training <- suppressWarnings(createDataFrame(Puromycin)) > stats <- summary(spark.glm(training, conc ~ rate + state)) > rStats <- summary(glm(conc ~ rate + state, data = Puromycin)) > rStats$coefficients Estimate Std. Error t value Pr(>|t|) (Intercept) -0.595461828 0.157462177 -3.781618 1.171709e-03 rate 0.006642461 0.001022196 6.498228 2.464757e-06 stateuntreated 0.136323828 0.095090605 1.433620 1.671302e-01 > stats$coefficients Estimate Std. Error t value Pr(>|t|) (Intercept) -0.459138000 0.130420375 -3.520447 2.150817e-03 rate 0.006642461 0.001022196 6.498228 2.464757e-06 state_treated -0.136323828 0.095090605 -1.433620 1.671302e-01 ``` You can see because the string index of state column is still different between R glm and RFormula, we can't get the same results. A workaround to this is that we can use a dataset which doesn't need string indexing. What do you think? @felixcheung
--- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org