[ https://issues.apache.org/jira/browse/SPARK-14657?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Yanbo Liang updated SPARK-14657: -------------------------------- Description: SparkR::glm output different features compared with R glm. SparkR::glm {quote} training <- suppressWarnings(createDataFrame(sqlContext, iris)) model <- glm(Sepal_Width ~ Sepal_Length + Species - 1, data = training) summary(model) Coefficients: Estimate Std. Error t value Pr(>|t|) Sepal_Length 0.67468 0.0093013 72.536 0 Species_versicolor -1.2349 0.07269 -16.989 0 Species_virginica -1.4708 0.077397 -19.003 0 {quote} stats::glm {quote} summary(glm(Sepal.Width ~ Sepal.Length + Species - 1, data = iris)) Coefficients: Estimate Std. Error t value Pr(>|t|) Sepal.Length 0.3499 0.0463 7.557 4.19e-12 *** Speciessetosa 1.6765 0.2354 7.123 4.46e-11 *** Speciesversicolor 0.6931 0.2779 2.494 0.0137 * Speciesvirginica 0.6690 0.3078 2.174 0.0313 * {quote} The encoder for string type feature is different. R did not drop any category but SparkR drop the last one. I searched online and test some other cases, found when we fit R glm model(or other models powered by R formula) w/o intercept on a dataset including string/category features, one of the levels in the first category feature is being used as reference level, we will not drop any category for that feature. I think we should keep consistent semantics between Spark RFormula and R formula. was: SparkR::glm output different features compared with R glm. SparkR::glm {quote} training <- suppressWarnings(createDataFrame(sqlContext, iris)) model <- glm(Sepal_Width ~ Sepal_Length + Species - 1, data = training) summary(model) Coefficients: Estimate Std. Error t value Pr(>|t|) Sepal_Length 0.67468 0.0093013 72.536 0 Species_versicolor -1.2349 0.07269 -16.989 0 Species_virginica -1.4708 0.077397 -19.003 0 {quote} stats::glm {quote} summary(glm(Sepal.Width ~ Sepal.Length + Species - 1, data = iris)) Coefficients: Estimate Std. Error t value Pr(>|t|) Sepal.Length 0.3499 0.0463 7.557 4.19e-12 *** Speciessetosa 1.6765 0.2354 7.123 4.46e-11 *** Speciesversicolor 0.6931 0.2779 2.494 0.0137 * Speciesvirginica 0.6690 0.3078 2.174 0.0313 * {quote} The encoder for string type feature is different. R did not drop any category but SparkR drop the last one. I searched online and test some other cases, found when we fit R glm model(or other models powered by R formula) w/o intercept on a dataset including string/category features, one of the levels in the first category feature is being used as reference level, we will not drop any category for that feature. I think we should keep consistent semantics for Spark RFormula. > RFormula output wrong features when formula w/o intercept > --------------------------------------------------------- > > Key: SPARK-14657 > URL: https://issues.apache.org/jira/browse/SPARK-14657 > Project: Spark > Issue Type: Bug > Components: ML > Reporter: Yanbo Liang > > SparkR::glm output different features compared with R glm. > SparkR::glm > {quote} > training <- suppressWarnings(createDataFrame(sqlContext, iris)) > model <- glm(Sepal_Width ~ Sepal_Length + Species - 1, data = training) > summary(model) > Coefficients: > Estimate Std. Error t value Pr(>|t|) > Sepal_Length 0.67468 0.0093013 72.536 0 > Species_versicolor -1.2349 0.07269 -16.989 0 > Species_virginica -1.4708 0.077397 -19.003 0 > {quote} > stats::glm > {quote} > summary(glm(Sepal.Width ~ Sepal.Length + Species - 1, data = iris)) > Coefficients: > Estimate Std. Error t value Pr(>|t|) > Sepal.Length 0.3499 0.0463 7.557 4.19e-12 *** > Speciessetosa 1.6765 0.2354 7.123 4.46e-11 *** > Speciesversicolor 0.6931 0.2779 2.494 0.0137 * > Speciesvirginica 0.6690 0.3078 2.174 0.0313 * > {quote} > The encoder for string type feature is different. R did not drop any category > but SparkR drop the last one. > I searched online and test some other cases, found when we fit R glm model(or > other models powered by R formula) w/o intercept on a dataset including > string/category features, one of the levels in the first category feature is > being used as reference level, we will not drop any category for that feature. > I think we should keep consistent semantics between Spark RFormula and R > formula. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org