[ https://issues.apache.org/jira/browse/SPARK-14657?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Xiangrui Meng updated SPARK-14657: ---------------------------------- Assignee: Yanbo Liang > RFormula output wrong features when formula w/o intercept > --------------------------------------------------------- > > Key: SPARK-14657 > URL: https://issues.apache.org/jira/browse/SPARK-14657 > Project: Spark > Issue Type: Bug > Components: ML > Reporter: Yanbo Liang > Assignee: Yanbo Liang > > SparkR::glm output different features compared with R glm when fit w/o > intercept and having string/category features. Take the following example, > SparkR output three features compared with four features for native R. > SparkR::glm > {quote} > training <- suppressWarnings(createDataFrame(sqlContext, iris)) > model <- glm(Sepal_Width ~ Sepal_Length + Species - 1, data = training) > summary(model) > Coefficients: > Estimate Std. Error t value Pr(>|t|) > Sepal_Length 0.67468 0.0093013 72.536 0 > Species_versicolor -1.2349 0.07269 -16.989 0 > Species_virginica -1.4708 0.077397 -19.003 0 > {quote} > stats::glm > {quote} > summary(glm(Sepal.Width ~ Sepal.Length + Species - 1, data = iris)) > Coefficients: > Estimate Std. Error t value Pr(>|t|) > Sepal.Length 0.3499 0.0463 7.557 4.19e-12 *** > Speciessetosa 1.6765 0.2354 7.123 4.46e-11 *** > Speciesversicolor 0.6931 0.2779 2.494 0.0137 * > Speciesvirginica 0.6690 0.3078 2.174 0.0313 * > {quote} > The encoder for string/category feature is different. R did not drop any > category but SparkR drop the last one. > I searched online and test some other cases, found when we fit R glm model(or > other models powered by R formula) w/o intercept on a dataset including > string/category features, one of the categories in the first category feature > is being used as reference category, we will not drop any category for that > feature. > I think we should keep consistent semantics between Spark RFormula and R > formula. > cc [~mengxr] -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org