[ https://issues.apache.org/jira/browse/SPARK-14659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15242739#comment-15242739 ]
Yanbo Liang edited comment on SPARK-14659 at 4/15/16 10:08 AM: --------------------------------------------------------------- Take the following R code as example: {quote} df=data.frame(id = c(1, 2, 3, 4), a = c("foo", "bar", "bar", "baz"), b = c(4, 4, 5, 5)) summary(glm(id ~ a + b, data = df)) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -2 NA NA NA abaz 1 NA NA NA afoo -1 NA NA NA b 1 NA NA NA {quote} R will drop "bar" when encode the string/category feature, because it's the first category alphabetically. However, Spark RFormula will drop "baz" due to the lowest frequency. You can refer the RFormulaSuite(https://github.com/apache/spark/blob/master/mllib/src/test/scala/org/apache/spark/ml/feature/RFormulaSuite.scala#L92). was (Author: yanboliang): For example: {quote} df=data.frame(id = c(1, 2, 3, 4), a = c("foo", "bar", "bar", "baz"), b = c(4, 4, 5, 5)) summary(glm(id ~ a + b, data = df)) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -2 NA NA NA abaz 1 NA NA NA afoo -1 NA NA NA b 1 NA NA NA {quote} R will drop "bar" when encode the string/category feature, because it's the first category alphabetically. However, Spark RFormula will drop "baz" due to the lowest frequency. You can refer the RFormulaSuite(https://github.com/apache/spark/blob/master/mllib/src/test/scala/org/apache/spark/ml/feature/RFormulaSuite.scala#L92). > OneHotEncoder support drop first category alphabetically in the encoded > vector > ------------------------------------------------------------------------------- > > Key: SPARK-14659 > URL: https://issues.apache.org/jira/browse/SPARK-14659 > Project: Spark > Issue Type: Improvement > Components: ML > Reporter: Yanbo Liang > > R formula drop the first category alphabetically when encode string/category > feature. Spark RFormula use OneHotEncoder to encode string/category feature > into vector, but only supporting "dropLast" by string/category frequencies. > This will cause SparkR produce different models compared with native R. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org