[ https://issues.apache.org/jira/browse/SPARK-10523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Shivaram Venkataraman updated SPARK-10523: ------------------------------------------ Component/s: SparkR > SparkR formula syntax to turn strings/factors into numerics > ----------------------------------------------------------- > > Key: SPARK-10523 > URL: https://issues.apache.org/jira/browse/SPARK-10523 > Project: Spark > Issue Type: Bug > Components: ML, SparkR > Reporter: Vincent Warmerdam > > In normal (non SparkR) R the formula syntax enables strings or factors to be > turned into dummy variables immediately when calling a classifier. This way, > the following R pattern is legal and often used: > {code} > library(magrittr) > df <- data.frame( class = c("a", "a", "b", "b"), i = c(1, 2, 5, 6)) > glm(class ~ i, family = "binomial", data = df) > {code} > The glm method will know that `class` is a string/factor and handles it > appropriately by casting it to a 0/1 array before applying any machine > learning. SparkR doesn't do this. > {code} > > ddf <- sqlContext %>% > createDataFrame(df) > > glm(class ~ i, family = "binomial", data = ddf) > Error in invokeJava(isStatic = TRUE, className, methodName, ...) : > java.lang.IllegalArgumentException: Unsupported type for label: StringType > at > org.apache.spark.ml.feature.RFormulaModel.transformLabel(RFormula.scala:185) > at > org.apache.spark.ml.feature.RFormulaModel.transform(RFormula.scala:150) > at org.apache.spark.ml.Pipeline$$anonfun$fit$2.apply(Pipeline.scala:146) > at org.apache.spark.ml.Pipeline$$anonfun$fit$2.apply(Pipeline.scala:134) > at scala.collection.Iterator$class.foreach(Iterator.scala:727) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) > at > scala.collection.IterableViewLike$Transformed$class.foreach(IterableViewLike.scala:42) > at > scala.collection.SeqViewLike$AbstractTransformed.foreach(SeqViewLike.scala:43) > at org.apache.spark.ml.Pipeline.fit(Pipeline.scala:134) > at > org.apache.spark.ml.api.r.SparkRWrappers$.fitRModelFormula(SparkRWrappers.scala:46) > at > org.apache.spark.ml.api.r.SparkRWrappers.fitRModelFormula(SparkRWrappers.scala) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at sun.refl > {code} > This can be fixed by doing a bit of manual labor. SparkR does accept booleans > as if they are integers here. > {code} > > ddf <- ddf %>% > withColumn("to_pred", .$class == "a") > > glm(to_pred ~ i, family = "binomial", data = ddf) > {code} > But this can become quite tedious, especially when you want to have models > that are using multiple classes that need classification. This is perhaps > less relevant for logistic regression (because it is a bit more like a > one-off classification approach) but it certainly is relevant if you would > want to use a formula for a randomforest and a column denotes, say, a type of > flower from the iris dataset. > Is there a good reason why this should not be a feature of formulas in Spark? > I am aware of issue 8774, which looks like it is adressing a similar theme > but a different issue. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org