Vincent Warmerdam created SPARK-10523: -----------------------------------------
Summary: SparkR formula syntax to turn strings/factors into numerics Key: SPARK-10523 URL: https://issues.apache.org/jira/browse/SPARK-10523 Project: Spark Issue Type: Bug Reporter: Vincent Warmerdam In normal (non SparkR) R the formula syntax enables strings or factors to be turned into dummy variables immediately when calling a classifier. This way, the following Rcode is legal and often used. {code} library(magrittr) df <- data.frame( class = c("a", "a", "b", "b"), i = c(1, 2, 5, 6)) glm(class ~ i, family = "binomial", data = df) {code} SparkR doesn't allow this. {code} > ddf <- sqlContext %>% createDataFrame(df) > glm(class ~ i, family = "binomial", data = ddf) Error in invokeJava(isStatic = TRUE, className, methodName, ...) : java.lang.IllegalArgumentException: Unsupported type for label: StringType at org.apache.spark.ml.feature.RFormulaModel.transformLabel(RFormula.scala:185) at org.apache.spark.ml.feature.RFormulaModel.transform(RFormula.scala:150) at org.apache.spark.ml.Pipeline$$anonfun$fit$2.apply(Pipeline.scala:146) at org.apache.spark.ml.Pipeline$$anonfun$fit$2.apply(Pipeline.scala:134) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at scala.collection.IterableViewLike$Transformed$class.foreach(IterableViewLike.scala:42) at scala.collection.SeqViewLike$AbstractTransformed.foreach(SeqViewLike.scala:43) at org.apache.spark.ml.Pipeline.fit(Pipeline.scala:134) at org.apache.spark.ml.api.r.SparkRWrappers$.fitRModelFormula(SparkRWrappers.scala:46) at org.apache.spark.ml.api.r.SparkRWrappers.fitRModelFormula(SparkRWrappers.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.refl {code} This can be fixed by doing a bit of manual labor. SparkR does accept booleans as if they are integers here. {code} > ddf <- ddf %>% withColumn("to_pred", .$class == "a") > glm(to_pred ~ i, family = "binomial", data = ddf) {code} But this can become quite tedious, especially when you want to have models that are using multiple classes that need classification. Is there a good reason why this should not be a feature of formulas in Spark? I am aware of issue 8774, which looks like it is adressing a similar theme but a different issue. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org