Vincent Warmerdam created SPARK-10523:
-----------------------------------------

             Summary: SparkR formula syntax to turn strings/factors into 
numerics
                 Key: SPARK-10523
                 URL: https://issues.apache.org/jira/browse/SPARK-10523
             Project: Spark
          Issue Type: Bug
            Reporter: Vincent Warmerdam


In normal (non SparkR) R the formula syntax enables strings or factors to be 
turned into dummy variables immediately when calling a classifier. This way, 
the following Rcode is legal and often used. 

{code}
library(magrittr) 
df <- data.frame( class = c("a", "a", "b", "b"), i = c(1, 2, 5, 6))
glm(class ~ i, family = "binomial", data = df)
{code}

SparkR doesn't allow this. 

{code}
> ddf <- sqlContext %>% 
  createDataFrame(df)
> glm(class ~ i, family = "binomial", data = ddf)
Error in invokeJava(isStatic = TRUE, className, methodName, ...) :
  java.lang.IllegalArgumentException: Unsupported type for label: StringType
        at 
org.apache.spark.ml.feature.RFormulaModel.transformLabel(RFormula.scala:185)
        at 
org.apache.spark.ml.feature.RFormulaModel.transform(RFormula.scala:150)
        at org.apache.spark.ml.Pipeline$$anonfun$fit$2.apply(Pipeline.scala:146)
        at org.apache.spark.ml.Pipeline$$anonfun$fit$2.apply(Pipeline.scala:134)
        at scala.collection.Iterator$class.foreach(Iterator.scala:727)
        at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
        at 
scala.collection.IterableViewLike$Transformed$class.foreach(IterableViewLike.scala:42)
        at 
scala.collection.SeqViewLike$AbstractTransformed.foreach(SeqViewLike.scala:43)
        at org.apache.spark.ml.Pipeline.fit(Pipeline.scala:134)
        at 
org.apache.spark.ml.api.r.SparkRWrappers$.fitRModelFormula(SparkRWrappers.scala:46)
        at 
org.apache.spark.ml.api.r.SparkRWrappers.fitRModelFormula(SparkRWrappers.scala)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.refl
{code}

This can be fixed by doing a bit of manual labor. SparkR does accept booleans 
as if they are integers here. 

{code}
> ddf <- ddf %>% 
  withColumn("to_pred", .$class == "a") 
> glm(to_pred ~ i, family = "binomial", data = ddf)
{code}

But this can become quite tedious, especially when you want to have models that 
are using multiple classes that need classification. 

Is there a good reason why this should not be a feature of formulas in Spark? I 
am aware of issue 8774, which looks like it is adressing a similar theme but a 
different issue. 




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to