[ 
https://issues.apache.org/jira/browse/SPARK-10523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman updated SPARK-10523:
------------------------------------------
    Component/s: SparkR

> SparkR formula syntax to turn strings/factors into numerics
> -----------------------------------------------------------
>
>                 Key: SPARK-10523
>                 URL: https://issues.apache.org/jira/browse/SPARK-10523
>             Project: Spark
>          Issue Type: Bug
>          Components: ML, SparkR
>            Reporter: Vincent Warmerdam
>
> In normal (non SparkR) R the formula syntax enables strings or factors to be 
> turned into dummy variables immediately when calling a classifier. This way, 
> the following R pattern is legal and often used:
> {code}
> library(magrittr) 
> df <- data.frame( class = c("a", "a", "b", "b"), i = c(1, 2, 5, 6))
> glm(class ~ i, family = "binomial", data = df)
> {code}
> The glm method will know that `class` is a string/factor and handles it 
> appropriately by casting it to a 0/1 array before applying any machine 
> learning. SparkR doesn't do this. 
> {code}
> > ddf <- sqlContext %>% 
>   createDataFrame(df)
> > glm(class ~ i, family = "binomial", data = ddf)
> Error in invokeJava(isStatic = TRUE, className, methodName, ...) :
>   java.lang.IllegalArgumentException: Unsupported type for label: StringType
>       at 
> org.apache.spark.ml.feature.RFormulaModel.transformLabel(RFormula.scala:185)
>       at 
> org.apache.spark.ml.feature.RFormulaModel.transform(RFormula.scala:150)
>       at org.apache.spark.ml.Pipeline$$anonfun$fit$2.apply(Pipeline.scala:146)
>       at org.apache.spark.ml.Pipeline$$anonfun$fit$2.apply(Pipeline.scala:134)
>       at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>       at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>       at 
> scala.collection.IterableViewLike$Transformed$class.foreach(IterableViewLike.scala:42)
>       at 
> scala.collection.SeqViewLike$AbstractTransformed.foreach(SeqViewLike.scala:43)
>       at org.apache.spark.ml.Pipeline.fit(Pipeline.scala:134)
>       at 
> org.apache.spark.ml.api.r.SparkRWrappers$.fitRModelFormula(SparkRWrappers.scala:46)
>       at 
> org.apache.spark.ml.api.r.SparkRWrappers.fitRModelFormula(SparkRWrappers.scala)
>       at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>       at sun.refl
> {code}
> This can be fixed by doing a bit of manual labor. SparkR does accept booleans 
> as if they are integers here. 
> {code}
> > ddf <- ddf %>% 
>   withColumn("to_pred", .$class == "a") 
> > glm(to_pred ~ i, family = "binomial", data = ddf)
> {code}
> But this can become quite tedious, especially when you want to have models 
> that are using multiple classes that need classification. This is perhaps 
> less relevant for logistic regression (because it is a bit more like a 
> one-off classification approach) but it certainly is relevant if you would 
> want to use a formula for a randomforest and a column denotes, say, a type of 
> flower from the iris dataset. 
> Is there a good reason why this should not be a feature of formulas in Spark? 
> I am aware of issue 8774, which looks like it is adressing a similar theme 
> but a different issue. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to