[ 
https://issues.apache.org/jira/browse/SPARK-10523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vincent Warmerdam updated SPARK-10523:
--------------------------------------
    Description: 
In normal (non SparkR) R the formula syntax enables strings or factors to be 
turned into dummy variables immediately when calling a classifier. This way, 
the following Rcode is legal and often used. 

{code}
library(magrittr) 
df <- data.frame( class = c("a", "a", "b", "b"), i = c(1, 2, 5, 6))
glm(class ~ i, family = "binomial", data = df)
{code}

SparkR doesn't allow this. 

{code}
> ddf <- sqlContext %>% 
  createDataFrame(df)
> glm(class ~ i, family = "binomial", data = ddf)
Error in invokeJava(isStatic = TRUE, className, methodName, ...) :
  java.lang.IllegalArgumentException: Unsupported type for label: StringType
        at 
org.apache.spark.ml.feature.RFormulaModel.transformLabel(RFormula.scala:185)
        at 
org.apache.spark.ml.feature.RFormulaModel.transform(RFormula.scala:150)
        at org.apache.spark.ml.Pipeline$$anonfun$fit$2.apply(Pipeline.scala:146)
        at org.apache.spark.ml.Pipeline$$anonfun$fit$2.apply(Pipeline.scala:134)
        at scala.collection.Iterator$class.foreach(Iterator.scala:727)
        at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
        at 
scala.collection.IterableViewLike$Transformed$class.foreach(IterableViewLike.scala:42)
        at 
scala.collection.SeqViewLike$AbstractTransformed.foreach(SeqViewLike.scala:43)
        at org.apache.spark.ml.Pipeline.fit(Pipeline.scala:134)
        at 
org.apache.spark.ml.api.r.SparkRWrappers$.fitRModelFormula(SparkRWrappers.scala:46)
        at 
org.apache.spark.ml.api.r.SparkRWrappers.fitRModelFormula(SparkRWrappers.scala)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.refl
{code}

This can be fixed by doing a bit of manual labor. SparkR does accept booleans 
as if they are integers here. 

{code}
> ddf <- ddf %>% 
  withColumn("to_pred", .$class == "a") 
> glm(to_pred ~ i, family = "binomial", data = ddf)
{code}

But this can become quite tedious, especially when you want to have models that 
are using multiple classes that need classification. This is perhaps less 
relevant for logistic regression (because it is a bit more like a one-off 
regression unless you want to run one for class) but it certainly is relevant 
if you would want to use a formula for a randomforest. 

Is there a good reason why this should not be a feature of formulas in Spark? I 
am aware of issue 8774, which looks like it is adressing a similar theme but a 
different issue. 


  was:
In normal (non SparkR) R the formula syntax enables strings or factors to be 
turned into dummy variables immediately when calling a classifier. This way, 
the following Rcode is legal and often used. 

{code}
library(magrittr) 
df <- data.frame( class = c("a", "a", "b", "b"), i = c(1, 2, 5, 6))
glm(class ~ i, family = "binomial", data = df)
{code}

SparkR doesn't allow this. 

{code}
> ddf <- sqlContext %>% 
  createDataFrame(df)
> glm(class ~ i, family = "binomial", data = ddf)
Error in invokeJava(isStatic = TRUE, className, methodName, ...) :
  java.lang.IllegalArgumentException: Unsupported type for label: StringType
        at 
org.apache.spark.ml.feature.RFormulaModel.transformLabel(RFormula.scala:185)
        at 
org.apache.spark.ml.feature.RFormulaModel.transform(RFormula.scala:150)
        at org.apache.spark.ml.Pipeline$$anonfun$fit$2.apply(Pipeline.scala:146)
        at org.apache.spark.ml.Pipeline$$anonfun$fit$2.apply(Pipeline.scala:134)
        at scala.collection.Iterator$class.foreach(Iterator.scala:727)
        at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
        at 
scala.collection.IterableViewLike$Transformed$class.foreach(IterableViewLike.scala:42)
        at 
scala.collection.SeqViewLike$AbstractTransformed.foreach(SeqViewLike.scala:43)
        at org.apache.spark.ml.Pipeline.fit(Pipeline.scala:134)
        at 
org.apache.spark.ml.api.r.SparkRWrappers$.fitRModelFormula(SparkRWrappers.scala:46)
        at 
org.apache.spark.ml.api.r.SparkRWrappers.fitRModelFormula(SparkRWrappers.scala)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.refl
{code}

This can be fixed by doing a bit of manual labor. SparkR does accept booleans 
as if they are integers here. 

{code}
> ddf <- ddf %>% 
  withColumn("to_pred", .$class == "a") 
> glm(to_pred ~ i, family = "binomial", data = ddf)
{code}

But this can become quite tedious, especially when you want to have models that 
are using multiple classes that need classification. 

Is there a good reason why this should not be a feature of formulas in Spark? I 
am aware of issue 8774, which looks like it is adressing a similar theme but a 
different issue. 



> SparkR formula syntax to turn strings/factors into numerics
> -----------------------------------------------------------
>
>                 Key: SPARK-10523
>                 URL: https://issues.apache.org/jira/browse/SPARK-10523
>             Project: Spark
>          Issue Type: Bug
>            Reporter: Vincent Warmerdam
>
> In normal (non SparkR) R the formula syntax enables strings or factors to be 
> turned into dummy variables immediately when calling a classifier. This way, 
> the following Rcode is legal and often used. 
> {code}
> library(magrittr) 
> df <- data.frame( class = c("a", "a", "b", "b"), i = c(1, 2, 5, 6))
> glm(class ~ i, family = "binomial", data = df)
> {code}
> SparkR doesn't allow this. 
> {code}
> > ddf <- sqlContext %>% 
>   createDataFrame(df)
> > glm(class ~ i, family = "binomial", data = ddf)
> Error in invokeJava(isStatic = TRUE, className, methodName, ...) :
>   java.lang.IllegalArgumentException: Unsupported type for label: StringType
>       at 
> org.apache.spark.ml.feature.RFormulaModel.transformLabel(RFormula.scala:185)
>       at 
> org.apache.spark.ml.feature.RFormulaModel.transform(RFormula.scala:150)
>       at org.apache.spark.ml.Pipeline$$anonfun$fit$2.apply(Pipeline.scala:146)
>       at org.apache.spark.ml.Pipeline$$anonfun$fit$2.apply(Pipeline.scala:134)
>       at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>       at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>       at 
> scala.collection.IterableViewLike$Transformed$class.foreach(IterableViewLike.scala:42)
>       at 
> scala.collection.SeqViewLike$AbstractTransformed.foreach(SeqViewLike.scala:43)
>       at org.apache.spark.ml.Pipeline.fit(Pipeline.scala:134)
>       at 
> org.apache.spark.ml.api.r.SparkRWrappers$.fitRModelFormula(SparkRWrappers.scala:46)
>       at 
> org.apache.spark.ml.api.r.SparkRWrappers.fitRModelFormula(SparkRWrappers.scala)
>       at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>       at sun.refl
> {code}
> This can be fixed by doing a bit of manual labor. SparkR does accept booleans 
> as if they are integers here. 
> {code}
> > ddf <- ddf %>% 
>   withColumn("to_pred", .$class == "a") 
> > glm(to_pred ~ i, family = "binomial", data = ddf)
> {code}
> But this can become quite tedious, especially when you want to have models 
> that are using multiple classes that need classification. This is perhaps 
> less relevant for logistic regression (because it is a bit more like a 
> one-off regression unless you want to run one for class) but it certainly is 
> relevant if you would want to use a formula for a randomforest. 
> Is there a good reason why this should not be a feature of formulas in Spark? 
> I am aware of issue 8774, which looks like it is adressing a similar theme 
> but a different issue. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to