Maciej Szymkiewicz created SPARK-11086:
------------------------------------------

             Summary: createDataFrame should dropFactor column-wise not 
cell-wise 
                 Key: SPARK-11086
                 URL: https://issues.apache.org/jira/browse/SPARK-11086
             Project: Spark
          Issue Type: Improvement
          Components: SparkR
            Reporter: Maciej Szymkiewicz


At this moment SparkR {{createDataFrame}} [is using nested 
loop|https://github.com/apache/spark/blob/896edb51ab7a88bbb31259e565311a9be6f2ca6d/R/pkg/R/SQLContext.R#L99]
 to convert {{factors}} to {{character}} when called on a local {{data.frame}}.

{code}
data <- lapply(1:n, function(i) {
    lapply(1:m, function(j) { dropFactor(data[i,j]) })
})
{code}

It works but is incredibly slow especially with {{data.table}} (~ 2 orders of 
magnitude compared to  PySpark / Pandas version on a DateFrame of size 1M rows 
x 2 columns).

A simple improvement is to apply {{dropFactor}} column-wise and then reshape 
output list:

{code}
args <- list(FUN=list, SIMPLIFY=FALSE, USE.NAMES=FALSE)  
data <- do.call(mapply, append(args, setNames(lapply(data, dropFactor), NULL)))
{code}

It should at least partially address 
[SPARK-8277|https://issues.apache.org/jira/browse/SPARK-8277].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to