[jira] [Commented] (SPARK-22105) Dataframe has poor performance when computing on many columns with codegen

Weichen Xu (JIRA) Fri, 22 Sep 2017 09:15:29 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-22105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16176650#comment-16176650
 ]


Weichen Xu commented on SPARK-22105:
------------------------------------

cc [~mlnick] [~cloud_fan]

> Dataframe has poor performance when computing on many columns with codegen
> --------------------------------------------------------------------------
>
>                 Key: SPARK-22105
>                 URL: https://issues.apache.org/jira/browse/SPARK-22105
>             Project: Spark
>          Issue Type: Improvement
>          Components: ML, SQL
>    Affects Versions: 2.3.0
>            Reporter: Weichen Xu
>            Priority: Minor
>
> Suppose we have a dataframe with many columns (e.g 100 columns), each column 
> is DoubleType.
> And we need to compute avg on each column. We will find using dataframe avg 
> will be much slower than using RDD.aggregate.
> I observe this issue from this PR: (One pass imputer)
> https://github.com/apache/spark/pull/18902
> I also write a minimal testing code to reproduce this issue, I use computing 
> sum to reproduce this issue:
> https://github.com/apache/spark/compare/master...WeichenXu123:aggr_test2?expand=1
> When we compute `sum` on 100 `DoubleType` columns, dataframe avg will be 
> about 3x slower than `RDD.aggregate`, but if we only compute one column, 
> dataframe avg will be much faster than `RDD.aggregate`.
> The reason of this issue, should be the defact in dataframe codegen. Codegen 
> will inline everything and generate large code block. When the column number 
> is large (e.g 100 columns), the codegen size will be too large, which cause 
> jvm failed to JIT and fall back to byte code interpretation.
> This PR should address this issue:
> https://github.com/apache/spark/pull/19082
> But we need more performance code against some code in ML after above PR 
> merged, to check whether this issue is actually fixed.
> This JIRA used to track this performance issue.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22105) Dataframe has poor performance when computing on many columns with codegen

Reply via email to