[ https://issues.apache.org/jira/browse/SPARK-22105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16176650#comment-16176650 ]
Weichen Xu commented on SPARK-22105: ------------------------------------ cc [~mlnick] [~cloud_fan] > Dataframe has poor performance when computing on many columns with codegen > -------------------------------------------------------------------------- > > Key: SPARK-22105 > URL: https://issues.apache.org/jira/browse/SPARK-22105 > Project: Spark > Issue Type: Improvement > Components: ML, SQL > Affects Versions: 2.3.0 > Reporter: Weichen Xu > Priority: Minor > > Suppose we have a dataframe with many columns (e.g 100 columns), each column > is DoubleType. > And we need to compute avg on each column. We will find using dataframe avg > will be much slower than using RDD.aggregate. > I observe this issue from this PR: (One pass imputer) > https://github.com/apache/spark/pull/18902 > I also write a minimal testing code to reproduce this issue, I use computing > sum to reproduce this issue: > https://github.com/apache/spark/compare/master...WeichenXu123:aggr_test2?expand=1 > When we compute `sum` on 100 `DoubleType` columns, dataframe avg will be > about 3x slower than `RDD.aggregate`, but if we only compute one column, > dataframe avg will be much faster than `RDD.aggregate`. > The reason of this issue, should be the defact in dataframe codegen. Codegen > will inline everything and generate large code block. When the column number > is large (e.g 100 columns), the codegen size will be too large, which cause > jvm failed to JIT and fall back to byte code interpretation. > This PR should address this issue: > https://github.com/apache/spark/pull/19082 > But we need more performance code against some code in ML after above PR > merged, to check whether this issue is actually fixed. > This JIRA used to track this performance issue. -- This message was sent by Atlassian JIRA (v6.4.14#64029) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org