[jira] [Commented] (SPARK-22105) Dataframe has poor performance when computing on many columns with codegen
[ https://issues.apache.org/jira/browse/SPARK-22105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16176650#comment-16176650 ] Weichen Xu commented on SPARK-22105: cc [~mlnick] [~cloud_fan] > Dataframe has poor performance when computing on many columns with codegen > -- > > Key: SPARK-22105 > URL: https://issues.apache.org/jira/browse/SPARK-22105 > Project: Spark > Issue Type: Improvement > Components: ML, SQL >Affects Versions: 2.3.0 >Reporter: Weichen Xu >Priority: Minor > > Suppose we have a dataframe with many columns (e.g 100 columns), each column > is DoubleType. > And we need to compute avg on each column. We will find using dataframe avg > will be much slower than using RDD.aggregate. > I observe this issue from this PR: (One pass imputer) > https://github.com/apache/spark/pull/18902 > I also write a minimal testing code to reproduce this issue, I use computing > sum to reproduce this issue: > https://github.com/apache/spark/compare/master...WeichenXu123:aggr_test2?expand=1 > When we compute `sum` on 100 `DoubleType` columns, dataframe avg will be > about 3x slower than `RDD.aggregate`, but if we only compute one column, > dataframe avg will be much faster than `RDD.aggregate`. > The reason of this issue, should be the defact in dataframe codegen. Codegen > will inline everything and generate large code block. When the column number > is large (e.g 100 columns), the codegen size will be too large, which cause > jvm failed to JIT and fall back to byte code interpretation. > This PR should address this issue: > https://github.com/apache/spark/pull/19082 > But we need more performance code against some code in ML after above PR > merged, to check whether this issue is actually fixed. > This JIRA used to track this performance issue. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22105) Dataframe has poor performance when computing on many columns with codegen
[ https://issues.apache.org/jira/browse/SPARK-22105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16176655#comment-16176655 ] Kazuaki Ishizaki commented on SPARK-22105: -- Can this PR at https://issues.apache.org/jira/browse/SPARK-21871 alleviate this issue? > Dataframe has poor performance when computing on many columns with codegen > -- > > Key: SPARK-22105 > URL: https://issues.apache.org/jira/browse/SPARK-22105 > Project: Spark > Issue Type: Improvement > Components: ML, SQL >Affects Versions: 2.3.0 >Reporter: Weichen Xu >Priority: Minor > > Suppose we have a dataframe with many columns (e.g 100 columns), each column > is DoubleType. > And we need to compute avg on each column. We will find using dataframe avg > will be much slower than using RDD.aggregate. > I observe this issue from this PR: (One pass imputer) > https://github.com/apache/spark/pull/18902 > I also write a minimal testing code to reproduce this issue, I use computing > sum to reproduce this issue: > https://github.com/apache/spark/compare/master...WeichenXu123:aggr_test2?expand=1 > When we compute `sum` on 100 `DoubleType` columns, dataframe avg will be > about 3x slower than `RDD.aggregate`, but if we only compute one column, > dataframe avg will be much faster than `RDD.aggregate`. > The reason of this issue, should be the defact in dataframe codegen. Codegen > will inline everything and generate large code block. When the column number > is large (e.g 100 columns), the codegen size will be too large, which cause > jvm failed to JIT and fall back to byte code interpretation. > This PR should address this issue: > https://github.com/apache/spark/pull/19082 > But we need more performance code against some code in ML after above PR > merged, to check whether this issue is actually fixed. > This JIRA used to track this performance issue. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22105) Dataframe has poor performance when computing on many columns with codegen
[ https://issues.apache.org/jira/browse/SPARK-22105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16359420#comment-16359420 ] Marco Gaido commented on SPARK-22105: - [~WeichenXu123] which is the number of rows for the dataset you tested? Maybe the time for generating/compiling the code can be a significant overhead if we have few data > Dataframe has poor performance when computing on many columns with codegen > -- > > Key: SPARK-22105 > URL: https://issues.apache.org/jira/browse/SPARK-22105 > Project: Spark > Issue Type: Improvement > Components: ML, SQL >Affects Versions: 2.3.0 >Reporter: Weichen Xu >Priority: Minor > > Suppose we have a dataframe with many columns (e.g 100 columns), each column > is DoubleType. > And we need to compute avg on each column. We will find using dataframe avg > will be much slower than using RDD.aggregate. > I observe this issue from this PR: (One pass imputer) > https://github.com/apache/spark/pull/18902 > I also write a minimal testing code to reproduce this issue, I use computing > sum to reproduce this issue: > https://github.com/apache/spark/compare/master...WeichenXu123:aggr_test2?expand=1 > When we compute `sum` on 100 `DoubleType` columns, dataframe avg will be > about 3x slower than `RDD.aggregate`, but if we only compute one column, > dataframe avg will be much faster than `RDD.aggregate`. > The reason of this issue, should be the defact in dataframe codegen. Codegen > will inline everything and generate large code block. When the column number > is large (e.g 100 columns), the codegen size will be too large, which cause > jvm failed to JIT and fall back to byte code interpretation. > This PR should address this issue: > https://github.com/apache/spark/pull/19082 > But we need more performance test against some code in ML after above PR > merged, to check whether this issue is actually fixed. > This JIRA used to track this performance issue. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org