[ https://issues.apache.org/jira/browse/SPARK-30563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17051073#comment-17051073 ]
Maxim Gekk commented on SPARK-30563: ------------------------------------ > we spend a lot of time in this loop even The loop just forces materialization of joined rows. By df.groupBy().count(), you skip some steps in join, it seems. I think in most cases, users need results of join but not just count on top of it. > Regressions in Join benchmarks > ------------------------------ > > Key: SPARK-30563 > URL: https://issues.apache.org/jira/browse/SPARK-30563 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 3.0.0 > Reporter: Maxim Gekk > Priority: Minor > > Regenerated benchmark results in the > https://github.com/apache/spark/pull/27078 shows many regressions in > JoinBenchmark. The benchmarked queries slowed down by up to 3 times, see > old results: > https://github.com/apache/spark/pull/27078/files#diff-d5cbaab2b49ee9fddfa0e229de8f607dL10 > new results: > https://github.com/apache/spark/pull/27078/files#diff-d5cbaab2b49ee9fddfa0e229de8f607dR10 > One of the difference in queries is using the `NoOp` datasource in new > queries. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org