c21 commented on pull request #29342: URL: https://github.com/apache/spark/pull/29342#issuecomment-673681660
> I am also curious if you can share the Perf benchmark you are using as a Gist (ideally linked in the PR description) in addition to please also reporting the aggregate CPU time ? @agrawaldevesh - It's not a standalone benchmark, so it's hard to share as Gist. I ran the query in the environment of [`JoinBenchmark`](https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/JoinBenchmark.scala). You can run the same benchmark as followed: 1. Change [`JoinBenchmark.shuffledHashJoin`](https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/JoinBenchmark.scala#L153-L167) with following query as specified in PR description: ``` def shuffleHashJoin(): Unit = { val N: Long = 4 << 22 withSQLConf( SQLConf.SHUFFLE_PARTITIONS.key -> "2", SQLConf.AUTO_BROADCASTJOIN_THRESHOLD.key -> "20000000") { codegenBenchmark("shuffle hash join", N) { val df1 = spark.range(N).selectExpr(s"cast(id as string) as k1") val df2 = spark.range(N / 10).selectExpr(s"cast(id * 10 as string) as k2") val df = df1.join(df2, col("k1") === col("k2"), "full_outer") df.noop() } } } ``` 2. The `JoinBenchmark` is hardcoded to test disable/enable whole-stage code-gen. So we need to change it to test disable/enable shuffled hash join, by changing [`SqlBasedBenchmark.codegenBenchmark`](https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/SqlBasedBenchmark.scala#L46-L62) to: ``` final def codegenBenchmark(name: String, cardinality: Long)(f: => Unit): Unit = { val benchmark = new Benchmark(name, cardinality, output = output) benchmark.addCase(s"$name off", numIters = 2) { _ => withSQLConf(SQLConf.PREFER_SORTMERGEJOIN.key -> "true") { f } } benchmark.addCase(s"$name on", numIters = 5) { _ => withSQLConf(SQLConf.PREFER_SORTMERGEJOIN.key -> "false") { f } } benchmark.run() } ``` ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org