[GitHub] [spark] c21 commented on pull request #29342: [SPARK-32399][SQL] Full outer shuffled hash join

GitBox Thu, 13 Aug 2020 13:00:54 -0700


c21 commented on pull request #29342:
URL: https://github.com/apache/spark/pull/29342#issuecomment-673681660



   > I am also curious if you can share the Perf benchmark you are using as a 
Gist (ideally linked in the PR description) in addition to please also 
reporting the aggregate CPU time ?
   
   @agrawaldevesh - It's not a standalone benchmark, so it's hard to share as 
Gist. I ran the query in the environment of 
[`JoinBenchmark`](https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/JoinBenchmark.scala).
 You can run the same benchmark as followed:
   
   1. Change 
[`JoinBenchmark.shuffledHashJoin`](https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/JoinBenchmark.scala#L153-L167)
 with following query as specified in PR description:
   
   ```
   def shuffleHashJoin(): Unit = {
       val N: Long = 4 << 22
       withSQLConf(
         SQLConf.SHUFFLE_PARTITIONS.key -> "2",
         SQLConf.AUTO_BROADCASTJOIN_THRESHOLD.key -> "20000000") {
         codegenBenchmark("shuffle hash join", N) {
           val df1 = spark.range(N).selectExpr(s"cast(id as string) as k1")
           val df2 = spark.range(N / 10).selectExpr(s"cast(id * 10 as string) 
as k2")
           val df = df1.join(df2, col("k1") === col("k2"), "full_outer")
           df.noop()
       }
     }
   }
   ```
   
   2. The `JoinBenchmark` is hardcoded to test disable/enable whole-stage 
code-gen. So we need to change it to test disable/enable shuffled hash join, by 
changing 
[`SqlBasedBenchmark.codegenBenchmark`](https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/SqlBasedBenchmark.scala#L46-L62)
 to:
   
   ```
     final def codegenBenchmark(name: String, cardinality: Long)(f: => Unit): 
Unit = {
       val benchmark = new Benchmark(name, cardinality, output = output)
   
       benchmark.addCase(s"$name off", numIters = 2) { _ =>
         withSQLConf(SQLConf.PREFER_SORTMERGEJOIN.key -> "true") {
           f
         }
       }
   
       benchmark.addCase(s"$name on", numIters = 5) { _ =>
         withSQLConf(SQLConf.PREFER_SORTMERGEJOIN.key -> "false") {
           f
         }
       }
   
       benchmark.run()
     }
   ```
   
   
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] c21 commented on pull request #29342: [SPARK-32399][SQL] Full outer shuffled hash join

Reply via email to