[ https://issues.apache.org/jira/browse/SPARK-32268?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17176121#comment-17176121 ]
Yuming Wang commented on SPARK-32268: ------------------------------------- {code:scala} import org.apache.spark.benchmark.Benchmark import org.apache.spark.util.sketch.BloomFilter val N = 400000000L val path = "/tmp/spark/bloomfilter" spark.range(N).write.mode("overwrite").parquet(path) val benchmark = new Benchmark(s"Benchmark build bloom filter", valuesPerIteration = N, minNumIters = 2) Seq(10, 1000, 100000, 10000000, 30000000, 50000000, 70000000).foreach { items => val df = spark.read.parquet(path).filter(s"id < ${items}") benchmark.addCase(s"Build bloom filter using UDAF with ${items} items") { _ => df.selectExpr("build_bloom_filter(id)").collect() } benchmark.addCase(s"Collect ${items} items and build bloom filter") { _ => val br = BloomFilter.create(items) df.collect().foreach(r => br.putLong(r.getLong(0))) } } benchmark.run() {code} --driver-memory 16G --conf spark.driver.maxResultSize=8G {noformat} Java HotSpot(TM) 64-Bit Server VM 1.8.0_221-b11 on Linux 3.10.0-957.10.1.el7.x86_64 Intel Core Processor (Broadwell, IBRS) Benchmark build bloom filter: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative --------------------------------------------------------------------------------------------------------------------------------- Build bloom filter using UDAF with 10 items 561 575 16 713.1 1.4 1.0X Collect 10 items and build bloom filter 426 435 10 939.2 1.1 1.3X Build bloom filter using UDAF with 1000 items 527 554 19 758.8 1.3 1.1X Collect 1000 items and build bloom filter 416 425 6 961.4 1.0 1.3X Build bloom filter using UDAF with 100000 items 564 578 12 708.8 1.4 1.0X Collect 100000 items and build bloom filter 452 463 16 884.5 1.1 1.2X Build bloom filter using UDAF with 10000000 items 2247 2275 40 178.0 5.6 0.2X Collect 10000000 items and build bloom filter 4545 4673 181 88.0 11.4 0.1X Build bloom filter using UDAF with 30000000 items 4789 4907 167 83.5 12.0 0.1X Collect 30000000 items and build bloom filter 34712 37529 NaN 11.5 86.8 0.0X Build bloom filter using UDAF with 50000000 items 4677 5440 1079 85.5 11.7 0.1X Collect 50000000 items and build bloom filter 79803 87770 NaN 5.0 199.5 0.0X Build bloom filter using UDAF with 70000000 items 5523 5624 143 72.4 13.8 0.1X Collect 70000000 items and build bloom filter 114387 130767 1265 3.5 286.0 0.0X {noformat} > Bloom Filter Join > ----------------- > > Key: SPARK-32268 > URL: https://issues.apache.org/jira/browse/SPARK-32268 > Project: Spark > Issue Type: New Feature > Components: SQL > Affects Versions: 3.1.0 > Reporter: Yuming Wang > Assignee: Yuming Wang > Priority: Major > Attachments: q16-bloom-filter.jpg, q16-default.jpg > > > We can improve the performance of some joins by pre-filtering one side of a > join using a Bloom filter and IN predicate generated from the values from the > other side of the join. > For > example:[tpcds/q16.sql|https://github.com/apache/spark/blob/a78d6ce376edf2a8836e01f47b9dff5371058d4c/sql/core/src/test/resources/tpcds/q16.sql]. > [Before this > optimization|https://issues.apache.org/jira/secure/attachment/13007418/q16-default.jpg]. > [After this > optimization|https://issues.apache.org/jira/secure/attachment/13007416/q16-bloom-filter.jpg]. > *Query Performance Benchmarks: TPC-DS Performance Evaluation* > Our setup for running TPC-DS benchmark was as follows: TPC-DS 5T and > Partitioned Parquet table > > |Query|Default(Seconds)|Enable Bloom Filter Join(Seconds)| > |tpcds q16|84|46| > |tpcds q36|29|21| > |tpcds q57|39|28| > |tpcds q94|42|34| > |tpcds q95|306|288| -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org