[ 
https://issues.apache.org/jira/browse/SPARK-32268?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17166127#comment-17166127
 ] 

Yuming Wang commented on SPARK-32268:
-------------------------------------

A simple benchmark for bloom filter.
{code:scala}
import org.apache.spark.benchmark.Benchmark
import org.apache.spark.util.sketch.BloomFilter

val N = 100000000
val items = 1 to 200000
val path = "/tmp/spark/bloomfilter"
spark.range(N).write.mode("overwrite").parquet(path)

val benchmark = new Benchmark(s"Benchmark bloom filter with ${items.size} 
items", valuesPerIteration = N, minNumIters = 5)

benchmark.addCase("bloom filter") { _ =>
  val br = BloomFilter.create(items.size)
  items.foreach(br.putLong(_))
  spark.read.parquet(path).filter(r => 
br.mightContainLong(r.getLong(0))).write.format("noop").mode("overwrite").save()
}

benchmark.addCase("in") { _ =>
  spark.read.parquet(path).filter(s"id in (${items.mkString(", 
")})").write.format("noop").mode("overwrite").save()
}

benchmark.addCase("binary comparison") { _ =>
  spark.read.parquet(path).filter(r => r.getLong(0) <= items.last && 
r.getLong(0) >= 1).write.format("noop").mode("overwrite").save()
}

benchmark.addCase("binary comparison and bloom filter") { _ =>
  val br = BloomFilter.create(items.size)
  items.foreach(br.putLong(_))
  spark.read.parquet(path).filter(r => r.getLong(0) <= items.last && 
r.getLong(0) >= 1 && 
br.mightContainLong(r.getLong(0))).write.format("noop").mode("overwrite").save()
}
benchmark.run()
{code}


{noformat}
Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.15.5
Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz
Benchmark bloom filter with 200000 items:  Best Time(ms)   Avg Time(ms)   
Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
bloom filter                                       4057           4139          
75         24.6          40.6       1.0X
in                                                47276          51586         
953          2.1         472.8       0.1X
binary comparison                                  1911           2086         
231         52.3          19.1       2.1X
binary comparison and bloom filter                 1959           2096         
160         51.0          19.6       2.1X

{noformat}



> Bloom Filter Join
> -----------------
>
>                 Key: SPARK-32268
>                 URL: https://issues.apache.org/jira/browse/SPARK-32268
>             Project: Spark
>          Issue Type: New Feature
>          Components: SQL
>    Affects Versions: 3.1.0
>            Reporter: Yuming Wang
>            Assignee: Yuming Wang
>            Priority: Major
>         Attachments: q16-bloom-filter.jpg, q16-default.jpg
>
>
> We can improve the performance of some joins by pre-filtering one side of a 
> join using a Bloom filter and IN predicate generated from the values from the 
> other side of the join.
>  For 
> example:[tpcds/q16.sql|https://github.com/apache/spark/blob/a78d6ce376edf2a8836e01f47b9dff5371058d4c/sql/core/src/test/resources/tpcds/q16.sql].
>  [Before this 
> optimization|https://issues.apache.org/jira/secure/attachment/13007418/q16-default.jpg].
>  [After this 
> optimization|https://issues.apache.org/jira/secure/attachment/13007416/q16-bloom-filter.jpg].
> *Query Performance Benchmarks: TPC-DS Performance Evaluation*
>  Our setup for running TPC-DS benchmark was as follows: TPC-DS 5T and 
> Partitioned Parquet table
>  
> |Query|Default(Seconds)|Enable Bloom Filter Join(Seconds)|
> |tpcds q16|84|46|
> |tpcds q36|29|21|
> |tpcds q57|39|28|
> |tpcds q94|42|34|
> |tpcds q95|306|288|



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to