[jira] [Commented] (SPARK-32268) Bloom Filter Join

Yuming Wang (Jira) Wed, 12 Aug 2020 00:50:29 -0700


    [ 
https://issues.apache.org/jira/browse/SPARK-32268?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17176121#comment-17176121
 ]


Yuming Wang commented on SPARK-32268:
-------------------------------------


{code:scala}
    import org.apache.spark.benchmark.Benchmark
    import org.apache.spark.util.sketch.BloomFilter

    val N = 400000000L
    val path = "/tmp/spark/bloomfilter"
    spark.range(N).write.mode("overwrite").parquet(path)

    val benchmark = new Benchmark(s"Benchmark build bloom filter", 
valuesPerIteration = N, minNumIters = 2)

    Seq(10, 1000, 100000, 10000000, 30000000, 50000000, 70000000).foreach { 
items =>
      val df = spark.read.parquet(path).filter(s"id < ${items}")
      benchmark.addCase(s"Build bloom filter using UDAF with ${items} items") { 
_ =>
        df.selectExpr("build_bloom_filter(id)").collect()
      }

      benchmark.addCase(s"Collect ${items} items and build bloom filter") { _ =>
        val br = BloomFilter.create(items)
        df.collect().foreach(r => br.putLong(r.getLong(0)))
      }
    }
    benchmark.run()
{code}
--driver-memory 16G --conf spark.driver.maxResultSize=8G
{noformat}
Java HotSpot(TM) 64-Bit Server VM 1.8.0_221-b11 on Linux 
3.10.0-957.10.1.el7.x86_64
Intel Core Processor (Broadwell, IBRS)
Benchmark build bloom filter:                      Best Time(ms)   Avg Time(ms) 
  Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
---------------------------------------------------------------------------------------------------------------------------------
Build bloom filter using UDAF with 10 items                  561            575 
         16        713.1           1.4       1.0X
Collect 10 items and build bloom filter                      426            435 
         10        939.2           1.1       1.3X
Build bloom filter using UDAF with 1000 items                527            554 
         19        758.8           1.3       1.1X
Collect 1000 items and build bloom filter                    416            425 
          6        961.4           1.0       1.3X
Build bloom filter using UDAF with 100000 items              564            578 
         12        708.8           1.4       1.0X
Collect 100000 items and build bloom filter                  452            463 
         16        884.5           1.1       1.2X
Build bloom filter using UDAF with 10000000 items           2247           2275 
         40        178.0           5.6       0.2X
Collect 10000000 items and build bloom filter               4545           4673 
        181         88.0          11.4       0.1X
Build bloom filter using UDAF with 30000000 items           4789           4907 
        167         83.5          12.0       0.1X
Collect 30000000 items and build bloom filter              34712          37529 
        NaN         11.5          86.8       0.0X
Build bloom filter using UDAF with 50000000 items           4677           5440 
       1079         85.5          11.7       0.1X
Collect 50000000 items and build bloom filter              79803          87770 
        NaN          5.0         199.5       0.0X
Build bloom filter using UDAF with 70000000 items           5523           5624 
        143         72.4          13.8       0.1X
Collect 70000000 items and build bloom filter             114387         130767 
       1265          3.5         286.0       0.0X
{noformat}


> Bloom Filter Join
> -----------------
>
>                 Key: SPARK-32268
>                 URL: https://issues.apache.org/jira/browse/SPARK-32268
>             Project: Spark
>          Issue Type: New Feature
>          Components: SQL
>    Affects Versions: 3.1.0
>            Reporter: Yuming Wang
>            Assignee: Yuming Wang
>            Priority: Major
>         Attachments: q16-bloom-filter.jpg, q16-default.jpg
>
>
> We can improve the performance of some joins by pre-filtering one side of a 
> join using a Bloom filter and IN predicate generated from the values from the 
> other side of the join.
>  For 
> example:[tpcds/q16.sql|https://github.com/apache/spark/blob/a78d6ce376edf2a8836e01f47b9dff5371058d4c/sql/core/src/test/resources/tpcds/q16.sql].
>  [Before this 
> optimization|https://issues.apache.org/jira/secure/attachment/13007418/q16-default.jpg].
>  [After this 
> optimization|https://issues.apache.org/jira/secure/attachment/13007416/q16-bloom-filter.jpg].
> *Query Performance Benchmarks: TPC-DS Performance Evaluation*
>  Our setup for running TPC-DS benchmark was as follows: TPC-DS 5T and 
> Partitioned Parquet table
>  
> |Query|Default(Seconds)|Enable Bloom Filter Join(Seconds)|
> |tpcds q16|84|46|
> |tpcds q36|29|21|
> |tpcds q57|39|28|
> |tpcds q94|42|34|
> |tpcds q95|306|288|



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32268) Bloom Filter Join

Reply via email to