GitHub user dongjoon-hyun opened a pull request:

    https://github.com/apache/spark/pull/22313

    [SPARK-25306][SQL] Use cache to speed up `createFilter`

    ## What changes were proposed in this pull request?
    
    In ORC data source, `createFilter` function has exponential time complexity 
due to lack of memoization like the following. This issue aims to improve it.
    
    **REPRODUCE**
    ```
    // Create and read 1 row table with 1000 columns
    sql("set spark.sql.orc.filterPushdown=true")
    val selectExpr = (1 to 1000).map(i => s"id c$i")
    spark.range(1).selectExpr(selectExpr: 
_*).write.mode("overwrite").orc("/tmp/orc")
    print(s"With 0 filters, ")
    spark.time(spark.read.orc("/tmp/orc").count)
    
    // Increase the number of filters
    (20 to 30).foreach { width =>
      val whereExpr = (1 to width).map(i => s"c$i is not null").mkString(" and 
")
      print(s"With $width filters, ")
      spark.time(spark.read.orc("/tmp/orc").where(whereExpr).count)
    }
    ```
    
    **RESULT**
    ```
    With 0 filters, Time taken: 653 ms                                          
    
    With 20 filters, Time taken: 962 ms
    With 21 filters, Time taken: 1282 ms
    With 22 filters, Time taken: 1982 ms
    With 23 filters, Time taken: 3855 ms
    With 24 filters, Time taken: 6719 ms
    With 25 filters, Time taken: 12669 ms
    With 26 filters, Time taken: 25032 ms
    With 27 filters, Time taken: 49585 ms
    With 28 filters, Time taken: 98980 ms     // over 1 min 38 seconds
    With 29 filters, Time taken: 198368 ms   // over 3 mins
    With 30 filters, Time taken: 393744 ms   // over 6 mins
    ```
    
    **AFTER THIS PR**
    ```
    With 0 filters, Time taken: 644 ms                                          
    
    With 20 filters, Time taken: 638 ms
    With 21 filters, Time taken: 360 ms
    With 22 filters, Time taken: 590 ms
    With 23 filters, Time taken: 318 ms
    With 24 filters, Time taken: 315 ms
    With 25 filters, Time taken: 381 ms
    With 26 filters, Time taken: 304 ms
    With 27 filters, Time taken: 294 ms
    With 28 filters, Time taken: 319 ms
    With 29 filters, Time taken: 288 ms
    With 30 filters, Time taken: 285 ms
    ```
    
    ## How was this patch tested?
    
    Pass the Jenkins with newly added test cases.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/dongjoon-hyun/spark SPARK-25306

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/22313.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #22313
    
----
commit ac06b0ca28d1da81fadbe0742a199b5e7b0de1ec
Author: Dongjoon Hyun <dongjoon@...>
Date:   2018-09-01T22:22:10Z

    [SPARK-25306][SQL] Use cache to speed up `createFilter`

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to