GitHub user dongjoon-hyun opened a pull request:

    https://github.com/apache/spark/pull/22597

    [SPARK-25579][SQL] Use quoted attribute names if needed in pushed ORC 
predicates

    ## What changes were proposed in this pull request?
    
    This PR aims to fix an ORC performance regression at Spark 2.4.0 RCs from 
Spark 2.3.2. Currently, for column names with `.`, the pushed predicates are 
ignored.
    
    **Test Data**
    ```scala
    scala> val df = spark.range(Int.MaxValue).sample(0.2).toDF("col.with.dot")
    scala> df.write.mode("overwrite").orc("/tmp/orc")
    ```
    
    **Spark 2.3.2**
    ```scala
    scala> spark.sql("set spark.sql.orc.impl=native")
    scala> spark.sql("set spark.sql.orc.filterPushdown=true")
    scala> spark.time(spark.read.orc("/tmp/orc").where("`col.with.dot` < 
10").show)
    +------------+
    |col.with.dot|
    +------------+
    |           1|
    |           8|
    +------------+
    
    Time taken: 1486 ms
    
    scala> spark.time(spark.read.orc("/tmp/orc").where("`col.with.dot` < 
10").show)
    +------------+
    |col.with.dot|
    +------------+
    |           1|
    |           8|
    +------------+
    
    Time taken: 163 ms
    ```
    
    **Spark 2.4.0 RC2**
    ```scala
    scala> spark.time(spark.read.orc("/tmp/orc").where("`col.with.dot` < 
10").show)
    +------------+
    |col.with.dot|
    +------------+
    |           1|
    |           8|
    +------------+
    
    Time taken: 4087 ms
    
    scala> spark.time(spark.read.orc("/tmp/orc").where("`col.with.dot` < 
10").show)
    +------------+
    |col.with.dot|
    +------------+
    |           1|
    |           8|
    +------------+
    
    Time taken: 1998 ms
    ```
    
    **This PR**
    ```scala
    scala> spark.time(spark.read.orc("/tmp/orc").where("`col.with.dot` < 
10").show)
    +------------+
    |col.with.dot|
    +------------+
    |           1|
    |           8|
    +------------+
    
    Time taken: 2477 ms
    
    scala> spark.time(spark.read.orc("/tmp/orc").where("`col.with.dot` < 
10").show)
    +------------+
    |col.with.dot|
    +------------+
    |           1|
    |           8|
    +------------+
    
    Time taken: 253 ms
    ```
    
    ## How was this patch tested?
    
    Pass the Jenkins with the existing test and manually performance test.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/dongjoon-hyun/spark SPARK-25579

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/22597.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #22597
    
----
commit f6c3dca65b85888392f8299cc5fc20f698c6afc5
Author: Dongjoon Hyun <dongjoon@...>
Date:   2018-10-01T04:33:04Z

    [SPARK-25579][SQL] Use quoted attribute names if needed in pushed ORC 
predicates

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to