GitHub user dongjoon-hyun opened a pull request: https://github.com/apache/spark/pull/22597
[SPARK-25579][SQL] Use quoted attribute names if needed in pushed ORC predicates ## What changes were proposed in this pull request? This PR aims to fix an ORC performance regression at Spark 2.4.0 RCs from Spark 2.3.2. Currently, for column names with `.`, the pushed predicates are ignored. **Test Data** ```scala scala> val df = spark.range(Int.MaxValue).sample(0.2).toDF("col.with.dot") scala> df.write.mode("overwrite").orc("/tmp/orc") ``` **Spark 2.3.2** ```scala scala> spark.sql("set spark.sql.orc.impl=native") scala> spark.sql("set spark.sql.orc.filterPushdown=true") scala> spark.time(spark.read.orc("/tmp/orc").where("`col.with.dot` < 10").show) +------------+ |col.with.dot| +------------+ | 1| | 8| +------------+ Time taken: 1486 ms scala> spark.time(spark.read.orc("/tmp/orc").where("`col.with.dot` < 10").show) +------------+ |col.with.dot| +------------+ | 1| | 8| +------------+ Time taken: 163 ms ``` **Spark 2.4.0 RC2** ```scala scala> spark.time(spark.read.orc("/tmp/orc").where("`col.with.dot` < 10").show) +------------+ |col.with.dot| +------------+ | 1| | 8| +------------+ Time taken: 4087 ms scala> spark.time(spark.read.orc("/tmp/orc").where("`col.with.dot` < 10").show) +------------+ |col.with.dot| +------------+ | 1| | 8| +------------+ Time taken: 1998 ms ``` **This PR** ```scala scala> spark.time(spark.read.orc("/tmp/orc").where("`col.with.dot` < 10").show) +------------+ |col.with.dot| +------------+ | 1| | 8| +------------+ Time taken: 2477 ms scala> spark.time(spark.read.orc("/tmp/orc").where("`col.with.dot` < 10").show) +------------+ |col.with.dot| +------------+ | 1| | 8| +------------+ Time taken: 253 ms ``` ## How was this patch tested? Pass the Jenkins with the existing test and manually performance test. You can merge this pull request into a Git repository by running: $ git pull https://github.com/dongjoon-hyun/spark SPARK-25579 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/22597.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #22597 ---- commit f6c3dca65b85888392f8299cc5fc20f698c6afc5 Author: Dongjoon Hyun <dongjoon@...> Date: 2018-10-01T04:33:04Z [SPARK-25579][SQL] Use quoted attribute names if needed in pushed ORC predicates ---- --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org