[GitHub] spark pull request #22573: [SPARK-25558][SQL] Pushdown predicates for nested...

dongjoon-hyun Sun, 30 Sep 2018 16:06:01 -0700

Github user dongjoon-hyun commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22573#discussion_r221476951
  
    --- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala
 ---
    @@ -437,53 +436,65 @@ object DataSourceStrategy {
        * @return a `Some[Filter]` if the input [[Expression]] is convertible, 
otherwise a `None`.
        */
       protected[sql] def translateFilter(predicate: Expression): 
Option[Filter] = {
    +    // Recursively try to find an attribute name from the top level that 
can be pushed down.
    +    def attrName(e: Expression): Option[String] = e match {
    +      // In Spark and many data sources such as parquet, dots are used as 
a column path delimiter;
    +      // thus, we don't translate such expressions.
    +      case a: Attribute if !a.name.contains(".") =>
    +        Some(a.name)
    --- End diff --
    
    Yes, @dbtsai . This PR has a regression on ORC at least. The following is 
ORC result in Spark 2.3.2 and it will slowdown at least 5 times like Parquet.
    > I know ORC doesn't work for now. We can have another followup PR to 
address this.
    
    ```scala
    scala> val df = spark.range(Int.MaxValue).sample(0.2).toDF("col.with.dot")
    scala> df.write.mode("overwrite").orc("/tmp/orc")
    scala> df.write.mode("overwrite").parquet("/tmp/parquet")
    scala> spark.sql("set spark.sql.orc.impl=native")
    scala> spark.sql("set spark.sql.orc.filterPushdown=true")
    scala> spark.time(spark.read.orc("/tmp/orc").where("`col.with.dot` = 
50000").count)
    Time taken: 803 ms
    
    scala> spark.time(spark.read.parquet("/tmp/parquet").where("`col.with.dot` 
= 50000").count)
    Time taken: 5573 ms
    
    scala> spark.version
    res6: String = 2.3.2
    ```



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22573: [SPARK-25558][SQL] Pushdown predicates for nested...

Reply via email to