[GitHub] spark pull request #17680: [SPARK-20364][SQL] Support Parquet predicate push...

HyukjinKwon Wed, 19 Apr 2017 18:17:29 -0700

Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17680#discussion_r112347873
  
    --- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilterSuite.scala
 ---
    @@ -536,4 +537,43 @@ class ParquetFilterSuite extends QueryTest with 
ParquetTest with SharedSQLContex
           // scalastyle:on nonascii
         }
       }
    +
    +  test("SPARK-20364: Predicate pushdown for columns with a '.' in them") {
    +    import testImplicits._
    +
    +    Seq(true, false).foreach { vectorized =>
    +      withSQLConf(SQLConf.PARQUET_VECTORIZED_READER_ENABLED.key -> 
vectorized.toString) {
    +        withTempPath { path =>
    +          Seq(Some(1), 
None).toDF("col.dots").write.parquet(path.getAbsolutePath)
    +          
assert(spark.read.parquet(path.getAbsolutePath).where("`col.dots` > 0").count() 
== 1)
    +        }
    +
    +        withTempPath { path =>
    +          Seq(Some(1L), 
None).toDF("col.dots").write.parquet(path.getAbsolutePath)
    +          
assert(spark.read.parquet(path.getAbsolutePath).where("`col.dots` >= 
1L").count() == 1)
    +        }
    +
    +        withTempPath { path =>
    +          Seq(Some(1.0F), 
None).toDF("col.dots").write.parquet(path.getAbsolutePath)
    +          
assert(spark.read.parquet(path.getAbsolutePath).where("`col.dots` < 
2.0").count() == 1)
    +        }
    +
    +        withTempPath { path =>
    +          Seq(Some(1.0D), 
None).toDF("col.dots").write.parquet(path.getAbsolutePath)
    +          
assert(spark.read.parquet(path.getAbsolutePath).where("`col.dots` <= 
1.0D").count() == 1)
    +        }
    +
    +        withTempPath { path =>
    +          Seq(true, 
false).toDF("col.dots").write.parquet(path.getAbsolutePath)
    +          
assert(spark.read.parquet(path.getAbsolutePath).where("`col.dots` == 
true").count() == 1)
    +        }
    +
    +        withTempPath { path =>
    +          Seq("apple", 
null).toDF("col.dots").write.parquet(path.getAbsolutePath)
    +          assert(
    +            spark.read.parquet(path.getAbsolutePath).where("`col.dots` IS 
NOT NULL").count() == 1)
    --- End diff --
    
    Actually, `IS NULL` is not the problem here.
    
    ```scala
    val path = "/tmp/abcde"
    spark.read.parquet(path).where("`col.dots` IS NULL").show()
    ```
    
    ```
    +--------+
    |col.dots|
    +--------+
    |    null|
    +--------+
    ```
    
    The reason is Parquet produces `null` permissively if the column does not 
exist after we upgrade it to 1.8.2 AFAIK. If this reason should be verified, I 
will look further. But in terms of the output, the issue is not reproduced.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17680: [SPARK-20364][SQL] Support Parquet predicate push...

Reply via email to