[GitHub] spark pull request #17680: [SPARK-20364][SQL] Support Parquet predicate push...
Github user HyukjinKwon closed the pull request at: https://github.com/apache/spark/pull/17680 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17680: [SPARK-20364][SQL] Support Parquet predicate push...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/17680#discussion_r116579008 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala --- @@ -235,3 +238,39 @@ private[parquet] object ParquetFilters { } } } + +/** + * Note that, this is a hacky workaround to allow dots in column names. Currently, column APIs + * in Parquet's `FilterApi` only allows dot-separated names so here we resemble those columns + * but only allow single column path that allows dots in the names as we don't currently push + * down filters with nested fields. --- End diff -- We need to explain the functions in this object `ParquetColumns` are based on the codes in `org.apache.parquet.filter2.predicate`. Thus, when upgrading the Parquet versions, we need to check whether they are still the same. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17680: [SPARK-20364][SQL] Support Parquet predicate push...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/17680#discussion_r116572835 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilterSuite.scala --- @@ -536,4 +537,53 @@ class ParquetFilterSuite extends QueryTest with ParquetTest with SharedSQLContex // scalastyle:on nonascii } } + + test("SPARK-20364: Predicate pushdown for columns with a '.' in them") { +import testImplicits._ + +Seq(true, false).foreach { vectorized => + withSQLConf(SQLConf.PARQUET_VECTORIZED_READER_ENABLED.key -> vectorized.toString) { +withTempPath { path => + Seq(Some(1), None).toDF("col.dots").write.parquet(path.getAbsolutePath) + assert(spark.read.parquet(path.getAbsolutePath).where("`col.dots` > 0").count() == 1) +} --- End diff -- Instead of duplicating the codes, please write a helper function. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17680: [SPARK-20364][SQL] Support Parquet predicate push...
Github user ash211 commented on a diff in the pull request: https://github.com/apache/spark/pull/17680#discussion_r112531312 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilterSuite.scala --- @@ -536,4 +537,43 @@ class ParquetFilterSuite extends QueryTest with ParquetTest with SharedSQLContex // scalastyle:on nonascii } } + + test("SPARK-20364: Predicate pushdown for columns with a '.' in them") { +import testImplicits._ + +Seq(true, false).foreach { vectorized => + withSQLConf(SQLConf.PARQUET_VECTORIZED_READER_ENABLED.key -> vectorized.toString) { +withTempPath { path => + Seq(Some(1), None).toDF("col.dots").write.parquet(path.getAbsolutePath) + assert(spark.read.parquet(path.getAbsolutePath).where("`col.dots` > 0").count() == 1) +} + +withTempPath { path => + Seq(Some(1L), None).toDF("col.dots").write.parquet(path.getAbsolutePath) + assert(spark.read.parquet(path.getAbsolutePath).where("`col.dots` >= 1L").count() == 1) +} + +withTempPath { path => + Seq(Some(1.0F), None).toDF("col.dots").write.parquet(path.getAbsolutePath) + assert(spark.read.parquet(path.getAbsolutePath).where("`col.dots` < 2.0").count() == 1) +} + +withTempPath { path => + Seq(Some(1.0D), None).toDF("col.dots").write.parquet(path.getAbsolutePath) + assert(spark.read.parquet(path.getAbsolutePath).where("`col.dots` <= 1.0D").count() == 1) +} + +withTempPath { path => + Seq(true, false).toDF("col.dots").write.parquet(path.getAbsolutePath) + assert(spark.read.parquet(path.getAbsolutePath).where("`col.dots` == true").count() == 1) +} + +withTempPath { path => + Seq("apple", null).toDF("col.dots").write.parquet(path.getAbsolutePath) + assert( +spark.read.parquet(path.getAbsolutePath).where("`col.dots` IS NOT NULL").count() == 1) --- End diff -- thanks for adding the additional test below --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17680: [SPARK-20364][SQL] Support Parquet predicate push...
Github user ash211 commented on a diff in the pull request: https://github.com/apache/spark/pull/17680#discussion_r112529697 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/FileSourceStrategySuite.scala --- @@ -487,6 +487,20 @@ class FileSourceStrategySuite extends QueryTest with SharedSQLContext with Predi } } + test("no filter puwhdown for nested field access") { --- End diff -- nit: pushdown --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17680: [SPARK-20364][SQL] Support Parquet predicate push...
Github user ash211 commented on a diff in the pull request: https://github.com/apache/spark/pull/17680#discussion_r112530989 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilterSuite.scala --- @@ -536,4 +537,43 @@ class ParquetFilterSuite extends QueryTest with ParquetTest with SharedSQLContex // scalastyle:on nonascii } } + + test("SPARK-20364: Predicate pushdown for columns with a '.' in them") { --- End diff -- thanks for the change -- I wasn't sure if predicate pushdown worked on nested columns and it looks like that change confirms it does not after this change. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17680: [SPARK-20364][SQL] Support Parquet predicate push...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/17680#discussion_r112347873 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilterSuite.scala --- @@ -536,4 +537,43 @@ class ParquetFilterSuite extends QueryTest with ParquetTest with SharedSQLContex // scalastyle:on nonascii } } + + test("SPARK-20364: Predicate pushdown for columns with a '.' in them") { +import testImplicits._ + +Seq(true, false).foreach { vectorized => + withSQLConf(SQLConf.PARQUET_VECTORIZED_READER_ENABLED.key -> vectorized.toString) { +withTempPath { path => + Seq(Some(1), None).toDF("col.dots").write.parquet(path.getAbsolutePath) + assert(spark.read.parquet(path.getAbsolutePath).where("`col.dots` > 0").count() == 1) +} + +withTempPath { path => + Seq(Some(1L), None).toDF("col.dots").write.parquet(path.getAbsolutePath) + assert(spark.read.parquet(path.getAbsolutePath).where("`col.dots` >= 1L").count() == 1) +} + +withTempPath { path => + Seq(Some(1.0F), None).toDF("col.dots").write.parquet(path.getAbsolutePath) + assert(spark.read.parquet(path.getAbsolutePath).where("`col.dots` < 2.0").count() == 1) +} + +withTempPath { path => + Seq(Some(1.0D), None).toDF("col.dots").write.parquet(path.getAbsolutePath) + assert(spark.read.parquet(path.getAbsolutePath).where("`col.dots` <= 1.0D").count() == 1) +} + +withTempPath { path => + Seq(true, false).toDF("col.dots").write.parquet(path.getAbsolutePath) + assert(spark.read.parquet(path.getAbsolutePath).where("`col.dots` == true").count() == 1) +} + +withTempPath { path => + Seq("apple", null).toDF("col.dots").write.parquet(path.getAbsolutePath) + assert( +spark.read.parquet(path.getAbsolutePath).where("`col.dots` IS NOT NULL").count() == 1) --- End diff -- Actually, `IS NULL` is not the problem here. ```scala val path = "/tmp/abcde" spark.read.parquet(path).where("`col.dots` IS NULL").show() ``` ``` ++ |col.dots| ++ |null| ++ ``` The reason is Parquet produces `null` permissively if the column does not exist after we upgrade it to 1.8.2 AFAIK. If this reason should be verified, I will look further. But in terms of the output, the issue is not reproduced. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17680: [SPARK-20364][SQL] Support Parquet predicate push...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/17680#discussion_r112288596 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilterSuite.scala --- @@ -536,4 +537,43 @@ class ParquetFilterSuite extends QueryTest with ParquetTest with SharedSQLContex // scalastyle:on nonascii } } + + test("SPARK-20364: Predicate pushdown for columns with a '.' in them") { --- End diff -- Up to my knolwedge, we don't push down filters with nested columns. Let me check if we already have the negative case explicitly and then add it if missing. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17680: [SPARK-20364][SQL] Support Parquet predicate push...
Github user ash211 commented on a diff in the pull request: https://github.com/apache/spark/pull/17680#discussion_r112285677 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilterSuite.scala --- @@ -536,4 +537,43 @@ class ParquetFilterSuite extends QueryTest with ParquetTest with SharedSQLContex // scalastyle:on nonascii } } + + test("SPARK-20364: Predicate pushdown for columns with a '.' in them") { --- End diff -- is there another existing test that checks that pushdown for `struct.field1` syntax works correctly? I'm not sure how to reference those inner fields in a struct field as I don't use it much personally, but want to make sure that's not broken as a result of this change. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17680: [SPARK-20364][SQL] Support Parquet predicate push...
Github user ash211 commented on a diff in the pull request: https://github.com/apache/spark/pull/17680#discussion_r112285883 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilterSuite.scala --- @@ -536,4 +537,43 @@ class ParquetFilterSuite extends QueryTest with ParquetTest with SharedSQLContex // scalastyle:on nonascii } } + + test("SPARK-20364: Predicate pushdown for columns with a '.' in them") { +import testImplicits._ + +Seq(true, false).foreach { vectorized => + withSQLConf(SQLConf.PARQUET_VECTORIZED_READER_ENABLED.key -> vectorized.toString) { +withTempPath { path => + Seq(Some(1), None).toDF("col.dots").write.parquet(path.getAbsolutePath) + assert(spark.read.parquet(path.getAbsolutePath).where("`col.dots` > 0").count() == 1) +} + +withTempPath { path => + Seq(Some(1L), None).toDF("col.dots").write.parquet(path.getAbsolutePath) + assert(spark.read.parquet(path.getAbsolutePath).where("`col.dots` >= 1L").count() == 1) +} + +withTempPath { path => + Seq(Some(1.0F), None).toDF("col.dots").write.parquet(path.getAbsolutePath) + assert(spark.read.parquet(path.getAbsolutePath).where("`col.dots` < 2.0").count() == 1) +} + +withTempPath { path => + Seq(Some(1.0D), None).toDF("col.dots").write.parquet(path.getAbsolutePath) + assert(spark.read.parquet(path.getAbsolutePath).where("`col.dots` <= 1.0D").count() == 1) +} + +withTempPath { path => + Seq(true, false).toDF("col.dots").write.parquet(path.getAbsolutePath) + assert(spark.read.parquet(path.getAbsolutePath).where("`col.dots` == true").count() == 1) +} + +withTempPath { path => + Seq("apple", null).toDF("col.dots").write.parquet(path.getAbsolutePath) + assert( +spark.read.parquet(path.getAbsolutePath).where("`col.dots` IS NOT NULL").count() == 1) --- End diff -- please also do the check for `IS NULL` having 1 row too, so this is symmetric --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17680: [SPARK-20364][SQL] Support Parquet predicate push...
GitHub user HyukjinKwon opened a pull request: https://github.com/apache/spark/pull/17680 [SPARK-20364][SQL] Support Parquet predicate pushdown on columns with dots ## What changes were proposed in this pull request? Currently, if there are dots in the column name, predicate pushdown seems being failed in Parquet. **With dots** ``` val path = "/tmp/abcde" Seq(Some(1), None).toDF("col.dots").write.parquet(path) spark.read.parquet(path).where("`col.dots` IS NOT NULL").show() ``` ``` ++ |col.dots| ++ ++ ``` **Without dots** ``` val path = "/tmp/abcde" Seq(Some(1), None).toDF("coldots").write.parquet(path) spark.read.parquet(path).where("`coldots` IS NOT NULL").show() ``` ``` +---+ |coldots| +---+ | 1| +---+ ``` It seems dot in the column names via `FilterApi` tries to separate the field name with dot (`ColumnPath` with multiple column paths) whereas the actual column name is `col.dots`. (See [FilterApi.java#L71](https://github.com/apache/parquet-mr/blob/apache-parquet-1.8.2/parquet-column/src/main/java/org/apache/parquet/filter2/predicate/FilterApi.java#L71) and it calls [ColumnPath.java#L44](https://github.com/apache/parquet-mr/blob/apache-parquet-1.8.2/parquet-common/src/main/java/org/apache/parquet/hadoop/metadata/ColumnPath.java#L44). I just tried to come up with ways to resolve it and I came up with two as below: - One is simply to don't push down filters when there are dots in column names so that it reads all and filters in Spark-side. - The other way creates Spark's `FilterApi` for those columns (it seems final) to get always use single column path it in Spark-side (this seems hacky) as we are not pushing down nested columns currently. So, it looks we can get a field name via `ColumnPath.get` not `ColumnPath.fromDotString` in this way. This PR proposes the latter way because I think we need to be sure on that it passes the tests. **After** ```scala val path = "/tmp/abcde" Seq(Some(1), None).toDF("col.dots").write.parquet(path) spark.read.parquet(path).where("`col.dots` IS NOT NULL").show() ``` ``` ++ |col.dots| ++ | 1| ++ ``` ## How was this patch tested? Existing tests should cover this. Some tests additionally were added in `ParquetFilterSuite.scala`. Manually, I ran related tests and Jenkins tests will cover this. You can merge this pull request into a Git repository by running: $ git pull https://github.com/HyukjinKwon/spark SPARK-20364 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/17680.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #17680 commit 297c70bd47cb859d343345849776bd7c21396078 Author: hyukjinkwonDate: 2017-04-19T06:24:15Z Parquet predicate pushdown on columns with dots return empty results --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org