GitHub user HyukjinKwon opened a pull request: https://github.com/apache/spark/pull/17680
[SPARK-20364][SQL] Support Parquet predicate pushdown on columns with dots ## What changes were proposed in this pull request? Currently, if there are dots in the column name, predicate pushdown seems being failed in Parquet. **With dots** ``` val path = "/tmp/abcde" Seq(Some(1), None).toDF("col.dots").write.parquet(path) spark.read.parquet(path).where("`col.dots` IS NOT NULL").show() ``` ``` +--------+ |col.dots| +--------+ +--------+ ``` **Without dots** ``` val path = "/tmp/abcde" Seq(Some(1), None).toDF("coldots").write.parquet(path) spark.read.parquet(path).where("`coldots` IS NOT NULL").show() ``` ``` +-------+ |coldots| +-------+ | 1| +-------+ ``` It seems dot in the column names via `FilterApi` tries to separate the field name with dot (`ColumnPath` with multiple column paths) whereas the actual column name is `col.dots`. (See [FilterApi.java#L71](https://github.com/apache/parquet-mr/blob/apache-parquet-1.8.2/parquet-column/src/main/java/org/apache/parquet/filter2/predicate/FilterApi.java#L71) and it calls [ColumnPath.java#L44](https://github.com/apache/parquet-mr/blob/apache-parquet-1.8.2/parquet-common/src/main/java/org/apache/parquet/hadoop/metadata/ColumnPath.java#L44). I just tried to come up with ways to resolve it and I came up with two as below: - One is simply to don't push down filters when there are dots in column names so that it reads all and filters in Spark-side. - The other way creates Spark's `FilterApi` for those columns (it seems final) to get always use single column path it in Spark-side (this seems hacky) as we are not pushing down nested columns currently. So, it looks we can get a field name via `ColumnPath.get` not `ColumnPath.fromDotString` in this way. This PR proposes the latter way because I think we need to be sure on that it passes the tests. **After** ```scala val path = "/tmp/abcde" Seq(Some(1), None).toDF("col.dots").write.parquet(path) spark.read.parquet(path).where("`col.dots` IS NOT NULL").show() ``` ``` +--------+ |col.dots| +--------+ | 1| +--------+ ``` ## How was this patch tested? Existing tests should cover this. Some tests additionally were added in `ParquetFilterSuite.scala`. Manually, I ran related tests and Jenkins tests will cover this. You can merge this pull request into a Git repository by running: $ git pull https://github.com/HyukjinKwon/spark SPARK-20364 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/17680.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #17680 ---- commit 297c70bd47cb859d343345849776bd7c21396078 Author: hyukjinkwon <gurwls...@gmail.com> Date: 2017-04-19T06:24:15Z Parquet predicate pushdown on columns with dots return empty results ---- --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org