[ https://issues.apache.org/jira/browse/SPARK-20364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15972171#comment-15972171 ]
Hyukjin Kwon commented on SPARK-20364: -------------------------------------- [~aash], [~robert3005] who found this issue in https://github.com/apache/spark/pull/17667 and [~lian cheng] who might have a better idea and I think can confirm if the investigation here is correct and decide the way to resolve it. > Parquet predicate pushdown on columns with dots return empty results > -------------------------------------------------------------------- > > Key: SPARK-20364 > URL: https://issues.apache.org/jira/browse/SPARK-20364 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 2.2.0 > Reporter: Hyukjin Kwon > > Currently, if there are dots in the column name, predicate pushdown seems > being failed in Parquet. > **With dots** > {code} > val path = "/tmp/abcde" > Seq(Some(1), None).toDF("col.dots").write.parquet(path) > spark.read.parquet(path).where("`col.dots` IS NOT NULL").show() > {code} > {code} > +--------+ > |col.dots| > +--------+ > +--------+ > {code} > **Without dots** > {code} > val path = "/tmp/abcde2" > Seq(Some(1), None).toDF("coldots").write.parquet(path) > spark.read.parquet(path).where("`coldots` IS NOT NULL").show() > {code} > {code} > +-------+ > |coldots| > +-------+ > | 1| > +-------+ > {code} > It seems dot in the column names via {{FilterApi}} tries to separate the > field name with dot ({{ColumnPath}} with multiple column paths) whereas the > actual column name is {{col.dots}}. (See [FilterApi.java#L71 > |https://github.com/apache/parquet-mr/blob/apache-parquet-1.8.2/parquet-column/src/main/java/org/apache/parquet/filter2/predicate/FilterApi.java#L71] > and it calls > [ColumnPath.java#L44|https://github.com/apache/parquet-mr/blob/apache-parquet-1.8.2/parquet-common/src/main/java/org/apache/parquet/hadoop/metadata/ColumnPath.java#L44]. > I just tried to come up with ways to resolve it and I came up with two as > below: > One is simply to don't push down filters when there are dots in column names > so that it reads all and filters in Spark-side. > The other way creates Spark's {{FilterApi}} for those columns (it seems > final) to get always use single column path it in Spark-side (this seems > hacky) as we are not pushing down nested columns currently. So, it looks we > can get a field name via {{ColumnPath.get}} not {{ColumnPath.fromDotString}} > in this way. > I just made a rough version of the latter. > {code} > private[parquet] object ParquetColumns { > def intColumn(columnPath: String): Column[Integer] with SupportsLtGt = { > new Column[Integer] (ColumnPath.get(columnPath), classOf[Integer]) with > SupportsLtGt > } > def longColumn(columnPath: String): Column[java.lang.Long] with > SupportsLtGt = { > new Column[java.lang.Long] ( > ColumnPath.get(columnPath), classOf[java.lang.Long]) with SupportsLtGt > } > def floatColumn(columnPath: String): Column[java.lang.Float] with > SupportsLtGt = { > new Column[java.lang.Float] ( > ColumnPath.get(columnPath), classOf[java.lang.Float]) with SupportsLtGt > } > def doubleColumn(columnPath: String): Column[java.lang.Double] with > SupportsLtGt = { > new Column[java.lang.Double] ( > ColumnPath.get(columnPath), classOf[java.lang.Double]) with SupportsLtGt > } > def booleanColumn(columnPath: String): Column[java.lang.Boolean] with > SupportsEqNotEq = { > new Column[java.lang.Boolean] ( > ColumnPath.get(columnPath), classOf[java.lang.Boolean]) with > SupportsEqNotEq > } > def binaryColumn(columnPath: String): Column[Binary] with SupportsLtGt = { > new Column[Binary] (ColumnPath.get(columnPath), classOf[Binary]) with > SupportsLtGt > } > } > {code} > Both sound not the best. Please let me know if anyone has a better idea. -- This message was sent by Atlassian JIRA (v6.3.15#6346) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org