[jira] [Commented] (SPARK-20364) Parquet predicate pushdown on columns with dots return empty results

Hyukjin Kwon (JIRA) Mon, 17 Apr 2017 23:09:07 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-20364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15972171#comment-15972171
 ]


Hyukjin Kwon commented on SPARK-20364:
--------------------------------------

[~aash], [~robert3005] who found this issue in 
https://github.com/apache/spark/pull/17667 

and [~lian cheng] who might have a better idea and I think can confirm if the 
investigation here is correct and decide the way to resolve it.

> Parquet predicate pushdown on columns with dots return empty results
> --------------------------------------------------------------------
>
>                 Key: SPARK-20364
>                 URL: https://issues.apache.org/jira/browse/SPARK-20364
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.2.0
>            Reporter: Hyukjin Kwon
>
> Currently, if there are dots in the column name, predicate pushdown seems 
> being failed in Parquet.
> **With dots**
> {code}
> val path = "/tmp/abcde"
> Seq(Some(1), None).toDF("col.dots").write.parquet(path)
> spark.read.parquet(path).where("`col.dots` IS NOT NULL").show()
> {code}
> {code}
> +--------+
> |col.dots|
> +--------+
> +--------+
> {code}
> **Without dots**
> {code}
> val path = "/tmp/abcde2"
> Seq(Some(1), None).toDF("coldots").write.parquet(path)
> spark.read.parquet(path).where("`coldots` IS NOT NULL").show()
> {code}
> {code}
> +-------+
> |coldots|
> +-------+
> |      1|
> +-------+
> {code}
> It seems dot in the column names via {{FilterApi}} tries to separate the 
> field name with dot ({{ColumnPath}} with multiple column paths) whereas the 
> actual column name is {{col.dots}}. (See [FilterApi.java#L71 
> |https://github.com/apache/parquet-mr/blob/apache-parquet-1.8.2/parquet-column/src/main/java/org/apache/parquet/filter2/predicate/FilterApi.java#L71]
>  and it calls 
> [ColumnPath.java#L44|https://github.com/apache/parquet-mr/blob/apache-parquet-1.8.2/parquet-common/src/main/java/org/apache/parquet/hadoop/metadata/ColumnPath.java#L44].
> I just tried to come up with ways to resolve it and I came up with two as 
> below:
> One is simply to don't push down filters when there are dots in column names 
> so that it reads all and filters in Spark-side.
> The other way creates Spark's {{FilterApi}} for those columns (it seems 
> final) to get always use single column path it in Spark-side (this seems 
> hacky) as we are not pushing down nested columns currently. So, it looks we 
> can get a field name via {{ColumnPath.get}} not {{ColumnPath.fromDotString}} 
> in this way.
> I just made a rough version of the latter. 
> {code}
> private[parquet] object ParquetColumns {
>   def intColumn(columnPath: String): Column[Integer] with SupportsLtGt = {
>     new Column[Integer] (ColumnPath.get(columnPath), classOf[Integer]) with 
> SupportsLtGt
>   }
>   def longColumn(columnPath: String): Column[java.lang.Long] with 
> SupportsLtGt = {
>     new Column[java.lang.Long] (
>       ColumnPath.get(columnPath), classOf[java.lang.Long]) with SupportsLtGt
>   }
>   def floatColumn(columnPath: String): Column[java.lang.Float] with 
> SupportsLtGt = {
>     new Column[java.lang.Float] (
>       ColumnPath.get(columnPath), classOf[java.lang.Float]) with SupportsLtGt
>   }
>   def doubleColumn(columnPath: String): Column[java.lang.Double] with 
> SupportsLtGt = {
>     new Column[java.lang.Double] (
>       ColumnPath.get(columnPath), classOf[java.lang.Double]) with SupportsLtGt
>   }
>   def booleanColumn(columnPath: String): Column[java.lang.Boolean] with 
> SupportsEqNotEq = {
>     new Column[java.lang.Boolean] (
>       ColumnPath.get(columnPath), classOf[java.lang.Boolean]) with 
> SupportsEqNotEq
>   }
>   def binaryColumn(columnPath: String): Column[Binary] with SupportsLtGt = {
>     new Column[Binary] (ColumnPath.get(columnPath), classOf[Binary]) with 
> SupportsLtGt
>   }
> }
> {code}
> Both sound not the best. Please let me know if anyone has a better idea.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20364) Parquet predicate pushdown on columns with dots return empty results

Reply via email to