[ 
https://issues.apache.org/jira/browse/SPARK-20364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15972400#comment-15972400
 ] 

Robert Kruszewski edited comment on SPARK-20364 at 4/18/17 9:36 AM:
--------------------------------------------------------------------

Looks like parquet doesn't differentiate between column named "a.b" and field 
"b" in column "a". We could always treat them the same but that might lead to 
some surprises down the road. Ideally parquet would let us construct filter 
given exact column name or dot separated accessors. I think what [~aash] 
proposes is roughly how we should fix it with caveat that eventually this 
should live in parquet imho.


was (Author: robert3005):
Looks like parquet doesn't differentiate between column named "a.b" and field b 
in column "a". We could always treat them the same but that might lead to some 
surprises down the road. Ideally parquet would let us construct filter given 
exact column name or dot separated accessors. I think what [~aash] proposes is 
roughly how we should fix it with caveat that eventually this should live in 
parquet imho.

> Parquet predicate pushdown on columns with dots return empty results
> --------------------------------------------------------------------
>
>                 Key: SPARK-20364
>                 URL: https://issues.apache.org/jira/browse/SPARK-20364
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.2.0
>            Reporter: Hyukjin Kwon
>
> Currently, if there are dots in the column name, predicate pushdown seems 
> being failed in Parquet.
> **With dots**
> {code}
> val path = "/tmp/abcde"
> Seq(Some(1), None).toDF("col.dots").write.parquet(path)
> spark.read.parquet(path).where("`col.dots` IS NOT NULL").show()
> {code}
> {code}
> +--------+
> |col.dots|
> +--------+
> +--------+
> {code}
> **Without dots**
> {code}
> val path = "/tmp/abcde2"
> Seq(Some(1), None).toDF("coldots").write.parquet(path)
> spark.read.parquet(path).where("`coldots` IS NOT NULL").show()
> {code}
> {code}
> +-------+
> |coldots|
> +-------+
> |      1|
> +-------+
> {code}
> It seems dot in the column names via {{FilterApi}} tries to separate the 
> field name with dot ({{ColumnPath}} with multiple column paths) whereas the 
> actual column name is {{col.dots}}. (See [FilterApi.java#L71 
> |https://github.com/apache/parquet-mr/blob/apache-parquet-1.8.2/parquet-column/src/main/java/org/apache/parquet/filter2/predicate/FilterApi.java#L71]
>  and it calls 
> [ColumnPath.java#L44|https://github.com/apache/parquet-mr/blob/apache-parquet-1.8.2/parquet-common/src/main/java/org/apache/parquet/hadoop/metadata/ColumnPath.java#L44].
> I just tried to come up with ways to resolve it and I came up with two as 
> below:
> One is simply to don't push down filters when there are dots in column names 
> so that it reads all and filters in Spark-side.
> The other way creates Spark's {{FilterApi}} for those columns (it seems 
> final) to get always use single column path it in Spark-side (this seems 
> hacky) as we are not pushing down nested columns currently. So, it looks we 
> can get a field name via {{ColumnPath.get}} not {{ColumnPath.fromDotString}} 
> in this way.
> I just made a rough version of the latter. 
> {code}
> private[parquet] object ParquetColumns {
>   def intColumn(columnPath: String): Column[Integer] with SupportsLtGt = {
>     new Column[Integer] (ColumnPath.get(columnPath), classOf[Integer]) with 
> SupportsLtGt
>   }
>   def longColumn(columnPath: String): Column[java.lang.Long] with 
> SupportsLtGt = {
>     new Column[java.lang.Long] (
>       ColumnPath.get(columnPath), classOf[java.lang.Long]) with SupportsLtGt
>   }
>   def floatColumn(columnPath: String): Column[java.lang.Float] with 
> SupportsLtGt = {
>     new Column[java.lang.Float] (
>       ColumnPath.get(columnPath), classOf[java.lang.Float]) with SupportsLtGt
>   }
>   def doubleColumn(columnPath: String): Column[java.lang.Double] with 
> SupportsLtGt = {
>     new Column[java.lang.Double] (
>       ColumnPath.get(columnPath), classOf[java.lang.Double]) with SupportsLtGt
>   }
>   def booleanColumn(columnPath: String): Column[java.lang.Boolean] with 
> SupportsEqNotEq = {
>     new Column[java.lang.Boolean] (
>       ColumnPath.get(columnPath), classOf[java.lang.Boolean]) with 
> SupportsEqNotEq
>   }
>   def binaryColumn(columnPath: String): Column[Binary] with SupportsLtGt = {
>     new Column[Binary] (ColumnPath.get(columnPath), classOf[Binary]) with 
> SupportsLtGt
>   }
> }
> {code}
> Both sound not the best. Please let me know if anyone has a better idea.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to