[ 
https://issues.apache.org/jira/browse/PARQUET-1809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17051959#comment-17051959
 ] 

Gabor Szadovszky commented on PARQUET-1809:
-------------------------------------------

It would be nice to use string arrays (or maybe more properly 
[ColumnPath|https://github.com/apache/parquet-mr/blob/master/parquet-common/src/main/java/org/apache/parquet/hadoop/metadata/ColumnPath.java]
 objects) instead of the _dot strings_ in all code parts but it seems to be a 
huge effort. And, in case of the mentioned configuration keys, it is not 
possible.
The problem with using '.' characters in column names is the potential 
collisions may occur in case of schemas like the following one:
{code}
message Document {
  required group foo {
    required int64 bar
  }
  required int64 foo.bar
}
{code}

>  Add new APIs for nested predicate pushdown
> -------------------------------------------
>
>                 Key: PARQUET-1809
>                 URL: https://issues.apache.org/jira/browse/PARQUET-1809
>             Project: Parquet
>          Issue Type: New Feature
>          Components: parquet-mr
>            Reporter: DB Tsai
>            Priority: Major
>
> Currently, Parquet's *org.apache.parquet.filter2.predicate.FilterApi* is 
> using *dot* to split the column name into multi-parts of nested fields. The 
> drawback is that this causes issues when the field name contains *dot*.
> The new APIs that will be added will take array of string directly for 
> multi-parts of nested fields, so no confusion as using *dot* as a separator.  
> See https://github.com/apache/spark/pull/27728 and [SPARK-17636] for details.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to