[ 
https://issues.apache.org/jira/browse/SPARK-34285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17274458#comment-17274458
 ] 

Attila Zsolt Piros edited comment on SPARK-34285 at 1/31/21, 4:50 AM:
----------------------------------------------------------------------

[~Xudingyu] predicate pushdown is extremely useful when a column group can be 
dropped altogether. 

To support this for each group statistics are stored in the Parquet. It 
contains the min and max value.
In case of "StringStartsWith" you can see dropping the column groups is an easy 
decision (let's say the min is "BBB" and the max is "EEE" in the current column 
group):
- when the pattern is after the max (i.e "F.*") or
- when the pattern is before the min (i.e "A.*")
you can safely drop the whole column.

Regarding the "StringEndsWith" and "StringContains" you cannot make any 
decision based on the min and max value (where the min and max is from the 
lexicographical ordering of the strings). 



was (Author: attilapiros):
[~Xudingyu] predicate pushdown is extremely useful when a column group can be 
dropped altogether. 

To support this for each group statistics are stored in the Parquet. It 
contains the min and max value.
In case of "StringStartsWith" you can see dropping the column groups is an easy 
decision (let's say the min is "BBB" and the max is "EEE" in the current column 
group):
- when the pattern is after the max (i.e "F.*") or
- when the pattern is before the min (i.e "A.*")
you can safely drop the whole column.

Regarding the "StringEndsWith" and "StringContains" you cannot make any 
decision based on the min and max value. 


> Implement Parquet StringEndsWith、StringContains Filter
> ------------------------------------------------------
>
>                 Key: SPARK-34285
>                 URL: https://issues.apache.org/jira/browse/SPARK-34285
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 3.0.0
>            Reporter: Xudingyu
>            Priority: Major
>
> When create parquetFilters, currently only implements  
> {code:java}
> case sources.StringStartsWith(name, prefix)
> {code}
> But there exists StringEndsWith、StringContains in 
> /spark/sql/catalyst/src/main/scala/org/apache/spark/sql/sources/filters.scala
> We can implements this two filters, and  rename 
> {code:java}
> PARQUET_FILTER_PUSHDOWN_STRING_STARTSWITH_ENABLED 
> {code}
>  to
> {code:java}
> PARQUET_FILTER_PUSHDOWN_STRING_ENABLED 
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to