[ https://issues.apache.org/jira/browse/SPARK-34285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17274458#comment-17274458 ]
Attila Zsolt Piros edited comment on SPARK-34285 at 1/31/21, 4:50 AM: ---------------------------------------------------------------------- [~Xudingyu] predicate pushdown is extremely useful when a column group can be dropped altogether. To support this for each group statistics are stored in the Parquet. It contains the min and max value. In case of "StringStartsWith" you can see dropping the column groups is an easy decision (let's say the min is "BBB" and the max is "EEE" in the current column group): - when the pattern is after the max (i.e "F.*") or - when the pattern is before the min (i.e "A.*") you can safely drop the whole column. Regarding the "StringEndsWith" and "StringContains" you cannot make any decision based on the min and max value (where the min and max is from the lexicographical ordering of the strings). was (Author: attilapiros): [~Xudingyu] predicate pushdown is extremely useful when a column group can be dropped altogether. To support this for each group statistics are stored in the Parquet. It contains the min and max value. In case of "StringStartsWith" you can see dropping the column groups is an easy decision (let's say the min is "BBB" and the max is "EEE" in the current column group): - when the pattern is after the max (i.e "F.*") or - when the pattern is before the min (i.e "A.*") you can safely drop the whole column. Regarding the "StringEndsWith" and "StringContains" you cannot make any decision based on the min and max value. > Implement Parquet StringEndsWith、StringContains Filter > ------------------------------------------------------ > > Key: SPARK-34285 > URL: https://issues.apache.org/jira/browse/SPARK-34285 > Project: Spark > Issue Type: Improvement > Components: SQL > Affects Versions: 3.0.0 > Reporter: Xudingyu > Priority: Major > > When create parquetFilters, currently only implements > {code:java} > case sources.StringStartsWith(name, prefix) > {code} > But there exists StringEndsWith、StringContains in > /spark/sql/catalyst/src/main/scala/org/apache/spark/sql/sources/filters.scala > We can implements this two filters, and rename > {code:java} > PARQUET_FILTER_PUSHDOWN_STRING_STARTSWITH_ENABLED > {code} > to > {code:java} > PARQUET_FILTER_PUSHDOWN_STRING_ENABLED > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org