[ https://issues.apache.org/jira/browse/SPARK-21218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16064608#comment-16064608 ]
Michael Styles edited comment on SPARK-21218 at 6/27/17 10:30 AM: ------------------------------------------------------------------ By not pushing the filter to Parquet, are we not preventing Parquet from skipping blocks during read operations? I have tests that show big improvements when applying this transformation. For instance, I have a Parquet file with 162,456,394 rows which is sorted on column C1. *IN Predicate* {noformat} df.filter[df['C1'].isin([42, 139])).collect() {noformat} *OR Predicate* {noformat} df.filter((df['C1'] == 42) | (df['C1'] == 139)).collect() {noformat} Notice the difference in the number of output rows for the scans (see attachments). Also, the IN predicate test took about 1.1 minutes, while the OR predicate test took about 16 seconds. was (Author: ptkool): By not pushing the filter to Parquet, are we not preventing Parquet from skipping blocks during read operations? I have tests that show big improvements when applying this transformation. For instance, I have a Parquet file with 162,456,394 rows which is sorted on column C1. *IN Predicate* {noformat} df.filter[df['C1'].isin([42, 139])).collect() {noformat} !IN Predicate.png|thumbnail! *OR Predicate* {noformat} df.filter((df['C1'] == 42) | (df['C1'] == 139)).collect() {noformat} !OR Predicate.png|thumbnail! Notice the difference in the number of output rows for the scan. > Convert IN predicate to equivalent Parquet filter > ------------------------------------------------- > > Key: SPARK-21218 > URL: https://issues.apache.org/jira/browse/SPARK-21218 > Project: Spark > Issue Type: Improvement > Components: Optimizer > Affects Versions: 2.1.1 > Reporter: Michael Styles > Attachments: IN Predicate.png, OR Predicate.png > > > Convert IN predicate to equivalent expression involving equality conditions > to allow the filter to be pushed down to Parquet. > For instance, > C1 IN (10, 20) is rewritten as (C1 = 10) OR (C1 = 20) -- This message was sent by Atlassian JIRA (v6.4.14#64029) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org