[ 
https://issues.apache.org/jira/browse/SPARK-21218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16064608#comment-16064608
 ] 

Michael Styles commented on SPARK-21218:
----------------------------------------

By not pushing the filter to Parquet, are we not preventing Parquet from 
skipping blocks during read operations? I have tests that show big improvements 
when applying this transformation.

For instance, I have a Parquet file with 162,456,394 rows which is sorted on 
column C1.

*IN Predicate*
{noformat}
df.filter[df['C1'].isin([42, 139])).collect()
{noformat}
!IN Predicate.png|thumbnail!

*OR Predicate*
{noformat}
df.filter((df['C1'] == 42) | (df['C1'] == 139)).collect()
{noformat}
!OR Pedicate.png|thumbnail!

Notice the difference in the number of output rows for the scan. 

> Convert IN predicate to equivalent Parquet filter
> -------------------------------------------------
>
>                 Key: SPARK-21218
>                 URL: https://issues.apache.org/jira/browse/SPARK-21218
>             Project: Spark
>          Issue Type: Improvement
>          Components: Optimizer
>    Affects Versions: 2.1.1
>            Reporter: Michael Styles
>         Attachments: IN Predicate.png, OR Predicate.png
>
>
> Convert IN predicate to equivalent expression involving equality conditions 
> to allow the filter to be pushed down to Parquet.
> For instance,
> C1 IN (10, 20) is rewritten as (C1 = 10) OR (C1 = 20)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to