[ 
https://issues.apache.org/jira/browse/SPARK-21218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16064608#comment-16064608
 ] 

Michael Styles edited comment on SPARK-21218 at 6/27/17 12:17 PM:
------------------------------------------------------------------

By not pushing the filter to Parquet, are we not preventing Parquet from 
skipping blocks during read operations? I have tests that show big improvements 
when applying this transformation.

For instance, I have a Parquet file with 162,456,394 rows which is sorted on 
column C1.

*IN Predicate*
{noformat}
df.filter[df['C1'].isin([42, 139])).collect()
{noformat}

*OR Predicate*
{noformat}
df.filter((df['C1'] == 42) | (df['C1'] == 139)).collect()
{noformat}

I'm seeing about a 50 -75 % improvement.


was (Author: ptkool):
By not pushing the filter to Parquet, are we not preventing Parquet from 
skipping blocks during read operations? I have tests that show big improvements 
when applying this transformation.

For instance, I have a Parquet file with 162,456,394 rows which is sorted on 
column C1.

*IN Predicate*
{noformat}
df.filter[df['C1'].isin([42, 139])).collect()
{noformat}

*OR Predicate*
{noformat}
df.filter((df['C1'] == 42) | (df['C1'] == 139)).collect()
{noformat}

Notice the difference in the number of output rows for the scans (see 
attachments). Also, the IN predicate test took about 1.1 minutes, while the OR 
predicate test took about 16 seconds. 

> Convert IN predicate to equivalent Parquet filter
> -------------------------------------------------
>
>                 Key: SPARK-21218
>                 URL: https://issues.apache.org/jira/browse/SPARK-21218
>             Project: Spark
>          Issue Type: Improvement
>          Components: Optimizer
>    Affects Versions: 2.1.1
>            Reporter: Michael Styles
>         Attachments: IN Predicate.png, OR Predicate.png
>
>
> Convert IN predicate to equivalent expression involving equality conditions 
> to allow the filter to be pushed down to Parquet.
> For instance,
> C1 IN (10, 20) is rewritten as (C1 = 10) OR (C1 = 20)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to