[jira] [Comment Edited] (SPARK-21218) Convert IN predicate to equivalent Parquet filter

Michael Styles (JIRA) Tue, 27 Jun 2017 03:31:48 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-21218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16064608#comment-16064608
 ]


Michael Styles edited comment on SPARK-21218 at 6/27/17 10:30 AM:
------------------------------------------------------------------

By not pushing the filter to Parquet, are we not preventing Parquet from 
skipping blocks during read operations? I have tests that show big improvements 
when applying this transformation.

For instance, I have a Parquet file with 162,456,394 rows which is sorted on 
column C1.

*IN Predicate*
{noformat}
df.filter[df['C1'].isin([42, 139])).collect()
{noformat}

*OR Predicate*
{noformat}
df.filter((df['C1'] == 42) | (df['C1'] == 139)).collect()
{noformat}

Notice the difference in the number of output rows for the scans (see 
attachments). Also, the IN predicate test took about 1.1 minutes, while the OR 
predicate test took about 16 seconds. 


was (Author: ptkool):
By not pushing the filter to Parquet, are we not preventing Parquet from 
skipping blocks during read operations? I have tests that show big improvements 
when applying this transformation.

For instance, I have a Parquet file with 162,456,394 rows which is sorted on 
column C1.

*IN Predicate*
{noformat}
df.filter[df['C1'].isin([42, 139])).collect()
{noformat}
!IN Predicate.png|thumbnail!

*OR Predicate*
{noformat}
df.filter((df['C1'] == 42) | (df['C1'] == 139)).collect()
{noformat}
!OR Predicate.png|thumbnail!

Notice the difference in the number of output rows for the scan. 

> Convert IN predicate to equivalent Parquet filter
> -------------------------------------------------
>
>                 Key: SPARK-21218
>                 URL: https://issues.apache.org/jira/browse/SPARK-21218
>             Project: Spark
>          Issue Type: Improvement
>          Components: Optimizer
>    Affects Versions: 2.1.1
>            Reporter: Michael Styles
>         Attachments: IN Predicate.png, OR Predicate.png
>
>
> Convert IN predicate to equivalent expression involving equality conditions 
> to allow the filter to be pushed down to Parquet.
> For instance,
> C1 IN (10, 20) is rewritten as (C1 = 10) OR (C1 = 20)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-21218) Convert IN predicate to equivalent Parquet filter

Reply via email to