[jira] [Commented] (SPARK-25548) In the PruneFileSourcePartitions optimizer, replace the nonPartitionOps field with true in the And(partitionOps, nonPartitionOps) to make the partition can be pruned

Eyal Farago (JIRA) Sat, 03 Nov 2018 00:50:23 -0700


    [ 
https://issues.apache.org/jira/browse/SPARK-25548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16673957#comment-16673957
 ]


Eyal Farago commented on SPARK-25548:
-------------------------------------

[~eaton], I think there are two possible approaches to handle this:

first would be extracting the partitions predicate and _And_ing it with the 
original predicate:
{code:java}
select * from src_par where 
(P_d in (2,3)) and 
((p_d=2 and key=2) or (p_d=3 and key=3))
{code}
second approach would be transforming this into a union:
{code:java}
select * from src_par where (p_d=2 and key=2) 
UNION ALL
select * from src_par where (p_d=3 and key=3)
{code}
I think second approach is easier to implement but it'd require additional 
rules to make sure partitioned are not scanned multiple times, ie. consider 
what'd happen if your predicate looked like this:
{code:java}
 (p_d=2 and key=2) or (p_d=3 and key=3) or (p_d=2 and key=33)
{code}
a naive approach would scan partition #2 twice while it's pretty obvious this 
can be avoided by _OR_ing the first and third conditions.

The first approach seems a bit more complicated by I think it somewhat 
resembles what you've started implementing in your pr, [~cloud_fan] your 
thoughts?

> In the PruneFileSourcePartitions optimizer, replace the nonPartitionOps field 
> with true in the And(partitionOps, nonPartitionOps) to make the partition can 
> be pruned
> ---------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-25548
>                 URL: https://issues.apache.org/jira/browse/SPARK-25548
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 2.3.2
>            Reporter: eaton
>            Assignee: Apache Spark
>            Priority: Critical
>
> In the PruneFileSourcePartitions optimizer, the partition files will not be 
> pruned if we use partition filter and non partition filter together, for 
> example:
> sql("CREATE TABLE IF NOT EXISTS src_par (key INT, value STRING) partitioned 
> by(p_d int) stored as parquet ")
>  sql("insert overwrite table src_par partition(p_d=2) select 2 as key, '4' as 
> value")
>  sql("insert overwrite table src_par partition(p_d=3) select 3 as key, '4' as 
> value")
>  sql("insert overwrite table src_par partition(p_d=4) select 4 as key, '4' as 
> value")
> The sql below will scan all the partition files, in which, the partition 
> **p_d=4** should be pruned.
>  **sql("select * from src_par where (p_d=2 and key=2) or (p_d=3 and 
> key=3)").show**



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25548) In the PruneFileSourcePartitions optimizer, replace the nonPartitionOps field with true in the And(partitionOps, nonPartitionOps) to make the partition can be pruned

Reply via email to