[ 
https://issues.apache.org/jira/browse/SPARK-45387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

TianyiMa updated SPARK-45387:
-----------------------------
    Description: Suppose we have a partitioned table `table_pt` with partition 
colum `dt` which is StringType and the table metadata is managed by Hive 
Metastore, if we filter partition by dt = '123', this filter can be pushed down 
to data source directly, but if the filter condition is number, e.g. dt = 123, 
Spark will not known which partition should be pushed down. Thus in the process 
of physical plan optimization, Spark will pull all of that table's partition 
meta data to client side, to decide which partition filter should be push down 
to the data source. This is poor of performance if the table has thousands of 
partitions and increasing the risk of hive metastore oom.  (was: Suppose we 
have a partitioned table `table_pt` with partition colum `dt` which is 
StringType and the table metadata is managed by Hive Metastore, if we filter 
partition by dt = '123', this filter can be pushed down to data source, but if 
the filter condition is number, e.g. dt = 123, that cannot be pushed down to 
data source, causing spark to pull all of that table's partition meta data to 
client, which is poor of performance if the table has thousands of partitions 
and increasing the risk of hive metastore oom.)

> Partition key filter cannot be pushed down when using cast
> ----------------------------------------------------------
>
>                 Key: SPARK-45387
>                 URL: https://issues.apache.org/jira/browse/SPARK-45387
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 3.1.1, 3.1.2, 3.3.0, 3.4.0
>            Reporter: TianyiMa
>            Priority: Critical
>         Attachments: PruneFileSourcePartitions.diff
>
>
> Suppose we have a partitioned table `table_pt` with partition colum `dt` 
> which is StringType and the table metadata is managed by Hive Metastore, if 
> we filter partition by dt = '123', this filter can be pushed down to data 
> source directly, but if the filter condition is number, e.g. dt = 123, Spark 
> will not known which partition should be pushed down. Thus in the process of 
> physical plan optimization, Spark will pull all of that table's partition 
> meta data to client side, to decide which partition filter should be push 
> down to the data source. This is poor of performance if the table has 
> thousands of partitions and increasing the risk of hive metastore oom.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to