[ https://issues.apache.org/jira/browse/SPARK-45387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
TianyiMa updated SPARK-45387: ----------------------------- Description: Suppose we have a partitioned table `table_pt` with partition colum `dt` which is StringType and the table metadata is managed by Hive Metastore, if we filter partition by dt = '123', this filter can be pushed down to data source directly, but if the filter condition is number, e.g. dt = 123, Spark will not known which partition should be pushed down. Thus in the process of physical plan optimization, Spark will pull all of that table's partition meta data to client side, to decide which partition filter should be push down to the data source. This is poor of performance if the table has thousands of partitions and increasing the risk of hive metastore oom. (was: Suppose we have a partitioned table `table_pt` with partition colum `dt` which is StringType and the table metadata is managed by Hive Metastore, if we filter partition by dt = '123', this filter can be pushed down to data source, but if the filter condition is number, e.g. dt = 123, that cannot be pushed down to data source, causing spark to pull all of that table's partition meta data to client, which is poor of performance if the table has thousands of partitions and increasing the risk of hive metastore oom.) > Partition key filter cannot be pushed down when using cast > ---------------------------------------------------------- > > Key: SPARK-45387 > URL: https://issues.apache.org/jira/browse/SPARK-45387 > Project: Spark > Issue Type: Improvement > Components: SQL > Affects Versions: 3.1.1, 3.1.2, 3.3.0, 3.4.0 > Reporter: TianyiMa > Priority: Critical > Attachments: PruneFileSourcePartitions.diff > > > Suppose we have a partitioned table `table_pt` with partition colum `dt` > which is StringType and the table metadata is managed by Hive Metastore, if > we filter partition by dt = '123', this filter can be pushed down to data > source directly, but if the filter condition is number, e.g. dt = 123, Spark > will not known which partition should be pushed down. Thus in the process of > physical plan optimization, Spark will pull all of that table's partition > meta data to client side, to decide which partition filter should be push > down to the data source. This is poor of performance if the table has > thousands of partitions and increasing the risk of hive metastore oom. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org