[jira] [Updated] (SPARK-45387) Optimize hive patition filter when the comparision dataType not match
[ https://issues.apache.org/jira/browse/SPARK-45387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] TianyiMa updated SPARK-45387: - Affects Version/s: 3.5.1 > Optimize hive patition filter when the comparision dataType not match > - > > Key: SPARK-45387 > URL: https://issues.apache.org/jira/browse/SPARK-45387 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.1, 3.1.2, 3.3.0, 3.4.0, 3.5.0, 3.5.1 >Reporter: TianyiMa >Priority: Critical > Labels: pull-request-available > Attachments: PruneFileSourcePartitions.diff > > > Suppose we have a partitioned table `table_pt` with partition colum `dt` > which is StringType and the table metadata is managed by Hive Metastore, if > we filter partition by dt = '123', this filter can be pushed down to data > source directly, but if the filter condition is number, e.g. dt = 123, Spark > will not known which partition should be pushed down. Thus in the process of > physical plan optimization, Spark will pull all of that table's partition > meta data to client side, to decide which partition filter should be push > down to the data source. This is poor of performance if the table has > thousands of partitions and increasing the risk of hive metastore oom. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45387) Optimize hive patition filter when the comparision dataType not match
[ https://issues.apache.org/jira/browse/SPARK-45387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] TianyiMa updated SPARK-45387: - Affects Version/s: 3.5.0 > Optimize hive patition filter when the comparision dataType not match > - > > Key: SPARK-45387 > URL: https://issues.apache.org/jira/browse/SPARK-45387 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.1, 3.1.2, 3.3.0, 3.4.0, 3.5.0 >Reporter: TianyiMa >Priority: Critical > Labels: pull-request-available > Attachments: PruneFileSourcePartitions.diff > > > Suppose we have a partitioned table `table_pt` with partition colum `dt` > which is StringType and the table metadata is managed by Hive Metastore, if > we filter partition by dt = '123', this filter can be pushed down to data > source directly, but if the filter condition is number, e.g. dt = 123, Spark > will not known which partition should be pushed down. Thus in the process of > physical plan optimization, Spark will pull all of that table's partition > meta data to client side, to decide which partition filter should be push > down to the data source. This is poor of performance if the table has > thousands of partitions and increasing the risk of hive metastore oom. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45387) Optimize hive patition filter when the comparision dataType not match
[ https://issues.apache.org/jira/browse/SPARK-45387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-45387: --- Labels: pull-request-available (was: ) > Optimize hive patition filter when the comparision dataType not match > - > > Key: SPARK-45387 > URL: https://issues.apache.org/jira/browse/SPARK-45387 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.1, 3.1.2, 3.3.0, 3.4.0 >Reporter: TianyiMa >Priority: Critical > Labels: pull-request-available > Attachments: PruneFileSourcePartitions.diff > > > Suppose we have a partitioned table `table_pt` with partition colum `dt` > which is StringType and the table metadata is managed by Hive Metastore, if > we filter partition by dt = '123', this filter can be pushed down to data > source directly, but if the filter condition is number, e.g. dt = 123, Spark > will not known which partition should be pushed down. Thus in the process of > physical plan optimization, Spark will pull all of that table's partition > meta data to client side, to decide which partition filter should be push > down to the data source. This is poor of performance if the table has > thousands of partitions and increasing the risk of hive metastore oom. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45387) Optimize hive patition filter when the comparision dataType not match
[ https://issues.apache.org/jira/browse/SPARK-45387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] TianyiMa updated SPARK-45387: - Summary: Optimize hive patition filter when the comparision dataType not match (was: Partition key filter cannot be pushed down when using cast) > Optimize hive patition filter when the comparision dataType not match > - > > Key: SPARK-45387 > URL: https://issues.apache.org/jira/browse/SPARK-45387 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.1, 3.1.2, 3.3.0, 3.4.0 >Reporter: TianyiMa >Priority: Critical > Attachments: PruneFileSourcePartitions.diff > > > Suppose we have a partitioned table `table_pt` with partition colum `dt` > which is StringType and the table metadata is managed by Hive Metastore, if > we filter partition by dt = '123', this filter can be pushed down to data > source directly, but if the filter condition is number, e.g. dt = 123, Spark > will not known which partition should be pushed down. Thus in the process of > physical plan optimization, Spark will pull all of that table's partition > meta data to client side, to decide which partition filter should be push > down to the data source. This is poor of performance if the table has > thousands of partitions and increasing the risk of hive metastore oom. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org