[ 
https://issues.apache.org/jira/browse/SPARK-46460?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhou Tong updated SPARK-46460:
------------------------------
    Description: 
SQL:select * from test_db.test_table where day between date_sub('2023-12-01',1) 
and  '2023-12-03'

The Physical Plan of sql above will implement _cast_ function on partition col 
'day',  like this, {_}cast(day as date) > 2023-11-30{_}. In this situation, 
spark just pass the filter condition _day < "2023-12-03"_ to HiveMetastore, not 
including filter condition {_}cast(day as date) > 2023-11-30{_}, which may lead 
performance of HMS degarde if the HiveTable has huge number of partitions.

 

In this regard, a new rule may solve this problem. This rule can convert binary 
comparison _cast(day as date) > 2023-11-30_ to {_}day > cast(2023-11-30 as 
string){_}. The right node is foldable, so the result is {_}day > 
"2023-11-30"{_}, and the filter condition passed to HMS will be _day > 
"2023-11-30" and_ _day < "2023-12-03"._

 

I am not sure that new rule is necessary, but I will still provide relevant 
patch.

  was:
SQL:select * from test_db.test_table where day between date_sub('2023-12-01',1) 
and  '2023-12-03'

The Physical Plan of sql above will implement 'cast' function on partition col 
'day',  like this, "cast(day as date) > 2023-11-30". In this situation, spark 
will pass the filter condition "day < 2023-12-03" to HiveMetastore, which may 
lead performance of HMS degarde if the HiveTable has huge number of partitions.

 


> The filter of partition includes cast function may lead the partition pruning 
> to disable
> ----------------------------------------------------------------------------------------
>
>                 Key: SPARK-46460
>                 URL: https://issues.apache.org/jira/browse/SPARK-46460
>             Project: Spark
>          Issue Type: Improvement
>          Components: Optimizer, SQL
>    Affects Versions: 3.2.0
>            Reporter: Zhou Tong
>            Priority: Minor
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> SQL:select * from test_db.test_table where day between 
> date_sub('2023-12-01',1) and  '2023-12-03'
> The Physical Plan of sql above will implement _cast_ function on partition 
> col 'day',  like this, {_}cast(day as date) > 2023-11-30{_}. In this 
> situation, spark just pass the filter condition _day < "2023-12-03"_ to 
> HiveMetastore, not including filter condition {_}cast(day as date) > 
> 2023-11-30{_}, which may lead performance of HMS degarde if the HiveTable has 
> huge number of partitions.
>  
> In this regard, a new rule may solve this problem. This rule can convert 
> binary comparison _cast(day as date) > 2023-11-30_ to {_}day > 
> cast(2023-11-30 as string){_}. The right node is foldable, so the result is 
> {_}day > "2023-11-30"{_}, and the filter condition passed to HMS will be _day 
> > "2023-11-30" and_ _day < "2023-12-03"._
>  
> I am not sure that new rule is necessary, but I will still provide relevant 
> patch.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to