icyjhl created SPARK-40610:
------------------------------

             Summary: Spark fall back to use getPartitions instead of 
getPartitionsByFilter when date_add functions used in where clause 
                 Key: SPARK-40610
                 URL: https://issues.apache.org/jira/browse/SPARK-40610
             Project: Spark
          Issue Type: Bug
          Components: SQL
    Affects Versions: 3.2.1
         Environment: edw.tmp_test_metastore_usage_source is a big table with 
1000 partitions and hundreds of columns
            Reporter: icyjhl


When I run a insert overwrite statement, I got error saying:

 
{code:java}
MetaStoreClient lost connection. Attempting to reconnect (1 of 1) after 1s. 
listPartitions {code}
 

It's weird as I only selected for about 3 partitions, so I rerun the sql and 
checked the metastore, then I found it's fetching all columns in all partitions:

 
{code:java}
select "CD_ID", "COMMENT", "COLUMN_NAME", "TYPE_NAME" from "COLUMNS_V2" where 
"CD_ID" 
in 
(675384,675393,675385,675394,675396,675397,675395,675398,675399,675401,675402,675400,675406……){code}
 

 

After testing, I found the problem is with the date_add function in where 
clause, if remove it ,sql works fine, else metastore would fetch all columns in 
all partitions.

 

 
{code:java}
insert overwrite table test.tmp_test_metastore_usage
SELECT userid
    ,SUBSTR(sendtime,1,10) AS creation_date
    ,cast(json_bh_esdate_deltadays_max as DECIMAL(38,2)) AS 
bh_esdate_deltadays_max
    ,json_bh_qiye_industryphyname AS bh_qiye_industryphyname
    ,cast(json_bh_esdate_deltadays_min as DECIMAL(38,2)) AS 
bh_esdate_deltadays_min
    ,cast(json_bh_subconam_min as DECIMAL(38,2)) AS bh_subconam_min
    ,cast(json_bh_qiye_regcap_min as DECIMAL(38,2)) AS bh_qiye_regcap_min
    ,json_bh_industryphyname AS bh_industryphyname
    ,cast(json_bh_subconam_mean as DECIMAL(38,2)) AS bh_subconam_mean
    ,cast(json_bh_industryphyname_nunique as DECIMAL(38,2)) AS 
bh_industryphyname_nunique
    ,cast(current_timestamp() as string) as dw_cre_date
    ,cast(current_timestamp() as string) as dw_upd_date
FROM (
    SELECT userid
        ,sendtime
        ,json_bh_esdate_deltadays_max
        ,json_bh_qiye_industryphyname
        ,json_bh_esdate_deltadays_min
        ,json_bh_subconam_min
        ,json_bh_qiye_regcap_min
        ,json_bh_industryphyname
        ,json_bh_subconam_mean
        ,json_bh_industryphyname_nunique
        ,row_number() OVER (
            PARTITION BY userid,dt ORDER BY sendtime DESC
            ) rn
    FROM edw.tmp_test_metastore_usage_source
    WHERE dt >= date_add('2022-09-22',-3 )
        AND json_bizid IN ('6101')
        AND json_dingid IN ('611')
    ) t
WHERE rn = 1 {code}
 

 

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to