[ 
https://issues.apache.org/jira/browse/SPARK-38454?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-38454:
---------------------------------
    Component/s: SQL
                     (was: PySpark)

> Partition Data Type Prevents Filtering Sporadically
> ---------------------------------------------------
>
>                 Key: SPARK-38454
>                 URL: https://issues.apache.org/jira/browse/SPARK-38454
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 3.2.0
>            Reporter: Christopher
>            Priority: Major
>
> A pipeline (an airflow DAG) that has been running successfully in 
> +production+ for 72+ hours has started failing with the same error on two 
> different queries with the only difference being the table. We believe the 
> root of the error is 
> {quote}Caused by: MetaException(message:Filtering is supported only on 
> partition keys of type string){quote}
>  
> We've seen this error resolve itself on task retry attempts, but the latest 
> occurrence of this task was not resolved on retry attempts, and all 
> proceeding airflow DAGs failed. The queries that trigger this error are 
> {quote}select * from db.cleansed_layer_table  where 
> (`dataset`='20220305185000_4d' AND `date_partition`=CAST('2022-03-05' as 
> DATE)):
> select * from db.raw_layer_table  where (`date_partition`=CAST('2022-03-05' 
> as DATE) AND `dataset`='20220305185000_4d')
> {quote}
>  
> The date_partition field was a DATE type when this error started occurring. 
> The task writes and queries the raw layer before the cleansed layer is 
> written & queried.
>  
> The first task failure was caused by the cleansed layer query, and the 
> proceeding ones all failed on the raw layer query. The inconsistent behavior 
> of the pipeline is of highest concern; there were 35 successful DAG runs in 
> Airflow of this pipeline.
>  
> The error suggests
> {quote}{{You can set the Spark configuration setting 
> spark.sql.hive.manageFilesourcePartitions to false to work around this 
> problem}}
> {quote}
> which resulted in too large of a performance hit to keep. 
>  
> We've changed the field to a STRING in our +development+ environment, and 
> have had 78 consecutive successful __ task runs. We've paused that test for 
> now, in favor of filtering only on dataset for now which we just started 
> running.
>  
> Is our assessment that we will experience higher reliability by changing the 
> data type of date_partition to STRING reasonable?



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to