[ https://issues.apache.org/jira/browse/SPARK-38454?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Hyukjin Kwon updated SPARK-38454: --------------------------------- Component/s: SQL (was: PySpark) > Partition Data Type Prevents Filtering Sporadically > --------------------------------------------------- > > Key: SPARK-38454 > URL: https://issues.apache.org/jira/browse/SPARK-38454 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 3.2.0 > Reporter: Christopher > Priority: Major > > A pipeline (an airflow DAG) that has been running successfully in > +production+ for 72+ hours has started failing with the same error on two > different queries with the only difference being the table. We believe the > root of the error is > {quote}Caused by: MetaException(message:Filtering is supported only on > partition keys of type string){quote} > > We've seen this error resolve itself on task retry attempts, but the latest > occurrence of this task was not resolved on retry attempts, and all > proceeding airflow DAGs failed. The queries that trigger this error are > {quote}select * from db.cleansed_layer_table where > (`dataset`='20220305185000_4d' AND `date_partition`=CAST('2022-03-05' as > DATE)): > select * from db.raw_layer_table where (`date_partition`=CAST('2022-03-05' > as DATE) AND `dataset`='20220305185000_4d') > {quote} > > The date_partition field was a DATE type when this error started occurring. > The task writes and queries the raw layer before the cleansed layer is > written & queried. > > The first task failure was caused by the cleansed layer query, and the > proceeding ones all failed on the raw layer query. The inconsistent behavior > of the pipeline is of highest concern; there were 35 successful DAG runs in > Airflow of this pipeline. > > The error suggests > {quote}{{You can set the Spark configuration setting > spark.sql.hive.manageFilesourcePartitions to false to work around this > problem}} > {quote} > which resulted in too large of a performance hit to keep. > > We've changed the field to a STRING in our +development+ environment, and > have had 78 consecutive successful __ task runs. We've paused that test for > now, in favor of filtering only on dataset for now which we just started > running. > > Is our assessment that we will experience higher reliability by changing the > data type of date_partition to STRING reasonable? -- This message was sent by Atlassian Jira (v8.20.1#820001) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org