[ https://issues.apache.org/jira/browse/SPARK-34631?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17583785#comment-17583785 ]
Brendan Morin edited comment on SPARK-34631 at 8/23/22 6:52 PM: ---------------------------------------------------------------- For other users encountering this issue suddenly: make sure you check all partitions, not just the partition you want to query. I'm encountering and able to duplicate this error this in the following scenario: Trying to query an external hive table that is partitioned on date, e.g.: {code:java} df = spark.sql("select a, b from my_db.my_table where date = '2022-08-10'") df.show() >>> java.lang.RuntimeException: Caught Hive MetaException attempting to get >>> partition... >>> Caused by: MetaException(message:Filtering is supported only on partition >>> keys of type string){code} Confirm the data type of the columns: {code:java} spark.sql("select * from my_db.my_table").printSchema() >>> root >>> |-- a: string (nullable = true) >>> |-- b: string (nullable = true) >>> |-- date: date (nullable = true){code} Check the partitions: {code:java} spark.sql("show partitions my_db.my_table").show(20, False) >>> +-------------------+ >>> |partition | >>> +-------------------+ >>> |date=2022-08-07 | >>> |date=2022-08-08 | >>> |date=2022-08-08_tmp| # Note the malformed partition >>> |date=2022-08-09 | >>> |date=2022-08-10 | >>> |date=2022-08-11 | >>> |date=2022-08-12 | >>> +-------------------+{code} This was the problem in my case. There was a date partition (Note: the problem partition was not the one I was querying for) that was malformed in the HDFS directory where the hive external table data was located. The string format was unable to be properly parsed into the data type. Removing this partition from HDFS, dropping and recreating the table with MSCK repair solved the issue. For additional context, my_db.my_table was managed as an external table. Table updates were done by writing parquet files as partitions, and then running drop table, create table, and MSCK repair on the table. For some reason, this write/update process did not fail due to the malformed partition, so additional partitions were able to continue to be added. The problem only manifested on read. I think that the root cause in my case is actually an overly broad catch by spark, and I think the error handling logic could be refined to identify this root cause, or clue users in that the issue may be a malformed partition name that does not parse correctly into the expected data type (date in this case). The specific error: {code:java} Caused by: MetaException(message:Filtering is supported only on partition keys of type string){code} is a bit of a red herring, as this is not true, and searching this error will lead you down a rabbit hole of incorrect root cause/unrelated issues. was (Author: brendanjmorin): For other users encountering this issue suddenly: make sure you check all partitions, not just the partition you want to query. I'm encountering and able to duplicate this error this in the following scenario: Trying to query an external hive table that is partitioned on date, e.g.: {code:java} df = spark.sql("select a, b from my_db.my_table where date = '2022-08-10'") df.show() >>> java.lang.RuntimeException: Caught Hive MetaException attempting to get >>> partition... >>> Caused by: MetaException(message:Filtering is supported only on partition >>> keys of type string){code} Confirm the data type of the columns: {code:java} spark.sql("select * from my_db.my_table").printSchema() >>> root >>> |-- a: string (nullable = true) >>> |-- b: string (nullable = true) >>> |-- date: date (nullable = true){code} Check the partitions: {code:java} spark.sql("show partitions my_db.my_table").show(20, False) >>> +-------------------+ >>> |partition | >>> +-------------------+ >>> |date=2022-08-07 | >>> |date=2022-08-08 | >>> |date=2022-08-08_tmp| # Note the malformed partition >>> |date=2022-08-09 | >>> |date=2022-08-10 | >>> |date=2022-08-11 | >>> |date=2022-08-12 | >>> +-------------------+{code} This was the problem in my case. There was a date partition (Note: the problem partition was not the only I was querying) that was malformed in the HDFS directory where the hive external table data was located. The string format was unable to be properly parsed into the data type. Removing this partition from HDFS, dropping and recreating the table with MSCK repair solved the issue. For additional context, my_db.my_table was managed as an external table. Table updates were done by writing parquet files as partitions, and then running drop table, create table, and MSCK repair on the table. For some reason, this write/update process did not fail due to the malformed partition, so additional partitions were able to continue to be added. The problem only manifested on read. I think that the root cause in my case is actually an overly broad catch by spark, and I think the error handling logic could be refined to identify this root cause, or clue users in that the issue may be a malformed partition name that does not parse correctly into the expected data type (date in this case). The specific error: {code:java} Caused by: MetaException(message:Filtering is supported only on partition keys of type string){code} is a bit of a red herring, as this is not true, and searching this error will lead you down a rabbit hole of incorrect root cause/unrelated issues. > Caught Hive MetaException when query by partition (partition col start with > ‘$’) > -------------------------------------------------------------------------------- > > Key: SPARK-34631 > URL: https://issues.apache.org/jira/browse/SPARK-34631 > Project: Spark > Issue Type: Bug > Components: DStreams, Java API > Affects Versions: 2.4.4 > Reporter: zhouyuan > Priority: Critical > > create a table, set location as parquet, do msck repair table to get the data. > But when query with partition column, got some errors (adding backtick would > not help) > {code:java} > // code placeholder > {code} > select count from some_table where `$partition_date` = '2015-01-01' > > {panel:title=error:} > java.lang.RuntimeException: Caught Hive MetaException attempting to get > partition metadata by filter from Hive. You can set the Spark configuration > setting spark.sql.hive.manageFilesourcePartitions to false to work around > this problem, however this will result in degraded performance. Please report > a bug: https://issues.apache.org/jira/browse/SPARK > at > org.apache.spark.sql.hive.client.Shim_v0_13.getPartitionsByFilter(HiveShim.scala:772) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getPartitionsByFilter$1.apply(HiveClientImpl.scala:679) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getPartitionsByFilter$1.apply(HiveClientImpl.scala:677) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:275) > at > org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:213) > at > org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:212) > at > org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:258) > at > org.apache.spark.sql.hive.client.HiveClientImpl.getPartitionsByFilter(HiveClientImpl.scala:677) > at > org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$listPartitionsByFilter$1.apply(HiveExternalCatalog.scala:1221) > at > org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$listPartitionsByFilter$1.apply(HiveExternalCatalog.scala:1214) > at > org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:97) > at > org.apache.spark.sql.hive.HiveExternalCatalog.listPartitionsByFilter(HiveExternalCatalog.scala:1214) > at > org.apache.spark.sql.catalyst.catalog.ExternalCatalogWithListener.listPartitionsByFilter(ExternalCatalogWithListener.scala:254) > at > org.apache.spark.sql.catalyst.catalog.SessionCatalog.listPartitionsByFilter(SessionCatalog.scala:962) > at > org.apache.spark.sql.hive.execution.HiveTableScanExec.rawPartitions$lzycompute(HiveTableScanExec.scala:174) > at > org.apache.spark.sql.hive.execution.HiveTableScanExec.rawPartitions(HiveTableScanExec.scala:166) > at > org.apache.spark.sql.hive.execution.HiveTableScanExec$$anonfun$11.apply(HiveTableScanExec.scala:192) > at > org.apache.spark.sql.hive.execution.HiveTableScanExec$$anonfun$11.apply(HiveTableScanExec.scala:192) > at org.apache.spark.util.Utils$.withDummyCallSite(Utils.scala:2470) > at > org.apache.spark.sql.hive.execution.HiveTableScanExec.doExecute(HiveTableScanExec.scala:191) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152) > at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127) > at > org.apache.spark.sql.execution.SparkPlan.getByteArrayRdd(SparkPlan.scala:247) > at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:339) > at > org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:38) > at > org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collectFromPlan(Dataset.scala:3389) > at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2550) > at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2550) > at org.apache.spark.sql.Dataset$$anonfun$52.apply(Dataset.scala:3370) > at > org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78) > at > org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125) > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73) > at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3369) > at org.apache.spark.sql.Dataset.head(Dataset.scala:2550) > at org.apache.spark.sql.Dataset.take(Dataset.scala:2764) > at org.apache.spark.sql.Dataset.getRows(Dataset.scala:254) > at org.apache.spark.sql.Dataset.showString(Dataset.scala:291) > at org.apache.spark.sql.Dataset.show(Dataset.scala:751) > at org.apache.spark.sql.Dataset.show(Dataset.scala:710) > at org.apache.spark.sql.Dataset.show(Dataset.scala:719) > ... 49 elided > Caused by: java.lang.reflect.InvocationTargetException: > org.apache.hadoop.hive.metastore.api.MetaException: Error parsing partition > filter : line 1:0 no viable alternative at character '$' > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.apache.spark.sql.hive.client.Shim_v0_13.getPartitionsByFilter(HiveShim.scala:759) > ... 92 more > Caused by: org.apache.hadoop.hive.metastore.api.MetaException: Error parsing > partition filter : line 1:0 no viable alternative at character '$' > at > org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$get_partitions_by_filter_result$get_partitions_by_filter_resultStandardScheme.read(ThriftHiveMetastore.java) > at > org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$get_partitions_by_filter_result$get_partitions_by_filter_resultStandardScheme.read(ThriftHiveMetastore.java) > at > org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$get_partitions_by_filter_result.read(ThriftHiveMetastore.java) > at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:86) > at > org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.recv_get_partitions_by_filter(ThriftHiveMetastore.java:2216) > at > org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.get_partitions_by_filter(ThriftHiveMetastore.java:2200) > at > org.apache.hadoop.hive.metastore.HiveMetaStoreClient.listPartitionsByFilter(HiveMetaStoreClient.java:1103) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.invoke(RetryingMetaStoreClient.java:156) > at com.sun.proxy.$Proxy15.listPartitionsByFilter(Unknown Source) > at > org.apache.hadoop.hive.ql.metadata.Hive.getPartitionsByFilter(Hive.java:2254) > ... 97 more > {panel} -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org