[jira] [Comment Edited] (SPARK-34631) Caught Hive MetaException when query by partition (partition col start with ‘$’)

Brendan Morin (Jira) Tue, 23 Aug 2022 11:53:15 -0700


    [ 
https://issues.apache.org/jira/browse/SPARK-34631?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17583785#comment-17583785
 ]


Brendan Morin edited comment on SPARK-34631 at 8/23/22 6:52 PM:
----------------------------------------------------------------

For other users encountering this issue suddenly: make sure you check all 
partitions, not just the partition you want to query. I'm encountering and able 
to duplicate this error this in the following scenario:

Trying to query an external hive table that is partitioned on date, e.g.:
{code:java}
df = spark.sql("select a, b from my_db.my_table where date = '2022-08-10'")
df.show()

>>> java.lang.RuntimeException:  Caught Hive MetaException attempting to get 
>>> partition...
>>> Caused by: MetaException(message:Filtering is supported only on partition 
>>> keys of type string){code}
Confirm the data type of the columns:
{code:java}
spark.sql("select * from my_db.my_table").printSchema() 

>>> root
>>>  |-- a: string (nullable = true)
>>>  |-- b: string (nullable = true)  
>>>  |-- date: date (nullable = true){code}
Check the partitions:
{code:java}
spark.sql("show partitions my_db.my_table").show(20, False) 

>>> +-------------------+ 
>>> |partition          | 
>>> +-------------------+ 
>>> |date=2022-08-07    | 
>>> |date=2022-08-08    | 
>>> |date=2022-08-08_tmp|   # Note the malformed partition
>>> |date=2022-08-09    | 
>>> |date=2022-08-10    | 
>>> |date=2022-08-11    | 
>>> |date=2022-08-12    | 
>>> +-------------------+{code}
This was the problem in my case. There was a date partition (Note: the problem 
partition was not the one I was querying for) that was malformed in the HDFS 
directory where the hive external table data was located. The string format was 
unable to be properly parsed into the data type. Removing this partition from 
HDFS, dropping and recreating the table with MSCK repair solved the issue.

For additional context, my_db.my_table was managed as an external table. Table 
updates were done by writing parquet files as partitions, and then running drop 
table, create table, and MSCK repair on the table. For some reason, this 
write/update process did not fail due to the malformed partition, so additional 
partitions were able to continue to be added. The problem only manifested on 
read.

I think that the root cause in my case is actually an overly broad catch by 
spark, and I think the error handling logic could be refined to identify this 
root cause, or clue users in that the issue may be a malformed partition name 
that does not parse correctly into the expected data type (date in this case).

The specific error:
{code:java}
Caused by: MetaException(message:Filtering is supported only on partition keys 
of type string){code}
is a bit of a red herring, as this is not true, and searching this error will 
lead you down a rabbit hole of incorrect root cause/unrelated issues.


was (Author: brendanjmorin):
For other users encountering this issue suddenly: make sure you check all 
partitions, not just the partition you want to query. I'm encountering and able 
to duplicate this error this in the following scenario:

Trying to query an external hive table that is partitioned on date, e.g.:
{code:java}
df = spark.sql("select a, b from my_db.my_table where date = '2022-08-10'")
df.show()

>>> java.lang.RuntimeException:  Caught Hive MetaException attempting to get 
>>> partition...
>>> Caused by: MetaException(message:Filtering is supported only on partition 
>>> keys of type string){code}
Confirm the data type of the columns:
{code:java}
spark.sql("select * from my_db.my_table").printSchema() 

>>> root
>>>  |-- a: string (nullable = true)
>>>  |-- b: string (nullable = true)  
>>>  |-- date: date (nullable = true){code}
Check the partitions:
{code:java}
spark.sql("show partitions my_db.my_table").show(20, False) 

>>> +-------------------+ 
>>> |partition          | 
>>> +-------------------+ 
>>> |date=2022-08-07    | 
>>> |date=2022-08-08    | 
>>> |date=2022-08-08_tmp|   # Note the malformed partition
>>> |date=2022-08-09    | 
>>> |date=2022-08-10    | 
>>> |date=2022-08-11    | 
>>> |date=2022-08-12    | 
>>> +-------------------+{code}
This was the problem in my case. There was a date partition (Note: the problem 
partition was not the only I was querying) that was malformed in the HDFS 
directory where the hive external table data was located. The string format was 
unable to be properly parsed into the data type. Removing this partition from 
HDFS, dropping and recreating the table with MSCK repair solved the issue.

For additional context, my_db.my_table was managed as an external table. Table 
updates were done by writing parquet files as partitions, and then running drop 
table, create table, and MSCK repair on the table. For some reason, this 
write/update process did not fail due to the malformed partition, so additional 
partitions were able to continue to be added. The problem only manifested on 
read.

I think that the root cause in my case is actually an overly broad catch by 
spark, and I think the error handling logic could be refined to identify this 
root cause, or clue users in that the issue may be a malformed partition name 
that does not parse correctly into the expected data type (date in this case).

The specific error:
{code:java}
Caused by: MetaException(message:Filtering is supported only on partition keys 
of type string){code}
is a bit of a red herring, as this is not true, and searching this error will 
lead you down a rabbit hole of incorrect root cause/unrelated issues.

> Caught Hive MetaException when query by partition (partition col start with 
> ‘$’)
> --------------------------------------------------------------------------------
>
>                 Key: SPARK-34631
>                 URL: https://issues.apache.org/jira/browse/SPARK-34631
>             Project: Spark
>          Issue Type: Bug
>          Components: DStreams, Java API
>    Affects Versions: 2.4.4
>            Reporter: zhouyuan
>            Priority: Critical
>
> create a table, set location as parquet, do msck repair table to get the data.
> But when query with partition column, got some errors (adding backtick would 
> not help)
> {code:java}
> // code placeholder
> {code}
> select count from some_table where `$partition_date` = '2015-01-01'
>  
> {panel:title=error:}
> java.lang.RuntimeException: Caught Hive MetaException attempting to get 
> partition metadata by filter from Hive. You can set the Spark configuration 
> setting spark.sql.hive.manageFilesourcePartitions to false to work around 
> this problem, however this will result in degraded performance. Please report 
> a bug: https://issues.apache.org/jira/browse/SPARK
>  at 
> org.apache.spark.sql.hive.client.Shim_v0_13.getPartitionsByFilter(HiveShim.scala:772)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getPartitionsByFilter$1.apply(HiveClientImpl.scala:679)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getPartitionsByFilter$1.apply(HiveClientImpl.scala:677)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:275)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:213)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:212)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:258)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl.getPartitionsByFilter(HiveClientImpl.scala:677)
>  at 
> org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$listPartitionsByFilter$1.apply(HiveExternalCatalog.scala:1221)
>  at 
> org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$listPartitionsByFilter$1.apply(HiveExternalCatalog.scala:1214)
>  at 
> org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:97)
>  at 
> org.apache.spark.sql.hive.HiveExternalCatalog.listPartitionsByFilter(HiveExternalCatalog.scala:1214)
>  at 
> org.apache.spark.sql.catalyst.catalog.ExternalCatalogWithListener.listPartitionsByFilter(ExternalCatalogWithListener.scala:254)
>  at 
> org.apache.spark.sql.catalyst.catalog.SessionCatalog.listPartitionsByFilter(SessionCatalog.scala:962)
>  at 
> org.apache.spark.sql.hive.execution.HiveTableScanExec.rawPartitions$lzycompute(HiveTableScanExec.scala:174)
>  at 
> org.apache.spark.sql.hive.execution.HiveTableScanExec.rawPartitions(HiveTableScanExec.scala:166)
>  at 
> org.apache.spark.sql.hive.execution.HiveTableScanExec$$anonfun$11.apply(HiveTableScanExec.scala:192)
>  at 
> org.apache.spark.sql.hive.execution.HiveTableScanExec$$anonfun$11.apply(HiveTableScanExec.scala:192)
>  at org.apache.spark.util.Utils$.withDummyCallSite(Utils.scala:2470)
>  at 
> org.apache.spark.sql.hive.execution.HiveTableScanExec.doExecute(HiveTableScanExec.scala:191)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
>  at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>  at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
>  at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
>  at 
> org.apache.spark.sql.execution.SparkPlan.getByteArrayRdd(SparkPlan.scala:247)
>  at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:339)
>  at 
> org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:38)
>  at 
> org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collectFromPlan(Dataset.scala:3389)
>  at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2550)
>  at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2550)
>  at org.apache.spark.sql.Dataset$$anonfun$52.apply(Dataset.scala:3370)
>  at 
> org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78)
>  at 
> org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
>  at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
>  at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3369)
>  at org.apache.spark.sql.Dataset.head(Dataset.scala:2550)
>  at org.apache.spark.sql.Dataset.take(Dataset.scala:2764)
>  at org.apache.spark.sql.Dataset.getRows(Dataset.scala:254)
>  at org.apache.spark.sql.Dataset.showString(Dataset.scala:291)
>  at org.apache.spark.sql.Dataset.show(Dataset.scala:751)
>  at org.apache.spark.sql.Dataset.show(Dataset.scala:710)
>  at org.apache.spark.sql.Dataset.show(Dataset.scala:719)
>  ... 49 elided
> Caused by: java.lang.reflect.InvocationTargetException: 
> org.apache.hadoop.hive.metastore.api.MetaException: Error parsing partition 
> filter : line 1:0 no viable alternative at character '$'
>  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>  at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>  at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.lang.reflect.Method.invoke(Method.java:498)
>  at 
> org.apache.spark.sql.hive.client.Shim_v0_13.getPartitionsByFilter(HiveShim.scala:759)
>  ... 92 more
> Caused by: org.apache.hadoop.hive.metastore.api.MetaException: Error parsing 
> partition filter : line 1:0 no viable alternative at character '$'
>  at 
> org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$get_partitions_by_filter_result$get_partitions_by_filter_resultStandardScheme.read(ThriftHiveMetastore.java)
>  at 
> org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$get_partitions_by_filter_result$get_partitions_by_filter_resultStandardScheme.read(ThriftHiveMetastore.java)
>  at 
> org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$get_partitions_by_filter_result.read(ThriftHiveMetastore.java)
>  at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:86)
>  at 
> org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.recv_get_partitions_by_filter(ThriftHiveMetastore.java:2216)
>  at 
> org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.get_partitions_by_filter(ThriftHiveMetastore.java:2200)
>  at 
> org.apache.hadoop.hive.metastore.HiveMetaStoreClient.listPartitionsByFilter(HiveMetaStoreClient.java:1103)
>  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>  at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>  at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.lang.reflect.Method.invoke(Method.java:498)
>  at 
> org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.invoke(RetryingMetaStoreClient.java:156)
>  at com.sun.proxy.$Proxy15.listPartitionsByFilter(Unknown Source)
>  at 
> org.apache.hadoop.hive.ql.metadata.Hive.getPartitionsByFilter(Hive.java:2254)
>  ... 97 more
> {panel}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-34631) Caught Hive MetaException when query by partition (partition col start with ‘$’)

Reply via email to