Miklos Szurap created SPARK-47444: ------------------------------------- Summary: Empty numRows table stats should not break Hive tables Key: SPARK-47444 URL: https://issues.apache.org/jira/browse/SPARK-47444 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.4.8 Reporter: Miklos Szurap
A Hive table cannot be accessed / queried / updated from Spark (it is completely "broken") if the "numRows" table property (table stat) is populated with a non-numeric value (like an empty string). Accessing the able from spark results in a "NumberFormatException": {code} scala> spark.sql("select * from t1p").show() java.lang.NumberFormatException: Zero length BigInteger at java.math.BigInteger.<init>(BigInteger.java:420) ... at org.apache.spark.sql.hive.client.HiveClientImpl$.org$apache$spark$sql$hive$client$HiveClientImpl$$readHiveStats(HiveClientImpl.scala:1243) ... at org.apache.spark.sql.hive.client.HiveClientImpl.getTable(HiveClientImpl.scala:91) ... {code} or similarly just with {code} java.lang.NumberFormatException: For input string: "Foo" {code} Currently the table stats can be broken through Spark with {code} scala> spark.sql("alter table t1p set tblproperties('numRows'='', 'STATS_GENERATED_VIA_STATS_TASK'='true')").show() {code} Spark should: 1. Validate sparkSQL "alter table" statements and not allow non-numeric values in the "totalSize", "numRows", "rawDataSize" table properties, as those are checked in the [HiveClientImpl#readHiveStats()|https://github.com/apache/spark/blob/1aafe60b3e7633f755499f5394ca62289e42588d/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala#L1260C15-L1260C28] 2. The HiveClientImpl#readHiveStats should probably tolerate these wrong "totalSize", "numRows", "rawDataSize" table properties and not fail with a cryptic NumberFormatException, but treat those as zero. Or at least it should provide a clue in the error message which table property is incorrect. Note: beeline/Hive validates alter table statements, however Impala can similarly break the table, the above item #1 needs to be fixed there too. I have checked only the Spark 2.4.x behavior, the same probably exists in Spark 3.x too. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org