Miklos Szurap created SPARK-47444:
-------------------------------------

             Summary: Empty numRows table stats should not break Hive tables
                 Key: SPARK-47444
                 URL: https://issues.apache.org/jira/browse/SPARK-47444
             Project: Spark
          Issue Type: Bug
          Components: SQL
    Affects Versions: 2.4.8
            Reporter: Miklos Szurap


A Hive table cannot be accessed / queried / updated from Spark (it is 
completely "broken") if the "numRows" table property (table stat) is populated 
with a non-numeric value (like an empty string). Accessing the able from spark 
results in a "NumberFormatException":
{code}
scala> spark.sql("select * from t1p").show()
java.lang.NumberFormatException: Zero length BigInteger
  at java.math.BigInteger.<init>(BigInteger.java:420)
...
  at 
org.apache.spark.sql.hive.client.HiveClientImpl$.org$apache$spark$sql$hive$client$HiveClientImpl$$readHiveStats(HiveClientImpl.scala:1243)
...
  at 
org.apache.spark.sql.hive.client.HiveClientImpl.getTable(HiveClientImpl.scala:91)
...
{code}
or
similarly just with
{code}
java.lang.NumberFormatException: For input string: "Foo"
{code}
Currently the table stats can be broken through Spark with
{code}
scala> spark.sql("alter table t1p set tblproperties('numRows'='', 
'STATS_GENERATED_VIA_STATS_TASK'='true')").show()
{code}
 
Spark should:
1. Validate sparkSQL "alter table" statements and not allow non-numeric values 
in the "totalSize", "numRows", "rawDataSize" table properties, as those are 
checked in the 
[HiveClientImpl#readHiveStats()|https://github.com/apache/spark/blob/1aafe60b3e7633f755499f5394ca62289e42588d/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala#L1260C15-L1260C28]
2. The HiveClientImpl#readHiveStats should probably tolerate these wrong 
"totalSize", "numRows", "rawDataSize" table properties and not fail with a 
cryptic NumberFormatException, but treat those as zero. Or at least it should 
provide a clue in the error message which table property is incorrect.

Note: beeline/Hive validates alter table statements, however Impala can 
similarly break the table, the above item #1 needs to be fixed there too.

I have checked only the Spark 2.4.x behavior, the same probably exists in Spark 
3.x too.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to