[ 
https://issues.apache.org/jira/browse/SPARK-47444?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Miklos Szurap updated SPARK-47444:
----------------------------------
    Description: 
SPARK-30262 resolved/avoided the NumberFormatException in Spark when the 
"totalSize", "numRows", "rawDataSize" table properties are empty, however the 
table stats (intentionally or by mistake) can be still set to an invalid/empty 
value through SparkSQL with an ALTER TABLE statement:
{code}
scala> spark.sql("alter table t1p set tblproperties('numRows'='', 
'STATS_GENERATED_VIA_STATS_TASK'='true')").show()
{code}
 
Spark should validate the sparkSQL "alter table" statements and not allow 
non-numeric values in the "totalSize", "numRows", "rawDataSize" table 
properties.
Though the NumberFormatException will not occur anymore in Spark 3.x, these 
table stats should have numeric values and may cause problems in other 
applications if those are not numbers.

Note: beeline/Hive validates alter table statements.

  was:
A Hive table cannot be accessed / queried / updated from Spark (it is 
completely "broken") if the "numRows" table property (table stat) is populated 
with a non-numeric value (like an empty string). Accessing the able from spark 
results in a "NumberFormatException":
{code}
scala> spark.sql("select * from t1p").show()
java.lang.NumberFormatException: Zero length BigInteger
  at java.math.BigInteger.<init>(BigInteger.java:420)
...
  at 
org.apache.spark.sql.hive.client.HiveClientImpl$.org$apache$spark$sql$hive$client$HiveClientImpl$$readHiveStats(HiveClientImpl.scala:1243)
...
  at 
org.apache.spark.sql.hive.client.HiveClientImpl.getTable(HiveClientImpl.scala:91)
...
{code}
or
similarly just with
{code}
java.lang.NumberFormatException: For input string: "Foo"
{code}
Currently the table stats can be broken through Spark with
{code}
scala> spark.sql("alter table t1p set tblproperties('numRows'='', 
'STATS_GENERATED_VIA_STATS_TASK'='true')").show()
{code}
 
Spark should:
1. Validate sparkSQL "alter table" statements and not allow non-numeric values 
in the "totalSize", "numRows", "rawDataSize" table properties, as those are 
checked in the 
[HiveClientImpl#readHiveStats()|https://github.com/apache/spark/blob/1aafe60b3e7633f755499f5394ca62289e42588d/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala#L1260C15-L1260C28]
2. The HiveClientImpl#readHiveStats should probably tolerate these wrong 
"totalSize", "numRows", "rawDataSize" table properties and not fail with a 
cryptic NumberFormatException, but treat those as zero. Or at least it should 
provide a clue in the error message which table property is incorrect.

Note: beeline/Hive validates alter table statements, however Impala can 
similarly break the table, the above item #1 needs to be fixed there too.

I have checked only the Spark 2.4.x behavior, the same probably exists in Spark 
3.x too.


> Empty numRows table stats should not break Hive tables
> ------------------------------------------------------
>
>                 Key: SPARK-47444
>                 URL: https://issues.apache.org/jira/browse/SPARK-47444
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.4.8
>            Reporter: Miklos Szurap
>            Priority: Major
>              Labels: Hive, HiveMetaStoreClient, SQL
>
> SPARK-30262 resolved/avoided the NumberFormatException in Spark when the 
> "totalSize", "numRows", "rawDataSize" table properties are empty, however the 
> table stats (intentionally or by mistake) can be still set to an 
> invalid/empty value through SparkSQL with an ALTER TABLE statement:
> {code}
> scala> spark.sql("alter table t1p set tblproperties('numRows'='', 
> 'STATS_GENERATED_VIA_STATS_TASK'='true')").show()
> {code}
>  
> Spark should validate the sparkSQL "alter table" statements and not allow 
> non-numeric values in the "totalSize", "numRows", "rawDataSize" table 
> properties.
> Though the NumberFormatException will not occur anymore in Spark 3.x, these 
> table stats should have numeric values and may cause problems in other 
> applications if those are not numbers.
> Note: beeline/Hive validates alter table statements.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to