[
https://issues.apache.org/jira/browse/SPARK-47444?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Miklos Szurap updated SPARK-47444:
----------------------------------
Description:
SPARK-30262 resolved/avoided the NumberFormatException in Spark when the
"totalSize", "numRows", "rawDataSize" table properties are empty, however the
table stats (intentionally or by mistake) can be still set to an invalid/empty
value through SparkSQL with an ALTER TABLE statement:
{code}
scala> spark.sql("alter table t1p set tblproperties('numRows'='',
'STATS_GENERATED_VIA_STATS_TASK'='true')").show()
{code}
Spark should validate the sparkSQL "alter table" statements and not allow
non-numeric values in the "totalSize", "numRows", "rawDataSize" table
properties.
Though the NumberFormatException will not occur anymore in Spark 3.x, these
table stats should have numeric values and may cause problems in other
applications if those are not numbers.
Note: beeline/Hive validates alter table statements.
was:
A Hive table cannot be accessed / queried / updated from Spark (it is
completely "broken") if the "numRows" table property (table stat) is populated
with a non-numeric value (like an empty string). Accessing the able from spark
results in a "NumberFormatException":
{code}
scala> spark.sql("select * from t1p").show()
java.lang.NumberFormatException: Zero length BigInteger
at java.math.BigInteger.<init>(BigInteger.java:420)
...
at
org.apache.spark.sql.hive.client.HiveClientImpl$.org$apache$spark$sql$hive$client$HiveClientImpl$$readHiveStats(HiveClientImpl.scala:1243)
...
at
org.apache.spark.sql.hive.client.HiveClientImpl.getTable(HiveClientImpl.scala:91)
...
{code}
or
similarly just with
{code}
java.lang.NumberFormatException: For input string: "Foo"
{code}
Currently the table stats can be broken through Spark with
{code}
scala> spark.sql("alter table t1p set tblproperties('numRows'='',
'STATS_GENERATED_VIA_STATS_TASK'='true')").show()
{code}
Spark should:
1. Validate sparkSQL "alter table" statements and not allow non-numeric values
in the "totalSize", "numRows", "rawDataSize" table properties, as those are
checked in the
[HiveClientImpl#readHiveStats()|https://github.com/apache/spark/blob/1aafe60b3e7633f755499f5394ca62289e42588d/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala#L1260C15-L1260C28]
2. The HiveClientImpl#readHiveStats should probably tolerate these wrong
"totalSize", "numRows", "rawDataSize" table properties and not fail with a
cryptic NumberFormatException, but treat those as zero. Or at least it should
provide a clue in the error message which table property is incorrect.
Note: beeline/Hive validates alter table statements, however Impala can
similarly break the table, the above item #1 needs to be fixed there too.
I have checked only the Spark 2.4.x behavior, the same probably exists in Spark
3.x too.
> Empty numRows table stats should not break Hive tables
> ------------------------------------------------------
>
> Key: SPARK-47444
> URL: https://issues.apache.org/jira/browse/SPARK-47444
> Project: Spark
> Issue Type: Bug
> Components: SQL
> Affects Versions: 2.4.8
> Reporter: Miklos Szurap
> Priority: Major
> Labels: Hive, HiveMetaStoreClient, SQL
>
> SPARK-30262 resolved/avoided the NumberFormatException in Spark when the
> "totalSize", "numRows", "rawDataSize" table properties are empty, however the
> table stats (intentionally or by mistake) can be still set to an
> invalid/empty value through SparkSQL with an ALTER TABLE statement:
> {code}
> scala> spark.sql("alter table t1p set tblproperties('numRows'='',
> 'STATS_GENERATED_VIA_STATS_TASK'='true')").show()
> {code}
>
> Spark should validate the sparkSQL "alter table" statements and not allow
> non-numeric values in the "totalSize", "numRows", "rawDataSize" table
> properties.
> Though the NumberFormatException will not occur anymore in Spark 3.x, these
> table stats should have numeric values and may cause problems in other
> applications if those are not numbers.
> Note: beeline/Hive validates alter table statements.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]