sarutak commented on PR #55550: URL: https://github.com/apache/spark/pull/55550#issuecomment-4349946083
Hi @shrirangmhalgi, `numRows`, `totalSize`, and `rawDataSize` are Hive Metastore's internal statistics properties, populated by `ANALYZE TABLE` and not intended to be set by users via `SET TBLPROPERTIES`. Also, Spark manages its own statistics under the `spark.sql.statistics.*` prefix (users are blocked from setting `spark.sql.*` keys by `HiveExternalCatalog.verifyTableProperties()`). The prefix-less Hive keys (`numRows` etc.) are a separate namespace that Spark does not use for its optimizer when Spark's own statistics are present. SPARK-30262 (cited as motivation) was a read-side issue caused by Hive Metastore's internal behavior, not by users writing invalid values. The `.filter(_.nonEmpty)` fix was the appropriate approach. The JIRA argues that Hive validates these properties, but the context is different. In Hive, `SET TBLPROPERTIES` with stats keys is a *specified operation*. This means validation is paired with a `STATS_GENERATED = USER` marker that tells the Metastore to treat the update as a user-initiated statistics change and update `COLUMN_STATS_ACCURATE` accordingly (see `AbstractAlterTablePropertiesAnalyzer` and `AlterTableSetPropertiesOperation` in Hive). Spark has none of this machinery. Adding validation alone would validate input for an operation that Spark doesn't actually support as a stats update mechanism. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
