Github user liancheng commented on the pull request: https://github.com/apache/spark/pull/3676#issuecomment-66719303 This LGTM, but would like to share some findings related to semantics of `COUNT(expr)`. It seems that Hive has a bug here, and Spark SQL behaves differently from Hive. The Hive language manual says [1] [1]: > count(expr) - Returns the number of rows for which the supplied expression is non-NULL but this doesn't conform to the following results (tested under Hive 0.13.1): ```sql -- The test table `src1(key INT, value STRING)` is the one we used in Spark SQL `TestHiveContext`. -- The table consists of 25 rows, among which 10 `key`s are `NULL`. CREATE TABLE src1(key INT, value STRING); LOAD DATA LOCAL INPATH 'data/files/kv3.txt' INTO TABLE src1; SELECT COUNT(key) FROM src1 WHERE key IS NOT NULL; -- => 15, reasonable SELECT COUNT(NULL) FROM src1; -- => 0, reasonable SELECT COUNT(1) FROM src1; -- => 25, reasonable, 1 is never `NULL` SELECT COUNT(key + 1) FROM src1; -- => 15, reasonable since `NULL + 1` is `NULL`. SELECT COUNT(key) FROM src1; -- => 25, huh? CREATE TABLE tmp AS SELECT CAST(key AS STRING), value FROM src1; SELECT COUNT(key) FROM tmp; -- => 15, hm... ``` I'm not sure whether Hive has something equivalent to the `StructField.nullable` field in Spark SQL, but it seems that it always assumes `INT` as not nullable even if the underlying data may contain `NULL`. And `COUNT(expr)` doesn't check the actual data for null when `expr` is a single column whose data type is not nullable. On the other hand, Spark SQL looks good. Here is a sample `hive/console` session: ```scala scala> sql("SELECT COUNT(key) FROM src1").collect() ... res2: Array[org.apache.spark.sql.Row] = Array([15]) // <- Reasonable scala> table("src1").printSchema root |-- key: integer (nullable = true) |-- value: string (nullable = true) ``` Notice that we consider all fields read from Hive Metastore nullable since data can be randomly dumped in without any validation. [1]: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF
--- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org