[GitHub] spark pull request: [SPARK-4829] [SQL] add rule to fold count(expr...

liancheng Thu, 11 Dec 2014 17:27:06 -0800

Github user liancheng commented on the pull request:

    https://github.com/apache/spark/pull/3676#issuecomment-66719303
  
    This LGTM, but would like to share some findings related to semantics of 
`COUNT(expr)`. It seems that Hive has a bug here, and Spark SQL behaves 
differently from Hive.
    
    The Hive language manual says [1] [1]:
    
    > count(expr) - Returns the number of rows for which the supplied 
expression is non-NULL
    
    but this doesn't conform to the following results (tested under Hive 
0.13.1):
    
    ```sql
    -- The test table `src1(key INT, value STRING)` is the one we used in Spark 
SQL `TestHiveContext`.
    -- The table consists of 25 rows, among which 10 `key`s are `NULL`.
    
    CREATE TABLE src1(key INT, value STRING);
    LOAD DATA LOCAL INPATH 'data/files/kv3.txt' INTO TABLE src1;
    
    SELECT COUNT(key) FROM src1
    WHERE key IS NOT NULL;              -- => 15, reasonable
    
    SELECT COUNT(NULL) FROM src1;       -- => 0, reasonable
    
    SELECT COUNT(1) FROM src1;          -- => 25, reasonable, 1 is never `NULL`
    
    SELECT COUNT(key + 1) FROM src1;    -- => 15, reasonable since `NULL + 1` 
is `NULL`.
    
    SELECT COUNT(key) FROM src1;        -- => 25, huh?
    
    CREATE TABLE tmp AS
    SELECT CAST(key AS STRING), value
    FROM src1;
    
    SELECT COUNT(key) FROM tmp;         -- => 15, hm...
    ```
    
    I'm not sure whether Hive has something equivalent to the 
`StructField.nullable` field in Spark SQL, but it seems that it always assumes 
`INT` as not nullable even if the underlying data may contain `NULL`. And 
`COUNT(expr)` doesn't check the actual data for null when `expr` is a single 
column whose data type is not nullable.
    
    On the other hand, Spark SQL looks good. Here is a sample `hive/console` 
session:
    
    ```scala
    scala> sql("SELECT COUNT(key) FROM src1").collect()
    ...
    res2: Array[org.apache.spark.sql.Row] = Array([15])     // <- Reasonable
    
    scala> table("src1").printSchema
    root
     |-- key: integer (nullable = true)
     |-- value: string (nullable = true)
    ```
    
    Notice that we consider all fields read from Hive Metastore nullable since 
data can be randomly dumped in without any validation.
    
    [1]: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4829] [SQL] add rule to fold count(expr...

Reply via email to