GitHub user cloud-fan opened a pull request:

    https://github.com/apache/spark/pull/19707

    [SPARK-22472][SQL] add null check for top-level primitive values

    ## What changes were proposed in this pull request?
    
    One powerful feature of `Dataset` is, we can easily map SQL rows to 
Scala/Java objects and do runtime null check automatically.
    
    For example, let's say we have a parquet file with schema `<a: int, b: 
string>`, and we have a `case class Data(a: Int, b: String)`. Users can easily 
read this parquet file into `Data` objects, and Spark will throw NPE if column 
`a` has null values.
    
    However the null checking is left behind for top-level primitive values. 
For example, let's say we have a parquet file with schema `<a: Int>`, and we 
read it into Scala `Int`. If column `a` has null values, we will get some weird 
results.
    ```
    scala> val ds = spark.read.parquet(...).as[Int]
    
    scala> ds.show()
    +----+
    |v   |
    +----+
    |null|
    |1   |
    +----+
    
    scala> ds.collect
    res0: Array[Long] = Array(0, 1)
    
    scala> ds.map(_ * 2).show
    +-----+
    |value|
    +-----+
    |-2   |
    |2    |
    +-----+
    ```
    
    This is because internally Spark use some special default values for 
primitive types, but never expect users to see/operate these default value 
directly.
    
    This PR adds null check for top-level primitive values
    
    ## How was this patch tested?
    
    new test

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/cloud-fan/spark bug

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/19707.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #19707
    
----
commit dad50806b27a40ed1112d8ee29b3bd5c60164170
Author: Wenchen Fan <wenc...@databricks.com>
Date:   2017-11-09T13:39:10Z

    add null check for top-level primitive values

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to