GitHub user cloud-fan opened a pull request: https://github.com/apache/spark/pull/19707
[SPARK-22472][SQL] add null check for top-level primitive values ## What changes were proposed in this pull request? One powerful feature of `Dataset` is, we can easily map SQL rows to Scala/Java objects and do runtime null check automatically. For example, let's say we have a parquet file with schema `<a: int, b: string>`, and we have a `case class Data(a: Int, b: String)`. Users can easily read this parquet file into `Data` objects, and Spark will throw NPE if column `a` has null values. However the null checking is left behind for top-level primitive values. For example, let's say we have a parquet file with schema `<a: Int>`, and we read it into Scala `Int`. If column `a` has null values, we will get some weird results. ``` scala> val ds = spark.read.parquet(...).as[Int] scala> ds.show() +----+ |v | +----+ |null| |1 | +----+ scala> ds.collect res0: Array[Long] = Array(0, 1) scala> ds.map(_ * 2).show +-----+ |value| +-----+ |-2 | |2 | +-----+ ``` This is because internally Spark use some special default values for primitive types, but never expect users to see/operate these default value directly. This PR adds null check for top-level primitive values ## How was this patch tested? new test You can merge this pull request into a Git repository by running: $ git pull https://github.com/cloud-fan/spark bug Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/19707.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #19707 ---- commit dad50806b27a40ed1112d8ee29b3bd5c60164170 Author: Wenchen Fan <wenc...@databricks.com> Date: 2017-11-09T13:39:10Z add null check for top-level primitive values ---- --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org