[GitHub] spark pull request #19707: [SPARK-22472][SQL] add null check for top-level p...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/19707 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19707: [SPARK-22472][SQL] add null check for top-level p...
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/19707#discussion_r150069049 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/DatasetSuite.scala --- @@ -1408,6 +1409,23 @@ class DatasetSuite extends QueryTest with SharedSQLContext { checkDataset(ds, SpecialCharClass("1", "2")) } } + + test("SPARK-22472: add null check for top-level primitive values") { +// If the primitive values are from Option, we need to do runtime null check. +val ds = Seq(Some(1), None).toDS().as[Int] +intercept[NullPointerException](ds.collect()) +val e = intercept[SparkException](ds.map(_ * 2).collect()) +assert(e.getCause.isInstanceOf[NullPointerException]) + +withTempPath { path => + Seq(new Integer(1), null).toDF("i").write.parquet(path.getCanonicalPath) --- End diff -- Thanks! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19707: [SPARK-22472][SQL] add null check for top-level p...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/19707#discussion_r150064768 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/DatasetSuite.scala --- @@ -1408,6 +1409,23 @@ class DatasetSuite extends QueryTest with SharedSQLContext { checkDataset(ds, SpecialCharClass("1", "2")) } } + + test("SPARK-22472: add null check for top-level primitive values") { +// If the primitive values are from Option, we need to do runtime null check. +val ds = Seq(Some(1), None).toDS().as[Int] +intercept[NullPointerException](ds.collect()) +val e = intercept[SparkException](ds.map(_ * 2).collect()) +assert(e.getCause.isInstanceOf[NullPointerException]) + +withTempPath { path => + Seq(new Integer(1), null).toDF("i").write.parquet(path.getCanonicalPath) --- End diff -- It doesn't matter, I just need a dataframe, I can even use `Seq(Some(1), None).toDF`, but using parquet is more convincing. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19707: [SPARK-22472][SQL] add null check for top-level p...
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/19707#discussion_r150027756 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/DatasetSuite.scala --- @@ -1408,6 +1409,23 @@ class DatasetSuite extends QueryTest with SharedSQLContext { checkDataset(ds, SpecialCharClass("1", "2")) } } + + test("SPARK-22472: add null check for top-level primitive values") { +// If the primitive values are from Option, we need to do runtime null check. +val ds = Seq(Some(1), None).toDS().as[Int] +intercept[NullPointerException](ds.collect()) +val e = intercept[SparkException](ds.map(_ * 2).collect()) +assert(e.getCause.isInstanceOf[NullPointerException]) + +withTempPath { path => + Seq(new Integer(1), null).toDF("i").write.parquet(path.getCanonicalPath) --- End diff -- Is this PR orthogonal to data source format? Could you test more data source like `JSON`, here? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19707: [SPARK-22472][SQL] add null check for top-level p...
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/19707#discussion_r150026840 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/ScalaReflection.scala --- @@ -134,7 +134,13 @@ object ScalaReflection extends ScalaReflection { val tpe = localTypeOf[T] val clsName = getClassNameFromType(tpe) val walkedTypePath = s"""- root class: "$clsName :: Nil -deserializerFor(tpe, None, walkedTypePath) +val expr = deserializerFor(tpe, None, walkedTypePath) +val Schema(_, nullable) = schemaFor(tpe) +if (nullable) { + expr +} else { + AssertNotNull(expr, walkedTypePath) +} --- End diff -- Hi, @cloud-fan . It looks great. Can we add a test case in `ScalaReflectionSuite`, too? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19707: [SPARK-22472][SQL] add null check for top-level p...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/19707#discussion_r150024246 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/DatasetSuite.scala --- @@ -1408,6 +1409,23 @@ class DatasetSuite extends QueryTest with SharedSQLContext { checkDataset(ds, SpecialCharClass("1", "2")) } } + + test("SPARK-22472: add null check for top-level primitive values") { +// If the primitive values are from Option, we need to do runtime null check. +val ds = Seq(Some(1), None).toDS().as[Int] +intercept[NullPointerException](ds.collect()) +val e = intercept[SparkException](ds.map(_ * 2).collect()) +assert(e.getCause.isInstanceOf[NullPointerException]) + +withTempPath { path => + Seq(new Integer(1), null).toDF("i").write.parquet(path.getCanonicalPath) --- End diff -- not a big deal, but `toDF("i")` is more explicit about column name. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19707: [SPARK-22472][SQL] add null check for top-level p...
Github user kiszk commented on a diff in the pull request: https://github.com/apache/spark/pull/19707#discussion_r150015018 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/DatasetSuite.scala --- @@ -1408,6 +1409,23 @@ class DatasetSuite extends QueryTest with SharedSQLContext { checkDataset(ds, SpecialCharClass("1", "2")) } } + + test("SPARK-22472: add null check for top-level primitive values") { +// If the primitive values are from Option, we need to do runtime null check. +val ds = Seq(Some(1), None).toDS().as[Int] +intercept[NullPointerException](ds.collect()) +val e = intercept[SparkException](ds.map(_ * 2).collect()) +assert(e.getCause.isInstanceOf[NullPointerException]) + +withTempPath { path => + Seq(new Integer(1), null).toDF("i").write.parquet(path.getCanonicalPath) --- End diff -- nit: `toDF()` also works. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19707: [SPARK-22472][SQL] add null check for top-level p...
GitHub user cloud-fan opened a pull request: https://github.com/apache/spark/pull/19707 [SPARK-22472][SQL] add null check for top-level primitive values ## What changes were proposed in this pull request? One powerful feature of `Dataset` is, we can easily map SQL rows to Scala/Java objects and do runtime null check automatically. For example, let's say we have a parquet file with schema ``, and we have a `case class Data(a: Int, b: String)`. Users can easily read this parquet file into `Data` objects, and Spark will throw NPE if column `a` has null values. However the null checking is left behind for top-level primitive values. For example, let's say we have a parquet file with schema ``, and we read it into Scala `Int`. If column `a` has null values, we will get some weird results. ``` scala> val ds = spark.read.parquet(...).as[Int] scala> ds.show() ++ |v | ++ |null| |1 | ++ scala> ds.collect res0: Array[Long] = Array(0, 1) scala> ds.map(_ * 2).show +-+ |value| +-+ |-2 | |2| +-+ ``` This is because internally Spark use some special default values for primitive types, but never expect users to see/operate these default value directly. This PR adds null check for top-level primitive values ## How was this patch tested? new test You can merge this pull request into a Git repository by running: $ git pull https://github.com/cloud-fan/spark bug Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/19707.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #19707 commit dad50806b27a40ed1112d8ee29b3bd5c60164170 Author: Wenchen FanDate: 2017-11-09T13:39:10Z add null check for top-level primitive values --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org