[ https://issues.apache.org/jira/browse/SPARK-32136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17149073#comment-17149073 ]
Jason Moore commented on SPARK-32136: ------------------------------------- Here is a similar test, and why it's a problem for what I'm needing to do: {noformat} case class C(d: Double) case class B(c: Option[C]) case class A(b: Option[B]) val df = Seq( A(None), A(Some(B(None))), A(Some(B(Some(C(1.0))))) ).toDF val res = df.groupBy("b").agg(count("*")) > res.show +-------+--------+ | b|count(1)| +-------+--------+ | [[]]| 2| |[[1.0]]| 1| +-------+--------+ > res.as[(Option[B], Long)].collect java.lang.RuntimeException: Error while decoding: java.lang.NullPointerException: Null value appeared in non-nullable field: - field (class: "scala.Double", name: "d") - option value class: "C" - field (class: "scala.Option", name: "c") - option value class: "B" - field (class: "scala.Option", name: "_1") - root class: "scala.Tuple2" If the schema is inferred from a Scala tuple/case class, or a Java bean, please try to use scala.Option[_] or other nullable types (e.g. java.lang.Integer instead of int/scala.Int). newInstance(class scala.Tuple2) {noformat} Interestingly, and potentially usefully to know, that using an Int instead of a Double above works as expected. > Spark producing incorrect groupBy results when key is a struct with nullable > properties > --------------------------------------------------------------------------------------- > > Key: SPARK-32136 > URL: https://issues.apache.org/jira/browse/SPARK-32136 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 3.0.0 > Reporter: Jason Moore > Priority: Major > > I'm in the process of migrating from Spark 2.4.x to Spark 3.0.0 and I'm > noticing a behaviour change in a particular aggregation we're doing, and I > think I've tracked it down to how Spark is now treating nullable properties > within the column being grouped by. > > Here's a simple test I've been able to set up to repro it: > > {code:scala} > case class B(c: Option[Double]) > case class A(b: Option[B]) > val df = Seq( > A(None), > A(Some(B(None))), > A(Some(B(Some(1.0)))) > ).toDF > val res = df.groupBy("b").agg(count("*")) > {code} > Spark 2.4.6 has the expected result: > {noformat} > > res.show > +-----+--------+ > | b|count(1)| > +-----+--------+ > | []| 1| > | null| 1| > |[1.0]| 1| > +-----+--------+ > > res.collect.foreach(println) > [[null],1] > [null,1] > [[1.0],1] > {noformat} > But Spark 3.0.0 has an unexpected result: > {noformat} > > res.show > +-----+--------+ > | b|count(1)| > +-----+--------+ > | []| 2| > |[1.0]| 1| > +-----+--------+ > > res.collect.foreach(println) > [[null],2] > [[1.0],1] > {noformat} > Notice how it has keyed one of the values in be as `[null]`; that is, an > instance of B with a null value for the `c` property instead of a null for > the overall value itself. > Is this an intended change? -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org