[ https://issues.apache.org/jira/browse/SPARK-14584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15970306#comment-15970306 ]
Hyukjin Kwon commented on SPARK-14584: -------------------------------------- [~joshrosen], it seems now it recognise the non-nullability as below: {code} scala> case class MyCaseClass(foo: Int) defined class MyCaseClass scala> sc.parallelize(Seq(0)).toDS.map(MyCaseClass).printSchema root |-- foo: integer (nullable = false) {code} Could we resolve this? > Improve recognition of non-nullability in Dataset transformations > ----------------------------------------------------------------- > > Key: SPARK-14584 > URL: https://issues.apache.org/jira/browse/SPARK-14584 > Project: Spark > Issue Type: Improvement > Components: SQL > Reporter: Josh Rosen > > There are many cases where we can statically know that a field will never be > null. For instance, a field in a case class with a primitive type will never > return null. However, there are currently several cases in the Dataset API > where we do not properly recognize this non-nullability. For instance: > {code} > case class MyCaseClass(foo: Int) > sc.parallelize(Seq(0)).toDS.map(MyCaseClass).printSchema > {code} > claims that the {{foo}} field is nullable even though this is impossible. > I believe that this is due to the way that we reason about nullability when > constructing serializer expressions in ExpressionEncoders. The following > assertion will currently fail if added to ExpressionEncoder: > {code} > require(schema.size == serializer.size) > schema.fields.zip(serializer).foreach { case (field, fieldSerializer) => > require(field.dataType == fieldSerializer.dataType, s"Field > ${field.name}'s data type is " + > s"${field.dataType} in the schema but ${fieldSerializer.dataType} in > its serializer") > require(field.nullable == fieldSerializer.nullable, s"Field > ${field.name}'s nullability is " + > s"${field.nullable} in the schema but ${fieldSerializer.nullable} in > its serializer") > } > {code} > Most often, the schema claims that a field is non-nullable while the encoder > allows for nullability, but occasionally we see a mismatch in the datatypes > due to disagreements over the nullability of nested structs' fields (or > fields of structs in arrays). > I think the problem is that when we're reasoning about nullability in a > struct's schema we consider its fields' nullability to be independent of the > nullability of the struct itself, whereas in the serializer expressions we > are considering those field extraction expressions to be nullable if the > input objects themselves can be nullable. > I'm not sure what's the simplest way to fix this. One proposal would be to > leave the serializers unchanged and have ObjectOperator derive its output > attributes from an explicitly-passed schema rather than using the > serializers' attributes. However, I worry that this might introduce bugs in > case the serializer and schema disagree. -- This message was sent by Atlassian JIRA (v6.3.15#6346) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org