[ https://issues.apache.org/jira/browse/SPARK-19980?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Wenchen Fan updated SPARK-19980: -------------------------------- Fix Version/s: 2.1.1 > Basic Dataset transformation on POJOs does not preserves nulls. > --------------------------------------------------------------- > > Key: SPARK-19980 > URL: https://issues.apache.org/jira/browse/SPARK-19980 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 2.1.0 > Reporter: Michel Lemay > Assignee: Takeshi Yamamuro > Fix For: 2.1.1, 2.2.0 > > > Applying an identity map transformation on a statically typed Dataset with a > POJO produces an unexpected result. > Given POJOs: > {code} > public class Stuff implements Serializable { > private String name; > public void setName(String name) { this.name = name; } > public String getName() { return name; } > } > public class Outer implements Serializable { > private String name; > private Stuff stuff; > public void setName(String name) { this.name = name; } > public String getName() { return name; } > public void setStuff(Stuff stuff) { this.stuff = stuff; } > public Stuff getStuff() { return stuff; } > } > {code} > Produces the result: > {code} > scala> val encoder = Encoders.bean(classOf[Outer]) > encoder: org.apache.spark.sql.Encoder[pojos.Outer] = class[name[0]: string, > stuff[0]: struct<name:string>] > scala> val schema = encoder.schema > schema: org.apache.spark.sql.types.StructType = > StructType(StructField(name,StringType,true), > StructField(stuff,StructType(StructField(name,StringType,true)),true)) > scala> schema.printTreeString > root > |-- name: string (nullable = true) > |-- stuff: struct (nullable = true) > | |-- name: string (nullable = true) > scala> val df = > spark.read.schema(schema).json("stuff.json").as[Outer](encoder) > df: org.apache.spark.sql.Dataset[pojos.Outer] = [name: string, stuff: > struct<name: string>] > scala> df.show() > +----+-----+ > |name|stuff| > +----+-----+ > | v1| null| > +----+-----+ > scala> df.map(x => x)(encoder).show() > +----+------+ > |name| stuff| > +----+------+ > | v1|[null]| > +----+------+ > {code} > After identity transformation, `stuff` becomes an object with null values > inside it instead of staying null itself. > Doing the same with case classes preserves the nulls: > {code} > scala> case class ScalaStuff(name: String) > defined class ScalaStuff > scala> case class ScalaOuter(name: String, stuff: ScalaStuff) > defined class ScalaOuter > scala> val encoder2 = Encoders.product[ScalaOuter] > encoder2: org.apache.spark.sql.Encoder[ScalaOuter] = class[name[0]: string, > stuff[0]: struct<name:string>] > scala> val schema2 = encoder2.schema > schema2: org.apache.spark.sql.types.StructType = > StructType(StructField(name,StringType,true), > StructField(stuff,StructType(StructField(name,StringType,true)),true)) > scala> schema2.printTreeString > root > |-- name: string (nullable = true) > |-- stuff: struct (nullable = true) > | |-- name: string (nullable = true) > scala> > scala> val df2 = spark.read.schema(schema2).json("stuff.json").as[ScalaOuter] > df2: org.apache.spark.sql.Dataset[ScalaOuter] = [name: string, stuff: > struct<name: string>] > scala> df2.show() > +----+-----+ > |name|stuff| > +----+-----+ > | v1| null| > +----+-----+ > scala> df2.map(x => x).show() > +----+-----+ > |name|stuff| > +----+-----+ > | v1| null| > +----+-----+ > {code} > stuff.json: > {code} > {"name":"v1", "stuff":null } > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org