Hello Michael, Thank you for looking into this query. In my case there seem to be an issue when I union a parquet file read from disk versus another dataframe that I construct in-memory. The only difference I see is the containsNull = true. In fact, I do not see any errors with union on the simple schema of "col1 thru col4" above. But the problem seem to exist only on that "some_histogram" column which contains the mixed containsNull = true/false. Let me know if this helps.
Thanks, Muthu On Wed, Oct 19, 2016 at 6:21 PM, Michael Armbrust <mich...@databricks.com> wrote: > Nullable is just a hint to the optimizer that its impossible for there to > be a null value in this column, so that it can avoid generating code for > null-checks. When in doubt, we set nullable=true since it is always safer > to check. > > Why in particular are you trying to change the nullability of the column? > > On Wed, Oct 19, 2016 at 6:07 PM, Muthu Jayakumar <bablo...@gmail.com> > wrote: > >> Hello there, >> >> I am trying to understand how and when does DataFrame (or Dataset) sets >> nullable = true vs false on a schema. >> >> Here is my observation from a sample code I tried... >> >> >> scala> spark.createDataset(Seq((1, "a", 2.0d), (2, "b", 2.0d), (3, "c", >> 2.0d))).toDF("col1", "col2", "col3").withColumn("col4", >> lit("bla")).printSchema() >> root >> |-- col1: integer (nullable = false) >> |-- col2: string (nullable = true) >> |-- col3: double (nullable = false) >> |-- col4: string (nullable = false) >> >> >> scala> spark.createDataset(Seq((1, "a", 2.0d), (2, "b", 2.0d), (3, "c", >> 2.0d))).toDF("col1", "col2", "col3").withColumn("col4", >> lit("bla")).write.parquet("/tmp/sample.parquet") >> >> scala> spark.read.parquet("/tmp/sample.parquet").printSchema() >> root >> |-- col1: integer (nullable = true) >> |-- col2: string (nullable = true) >> |-- col3: double (nullable = true) >> |-- col4: string (nullable = true) >> >> >> The place where this seem to get me into trouble is when I try to union >> one data-structure from in-memory (notice that in the below schema the >> highlighted element is represented as 'false' for in-memory created schema) >> and one from file that starts out with a schema like below... >> >> |-- some_histogram: struct (nullable = true) >> | |-- values: array (nullable = true) >> | | |-- element: double (containsNull = true) >> | |-- freq: array (nullable = true) >> | | |-- element: long (containsNull = true) >> >> Is there a way to convert this attribute from true to false without >> running any mapping / udf on that column? >> >> Please advice, >> Muthu >> > >