What is the issue you see when unioning? On Wed, Oct 19, 2016 at 6:39 PM, Muthu Jayakumar <bablo...@gmail.com> wrote:
> Hello Michael, > > Thank you for looking into this query. In my case there seem to be an > issue when I union a parquet file read from disk versus another dataframe > that I construct in-memory. The only difference I see is the containsNull = > true. In fact, I do not see any errors with union on the simple schema of > "col1 thru col4" above. But the problem seem to exist only on that > "some_histogram" column which contains the mixed containsNull = true/false. > Let me know if this helps. > > Thanks, > Muthu > > > > On Wed, Oct 19, 2016 at 6:21 PM, Michael Armbrust <mich...@databricks.com> > wrote: > >> Nullable is just a hint to the optimizer that its impossible for there to >> be a null value in this column, so that it can avoid generating code for >> null-checks. When in doubt, we set nullable=true since it is always safer >> to check. >> >> Why in particular are you trying to change the nullability of the column? >> >> On Wed, Oct 19, 2016 at 6:07 PM, Muthu Jayakumar <bablo...@gmail.com> >> wrote: >> >>> Hello there, >>> >>> I am trying to understand how and when does DataFrame (or Dataset) sets >>> nullable = true vs false on a schema. >>> >>> Here is my observation from a sample code I tried... >>> >>> >>> scala> spark.createDataset(Seq((1, "a", 2.0d), (2, "b", 2.0d), (3, "c", >>> 2.0d))).toDF("col1", "col2", "col3").withColumn("col4", >>> lit("bla")).printSchema() >>> root >>> |-- col1: integer (nullable = false) >>> |-- col2: string (nullable = true) >>> |-- col3: double (nullable = false) >>> |-- col4: string (nullable = false) >>> >>> >>> scala> spark.createDataset(Seq((1, "a", 2.0d), (2, "b", 2.0d), (3, "c", >>> 2.0d))).toDF("col1", "col2", "col3").withColumn("col4", >>> lit("bla")).write.parquet("/tmp/sample.parquet") >>> >>> scala> spark.read.parquet("/tmp/sample.parquet").printSchema() >>> root >>> |-- col1: integer (nullable = true) >>> |-- col2: string (nullable = true) >>> |-- col3: double (nullable = true) >>> |-- col4: string (nullable = true) >>> >>> >>> The place where this seem to get me into trouble is when I try to union >>> one data-structure from in-memory (notice that in the below schema the >>> highlighted element is represented as 'false' for in-memory created schema) >>> and one from file that starts out with a schema like below... >>> >>> |-- some_histogram: struct (nullable = true) >>> | |-- values: array (nullable = true) >>> | | |-- element: double (containsNull = true) >>> | |-- freq: array (nullable = true) >>> | | |-- element: long (containsNull = true) >>> >>> Is there a way to convert this attribute from true to false without >>> running any mapping / udf on that column? >>> >>> Please advice, >>> Muthu >>> >> >> >