Nullable is just a hint to the optimizer that its impossible for there to
be a null value in this column, so that it can avoid generating code for
null-checks.  When in doubt, we set nullable=true since it is always safer
to check.

Why in particular are you trying to change the nullability of the column?

On Wed, Oct 19, 2016 at 6:07 PM, Muthu Jayakumar <bablo...@gmail.com> wrote:

> Hello there,
>
> I am trying to understand how and when does DataFrame (or Dataset) sets
> nullable = true vs false on a schema.
>
> Here is my observation from a sample code I tried...
>
>
> scala> spark.createDataset(Seq((1, "a", 2.0d), (2, "b", 2.0d), (3, "c",
> 2.0d))).toDF("col1", "col2", "col3").withColumn("col4",
> lit("bla")).printSchema()
> root
>  |-- col1: integer (nullable = false)
>  |-- col2: string (nullable = true)
>  |-- col3: double (nullable = false)
>  |-- col4: string (nullable = false)
>
>
> scala> spark.createDataset(Seq((1, "a", 2.0d), (2, "b", 2.0d), (3, "c",
> 2.0d))).toDF("col1", "col2", "col3").withColumn("col4",
> lit("bla")).write.parquet("/tmp/sample.parquet")
>
> scala> spark.read.parquet("/tmp/sample.parquet").printSchema()
> root
>  |-- col1: integer (nullable = true)
>  |-- col2: string (nullable = true)
>  |-- col3: double (nullable = true)
>  |-- col4: string (nullable = true)
>
>
> The place where this seem to get me into trouble is when I try to union
> one data-structure from in-memory (notice that in the below schema the
> highlighted element is represented as 'false' for in-memory created schema)
> and one from file that starts out with a schema like below...
>
>  |-- some_histogram: struct (nullable = true)
>  |    |-- values: array (nullable = true)
>  |    |    |-- element: double (containsNull = true)
>  |    |-- freq: array (nullable = true)
>  |    |    |-- element: long (containsNull = true)
>
> Is there a way to convert this attribute from true to false without
> running any mapping / udf on that column?
>
> Please advice,
> Muthu
>

Reply via email to