Reading parquet strips non-nullability from schema

Greg Kopff Tue, 05 Jul 2022 23:29:39 -0700

Hi.

I’ve spent the last couple of hours trying to chase down an issue with 
writing/reading parquet files.  I was trying to save (and then read in) a 
parquet file with a schema that sets my non-nullability details correctly.  
After having no success for some time, I posted to Stack Overflow about it:

https://stackoverflow.com/q/72877780/1212960
<https://stackoverflow.com/q/72877780/1212960>

If you read the question, you’ll see that I can trace writing a schema with the
correct (non-)nullability information. And I can even see that when reading in
the parquet file, the correct nullability information is found. However, by
the time the data frame gets handed back to me, the (non-)nullability
information has been thrown away.

Only after posting the question did I find this sentence in the Parquet Files
(https://spark.apache.org/docs/3.3.0/sql-data-sources-parquet.html
<https://spark.apache.org/docs/3.3.0/sql-data-sources-parquet.html>)
documentation:

When reading Parquet files, all columns are automatically converted to be
nullable for compatibility reasons.

It seems odd to me that:

- Spark tracks nullable and non-nullable columns
- Parquet tracks nullable and non-nullable columns
- Spark can write a parquet file with correctly annotated non-nullable columns
- Spark can correctly read a parquet file and identify non-nullable columns

… and yet, Spark deliberately discards this information.

I understand (from this article
<https://medium.com/@weshoffman/apache-spark-parquet-and-troublesome-nulls-28712b06f836#:~:text=A%20column%E2%80%99s%20nullable%20characteristic%20is%20a%20contract%20with%20the%20Catalyst%20Optimizer%20that%20null%20data%20will%20not%20be%20produced>)
that Spark itself doesn’t enforce the non-nullability - it’s simply
information used by the Catalyst optimiser. And I also understand that Bad
Things™ would happen if you inserted null values when the schema says it’s
non-null.

But I don’t understand why I can’t accept those caveats and have Spark retain
my schema properly, with the non-nullability information maintained. What are
these compatibility reasons that the documentation alludes to?

Is this behaviour configurable? How can I create a dataframe from a parquet
file that uses the schema as-is (i.e. honouring the nullability information)?
I want the Catalyst optimiser to have this information available, and I control
my data to ensure that it meets the nullability constraints.

Thanks for your time.

Kindest regards,

—
Greg

Reading parquet strips non-nullability from schema

Reply via email to