Hi. I’ve spent the last couple of hours trying to chase down an issue with writing/reading parquet files. I was trying to save (and then read in) a parquet file with a schema that sets my non-nullability details correctly. After having no success for some time, I posted to Stack Overflow about it:
https://stackoverflow.com/q/72877780/1212960 <https://stackoverflow.com/q/72877780/1212960> If you read the question, you’ll see that I can trace writing a schema with the correct (non-)nullability information. And I can even see that when reading in the parquet file, the correct nullability information is found. However, by the time the data frame gets handed back to me, the (non-)nullability information has been thrown away. Only after posting the question did I find this sentence in the Parquet Files (https://spark.apache.org/docs/3.3.0/sql-data-sources-parquet.html <https://spark.apache.org/docs/3.3.0/sql-data-sources-parquet.html>) documentation: When reading Parquet files, all columns are automatically converted to be nullable for compatibility reasons. It seems odd to me that: - Spark tracks nullable and non-nullable columns - Parquet tracks nullable and non-nullable columns - Spark can write a parquet file with correctly annotated non-nullable columns - Spark can correctly read a parquet file and identify non-nullable columns … and yet, Spark deliberately discards this information. I understand (from this article <https://medium.com/@weshoffman/apache-spark-parquet-and-troublesome-nulls-28712b06f836#:~:text=A%20column%E2%80%99s%20nullable%20characteristic%20is%20a%20contract%20with%20the%20Catalyst%20Optimizer%20that%20null%20data%20will%20not%20be%20produced>) that Spark itself doesn’t enforce the non-nullability - it’s simply information used by the Catalyst optimiser. And I also understand that Bad Things™ would happen if you inserted null values when the schema says it’s non-null. But I don’t understand why I can’t accept those caveats and have Spark retain my schema properly, with the non-nullability information maintained. What are these compatibility reasons that the documentation alludes to? Is this behaviour configurable? How can I create a dataframe from a parquet file that uses the schema as-is (i.e. honouring the nullability information)? I want the Catalyst optimiser to have this information available, and I control my data to ensure that it meets the nullability constraints. Thanks for your time. Kindest regards, — Greg