Hi.

I’ve spent the last couple of hours trying to chase down an issue with 
writing/reading parquet files.  I was trying to save (and then read in) a 
parquet file with a schema that sets my non-nullability details correctly.  
After having no success for some time, I posted to Stack Overflow about it:

https://stackoverflow.com/q/72877780/1212960 
<https://stackoverflow.com/q/72877780/1212960>

If you read the question, you’ll see that I can trace writing a schema with the 
correct (non-)nullability information.  And I can even see that when reading in 
the parquet file, the correct nullability information is found.  However, by 
the time the data frame gets handed back to me, the (non-)nullability 
information has been thrown away.

Only after posting the question did I find this sentence in the Parquet Files 
(https://spark.apache.org/docs/3.3.0/sql-data-sources-parquet.html 
<https://spark.apache.org/docs/3.3.0/sql-data-sources-parquet.html>) 
documentation: 

When reading Parquet files, all columns are automatically converted to be 
nullable for compatibility reasons.

It seems odd to me that:

 - Spark tracks nullable and non-nullable columns
 - Parquet tracks nullable and non-nullable columns
 - Spark can write a parquet file with correctly annotated non-nullable columns
 - Spark can correctly read a parquet file and identify non-nullable columns

… and yet, Spark deliberately discards this information.  

I understand (from this article 
<https://medium.com/@weshoffman/apache-spark-parquet-and-troublesome-nulls-28712b06f836#:~:text=A%20column%E2%80%99s%20nullable%20characteristic%20is%20a%20contract%20with%20the%20Catalyst%20Optimizer%20that%20null%20data%20will%20not%20be%20produced>)
 that Spark itself doesn’t enforce the non-nullability - it’s simply 
information used by the Catalyst optimiser.  And I also understand that Bad 
Things™ would happen if you inserted null values when the schema says it’s 
non-null.

But I don’t understand why I can’t accept those caveats and have Spark retain 
my schema properly, with the non-nullability information maintained.  What are 
these compatibility reasons that the documentation alludes to?

Is this behaviour configurable?  How can I create a dataframe from a parquet 
file that uses the schema as-is (i.e. honouring the nullability information)?  
I want the Catalyst optimiser to have this information available, and I control 
my data to ensure that it meets the nullability constraints.

Thanks for your time.

Kindest regards,

—
Greg


Reply via email to