Github user JasonMWhite commented on the issue:

    https://github.com/apache/spark/pull/17293
  
    Reliable nullability information is about far more than non-nullable 
optimization to us. I would happily opt in to any performance penalty that 
validated that non-nullable columns were actually non-nullable, with a hard 
fail if it encountered an unexpected null. This is a real problem we face 
running a reasonably large data warehouse at scale.
    
    In fact, we already do this in another project through the use of a custom 
function called `assert_not_null` that throws an exception if it encounters a 
null in a specific field. This is awkward for us because:
    - it requires the use of a sideband storage of the schema, or the use of 
another library to read the actual schema of the parquet files to identify the 
columns that should be not nullable
    - UDFs can't be non-nullable AFAIK (at least they couldn't be when I last 
looked, please LMK if this is no longer the case), so we have to reach into the 
protected spark namespace to add this new function


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to