[ 
https://issues.apache.org/jira/browse/SPARK-55716?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kent Yao resolved SPARK-55716.
------------------------------
    Fix Version/s: 4.2.0
       Resolution: Fixed

Issue resolved by pull request 54517
[https://github.com/apache/spark/pull/54517]

> V1 file-based DataSource writes silently accept null values into NOT NULL 
> columns
> ---------------------------------------------------------------------------------
>
>                 Key: SPARK-55716
>                 URL: https://issues.apache.org/jira/browse/SPARK-55716
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 4.2.0
>            Reporter: Kent Yao
>            Assignee: Kent Yao
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 4.2.0
>
>
> V1 file-based DataSource writes (parquet/orc/json) silently accept null 
> values into NOT NULL columns. The root cause has two parts:
> 1. `DataSource.resolveRelation()` calls `dataSchema.asNullable` at line 439, 
> which strips NOT NULL constraints recursively. This was added in SPARK-13738 
> (2016) for read safety — files may contain nulls regardless of schema. 
> However, this also affects the write path.
> 2. `CreateDataSourceTableCommand` stores `dataSource.schema` 
> (post-asNullable) in the catalog at line 111, permanently losing NOT NULL 
> information.
> As a result, `PreprocessTableInsertion` never injects `AssertNotNull` for V1 
> file source tables because the schema it sees is all-nullable.
> Note that `InsertableRelation` (e.g., `SimpleInsertSource`) does NOT have 
> this problem because it preserves the original schema (SPARK-24583).
> **Fix:**
> - Fix `CreateDataSourceTableCommand` to preserve user-specified nullability 
> using recursive nullability merging (the resolved `dataSource.schema` may 
> have CharVarchar normalization and metadata that must be kept).
> - Fix `PreprocessTableInsertion` to restore nullability flags from the 
> catalog schema before null checks.
> - Add a legacy config `spark.sql.legacy.allowNullInsertForFileSourceTables` 
> (default false) to gate the write-side enforcement for backward compatibility.
> **Scope:**
> - This fix covers catalog-based table writes (INSERT INTO, INSERT OVERWRITE).
> - DataFrame `df.write.format().save()` without a catalog table is NOT 
> affected (no catalog schema to reference).
> - Both top-level and nested type nullability (array elements, struct fields, 
> map values) are enforced.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to