[PR] [SPARK-57068][SQL] Make SaveMode.Overwrite create the table when missing for SupportsCatalogOptions sources [spark]

via GitHub Tue, 26 May 2026 00:46:47 -0700


LuciferYang opened a new pull request, #56111:
URL: https://github.com/apache/spark/pull/56111


   ### What changes were proposed in this pull request?
   
   In `DataFrameWriter.saveCommand`, the `SaveMode.Append | SaveMode.Overwrite`
   branch calls `catalog.loadTable(ident)` without catching 
`NoSuchTableException`
   when the V2 source implements `SupportsCatalogOptions`. The exception
   propagates straight to the user, even though `SaveMode.ErrorIfExists` and
   `SaveMode.Ignore` on the same call succeed by routing to 
`CreateTableAsSelect`.
   
   This change catches `NoSuchTableException` for `SaveMode.Overwrite` only and
   routes to `CreateTableAsSelect(ignoreIfExists = false)`, mirroring the
   `createMode` arm immediately below. `SaveMode.Append` on a non-existent
   identifier intentionally continues to throw, because Append explicitly 
expects
   an existing table and silently creating would mask user mistakes.
   
   A new internal SQL conf 
`spark.sql.legacy.dataFrameWriter.overwriteOnMissingTableThrows`
   restores the pre-fix behavior for users who depend on it.
   
   The `CreateTableAsSelect` construction shared between the new fall-back path
   and the existing `createMode` arm is extracted into a private helper
   `createTableAsSelectForCatalogOptions` to keep both sites in sync.
   
   ### Why are the changes needed?
   
   The most idiomatic write call for any V2 connector,
   
       df.write.format(provider).mode("overwrite").save(newPath)
   
   fails with `NoSuchTableException` when `newPath` does not yet exist, whereas
   the equivalent V1 call (e.g. `format("parquet")`) succeeds by creating the
   table. V2 sources that implement `SupportsCatalogOptions` (Iceberg, Lance, 
and
   custom connectors) all hit this asymmetry. The fix aligns V2
   `SaveMode.Overwrite` semantics with V1: overwrite-on-missing creates the
   table, overwrite-on-existing truncates and writes.
   
   Behavior matrix after this change:
   
   | Mode × Target          | V1            | V2 before    | V2 after   |
   |------------------------|---------------|--------------|------------|
   | Overwrite, missing     | creates       | **throws**   | creates    |
   | Overwrite, existing    | truncate+write| overwrite    | unchanged  |
   | Append, missing        | creates       | throws       | throws*    |
   | Append, existing       | append        | append       | unchanged  |
   | ErrorIfExists, missing | creates       | creates      | unchanged  |
   | ErrorIfExists, existing| throws        | throws       | unchanged  |
   | Ignore, missing        | creates       | creates      | unchanged  |
   | Ignore, existing       | no-op         | no-op        | unchanged  |
   
   \* Intentional V1 divergence — see PR description.
   
   There is an inherent race window between `loadTable` (throws) and
   `CreateTableAsSelect`: a concurrent writer creating the table in between
   will cause `TableAlreadyExistsException` rather than overwriting. This is
   acceptable; V1's filesystem-atomic path doesn't expose it because V1 never
   consults a catalog. Users retry.
   
   ### Does this PR introduce _any_ user-facing change?
   
   Yes. `df.write.format(<V2 SupportsCatalogOptions source>).mode("overwrite")
   .save(<new identifier>)` now creates the table instead of throwing
   `NoSuchTableException`. No behavior change for paths that already exist. The
   migration guide has been updated. The legacy flag
   `spark.sql.legacy.dataFrameWriter.overwriteOnMissingTableThrows` restores the
   prior behavior.
   
   ### How was this patch tested?
   
   New tests in `SupportsCatalogOptionsSuite`:
   - `save works with Overwrite - no table, no partitioning, session catalog`
   - `save works with Overwrite - no table, with partitioning, session catalog`
   - `save works with Overwrite - no table, no partitioning, testcat catalog`
   - `save works with Overwrite - no table, with partitioning, testcat catalog`
   
   These reuse the existing `testCreateAndRead` helper, which verifies catalog
   state (table identity, partitioning, columns) in addition to data.
   
   Plus three behavior-pinning tests:
   - `Append mode still fails when table is missing - testcat catalog` (pins
     the intentional Append divergence)
   - `legacy flag restores throw on Overwrite-missing` (verifies the new conf)
   - `Overwrite + withSchemaEvolution on missing table is rejected` (verifies
     the schema-evolution gate fires with the expected error class)
   
   Existing tests continue to pass.
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   No.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] [SPARK-57068][SQL] Make SaveMode.Overwrite create the table when missing for SupportsCatalogOptions sources [spark]

Reply via email to