LuciferYang opened a new pull request, #56111:
URL: https://github.com/apache/spark/pull/56111
### What changes were proposed in this pull request?
In `DataFrameWriter.saveCommand`, the `SaveMode.Append | SaveMode.Overwrite`
branch calls `catalog.loadTable(ident)` without catching
`NoSuchTableException`
when the V2 source implements `SupportsCatalogOptions`. The exception
propagates straight to the user, even though `SaveMode.ErrorIfExists` and
`SaveMode.Ignore` on the same call succeed by routing to
`CreateTableAsSelect`.
This change catches `NoSuchTableException` for `SaveMode.Overwrite` only and
routes to `CreateTableAsSelect(ignoreIfExists = false)`, mirroring the
`createMode` arm immediately below. `SaveMode.Append` on a non-existent
identifier intentionally continues to throw, because Append explicitly
expects
an existing table and silently creating would mask user mistakes.
A new internal SQL conf
`spark.sql.legacy.dataFrameWriter.overwriteOnMissingTableThrows`
restores the pre-fix behavior for users who depend on it.
The `CreateTableAsSelect` construction shared between the new fall-back path
and the existing `createMode` arm is extracted into a private helper
`createTableAsSelectForCatalogOptions` to keep both sites in sync.
### Why are the changes needed?
The most idiomatic write call for any V2 connector,
df.write.format(provider).mode("overwrite").save(newPath)
fails with `NoSuchTableException` when `newPath` does not yet exist, whereas
the equivalent V1 call (e.g. `format("parquet")`) succeeds by creating the
table. V2 sources that implement `SupportsCatalogOptions` (Iceberg, Lance,
and
custom connectors) all hit this asymmetry. The fix aligns V2
`SaveMode.Overwrite` semantics with V1: overwrite-on-missing creates the
table, overwrite-on-existing truncates and writes.
Behavior matrix after this change:
| Mode × Target | V1 | V2 before | V2 after |
|------------------------|---------------|--------------|------------|
| Overwrite, missing | creates | **throws** | creates |
| Overwrite, existing | truncate+write| overwrite | unchanged |
| Append, missing | creates | throws | throws* |
| Append, existing | append | append | unchanged |
| ErrorIfExists, missing | creates | creates | unchanged |
| ErrorIfExists, existing| throws | throws | unchanged |
| Ignore, missing | creates | creates | unchanged |
| Ignore, existing | no-op | no-op | unchanged |
\* Intentional V1 divergence — see PR description.
There is an inherent race window between `loadTable` (throws) and
`CreateTableAsSelect`: a concurrent writer creating the table in between
will cause `TableAlreadyExistsException` rather than overwriting. This is
acceptable; V1's filesystem-atomic path doesn't expose it because V1 never
consults a catalog. Users retry.
### Does this PR introduce _any_ user-facing change?
Yes. `df.write.format(<V2 SupportsCatalogOptions source>).mode("overwrite")
.save(<new identifier>)` now creates the table instead of throwing
`NoSuchTableException`. No behavior change for paths that already exist. The
migration guide has been updated. The legacy flag
`spark.sql.legacy.dataFrameWriter.overwriteOnMissingTableThrows` restores the
prior behavior.
### How was this patch tested?
New tests in `SupportsCatalogOptionsSuite`:
- `save works with Overwrite - no table, no partitioning, session catalog`
- `save works with Overwrite - no table, with partitioning, session catalog`
- `save works with Overwrite - no table, no partitioning, testcat catalog`
- `save works with Overwrite - no table, with partitioning, testcat catalog`
These reuse the existing `testCreateAndRead` helper, which verifies catalog
state (table identity, partitioning, columns) in addition to data.
Plus three behavior-pinning tests:
- `Append mode still fails when table is missing - testcat catalog` (pins
the intentional Append divergence)
- `legacy flag restores throw on Overwrite-missing` (verifies the new conf)
- `Overwrite + withSchemaEvolution on missing table is rejected` (verifies
the schema-evolution gate fires with the expected error class)
Existing tests continue to pass.
### Was this patch authored or co-authored using generative AI tooling?
No.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]