Juliusz Sompolski created SPARK-54462:
-----------------------------------------
Summary: Delta DataFrameWriter saveAsTable with Overwrite mode
broken in connect mode
Key: SPARK-54462
URL: https://issues.apache.org/jira/browse/SPARK-54462
Project: Spark
Issue Type: Improvement
Components: Connect
Affects Versions: 4.1.0
Reporter: Juliusz Sompolski
Spark's SaveMode.Overwrite is documented as:
```
* if data/table already exists, existing data is expected to be
overwritten
* by the contents of the DataFrame.
```
It does not define the behaviour of overwriting the table metadata (schema,
etc). Delta datasource interpretation of this API documentation of
DataFrameWriter V1 is to not replace table schema, unless Delta-specific option
"overwriteSchema" is set to true.
However, DataFrameWriter V1 creates a ReplaceTableAsSelect plan, which is the
same as the plan of DataFrameWriterV2 createOrReplace API, which is documented
as:
```
* The output table's schema, partition layout, properties, and other
configuration
* will be based on the contents of the data frame and the configuration
set on this
* writer. If the table exists, its configuration and data will be
replaced.
```
Therefore, for calls via DataFrameWriter V2 createOrReplace, the metadata
always needs to be replaced, and Delta datasource doesn't use the
overwriteSchema option.
Since the created plan is exactly the same, Delta had used a very ugly hack to
detect where the API call is coming from based on the stack trace of the call.
In Spark 4.1 in connect mode, this stopped working because planning and
execution of the commands go decoupled, and the stack trace no longer contains
this point where the plan got created.
To retain compatibility of the Delta datasource with Spark 4.1 in connect mode,
Spark provides this explicit storage option to indicate to Delta datasource
that this call is coming from DataFrameWriter V1.
Followup: Since the details of the documented semantics of Spark's
DataFrameWriter V1 saveAsTable API differs from that of CREATE/REPLACE TABLE
AS SELECT, Spark should not be reusing the exact same logical plan for these
APIs.
Existing Datasources which have been implemented following Spark's
documentation of these APIs should have a way to differentiate between these
APIs.
However, at this point releasing Spark 4.1 as is would cause data corruption
issues with Delta in DataFrameWriter saveAsTable in overwrite mode, as it would
not be correctly interpreting it's overwriteSchema mode.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]