[
https://issues.apache.org/jira/browse/SPARK-54462?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Wenchen Fan resolved SPARK-54462.
---------------------------------
Fix Version/s: 4.1.0
Resolution: Fixed
Issue resolved by pull request 53215
[https://github.com/apache/spark/pull/53215]
> Delta DataFrameWriter saveAsTable with Overwrite mode broken in connect mode
> ----------------------------------------------------------------------------
>
> Key: SPARK-54462
> URL: https://issues.apache.org/jira/browse/SPARK-54462
> Project: Spark
> Issue Type: Improvement
> Components: Connect
> Affects Versions: 4.1.0
> Reporter: Juliusz Sompolski
> Assignee: Juliusz Sompolski
> Priority: Major
> Labels: pull-request-available
> Fix For: 4.1.0
>
>
> Spark's SaveMode.Overwrite is documented as:
> ```
> * if data/table already exists, existing data is expected to be
> overwritten
> * by the contents of the DataFrame.
> ```
> It does not define the behaviour of overwriting the table metadata (schema,
> etc). Delta datasource interpretation of this API documentation of
> DataFrameWriter V1 is to not replace table schema, unless Delta-specific
> option "overwriteSchema" is set to true.
> However, DataFrameWriter V1 creates a ReplaceTableAsSelect plan, which is the
> same as the plan of DataFrameWriterV2 createOrReplace API, which is
> documented as:
> ```
> * The output table's schema, partition layout, properties, and other
> configuration
> * will be based on the contents of the data frame and the
> configuration set on this
> * writer. If the table exists, its configuration and data will be
> replaced.
> ```
> Therefore, for calls via DataFrameWriter V2 createOrReplace, the metadata
> always needs to be replaced, and Delta datasource doesn't use the
> overwriteSchema option.
> Since the created plan is exactly the same, Delta had used a very ugly hack
> to detect where the API call is coming from based on the stack trace of the
> call.
> In Spark 4.1 in connect mode, this stopped working because planning and
> execution of the commands go decoupled, and the stack trace no longer
> contains this point where the plan got created.
> To retain compatibility of the Delta datasource with Spark 4.1 in connect
> mode, Spark provides this explicit storage option to indicate to Delta
> datasource that this call is coming from DataFrameWriter V1.
> Followup: Since the details of the documented semantics of Spark's
> DataFrameWriter V1 saveAsTable API differs from that of CREATE/REPLACE TABLE
> AS SELECT, Spark should not be reusing the exact same logical plan for these
> APIs.
> Existing Datasources which have been implemented following Spark's
> documentation of these APIs should have a way to differentiate between these
> APIs.
> However, at this point releasing Spark 4.1 as is would cause data corruption
> issues with Delta in DataFrameWriter saveAsTable in overwrite mode, as it
> would not be correctly interpreting it's overwriteSchema mode.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]