[ 
https://issues.apache.org/jira/browse/SPARK-54462?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-54462.
---------------------------------
    Fix Version/s: 4.1.0
       Resolution: Fixed

Issue resolved by pull request 53215
[https://github.com/apache/spark/pull/53215]

> Delta DataFrameWriter saveAsTable with Overwrite mode broken in connect mode
> ----------------------------------------------------------------------------
>
>                 Key: SPARK-54462
>                 URL: https://issues.apache.org/jira/browse/SPARK-54462
>             Project: Spark
>          Issue Type: Improvement
>          Components: Connect
>    Affects Versions: 4.1.0
>            Reporter: Juliusz Sompolski
>            Assignee: Juliusz Sompolski
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 4.1.0
>
>
> Spark's SaveMode.Overwrite is documented as:
> ```
>         * if data/table already exists, existing data is expected to be 
> overwritten
>         * by the contents of the DataFrame.
> ```
> It does not define the behaviour of overwriting the table metadata (schema, 
> etc). Delta datasource interpretation of this API documentation of 
> DataFrameWriter V1 is to not replace table schema, unless Delta-specific 
> option "overwriteSchema" is set to true.
> However, DataFrameWriter V1 creates a ReplaceTableAsSelect plan, which is the 
> same as the plan of DataFrameWriterV2 createOrReplace API, which is 
> documented as:
> ```
>        * The output table's schema, partition layout, properties, and other 
> configuration
>        * will be based on the contents of the data frame and the 
> configuration set on this
>        * writer. If the table exists, its configuration and data will be 
> replaced.
> ```
> Therefore, for calls via DataFrameWriter V2 createOrReplace, the metadata 
> always needs to be replaced, and Delta datasource doesn't use the 
> overwriteSchema option.
> Since the created plan is exactly the same, Delta had used a very ugly hack 
> to detect where the API call is coming from based on the stack trace of the 
> call.
> In Spark 4.1 in connect mode, this stopped working because planning and 
> execution of the commands go decoupled, and the stack trace no longer 
> contains this point where the plan got created.
> To retain compatibility of the Delta datasource with Spark 4.1 in connect 
> mode, Spark provides this explicit storage option to indicate to Delta 
> datasource that this call is coming from DataFrameWriter V1.
> Followup: Since the details of the documented semantics of Spark's 
> DataFrameWriter V1  saveAsTable API differs from that of CREATE/REPLACE TABLE 
> AS SELECT, Spark should  not be reusing the exact same logical plan for these 
> APIs.
> Existing Datasources which have been implemented following Spark's 
> documentation of these APIs should have a way to differentiate between these 
> APIs.
> However, at this point releasing Spark 4.1 as is would cause data corruption 
> issues with Delta in DataFrameWriter saveAsTable in overwrite mode, as it 
> would not be correctly interpreting it's overwriteSchema mode.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to