Juliusz Sompolski created SPARK-54462:
-----------------------------------------

             Summary: Delta DataFrameWriter saveAsTable with Overwrite mode 
broken in connect mode
                 Key: SPARK-54462
                 URL: https://issues.apache.org/jira/browse/SPARK-54462
             Project: Spark
          Issue Type: Improvement
          Components: Connect
    Affects Versions: 4.1.0
            Reporter: Juliusz Sompolski


Spark's SaveMode.Overwrite is documented as:

```
        * if data/table already exists, existing data is expected to be 
overwritten
        * by the contents of the DataFrame.

```
It does not define the behaviour of overwriting the table metadata (schema, 
etc). Delta datasource interpretation of this API documentation of 
DataFrameWriter V1 is to not replace table schema, unless Delta-specific option 
"overwriteSchema" is set to true.

However, DataFrameWriter V1 creates a ReplaceTableAsSelect plan, which is the 
same as the plan of DataFrameWriterV2 createOrReplace API, which is documented 
as:

```
       * The output table's schema, partition layout, properties, and other 
configuration
       * will be based on the contents of the data frame and the configuration 
set on this
       * writer. If the table exists, its configuration and data will be 
replaced.
```

Therefore, for calls via DataFrameWriter V2 createOrReplace, the metadata 
always needs to be replaced, and Delta datasource doesn't use the 
overwriteSchema option.
Since the created plan is exactly the same, Delta had used a very ugly hack to 
detect where the API call is coming from based on the stack trace of the call.

In Spark 4.1 in connect mode, this stopped working because planning and 
execution of the commands go decoupled, and the stack trace no longer contains 
this point where the plan got created.

To retain compatibility of the Delta datasource with Spark 4.1 in connect mode, 
Spark provides this explicit storage option to indicate to Delta datasource 
that this call is coming from DataFrameWriter V1.

Followup: Since the details of the documented semantics of Spark's 
DataFrameWriter V1  saveAsTable API differs from that of CREATE/REPLACE TABLE 
AS SELECT, Spark should  not be reusing the exact same logical plan for these 
APIs.
Existing Datasources which have been implemented following Spark's 
documentation of these APIs should have a way to differentiate between these 
APIs.

However, at this point releasing Spark 4.1 as is would cause data corruption 
issues with Delta in DataFrameWriter saveAsTable in overwrite mode, as it would 
not be correctly interpreting it's overwriteSchema mode.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to