[ 
https://issues.apache.org/jira/browse/SPARK-25188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16589086#comment-16589086
 ] 

Ryan Blue commented on SPARK-25188:
-----------------------------------

Here's the original proposal for adding a write config:

The read side has a {{ScanConfig}}, but the write side doesn't have an 
equivalent object that tracks a particular write. I think if we introduce one, 
the API would be more similar between the read and write sides, and we would 
have a better API for overwrite operations. I propose adding a {{WriteConfig}} 
object and passing it like this:

{code:lang=java}
interface BatchWriteSupport {
  WriteConfig newWriteConfig(writeOptions: Map[String, String])

  DataWriterFactory createWriterFactory(WriteConfig)

  void commit(WriteConfig, WriterCommitMessage[])
}
{code}

That allows us to pass options for the write that affect how the WriterFactory 
operates. For example, in Iceberg I could request using Orc as the underlying 
format instead of Parquet. (I also suggested an addition like this for the read 
side.)

The other benefit of adding {{WriteConfig}} is that it provides a clean way of 
adding the ReplaceData operations. The ones I'm currently working on are 
ReplaceDynamicPartitions and ReplaceData. The first one removes any data in 
partitions that are being written to, and the second one replaces data based on 
a filter: e.g. {{df.writeTo(t).overwrite($"day" == "2018-08-15")}}. The 
information about replacement could be carried by {{WriteConfig}} to {{commit}} 
and would be created with a support interface:

{code:lang=java}
interface BatchOverwriteSupport extends BatchWriteSupport {
  WriteConfig newOverwrite(writeOptions, filters: Filter[])

  WriteConfig newDynamicOverwrite(writeOptions)
}
{code}

This is much cleaner than staging a delete and then running a write to complete 
the operation. All of the information about what to overwrite is just passed to 
the commit operation that can handle it at once. This is much better for 
dynamic partition replacement because the partitions to be replaced aren't even 
known by Spark before the write.

Last, this adds a place for write life-cycle operations that matches the 
ScanConfig read life-cycle. This could be used to perform operations like 
getting a write lock on a Hive table if someone wanted to support Hive's 
locking mechanism in the future.

> Add WriteConfig
> ---------------
>
>                 Key: SPARK-25188
>                 URL: https://issues.apache.org/jira/browse/SPARK-25188
>             Project: Spark
>          Issue Type: Sub-task
>          Components: SQL
>    Affects Versions: 2.4.0
>            Reporter: Wenchen Fan
>            Priority: Major
>
> The current write API is not flexible enough to implement more complex write 
> operations like `replaceWhere`. We can follow the read API and add a 
> `WriteConfig` to make it more flexible.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to