[ 
https://issues.apache.org/jira/browse/SPARK-23202?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gengliang Wang updated SPARK-23202:
-----------------------------------
    Description: 
The current DataSourceWriter API makes it hard to implement 
{{onTaskCommit(taskCommit: TaskCommitMessage)}} in {{FileCommitProtocol}}.
In general, on receiving commit message, driver can start processing 
messages(e.g. persist messages into files) before all the messages are 
collected.

The proposal to add a new API:
{{add(WriterCommitMessage message)}}: Handles a commit message on receiving 
from a successful data writer.

This should make the whole API of DataSourceWriter compatible with 
{{FileCommitProtocol}}, and more flexible.

There was another radical attempt in 
[#20386|https://github.com/apache/spark/pull/20386]. Creating a new API is more 
reasonable.

  was:
Currently, the api DataSourceV2Writer#commit(WriterCommitMessage[]) commits a 

writing job with a list of commit messages.

It makes sense in some scenarios, e.g. MicroBatchExecution.

However, on receiving commit message, driver can start processing messages(e.g. 
persist messages into files) before all the messages are collected.

The proposal is to Break down DataSourceV2Writer.commit into two phase:
 # add(WriterCommitMessage message): Handles a commit message produced by 
\{@link DataWriter#commit()}.
 # commit():  Commits the writing job.

This should make the API more flexible, and more reasonable for implementing 
some datasources.


> Add new API in DataSourceWriter: onDataWriterCommit
> ---------------------------------------------------
>
>                 Key: SPARK-23202
>                 URL: https://issues.apache.org/jira/browse/SPARK-23202
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 2.3.0
>            Reporter: Gengliang Wang
>            Priority: Major
>
> The current DataSourceWriter API makes it hard to implement 
> {{onTaskCommit(taskCommit: TaskCommitMessage)}} in {{FileCommitProtocol}}.
> In general, on receiving commit message, driver can start processing 
> messages(e.g. persist messages into files) before all the messages are 
> collected.
> The proposal to add a new API:
> {{add(WriterCommitMessage message)}}: Handles a commit message on receiving 
> from a successful data writer.
> This should make the whole API of DataSourceWriter compatible with 
> {{FileCommitProtocol}}, and more flexible.
> There was another radical attempt in 
> [#20386|https://github.com/apache/spark/pull/20386]. Creating a new API is 
> more reasonable.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to