GitHub user gengliangwang opened a pull request: https://github.com/apache/spark/pull/20454
[SPARK-23202][SQL] Add new DataSourceWriter API: onDataWriterCommit ## What changes were proposed in this pull request? Currently, the api `DataSourceV2Writer#commit(WriterCommitMessage[])` commits a writing job with a list of commit messages. It makes sense in some scenarios, e.g. MicroBatchExecution. However, the API makes it hard to implement `onTaskCommit(taskCommit: TaskCommitMessage)` in `FileCommitProtocol`. In general, on receiving commit message, driver can start processing messages(e.g. persist messages into files) before all the messages are collected. The proposal to add a new API: `add(WriterCommitMessage message)`: Handles a commit message on receiving from a successful data writer. This should make the whole API of DataSourceWriter compatible with `FileCommitProtocol`, and more flexible. There was another radical attempt in #20386. This one should be more reasonable. ## How was this patch tested? Unit test You can merge this pull request into a Git repository by running: $ git pull https://github.com/gengliangwang/spark write_api Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/20454.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #20454 ---- commit 04edec2221a252ccfbcaf9e505eaae0a0f1664ab Author: Wang Gengliang <ltnwgl@...> Date: 2018-01-31T08:21:18Z new DataSourceWriter api: onDataWriterCommit commit 89776eced1b60b1856d6157a30ad1d8be0ba0f81 Author: Wang Gengliang <ltnwgl@...> Date: 2018-01-31T12:39:13Z revise comments and add test case ---- --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org