Github user steveloughran commented on a diff in the pull request: https://github.com/apache/spark/pull/19623#discussion_r148507385 --- Diff: sql/core/src/main/java/org/apache/spark/sql/sources/v2/writer/DataSourceV2Writer.java --- @@ -50,28 +53,34 @@ /** * Creates a writer factory which will be serialized and sent to executors. + * + * If this method fails (by throwing an exception), the action would fail and no Spark job was + * submitted. */ DataWriterFactory<Row> createWriterFactory(); /** * Commits this writing job with a list of commit messages. The commit messages are collected from - * successful data writers and are produced by {@link DataWriter#commit()}. If this method - * fails(throw exception), this writing job is considered to be failed, and - * {@link #abort(WriterCommitMessage[])} will be called. The written data should only be visible - * to data source readers if this method succeeds. + * successful data writers and are produced by {@link DataWriter#commit()}. + * + * If this method fails (by throwing an exception), this writing job is considered to to have been + * failed, and {@link #abort(WriterCommitMessage[])} would be called. The state of the destination + * is undefined and @{@link #abort(WriterCommitMessage[])} may not be able to deal with it. * * Note that, one partition may have multiple committed data writers because of speculative tasks. * Spark will pick the first successful one and get its commit message. Implementations should be - * aware of this and handle it correctly, e.g., have a mechanism to make sure only one data writer - * can commit successfully, or have a way to clean up the data of already-committed writers. + * aware of this and handle it correctly, e.g., have a coordinator to make sure only one data + * writer can commit, or have a way to clean up the data of already-committed writers. */ void commit(WriterCommitMessage[] messages); /** * Aborts this writing job because some data writers are failed to write the records and aborted, * or the Spark job fails with some unknown reasons, or {@link #commit(WriterCommitMessage[])} - * fails. If this method fails(throw exception), the underlying data source may have garbage that - * need to be cleaned manually, but these garbage should not be visible to data source readers. + * fails. + * + * If this method fails (by throwing an exception), the underlying data source may have garbage + * that need to be cleaned manually. --- End diff -- "may require manual cleanup". It could be more than just "garbage", which implies filesystem temp data...it could be tables in a database or similar
--- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org