Github user steveloughran commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19623#discussion_r148507385
  
    --- Diff: 
sql/core/src/main/java/org/apache/spark/sql/sources/v2/writer/DataSourceV2Writer.java
 ---
    @@ -50,28 +53,34 @@
     
       /**
        * Creates a writer factory which will be serialized and sent to 
executors.
    +   *
    +   * If this method fails (by throwing an exception), the action would 
fail and no Spark job was
    +   * submitted.
        */
       DataWriterFactory<Row> createWriterFactory();
     
       /**
        * Commits this writing job with a list of commit messages. The commit 
messages are collected from
    -   * successful data writers and are produced by {@link 
DataWriter#commit()}. If this method
    -   * fails(throw exception), this writing job is considered to be failed, 
and
    -   * {@link #abort(WriterCommitMessage[])} will be called. The written 
data should only be visible
    -   * to data source readers if this method succeeds.
    +   * successful data writers and are produced by {@link 
DataWriter#commit()}.
    +   *
    +   * If this method fails (by throwing an exception), this writing job is 
considered to to have been
    +   * failed, and {@link #abort(WriterCommitMessage[])} would be called. 
The state of the destination
    +   * is undefined and @{@link #abort(WriterCommitMessage[])} may not be 
able to deal with it.
        *
        * Note that, one partition may have multiple committed data writers 
because of speculative tasks.
        * Spark will pick the first successful one and get its commit message. 
Implementations should be
    -   * aware of this and handle it correctly, e.g., have a mechanism to make 
sure only one data writer
    -   * can commit successfully, or have a way to clean up the data of 
already-committed writers.
    +   * aware of this and handle it correctly, e.g., have a coordinator to 
make sure only one data
    +   * writer can commit, or have a way to clean up the data of 
already-committed writers.
        */
       void commit(WriterCommitMessage[] messages);
     
       /**
        * Aborts this writing job because some data writers are failed to write 
the records and aborted,
        * or the Spark job fails with some unknown reasons, or {@link 
#commit(WriterCommitMessage[])}
    -   * fails. If this method fails(throw exception), the underlying data 
source may have garbage that
    -   * need to be cleaned manually, but these garbage should not be visible 
to data source readers.
    +   * fails.
    +   *
    +   * If this method fails (by throwing an exception), the underlying data 
source may have garbage
    +   * that need to be cleaned manually.
    --- End diff --
    
    "may require manual cleanup". It could be more than just "garbage", which 
implies filesystem temp data...it could be tables in a database or similar


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to