[GitHub] spark pull request #22009: [SPARK-24882][SQL] improve data source v2 API

rdblue Mon, 13 Aug 2018 11:30:35 -0700

Github user rdblue commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22009#discussion_r209712363
  
    --- Diff: 
sql/core/src/main/java/org/apache/spark/sql/sources/v2/BatchWriteSupportProvider.java
 ---
    @@ -21,33 +21,39 @@
     
     import org.apache.spark.annotation.InterfaceStability;
     import org.apache.spark.sql.SaveMode;
    -import org.apache.spark.sql.sources.v2.writer.DataSourceWriter;
    +import org.apache.spark.sql.sources.v2.writer.BatchWriteSupport;
     import org.apache.spark.sql.types.StructType;
     
     /**
      * A mix-in interface for {@link DataSourceV2}. Data sources can implement 
this interface to
    - * provide data writing ability and save the data to the data source.
    + * provide data writing ability for batch processing.
    + *
    + * This interface is used when end users want to use a data source 
implementation directly, e.g.
    + * {@code Dataset.write.format(...).option(...).save()}.
      */
     @InterfaceStability.Evolving
    -public interface WriteSupport extends DataSourceV2 {
    +public interface BatchWriteSupportProvider extends DataSourceV2 {
     
       /**
    -   * Creates an optional {@link DataSourceWriter} to save the data to this 
data source. Data
    +   * Creates an optional {@link BatchWriteSupport} to save the data to 
this data source. Data
        * sources can return None if there is no writing needed to be done 
according to the save mode.
        *
        * If this method fails (by throwing an exception), the action will fail 
and no Spark job will be
        * submitted.
        *
    -   * @param writeUUID A unique string for the writing job. It's possible 
that there are many writing
    -   *                  jobs running at the same time, and the returned 
{@link DataSourceWriter} can
    -   *                  use this job id to distinguish itself from other 
jobs.
    +   * @param queryId A unique string for the writing query. It's possible 
that there are many
    +   *                writing queries running at the same time, and the 
returned
    +   *                {@link BatchWriteSupport} can use this id to 
distinguish itself from others.
        * @param schema the schema of the data to be written.
        * @param mode the save mode which determines what to do when the data 
are already in this data
        *             source, please refer to {@link SaveMode} for more details.
        * @param options the options for the returned data source writer, which 
is an immutable
        *                case-insensitive string-to-string map.
    -   * @return a writer to append data to this data source
    +   * @return a write support to write data to this data source.
        */
    -  Optional<DataSourceWriter> createWriter(
    -      String writeUUID, StructType schema, SaveMode mode, 
DataSourceOptions options);
    +  Optional<BatchWriteSupport> createBatchWriteSupport(
    +      String queryId,
    +      StructType schema,
    +      SaveMode mode,
    --- End diff --
    
    I think that the v2 sources should only use the plans proposed in the SPIP. 
I also think that the v2 data source API should always tell the data source 
exactly what to do: for overwrite, what should be deleted and what should be 
added.
    
    That doesn't block fixing the v2 API here and doesn't prevent anyone from 
using it. But it would prevent people from relying on undefined behavior that 
results from passing an ambiguous `SaveMode` to a source.
    
    The only thing that would not be available by doing this is overwrite 
support by using the SaveMode, which isn't something anyone should rely on 
because it doesn't have defined behavior.
    
    I understand that this may seem like it would block migration from the v1 
API to the v2 API. If so, that's the right thing to do until we clearly define 
how v2 sources behave.



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22009: [SPARK-24882][SQL] improve data source v2 API

Reply via email to