[GitHub] spark pull request #23208: [SPARK-25530][SQL] data source v2 API refactor (b...

rdblue Thu, 06 Dec 2018 12:08:05 -0800

Github user rdblue commented on a diff in the pull request:

    https://github.com/apache/spark/pull/23208#discussion_r239596456
  
    --- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala ---
    @@ -241,32 +241,28 @@ final class DataFrameWriter[T] private[sql](ds: 
Dataset[T]) {
     
         assertNotBucketed("save")
     
    -    val cls = DataSource.lookupDataSource(source, 
df.sparkSession.sessionState.conf)
    -    if (classOf[DataSourceV2].isAssignableFrom(cls)) {
    -      val source = 
cls.getConstructor().newInstance().asInstanceOf[DataSourceV2]
    -      source match {
    -        case provider: BatchWriteSupportProvider =>
    -          val sessionOptions = DataSourceV2Utils.extractSessionConfigs(
    -            source,
    -            df.sparkSession.sessionState.conf)
    -          val options = sessionOptions ++ extraOptions
    -
    +    val session = df.sparkSession
    +    val cls = DataSource.lookupDataSource(source, 
session.sessionState.conf)
    +    if (classOf[TableProvider].isAssignableFrom(cls)) {
    +      val provider = 
cls.getConstructor().newInstance().asInstanceOf[TableProvider]
    +      val sessionOptions = DataSourceV2Utils.extractSessionConfigs(
    +        provider, session.sessionState.conf)
    +      val options = sessionOptions ++ extraOptions
    +      val dsOptions = new DataSourceOptions(options.asJava)
    +      provider.getTable(dsOptions) match {
    +        case table: SupportsBatchWrite =>
    +          val relation = DataSourceV2Relation.create(table, dsOptions)
    +          // TODO: revisit it. We should not create the `AppendData` 
operator for `SaveMode.Append`.
    +          // We should create new end-users APIs for the `AppendData` 
operator.
    --- End diff --
    
    The example in the referenced comment is this:
    
    ```
    spark.range(1).format("source").write.save("non-existent-path")
    ```
    
    If a path for a path-based table doesn't exist, then I think that the table 
doesn't exist. If a table doesn't exist, then the operation for `save` should 
be CTAS instead of AppendData.
    
    Here, I think the right behavior is to check whether the provider returns a 
table. If it doesn't, then the table doesn't exist and the plan should be CTAS. 
If it does, then it must provide the schema used to validate the AppendData 
operation. Since we don't currently have CTAS, this should throw an exception 
stating that the table doesn't exist and can't be created.
    
    More generally, the meaning of SaveMode with v1 is not always reliable. I 
think the right approach is what @cloud-fan suggests: create a new write API 
for v2 tables that is clear for the new logical plans (I've proposed one and 
would be happy to open a PR). Once the logical plans are in place, we can go 
back through this API and move it over to v2 where the behaviors match.



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #23208: [SPARK-25530][SQL] data source v2 API refactor (b...

Reply via email to