GitHub user cloud-fan opened a pull request:

    https://github.com/apache/spark/pull/19474

    [SPARK-22252][SQL] FileFormatWriter should respect the input query schema

    ## What changes were proposed in this pull request?
    
    In https://github.com/apache/spark/pull/18064, we allowed `RunnableCommand` 
to have children in order to fix some UI issues. Then we made `InsertIntoXXX` 
commands take the input `query` as a child, when we do the actual writing, we 
just pass the physical plan to the writer(`FileFormatWriter.write`).
    
    However this is problematic. In Spark SQL, optimizer and planner are 
allowed to change the schema names a little bit. e.g. `ColumnPruning` rule will 
remove no-op `Project`s, like `Project("A", Scan("a"))`, and thus change the 
output schema from "<A: int>" to `<a: int>`. When it comes to writing, 
especially for self-description data format like parquet, we may write the 
wrong schema to the file and cause null values at the read path.
    
    Fortunately, in https://github.com/apache/spark/pull/18450 , we decided to 
allow nested execution and one query can map to multiple executions in the UI. 
This releases the major restriction in #18604 , and now we don't have to take 
the input `query` as child of `InsertIntoXXX` commands.
    
    So the fix is simple, this PR partially revert #18064 and make 
`InsertIntoXXX` commands leaf nodes again.
     
    ## How was this patch tested?
    
    new regression test

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/cloud-fan/spark bug

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/19474.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #19474
    
----
commit 3b1174f7e1ed9caae890936ceeb4fb54e58eadcc
Author: Wenchen Fan <wenc...@databricks.com>
Date:   2017-10-11T14:30:38Z

    FileFormatWriter should respect the input query schema

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to