[GitHub] spark pull request #22346: [branch-2.3][SPARK-25313][SQL] Fix regression in ...

gengliangwang Wed, 05 Sep 2018 23:43:08 -0700

GitHub user gengliangwang opened a pull request:

    https://github.com/apache/spark/pull/22346


    [branch-2.3][SPARK-25313][SQL] Fix regression in FileFormatWriter output 
names

    Port https://github.com/apache/spark/pull/22320 to branch-2.3
    ## What changes were proposed in this pull request?
    
    Let's see the follow example:
    ```
            val location = "/tmp/t"
            val df = spark.range(10).toDF("id")
            df.write.format("parquet").saveAsTable("tbl")
            spark.sql("CREATE VIEW view1 AS SELECT id FROM tbl")
            spark.sql(s"CREATE TABLE tbl2(ID long) USING parquet location 
$location")
            spark.sql("INSERT OVERWRITE TABLE tbl2 SELECT ID FROM view1")
            println(spark.read.parquet(location).schema)
            spark.table("tbl2").show()
    ```
    The output column name in schema will be `id` instead of `ID`, thus the 
last query shows nothing from `tbl2`.
    By enabling the debug message we can see that the output naming is changed 
from `ID` to `id`, and then the `outputColumns` in 
`InsertIntoHadoopFsRelationCommand` is changed in `RemoveRedundantAliases`.
    
![wechatimg5](https://user-images.githubusercontent.com/1097932/44947871-6299f200-ae46-11e8-9c96-d45fe368206c.jpeg)
    
    
![wechatimg4](https://user-images.githubusercontent.com/1097932/44947866-56ae3000-ae46-11e8-8923-8b3bbe060075.jpeg)
    
    **To guarantee correctness**, we should change the output columns from 
`Seq[Attribute]` to `Seq[String]` to avoid its names being replaced by 
optimizer.
    
    I will fix project elimination related rules in 
https://github.com/apache/spark/pull/22311 after this one.
    
    ## How was this patch tested?
    
    Unit test.
    
    Closes #22320 from gengliangwang/fixOutputSchema.
    
    Authored-by: Gengliang Wang <gengliang.w...@databricks.com>
    Signed-off-by: Wenchen Fan <wenc...@databricks.com>
    
    ## What changes were proposed in this pull request?
    
    (Please fill in changes proposed in this fix)
    
    ## How was this patch tested?
    
    (Please explain how this patch was tested. E.g. unit tests, integration 
tests, manual tests)
    (If this patch involves UI changes, please attach a screenshot; otherwise, 
remove this)
    
    Please review http://spark.apache.org/contributing.html before opening a 
pull request.


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/gengliangwang/spark portSchemaOutputName2.3

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/22346.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #22346
    
----
commit 470efdc2924553f941165f5fe84d930a5265081c
Author: Gengliang Wang <gengliang.wang@...>
Date:   2018-09-06T02:37:52Z

    [SPARK-25313][SQL] Fix regression in FileFormatWriter output names
    
    ## What changes were proposed in this pull request?
    
    Let's see the follow example:
    ```
            val location = "/tmp/t"
            val df = spark.range(10).toDF("id")
            df.write.format("parquet").saveAsTable("tbl")
            spark.sql("CREATE VIEW view1 AS SELECT id FROM tbl")
            spark.sql(s"CREATE TABLE tbl2(ID long) USING parquet location 
$location")
            spark.sql("INSERT OVERWRITE TABLE tbl2 SELECT ID FROM view1")
            println(spark.read.parquet(location).schema)
            spark.table("tbl2").show()
    ```
    The output column name in schema will be `id` instead of `ID`, thus the 
last query shows nothing from `tbl2`.
    By enabling the debug message we can see that the output naming is changed 
from `ID` to `id`, and then the `outputColumns` in 
`InsertIntoHadoopFsRelationCommand` is changed in `RemoveRedundantAliases`.
    
![wechatimg5](https://user-images.githubusercontent.com/1097932/44947871-6299f200-ae46-11e8-9c96-d45fe368206c.jpeg)
    
    
![wechatimg4](https://user-images.githubusercontent.com/1097932/44947866-56ae3000-ae46-11e8-8923-8b3bbe060075.jpeg)
    
    **To guarantee correctness**, we should change the output columns from 
`Seq[Attribute]` to `Seq[String]` to avoid its names being replaced by 
optimizer.
    
    I will fix project elimination related rules in 
https://github.com/apache/spark/pull/22311 after this one.
    
    ## How was this patch tested?
    
    Unit test.
    
    Closes #22320 from gengliangwang/fixOutputSchema.
    
    Authored-by: Gengliang Wang <gengliang.w...@databricks.com>
    Signed-off-by: Wenchen Fan <wenc...@databricks.com>

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22346: [branch-2.3][SPARK-25313][SQL] Fix regression in ...

Reply via email to