GitHub user gatorsmile opened a pull request:

    https://github.com/apache/spark/pull/16326

    [SPARK-18915] [SQL] Automatic Table Repair when Creating a Partitioned Data 
Source Table with a Specified Path

    ### What changes were proposed in this pull request?
    In Spark 2.1 (the default of `spark.sql.hive.manageFilesourcePartitions` is 
set to `true`), if we create a parititoned data source table given a specified 
path, it returns nothing when we try to query it. To get the data, we have to 
manually issue a DDL to repair the table. 
    
    In Spark 2.0, it can return the data stored in the specified path, without 
repairing the table. In Spark 2.1, if we set 
`spark.sql.hive.manageFilesourcePartitions` to false, the behavior is the same 
as Spark 2.0. 
    
    Below is the output of Spark 2.1. 
    ```Scala
    scala> spark.range(5).selectExpr("id as fieldOne", "id as 
partCol").write.partitionBy("partCol").mode("overwrite").saveAsTable("test")
                                                                                
    
    scala> spark.sql("desc formatted test").show(50, false)
    
+----------------------------+----------------------------------------------------------------------+-------+
    |col_name                    |data_type                                     
                        |comment|
    
+----------------------------+----------------------------------------------------------------------+-------+
    ...
    |Location:                   
|file:/Users/xiaoli/IdeaProjects/sparkDelivery/bin/spark-warehouse/test|       |
    |Table Type:                 |MANAGED                                       
                        |       |
    ...
    |Partition Provider:         |Catalog                                       
                        |       |
    
+----------------------------+----------------------------------------------------------------------+-------+
    
    
    scala> spark.sql(s"create table newTab (fieldOne long, partCol int) using 
parquet options (path 
'file:/Users/xiaoli/IdeaProjects/sparkDelivery/bin/spark-warehouse/test') 
partitioned by (partCol)")
    res3: org.apache.spark.sql.DataFrame = []
    
    scala> spark.table("newTab").show()
    +--------+-------+
    |fieldOne|partCol|
    +--------+-------+
    +--------+-------+
    ```
    
    This PR is to make it consistent with the behavior of Spark 2.0. no matter 
whether `spark.sql.hive.manageFilesourcePartitions` is `true` or `false`. It 
repairs the table when creating such a table. After the change, the behavior 
becomes consistent with what we did for CTAS of partitioned data source tables. 
    
    ### How was this patch tested?
    Modified the existing test case.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/gatorsmile/spark testtt

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/16326.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #16326
    
----
commit 40abcc281344923b33886243c83358f5084c2489
Author: gatorsmile <gatorsm...@gmail.com>
Date:   2016-12-18T03:55:32Z

    fix.

commit 3942c4ea53b199a476855a3f39087d893a4e900a
Author: gatorsmile <gatorsm...@gmail.com>
Date:   2016-12-18T03:56:51Z

    fix.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to