[GitHub] spark pull request #19747: [Spark-22431][SQL] Ensure that the datatype in th...

skambha Tue, 14 Nov 2017 09:05:08 -0800

GitHub user skambha opened a pull request:

    https://github.com/apache/spark/pull/19747


    [Spark-22431][SQL]  Ensure that the datatype in the schema for the 
table/view metadata is parseable by Spark before persisting it

    ## What changes were proposed in this pull request?
    * JIRA:  [SPARK-22431](https://issues.apache.org/jira/browse/SPARK-22431)  
: Creating Permanent view with illegal type
    
    **Description:** 
    - It is possible in Spark SQL to create a permanent view that uses an 
nested field with an illegal name.
    - For example if we create the following view:
    ```create view x as select struct('a' as `$q`, 1 as b) q```
    - A simple select fails with the following exception:
    
    ```
    select * from x;
    
    org.apache.spark.SparkException: Cannot recognize hive type string: 
struct<$q:string,b:int>
      at 
org.apache.spark.sql.hive.client.HiveClientImpl$.fromHiveColumn(HiveClientImpl.scala:812)
      at 
org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1$$anonfun$apply$11$$anonfun$7.apply(HiveClientImpl.scala:378)
      at 
org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1$$anonfun$apply$11$$anonfun$7.apply(HiveClientImpl.scala:378)
    ...
    ```
    **Issue/Analysis**: Right now, we can create a view with a schema that 
cannot be read back by Spark from the Hive metastore.  For more details, please 
see the discussion about the analysis and proposed fix options in comment 1 and 
comment 2 in the 
[SPARK-22431](https://issues.apache.org/jira/browse/SPARK-22431) 
    
    **Proposed changes**: 
     - Fix the hive table/view codepath to check whether the schema datatype is 
parseable by Spark before persisting it in the metastore. This change is 
localized to HiveClientImpl to do the check similar to the check in 
FromHiveColumn. This is fail-fast and we will avoid the scenario where we write 
something to the metastore that we are unable to read it back.  
    - Added new unit tests
    - Ran the sql related unit test suites ( hive/test, sql/test, 
catalyst/test) OK
    
    With the fix: 
    ```
    create view x as select struct('a' as `$q`, 1 as b) q;
    17/11/14 19:16:03 ERROR SparkSQLDriver: Failed in [create view x as select 
struct('a' as `$q`, 1 as b) q]
    org.apache.spark.SparkException: Cannot recognize the data type: 
struct<$q:string,b:int>
        at 
org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$org$apache$spark$sql$hive$client$HiveClientImpl$$verifyColumnDataType$1.apply(HiveClientImpl.scala:907)
        at 
org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$org$apache$spark$sql$hive$client$HiveClientImpl$$verifyColumnDataType$1.apply(HiveClientImpl.scala:901)
        at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
        at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
    ```
    ## How was this patch tested?
    - New unit tests have been added. 
    
    @hvanhovell, Please review and share your thoughts/comments.  Thank you so 
much.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/skambha/spark spark22431

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/19747.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #19747
    
----
commit c5824feb40af633ab480b311495ecb7737705c3a
Author: Sunitha Kambhampati <skam...@us.ibm.com>
Date:   2017-11-14T12:38:17Z

    Add check to ensure that the schema col datatype is parseable before 
persisting to metastore, and add unit tests

commit ce474b7b028bba45c8bd29c31308503626baafbc
Author: Sunitha Kambhampati <skam...@us.ibm.com>
Date:   2017-11-14T16:02:00Z

    Add : in error message

commit d5b553438d8740716e402c0210e3d121a48c2c64
Author: Sunitha Kambhampati <skam...@us.ibm.com>
Date:   2017-11-14T16:07:28Z

    Remove empty line

commit 626703310aa269a9351a2cf7b6ce23f8e4ab095a
Author: Sunitha Kambhampati <skam...@us.ibm.com>
Date:   2017-11-14T16:20:06Z

    remove empty line

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19747: [Spark-22431][SQL] Ensure that the datatype in th...

Reply via email to