[GitHub] spark pull request: [SPARK-13080] [SQL] Implement new Catalog API ...

rxin Sun, 21 Feb 2016 12:57:06 -0800

GitHub user rxin opened a pull request:

    https://github.com/apache/spark/pull/11293


    [SPARK-13080] [SQL] Implement new Catalog API using Hive

    ## What changes were proposed in this pull request?
    
    This is a step towards merging `SQLContext` and `HiveContext`. A new 
internal Catalog API was introduced in #10982 and extended in #11069. This 
patch introduces an implementation of this API using `HiveClient`, an existing 
interface to Hive. It also extends `HiveClient` with additional calls to Hive 
that are needed to complete the catalog implementation.
    
    *Where should I start reviewing?* The new catalog introduced is 
`HiveCatalog`. This class is relatively simple because it just calls 
`HiveClientImpl`, where most of the new logic is. I would not start with 
`HiveClient`, `HiveQl`, or `HiveMetastoreCatalog`, which are modified mainly 
because of a refactor.
    
    *Why is this patch so big?* I had to refactor HiveClient to remove an 
intermediate representation of databases, tables, partitions etc. After this 
refactor `CatalogTable` convert directly to and from `HiveTable` (etc.). 
Otherwise we would have to first convert `CatalogTable` to the intermediate 
representation and then convert that to HiveTable, which is messy.
    
    The new class hierarchy is as follows:
    ```
    org.apache.spark.sql.catalyst.catalog.Catalog
      - org.apache.spark.sql.catalyst.catalog.InMemoryCatalog
      - org.apache.spark.sql.hive.HiveCatalog
    ```
    
    Note that, as of this patch, none of these classes are currently used 
anywhere yet. This will come in the future before the Spark 2.0 release.
    
    
    ## How was the this patch tested?
    All existing unit tests, and HiveCatalogSuite that extends CatalogTestCases.
    


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/rxin/spark hive-catalog

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/11293.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #11293
    
----
commit 3b6660578f23c69abfb59fae6796ee10bf4d482d
Author: Andrew Or <and...@databricks.com>
Date:   2016-02-10T21:16:30Z

    Add skeleton for HiveCatalog

commit f3e094ad21bd38d400f90b93898995182a508e9b
Author: Andrew Or <and...@databricks.com>
Date:   2016-02-10T21:34:36Z

    Implement createDatabase

commit 4b09a7da8ddcc17a813e494d868a6ea55f01cd2e
Author: Andrew Or <and...@databricks.com>
Date:   2016-02-10T21:48:00Z

    Fix style

commit 526f278d78664c49572fd1b48495ca99d12d1896
Author: Andrew Or <and...@databricks.com>
Date:   2016-02-10T21:59:02Z

    Implement dropDatabase

commit 4aa6e66b5ee9fa2e5f8e4b9955ed98de5b35a57c
Author: Andrew Or <and...@databricks.com>
Date:   2016-02-10T22:06:08Z

    Implement alterDatabase

commit 433d180260c57a905e226f0b8686eeb92d5dc938
Author: Andrew Or <and...@databricks.com>
Date:   2016-02-10T22:14:15Z

    Implement getDatabase, listDatabases and databaseExists

commit ff5c5bea8d4d84ae56acd4caf225e59231b946ba
Author: Andrew Or <and...@databricks.com>
Date:   2016-02-10T23:18:53Z

    Implement createTable
    
    This required converting o.a.s.sql.catalyst.catalog.Table to its
    counterpart in o.a.s.sql.hive.client.HiveTable. This required
    making o.a.s.sql.hive.client.TableType an enum because we need
    to create one of these from name.

commit ff49f0cf6fabc645121b43b5746017c838a3551d
Author: Andrew Or <and...@databricks.com>
Date:   2016-02-10T23:22:38Z

    Explicitly mark methods with override in HiveCatalog

commit ca98c00264564717ddd427282bfff301ebdb6c70
Author: Andrew Or <and...@databricks.com>
Date:   2016-02-10T23:25:27Z

    Implement dropTable

commit 71f99646cdf30a68a8e592b80ef5a6f40685551b
Author: Andrew Or <and...@databricks.com>
Date:   2016-02-10T23:40:37Z

    Implement renameTable, alterTable

commit 13795d83c325a69fb35260c300b379e2e55725aa
Author: Andrew Or <and...@databricks.com>
Date:   2016-02-12T00:51:36Z

    Remove intermediate representation of tables, columns etc.
    
    Currently there's the catalog table, the Spark table used in the
    hive module, and the Hive table. To avoid converting to and from
    between these table representations, we kill the intermediate one,
    which is the one currently used throughout HiveClient and friends.

commit af5ffc0ee84f3dc3c2b9249228293ae7285f916e
Author: Andrew Or <and...@databricks.com>
Date:   2016-02-12T01:34:24Z

    Remove TableType enum
    
    Instead, this commit introduces CatalogTableType that serves
    the same purpose. This adds some type-safety and keeps the code
    clean.

commit d7b18e628374659f0a792d5c5a9154711fc9073b
Author: Andrew Or <and...@databricks.com>
Date:   2016-02-12T01:48:30Z

    Re-implement all table operations after the refactor

commit a915d01eac651994c4d69b961299b476fe40f77d
Author: Andrew Or <and...@databricks.com>
Date:   2016-02-12T20:50:39Z

    Implement all partition operations

commit 3ceb88d51a6e6af92cff2e90622ba235d0d107e9
Author: Andrew Or <and...@databricks.com>
Date:   2016-02-12T22:04:45Z

    Implement all function operations

commit 07332ad6803e578d9a61cc4693d8ce665ad8c29a
Author: Andrew Or <and...@databricks.com>
Date:   2016-02-12T22:10:33Z

    Simplify alterDatabase
    
    The operation doesn't support renaming anyway, so it doesn't
    make sense to pass in a name AND a CatalogDatabase that always
    has the same name.

commit cdf1f70479a6ac588249cea221b602e07d936892
Author: Andrew Or <and...@databricks.com>
Date:   2016-02-12T22:15:55Z

    Clean up HiveClientImpl a little

commit bbb81701602f97b5df43f074e33ab2a1d261926c
Author: Andrew Or <and...@databricks.com>
Date:   2016-02-12T23:06:12Z

    Merge branch 'master' of github.com:apache/spark into hive-catalog

commit 2b720256a319c9f9709801cb690f61cf1dbd0ace
Author: Andrew Or <and...@databricks.com>
Date:   2016-02-12T23:13:01Z

    Fix tests?

commit 5e2cd3afe77333ee586cb0fdfe962856b1ba2e84
Author: Andrew Or <and...@databricks.com>
Date:   2016-02-12T23:54:32Z

    Miscellaneous cleanup

commit 6519c2a8bf5e4dc8067bedad86e04a4cef0bc24f
Author: Andrew Or <and...@databricks.com>
Date:   2016-02-16T19:03:53Z

    Merge branch 'master' of github.com:apache/spark into hive-catalog

commit 7d58fac540694f21279f221b4fae489c6b4d1933
Author: Andrew Or <and...@databricks.com>
Date:   2016-02-16T22:17:15Z

    Address comments + minor cleanups

commit 1c05b9b3ce677a62062f1d90f861b20398ab42a4
Author: Andrew Or <and...@databricks.com>
Date:   2016-02-16T22:33:13Z

    Fix wrong Hive TableType issue
    
    We used to pass CatalogTableType#toString into HiveTable, which
    fails later when Hive extracts the Java enum value from the
    string. This was the cause of test failures in a few test suites:
    
    - InsertIntoHiveTableSuite
    - MultiDatabaseSuite
    - ParquetMetastoreSuite
    - ...

commit 4ecc3b1245998d2c9743840d1243ec55770db1a9
Author: Andrew Or <and...@databricks.com>
Date:   2016-02-16T22:55:25Z

    Fix CREATE TABLE serde setting
    
    Blatant programming mistake. This was caught by
    hive.execution.SQLQuerySuite.

commit 863ebd095e7c36c740ad88ec671522a4550f0273
Author: Andrew Or <and...@databricks.com>
Date:   2016-02-16T23:22:54Z

    Fix NPE in CREATE VIEW
    
    When we create views using HiveQl we pass in null data types
    because we can't specify these types until later. This caused
    a NPE downstream.

commit 539449215ebfc3df5d7b13fbd4808f7e37d20d77
Author: Andrew Or <and...@databricks.com>
Date:   2016-02-17T21:32:36Z

    Change CatalogColumn#dataType to String
    
    This fixes a failing test in HiveCompatibilitySuite, where Spark
    was ignoring the character limit in varchar but Hive respected it.
    The issue was that we were converting Hive types to and from
    Spark DataType, and in the process losing the limit information.
    
    Instead of doing this conversion, we simply encode the data type
    as a string so we don't loes any information. This means less
    type-safety but the real fix is outside the scope of this patch.

commit fe295fb6899be00eb8a37eceb6c996cf0794ff2c
Author: Andrew Or <and...@databricks.com>
Date:   2016-02-17T22:23:54Z

    Fix style

commit 43e3c66057d37c45db7392c6793baeef05b05039
Author: Andrew Or <and...@databricks.com>
Date:   2016-02-18T00:32:59Z

    Fix MetastoreDataSourcesSuite
    
    I missed one place where the data type was still a DataType, but
    not a string.

commit 27656491561a918e4e5bec7f44ef946ef825dc19
Author: Andrew Or <and...@databricks.com>
Date:   2016-02-18T03:04:57Z

    Add HiveCatalogSuite
    
    This suite extends the existing CatalogTestCases. Many tests
    needed to be modified significantly for Hive to work. Even after
    many hours spent on trying to make this work, there is still one
    that doesn't pass for some reason. In particular, I was not able
    to call "alterPartitions" on an existing Hive table as of this
    commit. That test is temporarily ignored for now. The rest of the
    tests added in this commit should pass.

commit 428c3c5cb875d2a160093a5d71f9634c2b0cb6aa
Author: Andrew Or <and...@databricks.com>
Date:   2016-02-18T03:07:17Z

    Merge branch 'master' of github.com:apache/spark into hive-catalog

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-13080] [SQL] Implement new Catalog API ...

Reply via email to