GitHub user rxin opened a pull request: https://github.com/apache/spark/pull/11293
[SPARK-13080] [SQL] Implement new Catalog API using Hive ## What changes were proposed in this pull request? This is a step towards merging `SQLContext` and `HiveContext`. A new internal Catalog API was introduced in #10982 and extended in #11069. This patch introduces an implementation of this API using `HiveClient`, an existing interface to Hive. It also extends `HiveClient` with additional calls to Hive that are needed to complete the catalog implementation. *Where should I start reviewing?* The new catalog introduced is `HiveCatalog`. This class is relatively simple because it just calls `HiveClientImpl`, where most of the new logic is. I would not start with `HiveClient`, `HiveQl`, or `HiveMetastoreCatalog`, which are modified mainly because of a refactor. *Why is this patch so big?* I had to refactor HiveClient to remove an intermediate representation of databases, tables, partitions etc. After this refactor `CatalogTable` convert directly to and from `HiveTable` (etc.). Otherwise we would have to first convert `CatalogTable` to the intermediate representation and then convert that to HiveTable, which is messy. The new class hierarchy is as follows: ``` org.apache.spark.sql.catalyst.catalog.Catalog - org.apache.spark.sql.catalyst.catalog.InMemoryCatalog - org.apache.spark.sql.hive.HiveCatalog ``` Note that, as of this patch, none of these classes are currently used anywhere yet. This will come in the future before the Spark 2.0 release. ## How was the this patch tested? All existing unit tests, and HiveCatalogSuite that extends CatalogTestCases. You can merge this pull request into a Git repository by running: $ git pull https://github.com/rxin/spark hive-catalog Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/11293.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #11293 ---- commit 3b6660578f23c69abfb59fae6796ee10bf4d482d Author: Andrew Or <and...@databricks.com> Date: 2016-02-10T21:16:30Z Add skeleton for HiveCatalog commit f3e094ad21bd38d400f90b93898995182a508e9b Author: Andrew Or <and...@databricks.com> Date: 2016-02-10T21:34:36Z Implement createDatabase commit 4b09a7da8ddcc17a813e494d868a6ea55f01cd2e Author: Andrew Or <and...@databricks.com> Date: 2016-02-10T21:48:00Z Fix style commit 526f278d78664c49572fd1b48495ca99d12d1896 Author: Andrew Or <and...@databricks.com> Date: 2016-02-10T21:59:02Z Implement dropDatabase commit 4aa6e66b5ee9fa2e5f8e4b9955ed98de5b35a57c Author: Andrew Or <and...@databricks.com> Date: 2016-02-10T22:06:08Z Implement alterDatabase commit 433d180260c57a905e226f0b8686eeb92d5dc938 Author: Andrew Or <and...@databricks.com> Date: 2016-02-10T22:14:15Z Implement getDatabase, listDatabases and databaseExists commit ff5c5bea8d4d84ae56acd4caf225e59231b946ba Author: Andrew Or <and...@databricks.com> Date: 2016-02-10T23:18:53Z Implement createTable This required converting o.a.s.sql.catalyst.catalog.Table to its counterpart in o.a.s.sql.hive.client.HiveTable. This required making o.a.s.sql.hive.client.TableType an enum because we need to create one of these from name. commit ff49f0cf6fabc645121b43b5746017c838a3551d Author: Andrew Or <and...@databricks.com> Date: 2016-02-10T23:22:38Z Explicitly mark methods with override in HiveCatalog commit ca98c00264564717ddd427282bfff301ebdb6c70 Author: Andrew Or <and...@databricks.com> Date: 2016-02-10T23:25:27Z Implement dropTable commit 71f99646cdf30a68a8e592b80ef5a6f40685551b Author: Andrew Or <and...@databricks.com> Date: 2016-02-10T23:40:37Z Implement renameTable, alterTable commit 13795d83c325a69fb35260c300b379e2e55725aa Author: Andrew Or <and...@databricks.com> Date: 2016-02-12T00:51:36Z Remove intermediate representation of tables, columns etc. Currently there's the catalog table, the Spark table used in the hive module, and the Hive table. To avoid converting to and from between these table representations, we kill the intermediate one, which is the one currently used throughout HiveClient and friends. commit af5ffc0ee84f3dc3c2b9249228293ae7285f916e Author: Andrew Or <and...@databricks.com> Date: 2016-02-12T01:34:24Z Remove TableType enum Instead, this commit introduces CatalogTableType that serves the same purpose. This adds some type-safety and keeps the code clean. commit d7b18e628374659f0a792d5c5a9154711fc9073b Author: Andrew Or <and...@databricks.com> Date: 2016-02-12T01:48:30Z Re-implement all table operations after the refactor commit a915d01eac651994c4d69b961299b476fe40f77d Author: Andrew Or <and...@databricks.com> Date: 2016-02-12T20:50:39Z Implement all partition operations commit 3ceb88d51a6e6af92cff2e90622ba235d0d107e9 Author: Andrew Or <and...@databricks.com> Date: 2016-02-12T22:04:45Z Implement all function operations commit 07332ad6803e578d9a61cc4693d8ce665ad8c29a Author: Andrew Or <and...@databricks.com> Date: 2016-02-12T22:10:33Z Simplify alterDatabase The operation doesn't support renaming anyway, so it doesn't make sense to pass in a name AND a CatalogDatabase that always has the same name. commit cdf1f70479a6ac588249cea221b602e07d936892 Author: Andrew Or <and...@databricks.com> Date: 2016-02-12T22:15:55Z Clean up HiveClientImpl a little commit bbb81701602f97b5df43f074e33ab2a1d261926c Author: Andrew Or <and...@databricks.com> Date: 2016-02-12T23:06:12Z Merge branch 'master' of github.com:apache/spark into hive-catalog commit 2b720256a319c9f9709801cb690f61cf1dbd0ace Author: Andrew Or <and...@databricks.com> Date: 2016-02-12T23:13:01Z Fix tests? commit 5e2cd3afe77333ee586cb0fdfe962856b1ba2e84 Author: Andrew Or <and...@databricks.com> Date: 2016-02-12T23:54:32Z Miscellaneous cleanup commit 6519c2a8bf5e4dc8067bedad86e04a4cef0bc24f Author: Andrew Or <and...@databricks.com> Date: 2016-02-16T19:03:53Z Merge branch 'master' of github.com:apache/spark into hive-catalog commit 7d58fac540694f21279f221b4fae489c6b4d1933 Author: Andrew Or <and...@databricks.com> Date: 2016-02-16T22:17:15Z Address comments + minor cleanups commit 1c05b9b3ce677a62062f1d90f861b20398ab42a4 Author: Andrew Or <and...@databricks.com> Date: 2016-02-16T22:33:13Z Fix wrong Hive TableType issue We used to pass CatalogTableType#toString into HiveTable, which fails later when Hive extracts the Java enum value from the string. This was the cause of test failures in a few test suites: - InsertIntoHiveTableSuite - MultiDatabaseSuite - ParquetMetastoreSuite - ... commit 4ecc3b1245998d2c9743840d1243ec55770db1a9 Author: Andrew Or <and...@databricks.com> Date: 2016-02-16T22:55:25Z Fix CREATE TABLE serde setting Blatant programming mistake. This was caught by hive.execution.SQLQuerySuite. commit 863ebd095e7c36c740ad88ec671522a4550f0273 Author: Andrew Or <and...@databricks.com> Date: 2016-02-16T23:22:54Z Fix NPE in CREATE VIEW When we create views using HiveQl we pass in null data types because we can't specify these types until later. This caused a NPE downstream. commit 539449215ebfc3df5d7b13fbd4808f7e37d20d77 Author: Andrew Or <and...@databricks.com> Date: 2016-02-17T21:32:36Z Change CatalogColumn#dataType to String This fixes a failing test in HiveCompatibilitySuite, where Spark was ignoring the character limit in varchar but Hive respected it. The issue was that we were converting Hive types to and from Spark DataType, and in the process losing the limit information. Instead of doing this conversion, we simply encode the data type as a string so we don't loes any information. This means less type-safety but the real fix is outside the scope of this patch. commit fe295fb6899be00eb8a37eceb6c996cf0794ff2c Author: Andrew Or <and...@databricks.com> Date: 2016-02-17T22:23:54Z Fix style commit 43e3c66057d37c45db7392c6793baeef05b05039 Author: Andrew Or <and...@databricks.com> Date: 2016-02-18T00:32:59Z Fix MetastoreDataSourcesSuite I missed one place where the data type was still a DataType, but not a string. commit 27656491561a918e4e5bec7f44ef946ef825dc19 Author: Andrew Or <and...@databricks.com> Date: 2016-02-18T03:04:57Z Add HiveCatalogSuite This suite extends the existing CatalogTestCases. Many tests needed to be modified significantly for Hive to work. Even after many hours spent on trying to make this work, there is still one that doesn't pass for some reason. In particular, I was not able to call "alterPartitions" on an existing Hive table as of this commit. That test is temporarily ignored for now. The rest of the tests added in this commit should pass. commit 428c3c5cb875d2a160093a5d71f9634c2b0cb6aa Author: Andrew Or <and...@databricks.com> Date: 2016-02-18T03:07:17Z Merge branch 'master' of github.com:apache/spark into hive-catalog ---- --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org