Quick update: I've updated my PR to add the table catalog API to implement this proposal. Here's the PR: https://github.com/apache/spark/pull/21306
On Mon, Jul 23, 2018 at 5:01 PM Ryan Blue <rb...@netflix.com> wrote: > Lately, I’ve been working on implementing the new SQL logical plans. I’m > currently blocked working on the plans that require table metadata > operations. For example, CTAS will be implemented as a create table and a > write using DSv2 (and a drop table if anything goes wrong). That requires > something to expose the create and drop table actions: a table catalog. > > Initially, I opened #21306 <https://github.com/apache/spark/pull/21306> > to get a table catalog from the data source, but that’s a bad idea because > it conflicts with future multi-catalog support. Sources are an > implementation of a read and write API that can be shared between catalogs. > For example, you could have prod and test HMS catalogs that both use the > Parquet source. The Parquet source shouldn’t determine whether a CTAS > statement creates a table in prod or test. > > That means that CTAS and other plans for DataSourceV2 need a solution to > determine the catalog to use. > Proposal > > I propose we add support for multiple catalogs now in support of the > DataSourceV2 work, to avoid hacky work-arounds. > > First, I think we need to add catalog to TableIdentifier so tables are > identified by catalog.db.table, not just db.table. This would make it > easy to specify the intended catalog for SQL statements, like CREATE > cat.db.table AS ..., and in the DataFrame API: > df.write.saveAsTable("cat.db.table") or spark.table("cat.db.table"). > > Second, we will need an API for catalogs to implement. The SPIP on APIs > for Table Metadata > <https://docs.google.com/document/d/1zLFiA1VuaWeVxeTDXNg8bL6GP3BVoOZBkewFtEnjEoo/edit#> > already proposed the API for create/alter/drop table operations. The only > part that is missing is how to register catalogs instead of using > DataSourceV2 to instantiate them. > > I think we should configure catalogs through Spark config properties, like > this: > > spark.sql.catalog.<name> = <impl-class> > spark.sql.catalog.<name>.<property> = <value> > > When a catalog is referenced by name, Spark would instantiate the > specified class using a no-arg constructor. The instance would then be > configured by passing a map of the remaining pairs in the > spark.sql.catalog.<name>.* namespace to a configure method with the > namespace part removed and an extra “name” parameter with the catalog name. > This would support external sources like JDBC, which have common options > like driver or hostname and port. > Backward-compatibility > > The current spark.catalog / ExternalCatalog would be used when the > catalog element of a TableIdentifier is left blank. That would provide > backward-compatibility. We could optionally allow users to control the > default table catalog with a property. > Relationship between catalogs and data sources > > In the proposed table catalog API, actions return a Table object that > exposes the DSv2 ReadSupport and WriteSupport traits. Table catalogs > would share data source implementations by returning Table instances that > use the correct data source. V2 sources would no longer need to be loaded > by reflection; the catalog would be loaded instead. > > Tables created using format("source") or USING source in SQL specify the > data source implementation directly. This “format” should be passed to the > source as a table property. The existing ExternalCatalog will need to > implement the new TableCatalog API for v2 sources and would continue to > use the property to determine the table’s data source or format > implementation. Other table catalog implementations would be free to > interpret the format string as they choose or to use it to choose a data > source implementation as in the default catalog. > > rb > > -- > Ryan Blue > Software Engineer > Netflix > -- Ryan Blue Software Engineer Netflix