Quick update: I've updated my PR to add the table catalog API to implement
this proposal. Here's the PR: https://github.com/apache/spark/pull/21306

On Mon, Jul 23, 2018 at 5:01 PM Ryan Blue <rb...@netflix.com> wrote:

> Lately, I’ve been working on implementing the new SQL logical plans. I’m
> currently blocked working on the plans that require table metadata
> operations. For example, CTAS will be implemented as a create table and a
> write using DSv2 (and a drop table if anything goes wrong). That requires
> something to expose the create and drop table actions: a table catalog.
>
> Initially, I opened #21306 <https://github.com/apache/spark/pull/21306>
> to get a table catalog from the data source, but that’s a bad idea because
> it conflicts with future multi-catalog support. Sources are an
> implementation of a read and write API that can be shared between catalogs.
> For example, you could have prod and test HMS catalogs that both use the
> Parquet source. The Parquet source shouldn’t determine whether a CTAS
> statement creates a table in prod or test.
>
> That means that CTAS and other plans for DataSourceV2 need a solution to
> determine the catalog to use.
> Proposal
>
> I propose we add support for multiple catalogs now in support of the
> DataSourceV2 work, to avoid hacky work-arounds.
>
> First, I think we need to add catalog to TableIdentifier so tables are
> identified by catalog.db.table, not just db.table. This would make it
> easy to specify the intended catalog for SQL statements, like CREATE
> cat.db.table AS ..., and in the DataFrame API:
> df.write.saveAsTable("cat.db.table") or spark.table("cat.db.table").
>
> Second, we will need an API for catalogs to implement. The SPIP on APIs
> for Table Metadata
> <https://docs.google.com/document/d/1zLFiA1VuaWeVxeTDXNg8bL6GP3BVoOZBkewFtEnjEoo/edit#>
> already proposed the API for create/alter/drop table operations. The only
> part that is missing is how to register catalogs instead of using
> DataSourceV2 to instantiate them.
>
> I think we should configure catalogs through Spark config properties, like
> this:
>
> spark.sql.catalog.<name> = <impl-class>
> spark.sql.catalog.<name>.<property> = <value>
>
> When a catalog is referenced by name, Spark would instantiate the
> specified class using a no-arg constructor. The instance would then be
> configured by passing a map of the remaining pairs in the
> spark.sql.catalog.<name>.* namespace to a configure method with the
> namespace part removed and an extra “name” parameter with the catalog name.
> This would support external sources like JDBC, which have common options
> like driver or hostname and port.
> Backward-compatibility
>
> The current spark.catalog / ExternalCatalog would be used when the
> catalog element of a TableIdentifier is left blank. That would provide
> backward-compatibility. We could optionally allow users to control the
> default table catalog with a property.
> Relationship between catalogs and data sources
>
> In the proposed table catalog API, actions return a Table object that
> exposes the DSv2 ReadSupport and WriteSupport traits. Table catalogs
> would share data source implementations by returning Table instances that
> use the correct data source. V2 sources would no longer need to be loaded
> by reflection; the catalog would be loaded instead.
>
> Tables created using format("source") or USING source in SQL specify the
> data source implementation directly. This “format” should be passed to the
> source as a table property. The existing ExternalCatalog will need to
> implement the new TableCatalog API for v2 sources and would continue to
> use the property to determine the table’s data source or format
> implementation. Other table catalog implementations would be free to
> interpret the format string as they choose or to use it to choose a data
> source implementation as in the default catalog.
>
> rb
> ​
> --
> Ryan Blue
> Software Engineer
> Netflix
>


-- 
Ryan Blue
Software Engineer
Netflix

Reply via email to