Re: [DISCUSS] Multiple catalog support

Wenchen Fan Fri, 27 Jul 2018 20:25:49 -0700

I think the major issue is, now users have 2 ways to create a specific data
source table: 1) use the USING syntax. 2) create the table in the specific
catalog. It can be super confusing if users create a cassandra table in
hbase data source. Also we can't drop the USING syntax as data source v1
still need it.


I have 2 proposals.

1. A v2 data source can always be plugged into Spark by USING, and
optionally by catalog. If it has catalog support, it can optionally support
the USING syntax. e.g. Hive data source must support USING and users are
allowed to write CREATE TABLE hive.db1.tbl1 ... USING hbase. Cassandra data
source will not support USING syntax and it's not allowed to write CRETE
TABLE cassandra.db1.tbl1 ... USING hbase. It's a little confusing if users
write CREATE TABLE hive.db1.tbl1 ... USING hive.

2. Exclude the catalog functionality from data source v2 and create a new
API for catalog plugin, so that USING is still the only way to create a
data source table. However we do need to read table metadata from data
source, we need to provide a TableSupport API in data source v2 so that
data source can create/lookup/drop/alter tables. One problem is the
end-user API. Users can query a table from data source directly via SELECT
... FROM parquet.`/data/path`  or FROM iceberg.db1.tbl1. It may be
ambiguous to interpret a 3-part name like `abc.db1.tbl1`, the `abc` can be
a catalog name or a data source name, we need to define the precedence when
lookup.

More proposals are welcome!

On Thu, Jul 26, 2018 at 3:44 AM Ryan Blue <rb...@netflix.com.invalid> wrote:

> Quick update: I've updated my PR to add the table catalog API to implement
> this proposal. Here's the PR: https://github.com/apache/spark/pull/21306
>
> On Mon, Jul 23, 2018 at 5:01 PM Ryan Blue <rb...@netflix.com> wrote:
>
>> Lately, I’ve been working on implementing the new SQL logical plans. I’m
>> currently blocked working on the plans that require table metadata
>> operations. For example, CTAS will be implemented as a create table and a
>> write using DSv2 (and a drop table if anything goes wrong). That requires
>> something to expose the create and drop table actions: a table catalog.
>>
>> Initially, I opened #21306 <https://github.com/apache/spark/pull/21306>
>> to get a table catalog from the data source, but that’s a bad idea because
>> it conflicts with future multi-catalog support. Sources are an
>> implementation of a read and write API that can be shared between catalogs.
>> For example, you could have prod and test HMS catalogs that both use the
>> Parquet source. The Parquet source shouldn’t determine whether a CTAS
>> statement creates a table in prod or test.
>>
>> That means that CTAS and other plans for DataSourceV2 need a solution to
>> determine the catalog to use.
>> Proposal
>>
>> I propose we add support for multiple catalogs now in support of the
>> DataSourceV2 work, to avoid hacky work-arounds.
>>
>> First, I think we need to add catalog to TableIdentifier so tables are
>> identified by catalog.db.table, not just db.table. This would make it
>> easy to specify the intended catalog for SQL statements, like CREATE
>> cat.db.table AS ..., and in the DataFrame API:
>> df.write.saveAsTable("cat.db.table") or spark.table("cat.db.table").
>>
>> Second, we will need an API for catalogs to implement. The SPIP on APIs
>> for Table Metadata
>> <https://docs.google.com/document/d/1zLFiA1VuaWeVxeTDXNg8bL6GP3BVoOZBkewFtEnjEoo/edit#>
>> already proposed the API for create/alter/drop table operations. The only
>> part that is missing is how to register catalogs instead of using
>> DataSourceV2 to instantiate them.
>>
>> I think we should configure catalogs through Spark config properties,
>> like this:
>>
>> spark.sql.catalog.<name> = <impl-class>
>> spark.sql.catalog.<name>.<property> = <value>
>>
>> When a catalog is referenced by name, Spark would instantiate the
>> specified class using a no-arg constructor. The instance would then be
>> configured by passing a map of the remaining pairs in the
>> spark.sql.catalog.<name>.* namespace to a configure method with the
>> namespace part removed and an extra “name” parameter with the catalog name.
>> This would support external sources like JDBC, which have common options
>> like driver or hostname and port.
>> Backward-compatibility
>>
>> The current spark.catalog / ExternalCatalog would be used when the
>> catalog element of a TableIdentifier is left blank. That would provide
>> backward-compatibility. We could optionally allow users to control the
>> default table catalog with a property.
>> Relationship between catalogs and data sources
>>
>> In the proposed table catalog API, actions return a Table object that
>> exposes the DSv2 ReadSupport and WriteSupport traits. Table catalogs
>> would share data source implementations by returning Table instances
>> that use the correct data source. V2 sources would no longer need to be
>> loaded by reflection; the catalog would be loaded instead.
>>
>> Tables created using format("source") or USING source in SQL specify the
>> data source implementation directly. This “format” should be passed to the
>> source as a table property. The existing ExternalCatalog will need to
>> implement the new TableCatalog API for v2 sources and would continue to
>> use the property to determine the table’s data source or format
>> implementation. Other table catalog implementations would be free to
>> interpret the format string as they choose or to use it to choose a data
>> source implementation as in the default catalog.
>>
>> rb
>> 
>> --
>> Ryan Blue
>> Software Engineer
>> Netflix
>>
>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>

Re: [DISCUSS] Multiple catalog support

Reply via email to