Re: [DISCUSS] Multiple catalog support

Wenchen Fan Tue, 31 Jul 2018 07:58:57 -0700

Here is my interpretation of your proposal, please correct me if something
is wrong.

End users can read/write a data source with its name and some options. e.g.
`df.read.format("xyz").option(...).load`. This is currently the only
end-user API for data source v2, and is widely used by Spark users to
read/write data source v1 and file sources, we should still support it. We
will add more end-user APIs in the future, once we standardize the DDL
logical plans.

If a data source wants to be used with tables, then it must implement some
catalog functionalities. At least it needs to support
create/lookup/alter/drop table, and optionally more features like managing
functions/views and supporting the USING syntax. This means, to use file
source with tables, we need another data source that has full catalog
functionalities. We can implement a Hive data source with all catalog
functionalities backed by HMS, or a Glue data source backed by AWS Glue.
They should both support USING syntax and thus support file sources. If
USING is not specified, the default storage(hive tables) should be used.

For path-based tables, we can create a special API for it and define the
rule to resolve ambiguity when looking up tables.

If we go with this direction, one problem is that, data source may not be a
good name anymore, since a data source can provide catalog functionalities.

Under the hood, I feel this proposal is very similar to my second proposal,
except that a catalog implementation must provide a default data
source/storage, and different rule for looking up tables.

On Sun, Jul 29, 2018 at 11:43 PM Ryan Blue <rb...@netflix.com> wrote:

> Wenchen, what I'm suggesting is a bit of both of your proposals.
>
> I think that USING should be optional like your first option. USING (or
> format(...) in the DF side) should configure the source or implementation,
> while the catalog should be part of the table identifier. They serve two
> different purposes: configuring the storage within the catalog, and
> choosing which catalog to pass create or other calls to. I think that's
> pretty much what you suggest in #1. The USING syntax would continue to be
> used to configure storage within a catalog.
>
> (Side note: I don't think this needs to be tied to a particular
> implementation. We currently use 'parquet' to tell the Spark catalog to use
> the Parquet source, but another catalog could also use 'parquet' to store
> data in Parquet format without using the Spark built-in source.)
>
> The second option suggests separating the catalog API from data source. In
>  #21306 <https://github.com/apache/spark/pull/21306>, I add the proposed
> catalog API and a reflection-based loader like the v1 sources use (and v2
> sources have used so far). I think that it makes much more sense to start
> with a catalog and then get the data source for operations like CTAS. This
> is compatible with the behavior from your point #1: the catalog chooses the
> source implementation and USING is optional.
>
> The reason why we considered an API to get a catalog from the source is
> because we defined the source API first, but it doesn't make sense to get a
> catalog from the data source. Catalogs can share data sources (e.g. prod
> and test environments). Plus, it makes more sense to determine the catalog
> and then have it return the source implementation because it may require a
> specific one, like JDBC or Iceberg would. With standard logical plans we
> always know the catalog when creating the plan: either the table identifier
> includes an explicit one, or the default catalog is used.
>
> In the PR I mentioned above, the catalog implementation's class is
> determined by Spark config properties, so there's no need to use
> ServiceLoader and we can use the same implementation class for multiple
> catalogs with different configs (e.g. prod and test environments).
>
> Your last point about path-based tables deserves some attention. But, we
> also need to define the behavior of path-based tables. Part of what we want
> to preserve is flexibility, like how you don't need to alter the schema in
> JSON tables, you just write different data. For the path-based syntax, I
> suggest looking up source first and using the source if there is one. If
> not, then look up the catalog. That way existing tables work, but we can
> migrate to catalogs with names that don't conflict.
>
> rb
>

Re: [DISCUSS] Multiple catalog support

Reply via email to