Re: [DISCUSS] Identifiers with multi-catalog support

Ryan Blue Tue, 22 Jan 2019 12:24:14 -0800

Thanks for reviewing this! I'll create an SPIP doc and issue for it and
call a vote.


On Tue, Jan 22, 2019 at 11:41 AM Matt Cheah <mch...@palantir.com> wrote:

> +1 for n-part namespace as proposed. Agree that a short SPIP would be
> appropriate for this. Perhaps also a JIRA ticket?
>
>
>
> -Matt Cheah
>
>
>
> *From: *Felix Cheung <felixcheun...@hotmail.com>
> *Date: *Sunday, January 20, 2019 at 4:48 PM
> *To: *"rb...@netflix.com" <rb...@netflix.com>, Spark Dev List <
> dev@spark.apache.org>
> *Subject: *Re: [DISCUSS] Identifiers with multi-catalog support
>
>
>
> +1 I like Ryan last mail. Thank you for putting it clearly (should be a
> spec/SPIP!)
>
>
>
> I agree and understand the need for 3 part id. However I don’t think we
> should make assumption that it must be or can only be as long as 3 parts.
> Once the catalog is identified (ie. The first part), the catalog should be
> responsible for resolving the namespace or schema etc. Agree also path is
> good idea to add to support file-based variant. Should separator be
> optional (perhaps in *space) to keep this extensible (it might not always
> be ‘.’)
>
>
>
> Also this whole scheme will need to play nice with column identifier as
> well.
>
>
>
>
> ------------------------------
>
> *From:* Ryan Blue <rb...@netflix.com.invalid>
> *Sent:* Thursday, January 17, 2019 11:38 AM
> *To:* Spark Dev List
> *Subject:* Re: [DISCUSS] Identifiers with multi-catalog support
>
>
>
> Any discussion on how Spark should manage identifiers when multiple
> catalogs are supported?
>
>
>
> I know this is an area where a lot of people are interested in making
> progress, and it is a blocker for both multi-catalog support and CTAS in
> DSv2.
>
>
>
> On Sun, Jan 13, 2019 at 2:22 PM Ryan Blue <rb...@netflix.com> wrote:
>
> I think that the solution to this problem is to mix the two approaches by
> supporting 3 identifier parts: catalog, namespace, and name, where
> namespace can be an n-part identifier:
>
> type Namespace = Seq[String]
>
> case class CatalogIdentifier(space: Namespace, name: String)
>
> This allows catalogs to work with the hierarchy of the external store, but
> the catalog API only requires a few discovery methods to list namespaces
> and to list each type of object in a namespace.
>
> def listNamespaces(): Seq[Namespace]
>
> def listNamespaces(space: Namespace, prefix: String): Seq[Namespace]
>
> def listTables(space: Namespace): Seq[CatalogIdentifier]
>
> def listViews(space: Namespace): Seq[CatalogIdentifier]
>
> def listFunctions(space: Namespace): Seq[CatalogIdentifier]
>
> The methods to list tables, views, or functions, would only return
> identifiers for the type queried, not namespaces or the other objects.
>
> The SQL parser would be updated so that identifiers are parsed to 
> UnresovledIdentifier(parts:
> Seq[String]), and resolution would work like this pseudo-code:
>
> def resolveIdentifier(ident: UnresolvedIdentifier): (CatalogPlugin, 
> CatalogIdentifier) = {
>
>   val maybeCatalog = sparkSession.catalog(ident.parts.head)
>
>   ident.parts match {
>
>     case Seq(catalogName, *space, name) if catalog.isDefined =>
>
>       (maybeCatalog.get, CatalogIdentifier(space, name))
>
>     case Seq(*space, name) =>
>
>       (sparkSession.defaultCatalog, CatalogIdentifier(space, name))
>
>   }
>
> }
>
> I think this is a good approach because it allows Spark users to reference
> or discovery any name in the hierarchy of an external store, it uses a few
> well-defined methods for discovery, and makes name hierarchy a user concern.
>
> ·         SHOW (DATABASES|SCHEMAS|NAMESPACES) would return the result of
> listNamespaces()
>
> ·         SHOW NAMESPACES LIKE a.b% would return the result of 
> listNamespaces(Seq("a"),
> "b")
>
> ·         USE a.b would set the current namespace to Seq("a", "b")
>
> ·         SHOW TABLES would return the result of
> listTables(currentNamespace)
>
> Also, I think that we could generalize this a little more to support
> path-based tables by adding a path to CatalogIdentifier, either as a
> namespace or as a separate optional string. Then, the identifier passed to
> a catalog would work for either a path-based table or a catalog table,
> without needing a path-based catalog API.
>
> Thoughts?
>
>
>
> On Sun, Jan 13, 2019 at 1:38 PM Ryan Blue <rb...@netflix.com> wrote:
>
> In the DSv2 sync up, we tried to discuss the Table metadata proposal but
> were side-tracked on its use of TableIdentifier. There were good points
> about how Spark should identify tables, views, functions, etc, and I want
> to start a discussion here.
>
> Identifiers are orthogonal to the TableCatalog proposal that can be
> updated to use whatever identifier class we choose. That proposal is
> concerned with what information should be passed to define a table, and how
> to pass that information.
>
> The main question for *this* discussion is: *how should Spark identify
> tables, views, and functions when it supports multiple catalogs?*
>
> There are two main approaches:
>
> 1.       Use a 3-part identifier, catalog.database.table
>
> 2.       Use an identifier with an arbitrary number of parts
>
> *Option 1: use 3-part identifiers*
>
> The argument for option #1 is that it is simple. If an external data store
> has additional logical hierarchy layers, then that hierarchy would be
> mapped to multiple catalogs in Spark. Spark can support show tables and
> show databases without much trouble. This is the approach used by Presto,
> so there is some precedent for it.
>
> The drawback is that mapping a more complex hierarchy into Spark requires
> more configuration. If an external DB has a 3-level hierarchy — say, for
> example, schema.database.table — then option #1 requires users to
> configure a catalog for each top-level structure, each schema. When a new
> schema is added, it is not automatically accessible.
>
> Catalog implementations could use session options could provide a rough
> work-around by changing the plugin’s “current” schema. I think this is an
> anti-pattern, so another strike against this option is that it encourages
> bad practices.
>
> *Option 2: use n-part identifiers*
>
> That drawback for option #1 is the main argument for option #2: Spark
> should allow users to easily interact with the native structure of an
> external store. For option #2, a full identifier would be an
> arbitrary-length list of identifiers. For the example above, using
> catalog.schema.database.table is allowed. An identifier would be
> something like this:
>
> case class CatalogIdentifier(parts: Seq[String])
>
> The problem with option #2 is how to implement a listing and discovery
> API, for operations like SHOW TABLES. If the catalog API requires a 
> list(ident:
> CatalogIdentifier), what does it return? There is no guarantee that the
> listed objects are tables and not nested namespaces. How would Spark handle
> arbitrary nesting that differs across catalogs?
>
> Hopefully, I’ve captured the design question well enough for a productive
> discussion. Thanks!
>
> rb
>
> --
>
> Ryan Blue
>
> Software Engineer
>
> Netflix
>
>
>
>
> --
>
> Ryan Blue
>
> Software Engineer
>
> Netflix
>
>
>
>
> --
>
> Ryan Blue
>
> Software Engineer
>
> Netflix
>


-- 
Ryan Blue
Software Engineer
Netflix

Re: [DISCUSS] Identifiers with multi-catalog support

Reply via email to