Any discussion on how Spark should manage identifiers when multiple catalogs are supported?
I know this is an area where a lot of people are interested in making progress, and it is a blocker for both multi-catalog support and CTAS in DSv2. On Sun, Jan 13, 2019 at 2:22 PM Ryan Blue <rb...@netflix.com> wrote: > I think that the solution to this problem is to mix the two approaches by > supporting 3 identifier parts: catalog, namespace, and name, where > namespace can be an n-part identifier: > > type Namespace = Seq[String] > case class CatalogIdentifier(space: Namespace, name: String) > > This allows catalogs to work with the hierarchy of the external store, but > the catalog API only requires a few discovery methods to list namespaces > and to list each type of object in a namespace. > > def listNamespaces(): Seq[Namespace] > def listNamespaces(space: Namespace, prefix: String): Seq[Namespace] > def listTables(space: Namespace): Seq[CatalogIdentifier] > def listViews(space: Namespace): Seq[CatalogIdentifier] > def listFunctions(space: Namespace): Seq[CatalogIdentifier] > > The methods to list tables, views, or functions, would only return > identifiers for the type queried, not namespaces or the other objects. > > The SQL parser would be updated so that identifiers are parsed to > UnresovledIdentifier(parts: > Seq[String]), and resolution would work like this pseudo-code: > > def resolveIdentifier(ident: UnresolvedIdentifier): (CatalogPlugin, > CatalogIdentifier) = { > val maybeCatalog = sparkSession.catalog(ident.parts.head) > ident.parts match { > case Seq(catalogName, *space, name) if catalog.isDefined => > (maybeCatalog.get, CatalogIdentifier(space, name)) > case Seq(*space, name) => > (sparkSession.defaultCatalog, CatalogIdentifier(space, name)) > } > } > > I think this is a good approach because it allows Spark users to reference > or discovery any name in the hierarchy of an external store, it uses a few > well-defined methods for discovery, and makes name hierarchy a user concern. > > - SHOW (DATABASES|SCHEMAS|NAMESPACES) would return the result of > listNamespaces() > - SHOW NAMESPACES LIKE a.b% would return the result of > listNamespaces(Seq("a"), > "b") > - USE a.b would set the current namespace to Seq("a", "b") > - SHOW TABLES would return the result of listTables(currentNamespace) > > Also, I think that we could generalize this a little more to support > path-based tables by adding a path to CatalogIdentifier, either as a > namespace or as a separate optional string. Then, the identifier passed to > a catalog would work for either a path-based table or a catalog table, > without needing a path-based catalog API. > > Thoughts? > > On Sun, Jan 13, 2019 at 1:38 PM Ryan Blue <rb...@netflix.com> wrote: > >> In the DSv2 sync up, we tried to discuss the Table metadata proposal but >> were side-tracked on its use of TableIdentifier. There were good points >> about how Spark should identify tables, views, functions, etc, and I want >> to start a discussion here. >> >> Identifiers are orthogonal to the TableCatalog proposal that can be >> updated to use whatever identifier class we choose. That proposal is >> concerned with what information should be passed to define a table, and how >> to pass that information. >> >> The main question for *this* discussion is: *how should Spark identify >> tables, views, and functions when it supports multiple catalogs?* >> >> There are two main approaches: >> >> 1. Use a 3-part identifier, catalog.database.table >> 2. Use an identifier with an arbitrary number of parts >> >> *Option 1: use 3-part identifiers* >> >> The argument for option #1 is that it is simple. If an external data >> store has additional logical hierarchy layers, then that hierarchy would be >> mapped to multiple catalogs in Spark. Spark can support show tables and >> show databases without much trouble. This is the approach used by Presto, >> so there is some precedent for it. >> >> The drawback is that mapping a more complex hierarchy into Spark requires >> more configuration. If an external DB has a 3-level hierarchy — say, for >> example, schema.database.table — then option #1 requires users to >> configure a catalog for each top-level structure, each schema. When a new >> schema is added, it is not automatically accessible. >> >> Catalog implementations could use session options could provide a rough >> work-around by changing the plugin’s “current” schema. I think this is an >> anti-pattern, so another strike against this option is that it encourages >> bad practices. >> >> *Option 2: use n-part identifiers* >> >> That drawback for option #1 is the main argument for option #2: Spark >> should allow users to easily interact with the native structure of an >> external store. For option #2, a full identifier would be an >> arbitrary-length list of identifiers. For the example above, using >> catalog.schema.database.table is allowed. An identifier would be >> something like this: >> >> case class CatalogIdentifier(parts: Seq[String]) >> >> The problem with option #2 is how to implement a listing and discovery >> API, for operations like SHOW TABLES. If the catalog API requires a >> list(ident: >> CatalogIdentifier), what does it return? There is no guarantee that the >> listed objects are tables and not nested namespaces. How would Spark handle >> arbitrary nesting that differs across catalogs? >> >> Hopefully, I’ve captured the design question well enough for a productive >> discussion. Thanks! >> >> rb >> -- >> Ryan Blue >> Software Engineer >> Netflix >> > > > -- > Ryan Blue > Software Engineer > Netflix > -- Ryan Blue Software Engineer Netflix