Re: [DISCUSS] Identifiers with multi-catalog support

Felix Cheung Sun, 20 Jan 2019 16:49:52 -0800

+1 I like Ryan last mail. Thank you for putting it clearly (should be a 
spec/SPIP!)


I agree and understand the need for 3 part id. However I don’t think we should 
make assumption that it must be or can only be as long as 3 parts. Once the 
catalog is identified (ie. The first part), the catalog should be responsible 
for resolving the namespace or schema etc. Agree also path is good idea to add 
to support file-based variant. Should separator be optional (perhaps in *space) 
to keep this extensible (it might not always be ‘.’)

Also this whole scheme will need to play nice with column identifier as well.


________________________________
From: Ryan Blue <[email protected]>
Sent: Thursday, January 17, 2019 11:38 AM
To: Spark Dev List
Subject: Re: [DISCUSS] Identifiers with multi-catalog support

Any discussion on how Spark should manage identifiers when multiple catalogs 
are supported?

I know this is an area where a lot of people are interested in making progress, 
and it is a blocker for both multi-catalog support and CTAS in DSv2.

On Sun, Jan 13, 2019 at 2:22 PM Ryan Blue 
<[email protected]<mailto:[email protected]>> wrote:

I think that the solution to this problem is to mix the two approaches by 
supporting 3 identifier parts: catalog, namespace, and name, where namespace 
can be an n-part identifier:

type Namespace = Seq[String]
case class CatalogIdentifier(space: Namespace, name: String)


This allows catalogs to work with the hierarchy of the external store, but the 
catalog API only requires a few discovery methods to list namespaces and to 
list each type of object in a namespace.

def listNamespaces(): Seq[Namespace]
def listNamespaces(space: Namespace, prefix: String): Seq[Namespace]
def listTables(space: Namespace): Seq[CatalogIdentifier]
def listViews(space: Namespace): Seq[CatalogIdentifier]
def listFunctions(space: Namespace): Seq[CatalogIdentifier]


The methods to list tables, views, or functions, would only return identifiers 
for the type queried, not namespaces or the other objects.

The SQL parser would be updated so that identifiers are parsed to 
UnresovledIdentifier(parts: Seq[String]), and resolution would work like this 
pseudo-code:

def resolveIdentifier(ident: UnresolvedIdentifier): (CatalogPlugin, 
CatalogIdentifier) = {
  val maybeCatalog = sparkSession.catalog(ident.parts.head)
  ident.parts match {
    case Seq(catalogName, *space, name) if catalog.isDefined =>
      (maybeCatalog.get, CatalogIdentifier(space, name))
    case Seq(*space, name) =>
      (sparkSession.defaultCatalog, CatalogIdentifier(space, name))
  }
}


I think this is a good approach because it allows Spark users to reference or 
discovery any name in the hierarchy of an external store, it uses a few 
well-defined methods for discovery, and makes name hierarchy a user concern.

  *   SHOW (DATABASES|SCHEMAS|NAMESPACES) would return the result of 
listNamespaces()
  *   SHOW NAMESPACES LIKE a.b% would return the result of 
listNamespaces(Seq("a"), "b")
  *   USE a.b would set the current namespace to Seq("a", "b")
  *   SHOW TABLES would return the result of listTables(currentNamespace)

Also, I think that we could generalize this a little more to support path-based 
tables by adding a path to CatalogIdentifier, either as a namespace or as a 
separate optional string. Then, the identifier passed to a catalog would work 
for either a path-based table or a catalog table, without needing a path-based 
catalog API.

Thoughts?

On Sun, Jan 13, 2019 at 1:38 PM Ryan Blue 
<[email protected]<mailto:[email protected]>> wrote:

In the DSv2 sync up, we tried to discuss the Table metadata proposal but were 
side-tracked on its use of TableIdentifier. There were good points about how 
Spark should identify tables, views, functions, etc, and I want to start a 
discussion here.

Identifiers are orthogonal to the TableCatalog proposal that can be updated to 
use whatever identifier class we choose. That proposal is concerned with what 
information should be passed to define a table, and how to pass that 
information.

The main question for this discussion is: how should Spark identify tables, 
views, and functions when it supports multiple catalogs?

There are two main approaches:

  1.  Use a 3-part identifier, catalog.database.table
  2.  Use an identifier with an arbitrary number of parts

Option 1: use 3-part identifiers

The argument for option #1 is that it is simple. If an external data store has 
additional logical hierarchy layers, then that hierarchy would be mapped to 
multiple catalogs in Spark. Spark can support show tables and show databases 
without much trouble. This is the approach used by Presto, so there is some 
precedent for it.

The drawback is that mapping a more complex hierarchy into Spark requires more 
configuration. If an external DB has a 3-level hierarchy — say, for example, 
schema.database.table — then option #1 requires users to configure a catalog 
for each top-level structure, each schema. When a new schema is added, it is 
not automatically accessible.

Catalog implementations could use session options could provide a rough 
work-around by changing the plugin’s “current” schema. I think this is an 
anti-pattern, so another strike against this option is that it encourages bad 
practices.

Option 2: use n-part identifiers

That drawback for option #1 is the main argument for option #2: Spark should 
allow users to easily interact with the native structure of an external store. 
For option #2, a full identifier would be an arbitrary-length list of 
identifiers. For the example above, using catalog.schema.database.table is 
allowed. An identifier would be something like this:

case class CatalogIdentifier(parts: Seq[String])


The problem with option #2 is how to implement a listing and discovery API, for 
operations like SHOW TABLES. If the catalog API requires a list(ident: 
CatalogIdentifier), what does it return? There is no guarantee that the listed 
objects are tables and not nested namespaces. How would Spark handle arbitrary 
nesting that differs across catalogs?

Hopefully, I’ve captured the design question well enough for a productive 
discussion. Thanks!

rb

--
Ryan Blue
Software Engineer
Netflix


--
Ryan Blue
Software Engineer
Netflix


--
Ryan Blue
Software Engineer
Netflix

Re: [DISCUSS] Identifiers with multi-catalog support

Reply via email to