I am definitely in favor of first-class / consistent support for tables and
data sources.

One thing that is not clear to me from this proposal is exactly what the
interfaces are between:
 - Spark
 - A (The?) metastore
 - A data source

If we pass in the table identifier is the data source then responsible for
talking directly to the metastore? Is that what we want? (I'm not sure)

On Fri, Feb 2, 2018 at 10:39 AM, Ryan Blue <rb...@netflix.com.invalid>
wrote:

> There are two main ways to load tables in Spark: by name (db.table) and by
> a path. Unfortunately, the integration for DataSourceV2 has no support for
> identifying tables by name.
>
> I propose supporting the use of TableIdentifier, which is the standard
> way to pass around table names.
>
> The reason I think we should do this is to easily support more ways of
> working with DataSourceV2 tables. SQL statements and parts of the
> DataFrameReader and DataFrameWriter APIs that use table names create
> UnresolvedRelation instances that wrap an unresolved TableIdentifier.
>
> By adding support for passing TableIdentifier to a DataSourceV2Relation,
> then about all we need to enable these code paths is to add a resolution
> rule. For that rule, we could easily identify a default data source that
> handles named tables.
>
> This is what we’re doing in our Spark build, and we have DataSourceV2
> tables working great through SQL. (Part of this depends on the logical plan
> changes from my previous email to ensure inserts are properly resolved.)
>
> In the long term, I think we should update how we parse tables so that
> TableIdentifier can contain a source in addition to a database/context
> and a table name. That would allow us to integration new sources fairly
> seamlessly, without needing to a rather redundant SQL create statement like
> this:
>
> CREATE TABLE database.name USING source OPTIONS (table 'database.name')
>
> Also, I think we should pass TableIdentifier to DataSourceV2Relation,
> rather than going with Wenchen’s suggestion that we pass the table name as
> a string property, “table”. My rationale is that the new API shouldn’t leak
> its internal details to other parts of the planner.
>
> If we were to convert TableIdentifer to a “table” property wherever
> DataSourceV2Relation is created, we create several places that need to be
> in sync with the same convention. On the other hand, passing
> TableIdentifier to DataSourceV2Relation and relying on the relation to
> correctly set the options passed to readers and writers minimizes the
> number of places that conversion needs to happen.
>
> rb
> ​
> --
> Ryan Blue
> Software Engineer
> Netflix
>

Reply via email to