[jira] [Commented] (SPARK-24814) Relationship between catalog and datasources

Ryan Blue (JIRA) Mon, 23 Jul 2018 12:55:48 -0700


    [ 
https://issues.apache.org/jira/browse/SPARK-24814?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16553324#comment-16553324
 ]


Ryan Blue commented on SPARK-24814:
-----------------------------------

I've been implementing more logical plans (AppendData, DeleteFrom, CTAS, and 
RTAS) on top of my PR to add the proposed table catalog API. After thinking 
about this more, I don't think that we need #3. I think we should always go 
from a catalog to a table implementation (data source v2) instead of from a 
data source to a catalog.

For example, think about the "parquet" data source. Once we have multiple table 
catalogs, what table catalog should Parquet return? We could make it simply the 
"default", but then that restricts Spark from creating Parquet tables through 
other sources on some write paths. I think it makes no sense for a user to 
specify a CTAS for a Parquet table without also specifying a catalog in the 
table name (via name triple, {{catalog.db.table}}). TableIdentifier triples are 
supported through saveAsTable, insertIntoTable, and all SQL statements, so it 
is easy to specify the catalog nearly everywhere. The one write path that is 
left out is `df.write.save`, but that could require a `catalog` option like the 
`table` and `database` options.

> Relationship between catalog and datasources
> --------------------------------------------
>
>                 Key: SPARK-24814
>                 URL: https://issues.apache.org/jira/browse/SPARK-24814
>             Project: Spark
>          Issue Type: New Feature
>          Components: SQL
>    Affects Versions: 2.4.0
>            Reporter: Bruce Robbins
>            Priority: Major
>
> This is somewhat related, though not identical to, [~rdblue]'s SPIP on 
> datasources and catalogs.
> Here are the requirements (IMO) for fully implementing V2 datasources and 
> their relationships to catalogs:
>  # The global catalog should be configurable (the default can be HMS, but it 
> should be overridable).
>  # The default catalog (or an explicitly specified catalog in a query, once 
> multiple catalogs are supported) can determine the V2 datasource to use for 
> reading and writing the data.
>  # Conversely, a V2 datasource can determine which catalog to use for 
> resolution (e.g., if the user issues 
> {{spark.read.format("acmex").table("mytable")}}, the acmex datasource would 
> decide which catalog to use for resolving “mytable”).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24814) Relationship between catalog and datasources

Reply via email to