My high level comment here is that as a naive person, I would expect a View to be a special form of Table that SupportsRead but doesn't SupportWrite. loadTable in the TableCatalog API should load both tables and views. This way you avoid multiple RPCs to a catalog or data source or metastore, and you avoid namespace/name conflits. Also you make yourself less susceptible to race conditions (which still inherently exist).
In addition, I'm not a SQL expert, but I thought that views are evaluated at runtime, therefore we shouldn't be persisting things like the schema for a view. What do people think of making Views a special form of Table? Best, Burak On Thu, Aug 13, 2020 at 2:40 PM John Zhuge <jzh...@apache.org> wrote: > Thanks Ryan. > > ViewCatalog API mimics TableCatalog API including how shared namespace is > handled: > > - The doc for createView > > <https://github.com/apache/spark/pull/28147/files#diff-24f7e7a09707492d3e65d549002e5849R109> > states > "it will throw ViewAlreadyExistsException when a view or table already > exists for the identifier." > - The doc for loadView > > <https://github.com/apache/spark/pull/28147/files#diff-24f7e7a09707492d3e65d549002e5849R75> > states > "If the catalog supports tables and contains a table for the identifier and > not a view, this must throw NoSuchViewException." > > Agree it is good to explicitly specify the order of resolution. I will add > a section in ViewCatalog javadoc to summarize the behavior for "shared > namespace". The loadView doc will also be updated to spell out the order of > resolution. > > On Thu, Aug 13, 2020 at 1:41 PM Ryan Blue <rb...@netflix.com.invalid> > wrote: > >> I agree with Wenchen that we need to be clear about resolution and >> behavior. For example, I think that we would agree that CREATE VIEW >> catalog.schema.name should fail when there is a table named >> catalog.schema.name. We’ve already included this behavior in the >> documentation for the TableCatalog API >> <https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/connector/catalog/TableCatalog.html#createTable-org.apache.spark.sql.connector.catalog.Identifier-org.apache.spark.sql.types.StructType-org.apache.spark.sql.connector.expressions.Transform:A-java.util.Map->, >> where create should fail if a view exists for the identifier. >> >> I think it was simply assumed that we would use the same approach — the >> API requires that table and view names share a namespace. But it would be >> good to specifically note either the order in which resolution will happen >> (views are resolved first) or note that it is not allowed and behavior is >> not guaranteed. I prefer the first option. >> >> On Wed, Aug 12, 2020 at 5:14 PM John Zhuge <jzh...@apache.org> wrote: >> >>> Hi Wenchen, >>> >>> Thanks for the feedback! >>> >>> 1. Add a new View API. How to avoid name conflicts between table and >>>> view? When resolving relation, shall we lookup table catalog first or view >>>> catalog? >>> >>> >>> See clarification in SPIP section "Proposed Changes - Namespace": >>> >>> - The proposed new view substitution rule and the changes to >>> ResolveCatalogs should ensure the view catalog is looked up first for a >>> "dual" catalog. >>> - The implementation for a "dual" catalog plugin should ensure: >>> - Creating a view in view catalog when a table of the same name >>> exists should fail. >>> - Creating a table in table catalog when a view of the same name >>> exists should fail as well. >>> >>> Agree with you that a new View API is more flexible. A couple of notes: >>> >>> - We actually started a common view prototype using the single >>> catalog approach, but once we added more and more view metadata, storing >>> them in table properties became not manageable, especially for the >>> feature >>> like "versioning". Eventually we opted for a view backend of S3 JSON >>> files. >>> - We'd like to move away from Hive metastore >>> >>> For more details and discussion, see SPIP section "Background and >>> Motivation". >>> >>> Thanks, >>> John >>> >>> On Wed, Aug 12, 2020 at 10:15 AM Wenchen Fan <cloud0...@gmail.com> >>> wrote: >>> >>>> Hi John, >>>> >>>> Thanks for working on this! View support is very important to the >>>> catalog plugin API. >>>> >>>> After reading your doc, I have one high-level question: should view be >>>> a separated API or it's just a special type of table? >>>> >>>> AFAIK in most databases, tables and views share the same namespace. You >>>> can't create a view if a same-name table exists. In Hive, view is just a >>>> special type of table, so they are in the same namespace naturally. If we >>>> have both table catalog and view catalog, we need a mechanism to make sure >>>> there are no name conflicts. >>>> >>>> On the other hand, the view metadata is very simple that can be put in >>>> table properties. I'd like to see more thoughts to evaluate these 2 >>>> approaches: >>>> 1. *Add a new View API*. How to avoid name conflicts between table and >>>> view? When resolving relation, shall we lookup table catalog first or view >>>> catalog? >>>> 2. *Reuse the Table API*. How to indicate it's a view? What if we do >>>> want to store table and views separately? >>>> >>>> I think a new View API is more flexible. I'd vote for it if we can come >>>> up with a good mechanism to avoid name conflicts. >>>> >>>> On Wed, Aug 12, 2020 at 6:20 AM John Zhuge <jzh...@apache.org> wrote: >>>> >>>>> Hi Spark devs, >>>>> >>>>> I'd like to bring more attention to this SPIP. As Dongjoon indicated >>>>> in the email "Apache Spark 3.1 Feature Expectation (Dec. 2020)", this >>>>> feature can be considered for 3.2 or even 3.1. >>>>> >>>>> View catalog builds on top of the catalog plugin system introduced in >>>>> DataSourceV2. It adds the “ViewCatalog” API to load, create, alter, and >>>>> drop views. A catalog plugin can naturally implement both ViewCatalog and >>>>> TableCatalog. >>>>> >>>>> Our internal implementation has been in production for over 8 months. >>>>> Recently we extended it to support materialized views, for the read path >>>>> initially. >>>>> >>>>> The PR has conflicts that I will resolve them shortly. >>>>> >>>>> Thanks, >>>>> >>>>> On Wed, Apr 22, 2020 at 12:24 AM John Zhuge <jzh...@apache.org> wrote: >>>>> >>>>>> Hi everyone, >>>>>> >>>>>> In order to disassociate view metadata from Hive Metastore and >>>>>> support different storage backends, I am proposing a new view catalog API >>>>>> to load, create, alter, and drop views. >>>>>> >>>>>> Document: >>>>>> https://docs.google.com/document/d/1XOxFtloiMuW24iqJ-zJnDzHl2KMxipTjJoxleJFz66A/edit?usp=sharing >>>>>> JIRA: https://issues.apache.org/jira/browse/SPARK-31357 >>>>>> WIP PR: https://github.com/apache/spark/pull/28147 >>>>>> >>>>>> As part of a project to support common views across query engines >>>>>> like Spark and Presto, my team used the view catalog API in Spark >>>>>> implementation. The project has been in production over three months. >>>>>> >>>>>> Thanks, >>>>>> John Zhuge >>>>>> >>>>> >>>>> >>>>> -- >>>>> John Zhuge >>>>> >>>> >>> >>> -- >>> John Zhuge >>> >> >> >> -- >> Ryan Blue >> Software Engineer >> Netflix >> > > > -- > John Zhuge >