Re: SPIP: Catalog API for view metadata

Burak Yavuz Thu, 13 Aug 2020 17:09:56 -0700

My high level comment here is that as a naive person, I would expect a View
to be a special form of Table that SupportsRead but doesn't SupportWrite.
loadTable in the TableCatalog API should load both tables and views. This
way you avoid multiple RPCs to a catalog or data source or metastore, and
you avoid namespace/name conflits. Also you make yourself less susceptible
to race conditions (which still inherently exist).


In addition, I'm not a SQL expert, but I thought that views are evaluated
at runtime, therefore we shouldn't be persisting things like the schema for
a view.

What do people think of making Views a special form of Table?

Best,
Burak


On Thu, Aug 13, 2020 at 2:40 PM John Zhuge <jzh...@apache.org> wrote:

> Thanks Ryan.
>
> ViewCatalog API mimics TableCatalog API including how shared namespace is
> handled:
>
>    - The doc for createView
>    
> <https://github.com/apache/spark/pull/28147/files#diff-24f7e7a09707492d3e65d549002e5849R109>
>  states
>    "it will throw ViewAlreadyExistsException when a view or table already
>    exists for the identifier."
>    - The doc for loadView
>    
> <https://github.com/apache/spark/pull/28147/files#diff-24f7e7a09707492d3e65d549002e5849R75>
>  states
>    "If the catalog supports tables and contains a table for the identifier and
>    not a view, this must throw NoSuchViewException."
>
> Agree it is good to explicitly specify the order of resolution. I will add
> a section in ViewCatalog javadoc to summarize the behavior for "shared
> namespace". The loadView doc will also be updated to spell out the order of
> resolution.
>
> On Thu, Aug 13, 2020 at 1:41 PM Ryan Blue <rb...@netflix.com.invalid>
> wrote:
>
>> I agree with Wenchen that we need to be clear about resolution and
>> behavior. For example, I think that we would agree that CREATE VIEW
>> catalog.schema.name should fail when there is a table named
>> catalog.schema.name. We’ve already included this behavior in the
>> documentation for the TableCatalog API
>> <https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/connector/catalog/TableCatalog.html#createTable-org.apache.spark.sql.connector.catalog.Identifier-org.apache.spark.sql.types.StructType-org.apache.spark.sql.connector.expressions.Transform:A-java.util.Map->,
>> where create should fail if a view exists for the identifier.
>>
>> I think it was simply assumed that we would use the same approach — the
>> API requires that table and view names share a namespace. But it would be
>> good to specifically note either the order in which resolution will happen
>> (views are resolved first) or note that it is not allowed and behavior is
>> not guaranteed. I prefer the first option.
>>
>> On Wed, Aug 12, 2020 at 5:14 PM John Zhuge <jzh...@apache.org> wrote:
>>
>>> Hi Wenchen,
>>>
>>> Thanks for the feedback!
>>>
>>> 1. Add a new View API. How to avoid name conflicts between table and
>>>> view? When resolving relation, shall we lookup table catalog first or view
>>>> catalog?
>>>
>>>
>>>  See clarification in SPIP section "Proposed Changes - Namespace":
>>>
>>>    - The proposed new view substitution rule and the changes to
>>>    ResolveCatalogs should ensure the view catalog is looked up first for a
>>>    "dual" catalog.
>>>    - The implementation for a "dual" catalog plugin should ensure:
>>>       -  Creating a view in view catalog when a table of the same name
>>>       exists should fail.
>>>       -  Creating a table in table catalog when a view of the same name
>>>       exists should fail as well.
>>>
>>> Agree with you that a new View API is more flexible. A couple of notes:
>>>
>>>    - We actually started a common view prototype using the single
>>>    catalog approach, but once we added more and more view metadata, storing
>>>    them in table properties became not manageable, especially for the 
>>> feature
>>>    like "versioning". Eventually we opted for a view backend of S3 JSON 
>>> files.
>>>    - We'd like to move away from Hive metastore
>>>
>>> For more details and discussion, see SPIP section "Background and
>>> Motivation".
>>>
>>> Thanks,
>>> John
>>>
>>> On Wed, Aug 12, 2020 at 10:15 AM Wenchen Fan <cloud0...@gmail.com>
>>> wrote:
>>>
>>>> Hi John,
>>>>
>>>> Thanks for working on this! View support is very important to the
>>>> catalog plugin API.
>>>>
>>>> After reading your doc, I have one high-level question: should view be
>>>> a separated API or it's just a special type of table?
>>>>
>>>> AFAIK in most databases, tables and views share the same namespace. You
>>>> can't create a view if a same-name table exists. In Hive, view is just a
>>>> special type of table, so they are in the same namespace naturally. If we
>>>> have both table catalog and view catalog, we need a mechanism to make sure
>>>> there are no name conflicts.
>>>>
>>>> On the other hand, the view metadata is very simple that can be put in
>>>> table properties. I'd like to see more thoughts to evaluate these 2
>>>> approaches:
>>>> 1. *Add a new View API*. How to avoid name conflicts between table and
>>>> view? When resolving relation, shall we lookup table catalog first or view
>>>> catalog?
>>>> 2. *Reuse the Table API*. How to indicate it's a view? What if we do
>>>> want to store table and views separately?
>>>>
>>>> I think a new View API is more flexible. I'd vote for it if we can come
>>>> up with a good mechanism to avoid name conflicts.
>>>>
>>>> On Wed, Aug 12, 2020 at 6:20 AM John Zhuge <jzh...@apache.org> wrote:
>>>>
>>>>> Hi Spark devs,
>>>>>
>>>>> I'd like to bring more attention to this SPIP. As Dongjoon indicated
>>>>> in the email "Apache Spark 3.1 Feature Expectation (Dec. 2020)", this
>>>>> feature can be considered for 3.2 or even 3.1.
>>>>>
>>>>> View catalog builds on top of the catalog plugin system introduced in
>>>>> DataSourceV2. It adds the “ViewCatalog” API to load, create, alter, and
>>>>> drop views. A catalog plugin can naturally implement both ViewCatalog and
>>>>> TableCatalog.
>>>>>
>>>>> Our internal implementation has been in production for over 8 months.
>>>>> Recently we extended it to support materialized views, for the read path
>>>>> initially.
>>>>>
>>>>> The PR has conflicts that I will resolve them shortly.
>>>>>
>>>>> Thanks,
>>>>>
>>>>> On Wed, Apr 22, 2020 at 12:24 AM John Zhuge <jzh...@apache.org> wrote:
>>>>>
>>>>>> Hi everyone,
>>>>>>
>>>>>> In order to disassociate view metadata from Hive Metastore and
>>>>>> support different storage backends, I am proposing a new view catalog API
>>>>>> to load, create, alter, and drop views.
>>>>>>
>>>>>> Document:
>>>>>> https://docs.google.com/document/d/1XOxFtloiMuW24iqJ-zJnDzHl2KMxipTjJoxleJFz66A/edit?usp=sharing
>>>>>> JIRA: https://issues.apache.org/jira/browse/SPARK-31357
>>>>>> WIP PR: https://github.com/apache/spark/pull/28147
>>>>>>
>>>>>> As part of a project to support common views across query engines
>>>>>> like Spark and Presto, my team used the view catalog API in Spark
>>>>>> implementation. The project has been in production over three months.
>>>>>>
>>>>>> Thanks,
>>>>>> John Zhuge
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> John Zhuge
>>>>>
>>>>
>>>
>>> --
>>> John Zhuge
>>>
>>
>>
>> --
>> Ryan Blue
>> Software Engineer
>> Netflix
>>
>
>
> --
> John Zhuge
>

Re: SPIP: Catalog API for view metadata

Reply via email to