Re: SPIP: Catalog API for view metadata

John Zhuge Tue, 18 Aug 2020 01:25:41 -0700

Thanks Burak and Walaa for the feedback!

Here are my perspectives:


We shouldn't be persisting things like the schema for a view


This is not related to which option to choose because existing code
persists schema as well.
When resolving the view, the analyzer always parses the view sql text, it
does not use the schema.

> AFAIK view schema is only used by DESCRIBE.


> Why not use TableCatalog.loadTable to load both tables and views
>
Also, views can be defined on top of either other views or base tables, so
> the less divergence in code paths between views and tables the better.


Existing Spark takes this approach and there are quite a few checks like
"tableType == CatalogTableType.VIEW".
View and table metadata surprisingly have very little in common, thus I'd
like to group view related code together, separate from table processing.
Views are much closer to CTEs. SPIP proposed a new rule ViewSubstitution in
the same "Substitution" batch as CTESubstitution.

This way you avoid multiple RPCs to a catalog or data source or metastore,
> and you avoid namespace/name conflits. Also you make yourself less
> susceptible to race conditions (which still inherently exist).
>

Valid concern. Can be mitigated by caching RPC calls in the catalog
implementation. The window for race condition can also be narrowed
significantly but not totally eliminated.


On Fri, Aug 14, 2020 at 2:43 AM Walaa Eldin Moustafa <[email protected]>
wrote:

> Wenchen, agreed with what you said. I was referring to situations where
> the underlying table schema evolves (say by introducing a nested field in a
> Struct), and also what you mentioned in cases of SELECT *. The Hive
> metastore handling of those does not automatically update view schema (even
> though executing the view in Hive results in data that has the most recent
> schema when underlying tables evolve -- so newly added nested field data
> shows up in the view evaluation query result but not in the view schema).
>
> On Fri, Aug 14, 2020 at 2:36 AM Wenchen Fan <[email protected]> wrote:
>
>> View should have a fixed schema like a table. It should either be
>> inferred from the query when creating the view, or be specified by the user
>> manually like CREATE VIEW v(a, b) AS SELECT.... Users can still alter
>> view schema manually.
>>
>> Basically a view is just a named SQL query, which mostly has fixed schema
>> unless you do something like SELECT *.
>>
>> On Fri, Aug 14, 2020 at 8:39 AM Walaa Eldin Moustafa <
>> [email protected]> wrote:
>>
>>> +1 to making views as special forms of tables. Sometimes a table can be
>>> converted to a view to hide some of the implementation details while not
>>> impacting readers (provided that the write path is controlled). Also, views
>>> can be defined on top of either other views or base tables, so the less
>>> divergence in code paths between views and tables the better.
>>>
>>> For whether to materialize view schema or infer it, one of the issues we
>>> face with the HMS approach of materialization is that when the underlying
>>> table schema evolves, HMS will still keep the view schema unchanged. This
>>> causes a number of discrepancies that we address out-of-band (e.g., run
>>> separate pipeline to ensure view schema freshness, or just re-derive it at
>>> read time (example derivation algorithm for view Avro schema
>>> <https://github.com/linkedin/coral/blob/master/coral-schema/src/main/java/com/linkedin/coral/schema/avro/ViewToAvroSchemaConverter.java>
>>> )).
>>>
>>> Also regarding SupportsRead vs SupportWrite, some views can be
>>> updateable (example from MySQL
>>> https://dev.mysql.com/doc/refman/8.0/en/view-updatability.html), but
>>> also implementing that requires a few concepts that are more prominent in
>>> an RDBMS.
>>>
>>> Thanks,
>>> Walaa.
>>>
>>>
>>> On Thu, Aug 13, 2020 at 5:09 PM Burak Yavuz <[email protected]> wrote:
>>>
>>>> My high level comment here is that as a naive person, I would expect a
>>>> View to be a special form of Table that SupportsRead but doesn't
>>>> SupportWrite. loadTable in the TableCatalog API should load both tables and
>>>> views. This way you avoid multiple RPCs to a catalog or data source or
>>>> metastore, and you avoid namespace/name conflits. Also you make yourself
>>>> less susceptible to race conditions (which still inherently exist).
>>>>
>>>> In addition, I'm not a SQL expert, but I thought that views are
>>>> evaluated at runtime, therefore we shouldn't be persisting things like the
>>>> schema for a view.
>>>>
>>>> What do people think of making Views a special form of Table?
>>>>
>>>> Best,
>>>> Burak
>>>>
>>>>
>>>> On Thu, Aug 13, 2020 at 2:40 PM John Zhuge <[email protected]> wrote:
>>>>
>>>>> Thanks Ryan.
>>>>>
>>>>> ViewCatalog API mimics TableCatalog API including how shared namespace
>>>>> is handled:
>>>>>
>>>>>    - The doc for createView
>>>>>    
>>>>> <https://github.com/apache/spark/pull/28147/files#diff-24f7e7a09707492d3e65d549002e5849R109>
>>>>>  states
>>>>>    "it will throw ViewAlreadyExistsException when a view or table already
>>>>>    exists for the identifier."
>>>>>    - The doc for loadView
>>>>>    
>>>>> <https://github.com/apache/spark/pull/28147/files#diff-24f7e7a09707492d3e65d549002e5849R75>
>>>>>  states
>>>>>    "If the catalog supports tables and contains a table for the 
>>>>> identifier and
>>>>>    not a view, this must throw NoSuchViewException."
>>>>>
>>>>> Agree it is good to explicitly specify the order of resolution. I will
>>>>> add a section in ViewCatalog javadoc to summarize the behavior for "shared
>>>>> namespace". The loadView doc will also be updated to spell out the order 
>>>>> of
>>>>> resolution.
>>>>>
>>>>> On Thu, Aug 13, 2020 at 1:41 PM Ryan Blue <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> I agree with Wenchen that we need to be clear about resolution and
>>>>>> behavior. For example, I think that we would agree that CREATE VIEW
>>>>>> catalog.schema.name should fail when there is a table named
>>>>>> catalog.schema.name. We’ve already included this behavior in the
>>>>>> documentation for the TableCatalog API
>>>>>> <https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/connector/catalog/TableCatalog.html#createTable-org.apache.spark.sql.connector.catalog.Identifier-org.apache.spark.sql.types.StructType-org.apache.spark.sql.connector.expressions.Transform:A-java.util.Map->,
>>>>>> where create should fail if a view exists for the identifier.
>>>>>>
>>>>>> I think it was simply assumed that we would use the same approach —
>>>>>> the API requires that table and view names share a namespace. But it 
>>>>>> would
>>>>>> be good to specifically note either the order in which resolution will
>>>>>> happen (views are resolved first) or note that it is not allowed and
>>>>>> behavior is not guaranteed. I prefer the first option.
>>>>>>
>>>>>> On Wed, Aug 12, 2020 at 5:14 PM John Zhuge <[email protected]> wrote:
>>>>>>
>>>>>>> Hi Wenchen,
>>>>>>>
>>>>>>> Thanks for the feedback!
>>>>>>>
>>>>>>> 1. Add a new View API. How to avoid name conflicts between table and
>>>>>>>> view? When resolving relation, shall we lookup table catalog first or 
>>>>>>>> view
>>>>>>>> catalog?
>>>>>>>
>>>>>>>
>>>>>>>  See clarification in SPIP section "Proposed Changes - Namespace":
>>>>>>>
>>>>>>>    - The proposed new view substitution rule and the changes to
>>>>>>>    ResolveCatalogs should ensure the view catalog is looked up first 
>>>>>>> for a
>>>>>>>    "dual" catalog.
>>>>>>>    - The implementation for a "dual" catalog plugin should ensure:
>>>>>>>       -  Creating a view in view catalog when a table of the same
>>>>>>>       name exists should fail.
>>>>>>>       -  Creating a table in table catalog when a view of the same
>>>>>>>       name exists should fail as well.
>>>>>>>
>>>>>>> Agree with you that a new View API is more flexible. A couple of
>>>>>>> notes:
>>>>>>>
>>>>>>>    - We actually started a common view prototype using the single
>>>>>>>    catalog approach, but once we added more and more view metadata, 
>>>>>>> storing
>>>>>>>    them in table properties became not manageable, especially for the 
>>>>>>> feature
>>>>>>>    like "versioning". Eventually we opted for a view backend of S3 JSON 
>>>>>>> files.
>>>>>>>    - We'd like to move away from Hive metastore
>>>>>>>
>>>>>>> For more details and discussion, see SPIP section "Background and
>>>>>>> Motivation".
>>>>>>>
>>>>>>> Thanks,
>>>>>>> John
>>>>>>>
>>>>>>> On Wed, Aug 12, 2020 at 10:15 AM Wenchen Fan <[email protected]>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi John,
>>>>>>>>
>>>>>>>> Thanks for working on this! View support is very important to the
>>>>>>>> catalog plugin API.
>>>>>>>>
>>>>>>>> After reading your doc, I have one high-level question: should view
>>>>>>>> be a separated API or it's just a special type of table?
>>>>>>>>
>>>>>>>> AFAIK in most databases, tables and views share the same namespace.
>>>>>>>> You can't create a view if a same-name table exists. In Hive, view is 
>>>>>>>> just
>>>>>>>> a special type of table, so they are in the same namespace naturally. 
>>>>>>>> If we
>>>>>>>> have both table catalog and view catalog, we need a mechanism to make 
>>>>>>>> sure
>>>>>>>> there are no name conflicts.
>>>>>>>>
>>>>>>>> On the other hand, the view metadata is very simple that can be put
>>>>>>>> in table properties. I'd like to see more thoughts to evaluate these 2
>>>>>>>> approaches:
>>>>>>>> 1. *Add a new View API*. How to avoid name conflicts between table
>>>>>>>> and view? When resolving relation, shall we lookup table catalog first 
>>>>>>>> or
>>>>>>>> view catalog?
>>>>>>>> 2. *Reuse the Table API*. How to indicate it's a view? What if we
>>>>>>>> do want to store table and views separately?
>>>>>>>>
>>>>>>>> I think a new View API is more flexible. I'd vote for it if we can
>>>>>>>> come up with a good mechanism to avoid name conflicts.
>>>>>>>>
>>>>>>>> On Wed, Aug 12, 2020 at 6:20 AM John Zhuge <[email protected]>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Hi Spark devs,
>>>>>>>>>
>>>>>>>>> I'd like to bring more attention to this SPIP. As
>>>>>>>>> Dongjoon indicated in the email "Apache Spark 3.1 Feature Expectation 
>>>>>>>>> (Dec.
>>>>>>>>> 2020)", this feature can be considered for 3.2 or even 3.1.
>>>>>>>>>
>>>>>>>>> View catalog builds on top of the catalog plugin system introduced
>>>>>>>>> in DataSourceV2. It adds the “ViewCatalog” API to load, create, 
>>>>>>>>> alter, and
>>>>>>>>> drop views. A catalog plugin can naturally implement both ViewCatalog 
>>>>>>>>> and
>>>>>>>>> TableCatalog.
>>>>>>>>>
>>>>>>>>> Our internal implementation has been in production for over 8
>>>>>>>>> months. Recently we extended it to support materialized views, for 
>>>>>>>>> the read
>>>>>>>>> path initially.
>>>>>>>>>
>>>>>>>>> The PR has conflicts that I will resolve them shortly.
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>>
>>>>>>>>> On Wed, Apr 22, 2020 at 12:24 AM John Zhuge <[email protected]>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Hi everyone,
>>>>>>>>>>
>>>>>>>>>> In order to disassociate view metadata from Hive Metastore and
>>>>>>>>>> support different storage backends, I am proposing a new view 
>>>>>>>>>> catalog API
>>>>>>>>>> to load, create, alter, and drop views.
>>>>>>>>>>
>>>>>>>>>> Document:
>>>>>>>>>> https://docs.google.com/document/d/1XOxFtloiMuW24iqJ-zJnDzHl2KMxipTjJoxleJFz66A/edit?usp=sharing
>>>>>>>>>> JIRA: https://issues.apache.org/jira/browse/SPARK-31357
>>>>>>>>>> WIP PR: https://github.com/apache/spark/pull/28147
>>>>>>>>>>
>>>>>>>>>> As part of a project to support common views across query engines
>>>>>>>>>> like Spark and Presto, my team used the view catalog API in Spark
>>>>>>>>>> implementation. The project has been in production over three months.
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>> John Zhuge
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> John Zhuge
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> John Zhuge
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Ryan Blue
>>>>>> Software Engineer
>>>>>> Netflix
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> John Zhuge
>>>>>
>>>>

-- 
John Zhuge

Re: SPIP: Catalog API for view metadata

Reply via email to