Re: SPIP: Catalog API for view metadata

John Zhuge Tue, 18 Aug 2020 08:37:55 -0700

Thanks Wenchen. Will do.

On Tue, Aug 18, 2020 at 6:38 AM Wenchen Fan <cloud0...@gmail.com> wrote:


> > AFAIK view schema is only used by DESCRIBE.
>
> Correction: Spark adds a new Project at the top of the parsed plan from
> view, based on the stored schema, to make sure the view schema doesn't
> change.
>
> Can you update your doc to incorporate the cache idea? Let's make sure we
> don't have perf issues if we go with the new View API.
>
> On Tue, Aug 18, 2020 at 4:25 PM John Zhuge <jzh...@apache.org> wrote:
>
>> Thanks Burak and Walaa for the feedback!
>>
>> Here are my perspectives:
>>
>> We shouldn't be persisting things like the schema for a view
>>
>>
>> This is not related to which option to choose because existing code
>> persists schema as well.
>> When resolving the view, the analyzer always parses the view sql text, it
>> does not use the schema.
>>
>>> AFAIK view schema is only used by DESCRIBE.
>>
>>
>>> Why not use TableCatalog.loadTable to load both tables and views
>>>
>> Also, views can be defined on top of either other views or base tables,
>>> so the less divergence in code paths between views and tables the better.
>>
>>
>> Existing Spark takes this approach and there are quite a few checks like
>> "tableType == CatalogTableType.VIEW".
>> View and table metadata surprisingly have very little in common, thus I'd
>> like to group view related code together, separate from table processing.
>> Views are much closer to CTEs. SPIP proposed a new rule ViewSubstitution
>> in the same "Substitution" batch as CTESubstitution.
>>
>> This way you avoid multiple RPCs to a catalog or data source or
>>> metastore, and you avoid namespace/name conflits. Also you make yourself
>>> less susceptible to race conditions (which still inherently exist).
>>>
>>
>> Valid concern. Can be mitigated by caching RPC calls in the catalog
>> implementation. The window for race condition can also be narrowed
>> significantly but not totally eliminated.
>>
>>
>> On Fri, Aug 14, 2020 at 2:43 AM Walaa Eldin Moustafa <
>> wa.moust...@gmail.com> wrote:
>>
>>> Wenchen, agreed with what you said. I was referring to situations where
>>> the underlying table schema evolves (say by introducing a nested field in a
>>> Struct), and also what you mentioned in cases of SELECT *. The Hive
>>> metastore handling of those does not automatically update view schema (even
>>> though executing the view in Hive results in data that has the most recent
>>> schema when underlying tables evolve -- so newly added nested field data
>>> shows up in the view evaluation query result but not in the view schema).
>>>
>>> On Fri, Aug 14, 2020 at 2:36 AM Wenchen Fan <cloud0...@gmail.com> wrote:
>>>
>>>> View should have a fixed schema like a table. It should either be
>>>> inferred from the query when creating the view, or be specified by the user
>>>> manually like CREATE VIEW v(a, b) AS SELECT.... Users can still alter
>>>> view schema manually.
>>>>
>>>> Basically a view is just a named SQL query, which mostly has fixed
>>>> schema unless you do something like SELECT *.
>>>>
>>>> On Fri, Aug 14, 2020 at 8:39 AM Walaa Eldin Moustafa <
>>>> wa.moust...@gmail.com> wrote:
>>>>
>>>>> +1 to making views as special forms of tables. Sometimes a table can
>>>>> be converted to a view to hide some of the implementation details while 
>>>>> not
>>>>> impacting readers (provided that the write path is controlled). Also, 
>>>>> views
>>>>> can be defined on top of either other views or base tables, so the less
>>>>> divergence in code paths between views and tables the better.
>>>>>
>>>>> For whether to materialize view schema or infer it, one of the issues
>>>>> we face with the HMS approach of materialization is that when the
>>>>> underlying table schema evolves, HMS will still keep the view schema
>>>>> unchanged. This causes a number of discrepancies that we address
>>>>> out-of-band (e.g., run separate pipeline to ensure view schema freshness,
>>>>> or just re-derive it at read time (example derivation algorithm for
>>>>> view Avro schema
>>>>> <https://github.com/linkedin/coral/blob/master/coral-schema/src/main/java/com/linkedin/coral/schema/avro/ViewToAvroSchemaConverter.java>
>>>>> )).
>>>>>
>>>>> Also regarding SupportsRead vs SupportWrite, some views can be
>>>>> updateable (example from MySQL
>>>>> https://dev.mysql.com/doc/refman/8.0/en/view-updatability.html), but
>>>>> also implementing that requires a few concepts that are more prominent in
>>>>> an RDBMS.
>>>>>
>>>>> Thanks,
>>>>> Walaa.
>>>>>
>>>>>
>>>>> On Thu, Aug 13, 2020 at 5:09 PM Burak Yavuz <brk...@gmail.com> wrote:
>>>>>
>>>>>> My high level comment here is that as a naive person, I would expect
>>>>>> a View to be a special form of Table that SupportsRead but doesn't
>>>>>> SupportWrite. loadTable in the TableCatalog API should load both tables 
>>>>>> and
>>>>>> views. This way you avoid multiple RPCs to a catalog or data source or
>>>>>> metastore, and you avoid namespace/name conflits. Also you make yourself
>>>>>> less susceptible to race conditions (which still inherently exist).
>>>>>>
>>>>>> In addition, I'm not a SQL expert, but I thought that views are
>>>>>> evaluated at runtime, therefore we shouldn't be persisting things like 
>>>>>> the
>>>>>> schema for a view.
>>>>>>
>>>>>> What do people think of making Views a special form of Table?
>>>>>>
>>>>>> Best,
>>>>>> Burak
>>>>>>
>>>>>>
>>>>>> On Thu, Aug 13, 2020 at 2:40 PM John Zhuge <jzh...@apache.org> wrote:
>>>>>>
>>>>>>> Thanks Ryan.
>>>>>>>
>>>>>>> ViewCatalog API mimics TableCatalog API including how shared
>>>>>>> namespace is handled:
>>>>>>>
>>>>>>>    - The doc for createView
>>>>>>>    
>>>>>>> <https://github.com/apache/spark/pull/28147/files#diff-24f7e7a09707492d3e65d549002e5849R109>
>>>>>>>  states
>>>>>>>    "it will throw ViewAlreadyExistsException when a view or table 
>>>>>>> already
>>>>>>>    exists for the identifier."
>>>>>>>    - The doc for loadView
>>>>>>>    
>>>>>>> <https://github.com/apache/spark/pull/28147/files#diff-24f7e7a09707492d3e65d549002e5849R75>
>>>>>>>  states
>>>>>>>    "If the catalog supports tables and contains a table for the 
>>>>>>> identifier and
>>>>>>>    not a view, this must throw NoSuchViewException."
>>>>>>>
>>>>>>> Agree it is good to explicitly specify the order of resolution. I
>>>>>>> will add a section in ViewCatalog javadoc to summarize the behavior for
>>>>>>> "shared namespace". The loadView doc will also be updated to spell out 
>>>>>>> the
>>>>>>> order of resolution.
>>>>>>>
>>>>>>> On Thu, Aug 13, 2020 at 1:41 PM Ryan Blue <rb...@netflix.com.invalid>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> I agree with Wenchen that we need to be clear about resolution and
>>>>>>>> behavior. For example, I think that we would agree that CREATE
>>>>>>>> VIEW catalog.schema.name should fail when there is a table named
>>>>>>>> catalog.schema.name. We’ve already included this behavior in the
>>>>>>>> documentation for the TableCatalog API
>>>>>>>> <https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/connector/catalog/TableCatalog.html#createTable-org.apache.spark.sql.connector.catalog.Identifier-org.apache.spark.sql.types.StructType-org.apache.spark.sql.connector.expressions.Transform:A-java.util.Map->,
>>>>>>>> where create should fail if a view exists for the identifier.
>>>>>>>>
>>>>>>>>
>>>>>>>> I think it was simply assumed that we would use the same approach —
>>>>>>>> the API requires that table and view names share a namespace. But it 
>>>>>>>> would
>>>>>>>> be good to specifically note either the order in which resolution will
>>>>>>>> happen (views are resolved first) or note that it is not allowed and
>>>>>>>> behavior is not guaranteed. I prefer the first option.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Wed, Aug 12, 2020 at 5:14 PM John Zhuge <jzh...@apache.org>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Hi Wenchen,
>>>>>>>>>
>>>>>>>>> Thanks for the feedback!
>>>>>>>>>
>>>>>>>>> 1. Add a new View API. How to avoid name conflicts between table
>>>>>>>>>> and view? When resolving relation, shall we lookup table catalog 
>>>>>>>>>> first or
>>>>>>>>>> view catalog?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>  See clarification in SPIP section "Proposed Changes - Namespace":
>>>>>>>>>
>>>>>>>>>    - The proposed new view substitution rule and the changes to
>>>>>>>>>    ResolveCatalogs should ensure the view catalog is looked up first 
>>>>>>>>> for a
>>>>>>>>>    "dual" catalog.
>>>>>>>>>    - The implementation for a "dual" catalog plugin should ensure:
>>>>>>>>>       -  Creating a view in view catalog when a table of the same
>>>>>>>>>       name exists should fail.
>>>>>>>>>       -  Creating a table in table catalog when a view of the
>>>>>>>>>       same name exists should fail as well.
>>>>>>>>>
>>>>>>>>> Agree with you that a new View API is more flexible. A couple of
>>>>>>>>> notes:
>>>>>>>>>
>>>>>>>>>    - We actually started a common view prototype using the single
>>>>>>>>>    catalog approach, but once we added more and more view metadata, 
>>>>>>>>> storing
>>>>>>>>>    them in table properties became not manageable, especially for the 
>>>>>>>>> feature
>>>>>>>>>    like "versioning". Eventually we opted for a view backend of S3 
>>>>>>>>> JSON files.
>>>>>>>>>    - We'd like to move away from Hive metastore
>>>>>>>>>
>>>>>>>>> For more details and discussion, see SPIP section "Background and
>>>>>>>>> Motivation".
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> John
>>>>>>>>>
>>>>>>>>> On Wed, Aug 12, 2020 at 10:15 AM Wenchen Fan <cloud0...@gmail.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Hi John,
>>>>>>>>>>
>>>>>>>>>> Thanks for working on this! View support is very important to the
>>>>>>>>>> catalog plugin API.
>>>>>>>>>>
>>>>>>>>>> After reading your doc, I have one high-level question: should
>>>>>>>>>> view be a separated API or it's just a special type of table?
>>>>>>>>>>
>>>>>>>>>> AFAIK in most databases, tables and views share the same
>>>>>>>>>> namespace. You can't create a view if a same-name table exists. In 
>>>>>>>>>> Hive,
>>>>>>>>>> view is just a special type of table, so they are in the same 
>>>>>>>>>> namespace
>>>>>>>>>> naturally. If we have both table catalog and view catalog, we need a
>>>>>>>>>> mechanism to make sure there are no name conflicts.
>>>>>>>>>>
>>>>>>>>>> On the other hand, the view metadata is very simple that can be
>>>>>>>>>> put in table properties. I'd like to see more thoughts to evaluate 
>>>>>>>>>> these 2
>>>>>>>>>> approaches:
>>>>>>>>>> 1. *Add a new View API*. How to avoid name conflicts between
>>>>>>>>>> table and view? When resolving relation, shall we lookup table 
>>>>>>>>>> catalog
>>>>>>>>>> first or view catalog?
>>>>>>>>>> 2. *Reuse the Table API*. How to indicate it's a view? What if
>>>>>>>>>> we do want to store table and views separately?
>>>>>>>>>>
>>>>>>>>>> I think a new View API is more flexible. I'd vote for it if we
>>>>>>>>>> can come up with a good mechanism to avoid name conflicts.
>>>>>>>>>>
>>>>>>>>>> On Wed, Aug 12, 2020 at 6:20 AM John Zhuge <jzh...@apache.org>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi Spark devs,
>>>>>>>>>>>
>>>>>>>>>>> I'd like to bring more attention to this SPIP. As
>>>>>>>>>>> Dongjoon indicated in the email "Apache Spark 3.1 Feature 
>>>>>>>>>>> Expectation (Dec.
>>>>>>>>>>> 2020)", this feature can be considered for 3.2 or even 3.1.
>>>>>>>>>>>
>>>>>>>>>>> View catalog builds on top of the catalog plugin system
>>>>>>>>>>> introduced in DataSourceV2. It adds the “ViewCatalog” API to load, 
>>>>>>>>>>> create,
>>>>>>>>>>> alter, and drop views. A catalog plugin can naturally implement both
>>>>>>>>>>> ViewCatalog and TableCatalog.
>>>>>>>>>>>
>>>>>>>>>>> Our internal implementation has been in production for over 8
>>>>>>>>>>> months. Recently we extended it to support materialized views, for 
>>>>>>>>>>> the read
>>>>>>>>>>> path initially.
>>>>>>>>>>>
>>>>>>>>>>> The PR has conflicts that I will resolve them shortly.
>>>>>>>>>>>
>>>>>>>>>>> Thanks,
>>>>>>>>>>>
>>>>>>>>>>> On Wed, Apr 22, 2020 at 12:24 AM John Zhuge <jzh...@apache.org>
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hi everyone,
>>>>>>>>>>>>
>>>>>>>>>>>> In order to disassociate view metadata from Hive Metastore and
>>>>>>>>>>>> support different storage backends, I am proposing a new view 
>>>>>>>>>>>> catalog API
>>>>>>>>>>>> to load, create, alter, and drop views.
>>>>>>>>>>>>
>>>>>>>>>>>> Document:
>>>>>>>>>>>> https://docs.google.com/document/d/1XOxFtloiMuW24iqJ-zJnDzHl2KMxipTjJoxleJFz66A/edit?usp=sharing
>>>>>>>>>>>> JIRA: https://issues.apache.org/jira/browse/SPARK-31357
>>>>>>>>>>>> WIP PR: https://github.com/apache/spark/pull/28147
>>>>>>>>>>>>
>>>>>>>>>>>> As part of a project to support common views across query
>>>>>>>>>>>> engines like Spark and Presto, my team used the view catalog API 
>>>>>>>>>>>> in Spark
>>>>>>>>>>>> implementation. The project has been in production over three 
>>>>>>>>>>>> months.
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks,
>>>>>>>>>>>> John Zhuge
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>> John Zhuge
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> John Zhuge
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Ryan Blue
>>>>>>>> Software Engineer
>>>>>>>> Netflix
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> John Zhuge
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>
>>>
>>
>> --
>> John Zhuge
>>
>>
>>
>
> --
John Zhuge

Re: SPIP: Catalog API for view metadata

Reply via email to