Re: SPIP: Catalog API for view metadata

Wenchen Fan Thu, 03 Sep 2020 02:00:59 -0700

Any updates here? I agree that a new View API is better, but we need a
solution to avoid performance regression. We need to elaborate on the cache
idea.


On Thu, Aug 20, 2020 at 7:43 AM Ryan Blue <rb...@netflix.com> wrote:

> I think it is a good idea to keep tables and views separate.
>
> The main two arguments I’ve heard for combining lookup into a single
> function are the ones brought up in this thread. First, an identifier in a
> catalog must be either a view or a table and should not collide. Second, a
> single lookup is more likely to require a single RPC. I think the RPC
> concern is well addressed by caching, which we already do in the Spark
> catalog, so I’ll primarily focus on the first.
>
> Table/view name collision is unlikely to be a problem. Metastores that
> support both today store them in a single namespace, so this is not a
> concern for even a naive implementation that talks to the Hive MetaStore. I
> know that a new metastore catalog could choose to implement both
> ViewCatalog and TableCatalog and store the two sets separately, but that
> would be a very strange choice: if the metastore itself has different
> namespaces for tables and views, then it makes much more sense to expose
> them through separate catalogs because Spark will always prefer one over
> the other.
>
> In a similar line of reasoning, catalogs that expose both views and tables
> are much more rare than catalogs that only expose one. For example, v2
> catalogs for JDBC and Cassandra expose data through the Table interface and
> implementing ViewCatalog would make little sense. Exposing new data sources
> to Spark requires TableCatalog, not ViewCatalog. View catalogs are likely
> to be the same. Say I have a way to convert Pig statements or some other
> representation into a SQL view. It would make little sense to combine that
> with some other TableCatalog.
>
> I also don’t think there is benefit from an API perspective to justify
> combining the Table and View interfaces. The two share only schema and
> properties, and are handled very differently internally — a View’s SQL
> query is parsed and substituted into the plan, while a Table is wrapped in
> a relation that eventually becomes a Scan node using SupportsRead. A view’s
> SQL also needs additional context to be resolved correctly: the current
> catalog and namespace from the time the view was created.
>
> Query planning is distinct between tables and views, so Spark doesn’t
> benefit from combining them. I think it has actually caused problems that
> both were resolved by the same method in v1: the resolution rule grew
> extremely complicated trying to look up a reference just once because it
> had to parse a view plan and resolve relations within it using the view’s
> context (current database). In contrast, John’s new view substitution rules
> are cleaner and can stay within the substitution batch.
>
> People implementing views would also not benefit from combining the two
> interfaces:
>
>    - There is little overlap between View and Table, only schema and
>    properties
>    - Most catalogs won’t implement both interfaces, so returning a
>    ViewOrTable is more difficult for implementations
>    - TableCatalog assumes that ViewCatalog will be added separately like
>    John proposes, so we would have to break or replace that API
>
> I understand the initial appeal of combining TableCatalog and ViewCatalog
> since it is done that way in the existing interfaces. But I think that Hive
> chose to do that mostly on the fact that the two were already stored
> together, and not because it made sense for users of the API, or any other
> implementer of the API.
>
> rb
>
> On Tue, Aug 18, 2020 at 9:46 AM John Zhuge <jzh...@apache.org> wrote:
>
>>
>>
>>
>>> > AFAIK view schema is only used by DESCRIBE.
>>>
>>> Correction: Spark adds a new Project at the top of the parsed plan from
>>> view, based on the stored schema, to make sure the view schema doesn't
>>> change.
>>>
>>
>> Thanks Wenchen! I thought I forgot something :) Yes it is the validation
>> done in *checkAnalysis*:
>>
>>           // If the view output doesn't have the same number of columns
>> neither with the child
>>           // output, nor with the query column names, throw an
>> AnalysisException.
>>           // If the view's child output can't up cast to the view output,
>>           // throw an AnalysisException, too.
>>
>> The view output comes from the schema:
>>
>>       val child = View(
>>         desc = metadata,
>>         output = metadata.schema.toAttributes,
>>         child = parser.parsePlan(viewText))
>>
>> So it is a validation (here) or cache (in DESCRIBE) nice to have but not
>> "required" or "should be frozen". Thanks Ryan and Burak for pointing that
>> out in SPIP. I will add a new paragraph accordingly.
>>
>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>

Re: SPIP: Catalog API for view metadata

Reply via email to