Any updates here? I agree that a new View API is better, but we need a solution to avoid performance regression. We need to elaborate on the cache idea.
On Thu, Aug 20, 2020 at 7:43 AM Ryan Blue <rb...@netflix.com> wrote: > I think it is a good idea to keep tables and views separate. > > The main two arguments I’ve heard for combining lookup into a single > function are the ones brought up in this thread. First, an identifier in a > catalog must be either a view or a table and should not collide. Second, a > single lookup is more likely to require a single RPC. I think the RPC > concern is well addressed by caching, which we already do in the Spark > catalog, so I’ll primarily focus on the first. > > Table/view name collision is unlikely to be a problem. Metastores that > support both today store them in a single namespace, so this is not a > concern for even a naive implementation that talks to the Hive MetaStore. I > know that a new metastore catalog could choose to implement both > ViewCatalog and TableCatalog and store the two sets separately, but that > would be a very strange choice: if the metastore itself has different > namespaces for tables and views, then it makes much more sense to expose > them through separate catalogs because Spark will always prefer one over > the other. > > In a similar line of reasoning, catalogs that expose both views and tables > are much more rare than catalogs that only expose one. For example, v2 > catalogs for JDBC and Cassandra expose data through the Table interface and > implementing ViewCatalog would make little sense. Exposing new data sources > to Spark requires TableCatalog, not ViewCatalog. View catalogs are likely > to be the same. Say I have a way to convert Pig statements or some other > representation into a SQL view. It would make little sense to combine that > with some other TableCatalog. > > I also don’t think there is benefit from an API perspective to justify > combining the Table and View interfaces. The two share only schema and > properties, and are handled very differently internally — a View’s SQL > query is parsed and substituted into the plan, while a Table is wrapped in > a relation that eventually becomes a Scan node using SupportsRead. A view’s > SQL also needs additional context to be resolved correctly: the current > catalog and namespace from the time the view was created. > > Query planning is distinct between tables and views, so Spark doesn’t > benefit from combining them. I think it has actually caused problems that > both were resolved by the same method in v1: the resolution rule grew > extremely complicated trying to look up a reference just once because it > had to parse a view plan and resolve relations within it using the view’s > context (current database). In contrast, John’s new view substitution rules > are cleaner and can stay within the substitution batch. > > People implementing views would also not benefit from combining the two > interfaces: > > - There is little overlap between View and Table, only schema and > properties > - Most catalogs won’t implement both interfaces, so returning a > ViewOrTable is more difficult for implementations > - TableCatalog assumes that ViewCatalog will be added separately like > John proposes, so we would have to break or replace that API > > I understand the initial appeal of combining TableCatalog and ViewCatalog > since it is done that way in the existing interfaces. But I think that Hive > chose to do that mostly on the fact that the two were already stored > together, and not because it made sense for users of the API, or any other > implementer of the API. > > rb > > On Tue, Aug 18, 2020 at 9:46 AM John Zhuge <jzh...@apache.org> wrote: > >> >> >> >>> > AFAIK view schema is only used by DESCRIBE. >>> >>> Correction: Spark adds a new Project at the top of the parsed plan from >>> view, based on the stored schema, to make sure the view schema doesn't >>> change. >>> >> >> Thanks Wenchen! I thought I forgot something :) Yes it is the validation >> done in *checkAnalysis*: >> >> // If the view output doesn't have the same number of columns >> neither with the child >> // output, nor with the query column names, throw an >> AnalysisException. >> // If the view's child output can't up cast to the view output, >> // throw an AnalysisException, too. >> >> The view output comes from the schema: >> >> val child = View( >> desc = metadata, >> output = metadata.schema.toAttributes, >> child = parser.parsePlan(viewText)) >> >> So it is a validation (here) or cache (in DESCRIBE) nice to have but not >> "required" or "should be frozen". Thanks Ryan and Burak for pointing that >> out in SPIP. I will add a new paragraph accordingly. >> > > > -- > Ryan Blue > Software Engineer > Netflix >