I think it is a good idea to keep tables and views separate. The main two arguments I’ve heard for combining lookup into a single function are the ones brought up in this thread. First, an identifier in a catalog must be either a view or a table and should not collide. Second, a single lookup is more likely to require a single RPC. I think the RPC concern is well addressed by caching, which we already do in the Spark catalog, so I’ll primarily focus on the first.
Table/view name collision is unlikely to be a problem. Metastores that support both today store them in a single namespace, so this is not a concern for even a naive implementation that talks to the Hive MetaStore. I know that a new metastore catalog could choose to implement both ViewCatalog and TableCatalog and store the two sets separately, but that would be a very strange choice: if the metastore itself has different namespaces for tables and views, then it makes much more sense to expose them through separate catalogs because Spark will always prefer one over the other. In a similar line of reasoning, catalogs that expose both views and tables are much more rare than catalogs that only expose one. For example, v2 catalogs for JDBC and Cassandra expose data through the Table interface and implementing ViewCatalog would make little sense. Exposing new data sources to Spark requires TableCatalog, not ViewCatalog. View catalogs are likely to be the same. Say I have a way to convert Pig statements or some other representation into a SQL view. It would make little sense to combine that with some other TableCatalog. I also don’t think there is benefit from an API perspective to justify combining the Table and View interfaces. The two share only schema and properties, and are handled very differently internally — a View’s SQL query is parsed and substituted into the plan, while a Table is wrapped in a relation that eventually becomes a Scan node using SupportsRead. A view’s SQL also needs additional context to be resolved correctly: the current catalog and namespace from the time the view was created. Query planning is distinct between tables and views, so Spark doesn’t benefit from combining them. I think it has actually caused problems that both were resolved by the same method in v1: the resolution rule grew extremely complicated trying to look up a reference just once because it had to parse a view plan and resolve relations within it using the view’s context (current database). In contrast, John’s new view substitution rules are cleaner and can stay within the substitution batch. People implementing views would also not benefit from combining the two interfaces: - There is little overlap between View and Table, only schema and properties - Most catalogs won’t implement both interfaces, so returning a ViewOrTable is more difficult for implementations - TableCatalog assumes that ViewCatalog will be added separately like John proposes, so we would have to break or replace that API I understand the initial appeal of combining TableCatalog and ViewCatalog since it is done that way in the existing interfaces. But I think that Hive chose to do that mostly on the fact that the two were already stored together, and not because it made sense for users of the API, or any other implementer of the API. rb On Tue, Aug 18, 2020 at 9:46 AM John Zhuge <jzh...@apache.org> wrote: > > > >> > AFAIK view schema is only used by DESCRIBE. >> >> Correction: Spark adds a new Project at the top of the parsed plan from >> view, based on the stored schema, to make sure the view schema doesn't >> change. >> > > Thanks Wenchen! I thought I forgot something :) Yes it is the validation > done in *checkAnalysis*: > > // If the view output doesn't have the same number of columns > neither with the child > // output, nor with the query column names, throw an > AnalysisException. > // If the view's child output can't up cast to the view output, > // throw an AnalysisException, too. > > The view output comes from the schema: > > val child = View( > desc = metadata, > output = metadata.schema.toAttributes, > child = parser.parsePlan(viewText)) > > So it is a validation (here) or cache (in DESCRIBE) nice to have but not > "required" or "should be frozen". Thanks Ryan and Burak for pointing that > out in SPIP. I will add a new paragraph accordingly. > -- Ryan Blue Software Engineer Netflix