Thanks Wenchen. Will do. On Tue, Aug 18, 2020 at 6:38 AM Wenchen Fan <cloud0...@gmail.com> wrote:
> > AFAIK view schema is only used by DESCRIBE. > > Correction: Spark adds a new Project at the top of the parsed plan from > view, based on the stored schema, to make sure the view schema doesn't > change. > > Can you update your doc to incorporate the cache idea? Let's make sure we > don't have perf issues if we go with the new View API. > > On Tue, Aug 18, 2020 at 4:25 PM John Zhuge <jzh...@apache.org> wrote: > >> Thanks Burak and Walaa for the feedback! >> >> Here are my perspectives: >> >> We shouldn't be persisting things like the schema for a view >> >> >> This is not related to which option to choose because existing code >> persists schema as well. >> When resolving the view, the analyzer always parses the view sql text, it >> does not use the schema. >> >>> AFAIK view schema is only used by DESCRIBE. >> >> >>> Why not use TableCatalog.loadTable to load both tables and views >>> >> Also, views can be defined on top of either other views or base tables, >>> so the less divergence in code paths between views and tables the better. >> >> >> Existing Spark takes this approach and there are quite a few checks like >> "tableType == CatalogTableType.VIEW". >> View and table metadata surprisingly have very little in common, thus I'd >> like to group view related code together, separate from table processing. >> Views are much closer to CTEs. SPIP proposed a new rule ViewSubstitution >> in the same "Substitution" batch as CTESubstitution. >> >> This way you avoid multiple RPCs to a catalog or data source or >>> metastore, and you avoid namespace/name conflits. Also you make yourself >>> less susceptible to race conditions (which still inherently exist). >>> >> >> Valid concern. Can be mitigated by caching RPC calls in the catalog >> implementation. The window for race condition can also be narrowed >> significantly but not totally eliminated. >> >> >> On Fri, Aug 14, 2020 at 2:43 AM Walaa Eldin Moustafa < >> wa.moust...@gmail.com> wrote: >> >>> Wenchen, agreed with what you said. I was referring to situations where >>> the underlying table schema evolves (say by introducing a nested field in a >>> Struct), and also what you mentioned in cases of SELECT *. The Hive >>> metastore handling of those does not automatically update view schema (even >>> though executing the view in Hive results in data that has the most recent >>> schema when underlying tables evolve -- so newly added nested field data >>> shows up in the view evaluation query result but not in the view schema). >>> >>> On Fri, Aug 14, 2020 at 2:36 AM Wenchen Fan <cloud0...@gmail.com> wrote: >>> >>>> View should have a fixed schema like a table. It should either be >>>> inferred from the query when creating the view, or be specified by the user >>>> manually like CREATE VIEW v(a, b) AS SELECT.... Users can still alter >>>> view schema manually. >>>> >>>> Basically a view is just a named SQL query, which mostly has fixed >>>> schema unless you do something like SELECT *. >>>> >>>> On Fri, Aug 14, 2020 at 8:39 AM Walaa Eldin Moustafa < >>>> wa.moust...@gmail.com> wrote: >>>> >>>>> +1 to making views as special forms of tables. Sometimes a table can >>>>> be converted to a view to hide some of the implementation details while >>>>> not >>>>> impacting readers (provided that the write path is controlled). Also, >>>>> views >>>>> can be defined on top of either other views or base tables, so the less >>>>> divergence in code paths between views and tables the better. >>>>> >>>>> For whether to materialize view schema or infer it, one of the issues >>>>> we face with the HMS approach of materialization is that when the >>>>> underlying table schema evolves, HMS will still keep the view schema >>>>> unchanged. This causes a number of discrepancies that we address >>>>> out-of-band (e.g., run separate pipeline to ensure view schema freshness, >>>>> or just re-derive it at read time (example derivation algorithm for >>>>> view Avro schema >>>>> <https://github.com/linkedin/coral/blob/master/coral-schema/src/main/java/com/linkedin/coral/schema/avro/ViewToAvroSchemaConverter.java> >>>>> )). >>>>> >>>>> Also regarding SupportsRead vs SupportWrite, some views can be >>>>> updateable (example from MySQL >>>>> https://dev.mysql.com/doc/refman/8.0/en/view-updatability.html), but >>>>> also implementing that requires a few concepts that are more prominent in >>>>> an RDBMS. >>>>> >>>>> Thanks, >>>>> Walaa. >>>>> >>>>> >>>>> On Thu, Aug 13, 2020 at 5:09 PM Burak Yavuz <brk...@gmail.com> wrote: >>>>> >>>>>> My high level comment here is that as a naive person, I would expect >>>>>> a View to be a special form of Table that SupportsRead but doesn't >>>>>> SupportWrite. loadTable in the TableCatalog API should load both tables >>>>>> and >>>>>> views. This way you avoid multiple RPCs to a catalog or data source or >>>>>> metastore, and you avoid namespace/name conflits. Also you make yourself >>>>>> less susceptible to race conditions (which still inherently exist). >>>>>> >>>>>> In addition, I'm not a SQL expert, but I thought that views are >>>>>> evaluated at runtime, therefore we shouldn't be persisting things like >>>>>> the >>>>>> schema for a view. >>>>>> >>>>>> What do people think of making Views a special form of Table? >>>>>> >>>>>> Best, >>>>>> Burak >>>>>> >>>>>> >>>>>> On Thu, Aug 13, 2020 at 2:40 PM John Zhuge <jzh...@apache.org> wrote: >>>>>> >>>>>>> Thanks Ryan. >>>>>>> >>>>>>> ViewCatalog API mimics TableCatalog API including how shared >>>>>>> namespace is handled: >>>>>>> >>>>>>> - The doc for createView >>>>>>> >>>>>>> <https://github.com/apache/spark/pull/28147/files#diff-24f7e7a09707492d3e65d549002e5849R109> >>>>>>> states >>>>>>> "it will throw ViewAlreadyExistsException when a view or table >>>>>>> already >>>>>>> exists for the identifier." >>>>>>> - The doc for loadView >>>>>>> >>>>>>> <https://github.com/apache/spark/pull/28147/files#diff-24f7e7a09707492d3e65d549002e5849R75> >>>>>>> states >>>>>>> "If the catalog supports tables and contains a table for the >>>>>>> identifier and >>>>>>> not a view, this must throw NoSuchViewException." >>>>>>> >>>>>>> Agree it is good to explicitly specify the order of resolution. I >>>>>>> will add a section in ViewCatalog javadoc to summarize the behavior for >>>>>>> "shared namespace". The loadView doc will also be updated to spell out >>>>>>> the >>>>>>> order of resolution. >>>>>>> >>>>>>> On Thu, Aug 13, 2020 at 1:41 PM Ryan Blue <rb...@netflix.com.invalid> >>>>>>> wrote: >>>>>>> >>>>>>>> I agree with Wenchen that we need to be clear about resolution and >>>>>>>> behavior. For example, I think that we would agree that CREATE >>>>>>>> VIEW catalog.schema.name should fail when there is a table named >>>>>>>> catalog.schema.name. We’ve already included this behavior in the >>>>>>>> documentation for the TableCatalog API >>>>>>>> <https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/connector/catalog/TableCatalog.html#createTable-org.apache.spark.sql.connector.catalog.Identifier-org.apache.spark.sql.types.StructType-org.apache.spark.sql.connector.expressions.Transform:A-java.util.Map->, >>>>>>>> where create should fail if a view exists for the identifier. >>>>>>>> >>>>>>>> >>>>>>>> I think it was simply assumed that we would use the same approach — >>>>>>>> the API requires that table and view names share a namespace. But it >>>>>>>> would >>>>>>>> be good to specifically note either the order in which resolution will >>>>>>>> happen (views are resolved first) or note that it is not allowed and >>>>>>>> behavior is not guaranteed. I prefer the first option. >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> On Wed, Aug 12, 2020 at 5:14 PM John Zhuge <jzh...@apache.org> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> Hi Wenchen, >>>>>>>>> >>>>>>>>> Thanks for the feedback! >>>>>>>>> >>>>>>>>> 1. Add a new View API. How to avoid name conflicts between table >>>>>>>>>> and view? When resolving relation, shall we lookup table catalog >>>>>>>>>> first or >>>>>>>>>> view catalog? >>>>>>>>> >>>>>>>>> >>>>>>>>> See clarification in SPIP section "Proposed Changes - Namespace": >>>>>>>>> >>>>>>>>> - The proposed new view substitution rule and the changes to >>>>>>>>> ResolveCatalogs should ensure the view catalog is looked up first >>>>>>>>> for a >>>>>>>>> "dual" catalog. >>>>>>>>> - The implementation for a "dual" catalog plugin should ensure: >>>>>>>>> - Creating a view in view catalog when a table of the same >>>>>>>>> name exists should fail. >>>>>>>>> - Creating a table in table catalog when a view of the >>>>>>>>> same name exists should fail as well. >>>>>>>>> >>>>>>>>> Agree with you that a new View API is more flexible. A couple of >>>>>>>>> notes: >>>>>>>>> >>>>>>>>> - We actually started a common view prototype using the single >>>>>>>>> catalog approach, but once we added more and more view metadata, >>>>>>>>> storing >>>>>>>>> them in table properties became not manageable, especially for the >>>>>>>>> feature >>>>>>>>> like "versioning". Eventually we opted for a view backend of S3 >>>>>>>>> JSON files. >>>>>>>>> - We'd like to move away from Hive metastore >>>>>>>>> >>>>>>>>> For more details and discussion, see SPIP section "Background and >>>>>>>>> Motivation". >>>>>>>>> >>>>>>>>> Thanks, >>>>>>>>> John >>>>>>>>> >>>>>>>>> On Wed, Aug 12, 2020 at 10:15 AM Wenchen Fan <cloud0...@gmail.com> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> Hi John, >>>>>>>>>> >>>>>>>>>> Thanks for working on this! View support is very important to the >>>>>>>>>> catalog plugin API. >>>>>>>>>> >>>>>>>>>> After reading your doc, I have one high-level question: should >>>>>>>>>> view be a separated API or it's just a special type of table? >>>>>>>>>> >>>>>>>>>> AFAIK in most databases, tables and views share the same >>>>>>>>>> namespace. You can't create a view if a same-name table exists. In >>>>>>>>>> Hive, >>>>>>>>>> view is just a special type of table, so they are in the same >>>>>>>>>> namespace >>>>>>>>>> naturally. If we have both table catalog and view catalog, we need a >>>>>>>>>> mechanism to make sure there are no name conflicts. >>>>>>>>>> >>>>>>>>>> On the other hand, the view metadata is very simple that can be >>>>>>>>>> put in table properties. I'd like to see more thoughts to evaluate >>>>>>>>>> these 2 >>>>>>>>>> approaches: >>>>>>>>>> 1. *Add a new View API*. How to avoid name conflicts between >>>>>>>>>> table and view? When resolving relation, shall we lookup table >>>>>>>>>> catalog >>>>>>>>>> first or view catalog? >>>>>>>>>> 2. *Reuse the Table API*. How to indicate it's a view? What if >>>>>>>>>> we do want to store table and views separately? >>>>>>>>>> >>>>>>>>>> I think a new View API is more flexible. I'd vote for it if we >>>>>>>>>> can come up with a good mechanism to avoid name conflicts. >>>>>>>>>> >>>>>>>>>> On Wed, Aug 12, 2020 at 6:20 AM John Zhuge <jzh...@apache.org> >>>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>>> Hi Spark devs, >>>>>>>>>>> >>>>>>>>>>> I'd like to bring more attention to this SPIP. As >>>>>>>>>>> Dongjoon indicated in the email "Apache Spark 3.1 Feature >>>>>>>>>>> Expectation (Dec. >>>>>>>>>>> 2020)", this feature can be considered for 3.2 or even 3.1. >>>>>>>>>>> >>>>>>>>>>> View catalog builds on top of the catalog plugin system >>>>>>>>>>> introduced in DataSourceV2. It adds the “ViewCatalog” API to load, >>>>>>>>>>> create, >>>>>>>>>>> alter, and drop views. A catalog plugin can naturally implement both >>>>>>>>>>> ViewCatalog and TableCatalog. >>>>>>>>>>> >>>>>>>>>>> Our internal implementation has been in production for over 8 >>>>>>>>>>> months. Recently we extended it to support materialized views, for >>>>>>>>>>> the read >>>>>>>>>>> path initially. >>>>>>>>>>> >>>>>>>>>>> The PR has conflicts that I will resolve them shortly. >>>>>>>>>>> >>>>>>>>>>> Thanks, >>>>>>>>>>> >>>>>>>>>>> On Wed, Apr 22, 2020 at 12:24 AM John Zhuge <jzh...@apache.org> >>>>>>>>>>> wrote: >>>>>>>>>>> >>>>>>>>>>>> Hi everyone, >>>>>>>>>>>> >>>>>>>>>>>> In order to disassociate view metadata from Hive Metastore and >>>>>>>>>>>> support different storage backends, I am proposing a new view >>>>>>>>>>>> catalog API >>>>>>>>>>>> to load, create, alter, and drop views. >>>>>>>>>>>> >>>>>>>>>>>> Document: >>>>>>>>>>>> https://docs.google.com/document/d/1XOxFtloiMuW24iqJ-zJnDzHl2KMxipTjJoxleJFz66A/edit?usp=sharing >>>>>>>>>>>> JIRA: https://issues.apache.org/jira/browse/SPARK-31357 >>>>>>>>>>>> WIP PR: https://github.com/apache/spark/pull/28147 >>>>>>>>>>>> >>>>>>>>>>>> As part of a project to support common views across query >>>>>>>>>>>> engines like Spark and Presto, my team used the view catalog API >>>>>>>>>>>> in Spark >>>>>>>>>>>> implementation. The project has been in production over three >>>>>>>>>>>> months. >>>>>>>>>>>> >>>>>>>>>>>> Thanks, >>>>>>>>>>>> John Zhuge >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> -- >>>>>>>>>>> John Zhuge >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>>> -- >>>>>>>>> John Zhuge >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>>> -- >>>>>>>> Ryan Blue >>>>>>>> Software Engineer >>>>>>>> Netflix >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>> >>>>>>> -- >>>>>>> John Zhuge >>>>>>> >>>>>>> >>>>>>> >>>>>> >>>>>> >>>>> >>>>> >>>> >>>> >>> >>> >> >> -- >> John Zhuge >> >> >> > > -- John Zhuge