Thanks Jan! +1 for everyone to take a look before the discussion, and see if there are any missing options or major arguments.
I have also added the images regarding all the options, it might be easier to parse than the big sheet. I will also put it here for people that do not have time to read through it: *Option 1: Add storage table identifier in view metadata content* [image: MV option 1.png] *Option 2: Add storage table metadata file pointer in view object* [image: MV option 2.png] *Option 3: Add storage table metadata file pointer in view metadata content* [image: MV option 3.png] *Option 4: Embed table metadata in view metadata content* [image: MV option 4.png] *Option 5: New MV spec, MV object has table and view metadata file pointers* [image: MV option 5.png] *Option 6: New MV spec, MV metadata content embeds table and view metadata* [image: MV option 6.png] *Option 7: New MV spec, completely new MV metadata content* [image: MV option 7.png] -Jack On Sun, Mar 3, 2024 at 11:45 PM Jan Kaul <jank...@mailbox.org.invalid> wrote: > I think it's great to have a face to face discussion about this. > Additionally, I would propose to use Jacks' document > <https://docs.google.com/spreadsheets/d/1a0tlyh8f2ft2SepE7H3bgoY2A0q5IILgzAsJMnwjTBs/edit#gid=0> > as a common ground for the discussion and that everyone has a quick look > before the next community sync. If you think the document is still missing > some arguments, please make suggestions to add them. This way we have to > spend less time to get everyone up to speed and have a more common > terminology. > > Looking forward to the discussion, best wishes > > Jan > On 02.03.24 02:06, Walaa Eldin Moustafa wrote: > > The calendar on the site is currently broken > https://iceberg.apache.org/community/#iceberg-community-events. Might > help to fix it or share the meeting link here. > > On Fri, Mar 1, 2024 at 3:43 PM Jack Ye <yezhao...@gmail.com> wrote: > >> Sounds good, let's discuss this in person! >> >> I am a bit worried that we have quite a few critical topics going on >> right now on devlist, and this will take up a lot of time to discuss. If it >> ends up going for too long, l propose let us have a dedicated meeting, and >> I am more than happy to organize it. >> >> Best, >> Jack Ye >> >> On Fri, Mar 1, 2024 at 12:48 PM Ryan Blue <b...@tabular.io> wrote: >> >>> Hey everyone, >>> >>> I think this thread has hit a point of diminishing returns and that we >>> still don't have a common understanding of what the options under >>> consideration actually are. >>> >>> Since we were already planning on discussing this at the next community >>> sync, I suggest we pick this up there and use that time to align on what >>> exactly we're considering. We can then start a new thread to lay out the >>> designs under consideration in more detail and then have a discussion about >>> trade-offs. >>> >>> Does that sound reasonable? >>> >>> Ryan >>> >>> >>> On Fri, Mar 1, 2024 at 11:09 AM Walaa Eldin Moustafa < >>> wa.moust...@gmail.com> wrote: >>> >>>> I am finding it hard to interpret the options concretely. I would also >>>> suggest breaking the expectation/outcome to milestones. Maybe it becomes >>>> easier if we agree to distinguish between an approach that is feasible in >>>> the near term and another in the long term, especially if the latter >>>> requires significant engine-side changes. >>>> >>>> Further, maybe it helps if we start with an option that fully reuses >>>> the existing spec, and see how we view it in comparison with the options >>>> discussed previously. I am sharing one below. It reuses the current spec of >>>> Iceberg views and tables by leveraging table properties to capture >>>> materialized view metadata. What is common (and not common) between this >>>> and the desired representations? >>>> >>>> The new properties are: >>>> Properties on a View: >>>> >>>> 1. >>>> >>>> *iceberg.materialized.view*: >>>> - *Type*: View property >>>> - *Purpose*: This property is used to mark whether a view is a >>>> materialized view. If set to true, the view is treated as a >>>> materialized view. This helps in differentiating between virtual and >>>> materialized views within the catalog and dictates specific handling >>>> and >>>> validation logic for materialized views. >>>> 2. >>>> >>>> *iceberg.materialized.view.storage.location*: >>>> - *Type*: View property >>>> - *Purpose*: Specifies the location of the storage table >>>> associated with the materialized view. This property is used for >>>> linking a >>>> materialized view with its corresponding storage table, enabling data >>>> management and query execution based on the stored data freshness. >>>> >>>> Properties on a Table: >>>> >>>> 1. *base.snapshot.[UUID]*: >>>> - *Type*: Table property >>>> - *Purpose*: These properties store the snapshot IDs of the base >>>> tables at the time the materialized view's data was last updated. >>>> Each >>>> property is prefixed with base.snapshot. followed by the UUID of >>>> the base table. They are used to track whether the materialized >>>> view's data >>>> is up to date with the base tables by comparing these snapshot IDs >>>> with the >>>> current snapshot IDs of the base tables. If all the base tables' >>>> current >>>> snapshot IDs match the ones stored in these properties, the >>>> materialized >>>> view's data is considered fresh. >>>> >>>> >>>> Thanks, >>>> Walaa. >>>> >>>> >>>> On Fri, Mar 1, 2024 at 9:15 AM Jack Ye <yezhao...@gmail.com> wrote: >>>> >>>>> > All of these approaches are aligned in one, specific way: the >>>>> storage table is an iceberg table. >>>>> >>>>> I do not think that is true. I think people are aligned that we would >>>>> like to re-use the Iceberg table metadata defined in the Iceberg table >>>>> spec >>>>> to express the data in MV, but I don't think it goes that far to say it >>>>> must be an Iceberg table. Once you have that mindset, then of course >>>>> option >>>>> 1 (separate table and view) is the only option. >>>>> >>>>> > I don't think that is necessary and it significantly increases the >>>>> complexity. >>>>> >>>>> And can you quantify what you mean by "significantly increases the >>>>> complexity"? Seems like a lot of concerns are coming from the tradeoff >>>>> with >>>>> complexity. We probably all agree that using option 7 (a completely new >>>>> metadata type) is a lot of work from scratch, that is why it is not >>>>> favored. However, my understanding is that as long as we re-use the view >>>>> and table metadata, then the majority of the existing logic can be reused. >>>>> I think what we have gone through in Slack to draft the rough Java API >>>>> shape helps here, because people can estimate the amount of effort >>>>> required >>>>> to implement it. And I don't think they are **significantly** more complex >>>>> to implement. Could you elaborate more about the complexity that you >>>>> imagine? >>>>> >>>>> -Jack >>>>> >>>>> >>>>> >>>>> On Fri, Mar 1, 2024 at 8:57 AM Daniel Weeks <daniel.c.we...@gmail.com> >>>>> wrote: >>>>> >>>>>> I feel I've been most vocal about pushing back against options 2+ (or >>>>>> Ryan's categories of combined table/view, or new metadata type), so I'll >>>>>> try to expand on my reasoning. >>>>>> >>>>>> I understand the appeal of creating a design where we encapsulate the >>>>>> view/storage from both a structural and performance standpoint, but I >>>>>> don't >>>>>> think that is necessary and it significantly increases the complexity. >>>>>> >>>>>> All of these approaches are aligned in one, specific way: the storage >>>>>> table is an iceberg table. >>>>>> >>>>>> Because of this, all the behaviors and requirements still apply to >>>>>> these tables. They need to be maintained (snapshot cleanup, orphan >>>>>> files), >>>>>> in cases need to be optimized (compaction, manifest rewrites), they need >>>>>> to >>>>>> be able to be inspected (this will be even more important with MV since >>>>>> staleness can produce different results and questions will arise about >>>>>> what >>>>>> state the storage table was in). There may be cases where the tables >>>>>> need >>>>>> to be managed directly. >>>>>> >>>>>> Anywhere we deviate from the existing constructs/commit/access for >>>>>> tables, we will ultimately have to then unwrap to re-expose the >>>>>> underlying >>>>>> Iceberg behavior. This creates unnecessary complexity in the library/API >>>>>> layer, which are not the primary interface users will have with >>>>>> materialized views where an engine is almost entirely necessary to >>>>>> interact >>>>>> with the dataset. >>>>>> >>>>>> As to the performance concerns around option 1, I think we're >>>>>> overstating the downsides. It really comes down to how many metadata >>>>>> loads >>>>>> are necessary and evaluating freshness would likely be the real >>>>>> bottleneck >>>>>> as it involves potentially loading many tables. All of the options are >>>>>> on >>>>>> the same order of performance for the metadata and table loads. >>>>>> >>>>>> As to the visibility of tables and whether they're registered in the >>>>>> catalog, I think registering in the catalog is the right approach so that >>>>>> the tables are still addressable for maintenance/etc. The visibility of >>>>>> the storage table is a catalog implementation decision and shouldn't be a >>>>>> requirement of the MV spec (I can see cases for both and it isn't >>>>>> necessary >>>>>> to dictate a behavior). >>>>>> >>>>>> I'm still strongly in favor of Option 1 (separate table and view) for >>>>>> these reasons. >>>>>> >>>>>> -Dan >>>>>> >>>>>> >>>>>> >>>>>> On Thu, Feb 29, 2024 at 11:07 PM Jack Ye <yezhao...@gmail.com> wrote: >>>>>> >>>>>>> > Jack, it sounds like you’re the proponent of a combined table and >>>>>>> view (rather than a new metadata spec for a materialized view). What is >>>>>>> the >>>>>>> main motivation? It seems like you’re convinced of that approach, but I >>>>>>> don’t understand the advantage it brings. >>>>>>> >>>>>>> Sorry I have to make a Google Sheet to capture all the options we >>>>>>> have discussed so far, I wanted to use the existing Google Doc, but it >>>>>>> has >>>>>>> really bad table/sheet support... >>>>>>> >>>>>>> >>>>>>> https://docs.google.com/spreadsheets/d/1a0tlyh8f2ft2SepE7H3bgoY2A0q5IILgzAsJMnwjTBs/edit#gid=0 >>>>>>> >>>>>>> I have listed all the options, with how they are implemented and >>>>>>> some important considerations we have discussed so far. Note that: >>>>>>> 1. This sheet currently excludes the lineage information, which we >>>>>>> can discuss more later after the current topic is resolved. >>>>>>> 2. I removed the considerations for REST integration since from the >>>>>>> other thread we have clarified that they should be considered completely >>>>>>> separately. >>>>>>> >>>>>>> *Why I come as a proponent of having a new MV object with table and >>>>>>> view metadata file pointer* >>>>>>> >>>>>>> In my sheet, there are 3 options that do not have major problems: >>>>>>> Option 2: Add storage table metadata file pointer in view object >>>>>>> Option 5: New MV object with table and view metadata file pointer >>>>>>> Option 6: New MV spec with table and view metadata >>>>>>> >>>>>>> I originally excluded option 2 because I think it does not align >>>>>>> with the REST spec, but after the other discussion thread about >>>>>>> "Inconsistency >>>>>>> between REST spec and table/view spec", I think my original concern no >>>>>>> longer holds true so now I put it back. And based on my personal >>>>>>> preference that MV is an independent object that should be separated >>>>>>> from >>>>>>> view and table, plus the fact that option 5 is probably less work than >>>>>>> option 6 for implementation, that is how I come as a proponent of >>>>>>> option 5 >>>>>>> at this moment. >>>>>>> >>>>>>> >>>>>>> *Regarding Ryan's evaluation framework * >>>>>>> >>>>>>> I think we need to reconcile this sheet with Ryan's evaluation >>>>>>> framework. That framework categorization puts option 2, 3, 4, 5, 6 all >>>>>>> under the same category of "A combination of a view and a table" >>>>>>> and concludes that they don't have any advantage for the same set of >>>>>>> reasons. But those reasons are not really convincing to me so let's talk >>>>>>> about them in more detail. >>>>>>> >>>>>>> (1) You said "I don’t see a reason why a combined view and table is >>>>>>> advantageous" as "this would cause unnecessary dependence between the >>>>>>> view >>>>>>> and table in catalogs." What dependency exactly do you mean here? And >>>>>>> why >>>>>>> is that unnecessary, given there has to be some sort of dependency >>>>>>> anyway >>>>>>> unless we go with option 5 or 6? >>>>>>> >>>>>>> (2) You said "I guess there’s an argument that you could load both >>>>>>> table and view metadata locations at the same time. That hardly seems >>>>>>> worth >>>>>>> the trouble". I disagree with that. Catalog interaction performance is >>>>>>> critical to at least everyone working in EMR and Athena, and MV itself >>>>>>> as >>>>>>> an acceleration approach needs to be as fast as possible. >>>>>>> >>>>>>> I have put 3 key operations in the doc that I think matters for MV >>>>>>> during interactions with engine: >>>>>>> 1. refreshes storage table >>>>>>> 2. get the storage table of the MV >>>>>>> 3. if stale, get the view SQL >>>>>>> >>>>>>> And option 1 clearly falls short with 4 sequential steps required to >>>>>>> load a storage table. You mentioned "recent issues with adding views to >>>>>>> the >>>>>>> JDBC catalog" in this topic, could you explain a bit more? >>>>>>> >>>>>>> (3) You said "I also think that once we decide on structure, we can >>>>>>> make it possible for REST catalog implementations to do smart things, >>>>>>> in a >>>>>>> way that doesn’t put additional requirements on the underlying catalog >>>>>>> store." If REST is fully compatible with Iceberg spec then I have no >>>>>>> problem with this statement. However, as we discussed in the other >>>>>>> thread, >>>>>>> it is not the case. In the current state, I think the sequence of action >>>>>>> should be to evolve the Iceberg table/view spec (or add a MV spec) >>>>>>> first, >>>>>>> and then think about how REST can incorporate it or do smart things that >>>>>>> are not Iceberg spec compliant. Do you agree with that? >>>>>>> >>>>>>> (4) You said the table identifier pointer "is a problem we need to >>>>>>> solve generally because a materialized table needs to be able to track >>>>>>> the >>>>>>> upstream state of tables that were used". I don't think that is a >>>>>>> reason to >>>>>>> choose to use a table identifier pointer for a storage table. The issue >>>>>>> is >>>>>>> not about using a table identifier pointer. It is about exposing the >>>>>>> storage table as a separate entity in the catalog, which is what people >>>>>>> do >>>>>>> not like and is already discussed in length in Jan's question 3 (also >>>>>>> linked in the sheet). I agree with that statement, because without a >>>>>>> REST >>>>>>> implementation that can magically hide the storage table, this model >>>>>>> adds >>>>>>> additional burden regarding compliance and data governance for any other >>>>>>> non-REST catalog implementations that are compliant to the Iceberg spec. >>>>>>> Many mechanisms need to be built in a catalog to hide, protect, >>>>>>> maintain, >>>>>>> recycle the storage table, that can be avoided by using other >>>>>>> approaches. I >>>>>>> think we should reach a consensus about that and discuss further if you >>>>>>> do >>>>>>> not agree. >>>>>>> >>>>>>> Best, >>>>>>> Jack Ye >>>>>>> >>>>>>> On Thu, Feb 29, 2024 at 10:53 PM Jan Kaul >>>>>>> <jank...@mailbox.org.invalid> <jank...@mailbox.org.invalid> wrote: >>>>>>> >>>>>>>> Hi Ryan, we actually discussed your categories in this question >>>>>>>> <https://docs.google.com/document/d/1UnhldHhe3Grz8JBngwXPA6ZZord1xMedY5ukEhZYF-A/edit?pli=1#heading=h.y70rtfhi9qxi>. >>>>>>>> Where your categories correspond to the following designs: >>>>>>>> >>>>>>>> - Separate table and view => Design 1 >>>>>>>> - Combination of view and table => Design 2 >>>>>>>> - A new metadata type => Design 4 >>>>>>>> >>>>>>>> Jan >>>>>>>> On 01.03.24 00:03, Ryan Blue wrote: >>>>>>>> >>>>>>>> Looks like it wasn’t clear what I meant for the 3 categories, so >>>>>>>> I’ll be more specific: >>>>>>>> >>>>>>>> - *Separate table and view*: this option is to have the objects >>>>>>>> that we have today, with extra metadata. Commit processes are >>>>>>>> separate: >>>>>>>> committing to the table doesn’t alter the view and committing to >>>>>>>> the view >>>>>>>> doesn’t change the table. However, changing the view can make it so >>>>>>>> the >>>>>>>> table is no longer useful as a materialization. >>>>>>>> - *A combination of a view and a table*: in this option, the >>>>>>>> table metadata and view metadata are the same as the first option. >>>>>>>> The >>>>>>>> difference is that the commit process combines them, either by >>>>>>>> embedding a >>>>>>>> table metadata location in view metadata or by tracking both in the >>>>>>>> same >>>>>>>> catalog reference. >>>>>>>> - *A new metadata type*: this option is where we define a new >>>>>>>> metadata object that has view attributes, like SQL representations, >>>>>>>> along >>>>>>>> with table attributes, like partition specs and snapshots. >>>>>>>> >>>>>>>> Hopefully this is clear because I think much of the confusion is >>>>>>>> caused by different definitions. >>>>>>>> >>>>>>>> The LoadTableResponse having optional metadata-location field >>>>>>>> implies that the object in the catalog no longer needs to hold a >>>>>>>> metadata >>>>>>>> file pointer >>>>>>>> >>>>>>>> The REST protocol has not removed the requirement for a metadata >>>>>>>> file, so I’m going to keep focused on the MV design options. >>>>>>>> >>>>>>>> When we say a MV can be a “new metadata type”, it does not mean it >>>>>>>> needs to define a completely brand new structure of the metadata >>>>>>>> content >>>>>>>> >>>>>>>> I’m making a distinction between separate metadata files for the >>>>>>>> table and the view and a combined metadata object, as above. >>>>>>>> >>>>>>>> We can define an “Iceberg MV” to be an object in a catalog, which >>>>>>>> has 1 table metadata file pointer, and 1 view metadata file pointer >>>>>>>> >>>>>>>> This is the option I am referring to as a “combination of a view >>>>>>>> and a table”. >>>>>>>> >>>>>>>> So to review my initial email, I don’t see a reason why a combined >>>>>>>> view and table is advantageous, either implemented by having a catalog >>>>>>>> reference with two metadata locations or embedding a table metadata >>>>>>>> location in view metadata. This would cause unnecessary dependence >>>>>>>> between >>>>>>>> the view and table in catalogs. I guess there’s an argument that you >>>>>>>> could >>>>>>>> load both table and view metadata locations at the same time. That >>>>>>>> hardly >>>>>>>> seems worth the trouble given the recent issues with adding views to >>>>>>>> the >>>>>>>> JDBC catalog. >>>>>>>> >>>>>>>> I also think that once we decide on structure, we can make it >>>>>>>> possible for REST catalog implementations to do smart things, in a way >>>>>>>> that >>>>>>>> doesn’t put additional requirements on the underlying catalog store. >>>>>>>> For >>>>>>>> instance, we could specify how to send additional objects in a >>>>>>>> LoadViewResult, in case the catalog wants to pre-fetch table metadata. >>>>>>>> I >>>>>>>> think these optimizations are a later addition, after we define the >>>>>>>> relationship between views and tables. >>>>>>>> >>>>>>>> Jack, it sounds like you’re the proponent of a combined table and >>>>>>>> view (rather than a new metadata spec for a materialized view). What >>>>>>>> is the >>>>>>>> main motivation? It seems like you’re convinced of that approach, but I >>>>>>>> don’t understand the advantage it brings. >>>>>>>> >>>>>>>> Ryan >>>>>>>> >>>>>>>> On Thu, Feb 29, 2024 at 12:26 PM Szehon Ho <szehon.apa...@gmail.com> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> Hi >>>>>>>>> >>>>>>>>> Yes I mostly agree with the assessment. To clarify a few minor >>>>>>>>> points. >>>>>>>>> >>>>>>>>> is a materialized view a view and a separate table, a combination >>>>>>>>>> of the two (i.e. commits are combined), or a new metadata type? >>>>>>>>> >>>>>>>>> >>>>>>>>> For 'new metadata type', I consider mostly Jack's initial proposal >>>>>>>>> of a new Catalog MV object that has two references (ViewMetadata + >>>>>>>>> TableMetadata). >>>>>>>>> >>>>>>>>> The arguments that I see for a combined materialized view object >>>>>>>>>> are: >>>>>>>>>> >>>>>>>>>> - Regular views are separate, rather than being tables with >>>>>>>>>> SQL and no data so it would be inconsistent (“Iceberg view is >>>>>>>>>> just a table >>>>>>>>>> with no data but with representations defined. But we did not do >>>>>>>>>> that.”) >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> - Materialized views are different objects in DDL >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> - Tables may be a superset of functionality needed for >>>>>>>>>> materialized views >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> - Tables are not typically exposed to end users — but this >>>>>>>>>> isn’t required by the separate view and table option >>>>>>>>>> >>>>>>>>>> For completeness, there seem to be a few additional ones >>>>>>>>> (mentioned in the Slack and above messages). >>>>>>>>> >>>>>>>>> - Lack of spec change (to ViewMetadata). But as Jack says it >>>>>>>>> is a spec change (ie, to catalogs) >>>>>>>>> - A single call to get the View's StorageTable (versus two >>>>>>>>> calls) >>>>>>>>> - A more natural API, no opportunity for user to call >>>>>>>>> Catalog.dropTable() and renameTable() on storage table >>>>>>>>> >>>>>>>>> >>>>>>>>> *Thoughts: *I think the long discussion sessions we had on Slack >>>>>>>>> was fruitful for me, as seeing the API clarified some things. >>>>>>>>> >>>>>>>>> I was initially more in favor of MV being a new metadata type >>>>>>>>> (TableMetadata + ViewMetadata). But seeing most of the MV operations >>>>>>>>> end >>>>>>>>> up being ViewCatalog or Catalog operations, I am starting to think >>>>>>>>> API-wise >>>>>>>>> that it may not align with the new metadata type (unless we define >>>>>>>>> MVCatalog and /MV REST endpoints, which then are boilerplate >>>>>>>>> wrappers). >>>>>>>>> >>>>>>>>> Initially one question I had for option 'a view and a separate >>>>>>>>> table', was how to make this table reference (metadata.json or catalog >>>>>>>>> reference). In the previous option, we had a precedent of Catalog >>>>>>>>> references to Metadata, but not pointers between Metadatas. I >>>>>>>>> initially >>>>>>>>> saw the proposed Catalog's TableIdentifier pointer as 'polluting' >>>>>>>>> catalog >>>>>>>>> concerns in ViewMetadata. (I saw Catalog and ViewCatalog as a layer >>>>>>>>> above >>>>>>>>> TableMetadata and ViewMetadata). But I think Dan in the Slack made a >>>>>>>>> fair >>>>>>>>> point that ViewMetadata already is tightly bound with a Catalog. In >>>>>>>>> this >>>>>>>>> case, I think this approach does have its merits as well in aligning >>>>>>>>> Catalog API's with the metadata. >>>>>>>>> >>>>>>>>> Thanks >>>>>>>>> Szehon >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> On Thu, Feb 29, 2024 at 5:45 AM Jan Kaul >>>>>>>>> <jank...@mailbox.org.invalid> <jank...@mailbox.org.invalid> wrote: >>>>>>>>> >>>>>>>>>> Hi all, >>>>>>>>>> >>>>>>>>>> I would like to provide my perspective on the question of what a >>>>>>>>>> materialized view is and elaborate on Jack's recent proposal to view >>>>>>>>>> a >>>>>>>>>> materialized view as a catalog concept. >>>>>>>>>> >>>>>>>>>> Firstly, let's look at the role of the catalog. Every entity in >>>>>>>>>> the catalog has a *unique identifier*, and the catalog provides >>>>>>>>>> methods to create, load, and update these entities. An important >>>>>>>>>> thing to >>>>>>>>>> note is that the catalog methods exhibit two different behaviors: >>>>>>>>>> the *create >>>>>>>>>> and load methods deal with the entire entity*, while the >>>>>>>>>> *update(commit) >>>>>>>>>> method only deals with partial changes* to the entities. >>>>>>>>>> >>>>>>>>>> In the context of our current discussion, materialized view (MV) >>>>>>>>>> metadata is a union of view and table metadata. The fact that the >>>>>>>>>> update >>>>>>>>>> method deals only with partial changes, enables us to *reuse the >>>>>>>>>> existing methods for updating tables and views*. For updates we >>>>>>>>>> don't have to define what constitutes an entire materialized view. >>>>>>>>>> Changes >>>>>>>>>> to a materialized view targeting the properties related to the view >>>>>>>>>> metadata could use the update(commit) view method. Similarly, changes >>>>>>>>>> targeting the properties related to the table metadata could use the >>>>>>>>>> update(commit) table method. This is great news because we don't >>>>>>>>>> have to >>>>>>>>>> redefine view and table commits (requirements, updates). >>>>>>>>>> This is shown in the fact that Jack uses the same operation to >>>>>>>>>> update the storage table for Option 1 and 3: >>>>>>>>>> >>>>>>>>>> // REST: POST /namespaces/db1/tables/mv1?materializedView=true >>>>>>>>>> // non-REST: update JSON files at table_metadata_location >>>>>>>>>> storageTable.newAppend().appendFile(...).commit(); >>>>>>>>>> >>>>>>>>>> The open question is *whether the create and load methods should >>>>>>>>>> treat the properties that constitute the MV metadata as two entities >>>>>>>>>> (View >>>>>>>>>> + Table) or one entity (new MV object)*. This is all part of >>>>>>>>>> Jack's proposal, where Option 1 proposes a new MV object, and Option >>>>>>>>>> 3 >>>>>>>>>> proposes two separate entities. The advantage of Option 1 is that it >>>>>>>>>> doesn't require two operations to load the metadata. On the other >>>>>>>>>> hand, the >>>>>>>>>> advantage of Option 3 is that no new operations or catalogs have to >>>>>>>>>> be >>>>>>>>>> defined. >>>>>>>>>> >>>>>>>>>> In my opinion, defining a new representation for materialized >>>>>>>>>> views (Option 1) is generally the cleaner solution. However, I see a >>>>>>>>>> path >>>>>>>>>> where we could first introduce Option 3 and still have the >>>>>>>>>> possibility to >>>>>>>>>> transition to Option 1 if needed. The great thing about Option 3 is >>>>>>>>>> that it >>>>>>>>>> only requires minor changes to the current spec and is mostly >>>>>>>>>> implementation detail. >>>>>>>>>> >>>>>>>>>> Therefore I would propose small additions to Jacks Option 3 that >>>>>>>>>> only introduce changes to the spec that are not specific to >>>>>>>>>> materialized >>>>>>>>>> views. The idea is to introduce boolean properties to be set on the >>>>>>>>>> creation of the view and the storage table that indicate that they >>>>>>>>>> belong >>>>>>>>>> to a materialized view. The view property "materialized" is set to >>>>>>>>>> "true" >>>>>>>>>> for a MV and "false" for a regular view. And the table property >>>>>>>>>> "storage_table" is set to "true" for a storage table and "false" for >>>>>>>>>> a >>>>>>>>>> regular table. The absence of these properties indicates a regular >>>>>>>>>> view or >>>>>>>>>> table. >>>>>>>>>> >>>>>>>>>> ViewCatalog viewCatalog = (ViewCatalog) catalog; >>>>>>>>>> >>>>>>>>>> // REST: GET /namespaces/db1/views/mv1 >>>>>>>>>> // non-REST: load JSON file at metadata_location >>>>>>>>>> View mv = viewCatalog.loadView(TableIdentifier.of("db1", "mv1")); >>>>>>>>>> >>>>>>>>>> // REST: GET /namespaces/db1/tables/mv1 >>>>>>>>>> // non-REST: load JSON file at table_metadata_location if present >>>>>>>>>> Table storageTable = view.storageTable(); >>>>>>>>>> >>>>>>>>>> // REST: POST /namespaces/db1/tables/mv1 >>>>>>>>>> // non-REST: update JSON file at table_metadata_location >>>>>>>>>> storageTable.newAppend().appendFile(...).commit(); >>>>>>>>>> >>>>>>>>>> We could then introduce a new requirement for views and tables >>>>>>>>>> called "AssertProperty" which could make sure to only perform >>>>>>>>>> updates that >>>>>>>>>> are inline with materialized views. The additional requirement can >>>>>>>>>> be seen >>>>>>>>>> as a general extension which does not need to be changed if we >>>>>>>>>> decide to >>>>>>>>>> got with Option 1 in the future. >>>>>>>>>> >>>>>>>>>> Let me know what you think. >>>>>>>>>> >>>>>>>>>> Best wishes, >>>>>>>>>> >>>>>>>>>> Jan >>>>>>>>>> On 29.02.24 04:09, Walaa Eldin Moustafa wrote: >>>>>>>>>> >>>>>>>>>> Thanks Ryan for the insights. I agree that reusing existing >>>>>>>>>> metadata definitions and minimizing spec changes are very important. >>>>>>>>>> This >>>>>>>>>> also minimizes spec drift (between materialized views and views >>>>>>>>>> spec, and >>>>>>>>>> between materialized views and tables spec), and simplifies the >>>>>>>>>> implementation. >>>>>>>>>> >>>>>>>>>> In an effort to take the discussion forward with concrete design >>>>>>>>>> options based on an end-to-end implementation, I have prototyped the >>>>>>>>>> implementation (and added Spark support) in this PR >>>>>>>>>> https://github.com/apache/iceberg/pull/9830. I hope it helps us >>>>>>>>>> reach convergence faster. More details about some of the design >>>>>>>>>> options are >>>>>>>>>> discussed in the description of the PR. >>>>>>>>>> >>>>>>>>>> Thanks, >>>>>>>>>> Walaa. >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Wed, Feb 28, 2024 at 6:20 PM Ryan Blue <b...@tabular.io> >>>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>>> I mean separate table and view metadata that is somehow combined >>>>>>>>>>> through a commit process. For instance, keeping a pointer to a table >>>>>>>>>>> metadata file in a view metadata file or combining commits to >>>>>>>>>>> reference >>>>>>>>>>> both. I don't see the value in either option. >>>>>>>>>>> >>>>>>>>>>> On Wed, Feb 28, 2024 at 5:05 PM Jack Ye <yezhao...@gmail.com> >>>>>>>>>>> wrote: >>>>>>>>>>> >>>>>>>>>>>> Thanks Ryan for the help to trace back to the root question! >>>>>>>>>>>> Just a clarification question regarding your reply before I reply >>>>>>>>>>>> further: >>>>>>>>>>>> what exactly does the option "a combination of the two (i.e. >>>>>>>>>>>> commits are >>>>>>>>>>>> combined)" mean? How is that different from "a new metadata type"? >>>>>>>>>>>> >>>>>>>>>>>> -Jack >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> On Wed, Feb 28, 2024 at 2:10 PM Ryan Blue <b...@tabular.io> >>>>>>>>>>>> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> I’m catching up on this conversation, so hopefully I can bring >>>>>>>>>>>>> a fresh perspective. >>>>>>>>>>>>> >>>>>>>>>>>>> Jack already pointed out that we need to start from the basics >>>>>>>>>>>>> and I agree with that. Let’s remove voting at this point. Right >>>>>>>>>>>>> now is the >>>>>>>>>>>>> time for discussing trade-offs, not lining up and taking sides. I >>>>>>>>>>>>> realize >>>>>>>>>>>>> that wasn’t the intent with adding a vote, but that’s almost >>>>>>>>>>>>> always the >>>>>>>>>>>>> result. It’s too easy to use it as a stand-in for consensus and >>>>>>>>>>>>> move on >>>>>>>>>>>>> prematurely. I get the impression from the swirl in Slack that >>>>>>>>>>>>> discussion >>>>>>>>>>>>> has moved ahead of agreement. >>>>>>>>>>>>> >>>>>>>>>>>>> We’re still at the most basic question: is a materialized view >>>>>>>>>>>>> a view and a separate table, a combination of the two (i.e. >>>>>>>>>>>>> commits are >>>>>>>>>>>>> combined), or a new metadata type? >>>>>>>>>>>>> >>>>>>>>>>>>> For now, I’m ignoring whether the “separate table” is some >>>>>>>>>>>>> kind of “system table” (meaning hidden?) or if it is exposed in >>>>>>>>>>>>> the >>>>>>>>>>>>> catalog. That’s a later choice (already pointed out) and, I >>>>>>>>>>>>> suspect, it >>>>>>>>>>>>> should be delegated to catalog implementations. >>>>>>>>>>>>> >>>>>>>>>>>>> To simplify this a little, I think that we can eliminate the >>>>>>>>>>>>> option to combine table and view commits. I don’t think there is >>>>>>>>>>>>> a reason >>>>>>>>>>>>> to combine the two. If separate, a table would track the view >>>>>>>>>>>>> version used >>>>>>>>>>>>> along with freshness information for referenced tables. If the >>>>>>>>>>>>> table is >>>>>>>>>>>>> automatically skipped when the version no longer matches the >>>>>>>>>>>>> view, then no >>>>>>>>>>>>> action needs to happen when a view definition changes. Similarly, >>>>>>>>>>>>> the table >>>>>>>>>>>>> can be updated independently without needing to also swap view >>>>>>>>>>>>> metadata. >>>>>>>>>>>>> This also aligns with the idea from the original doc that there >>>>>>>>>>>>> can be >>>>>>>>>>>>> multiple materialization tables for a view. Each should operate >>>>>>>>>>>>> independently unless I’m missing something >>>>>>>>>>>>> >>>>>>>>>>>>> I don’t think the last paragraph’s conclusion is contentious >>>>>>>>>>>>> so I’ll move on, but please stop here and reply if you disagree! >>>>>>>>>>>>> >>>>>>>>>>>>> That leaves the main two options, a view and a separate table >>>>>>>>>>>>> linked by metadata, or, combined materialized view metadata. >>>>>>>>>>>>> >>>>>>>>>>>>> As the doc notes, the separate view and table option is >>>>>>>>>>>>> simpler because it reuses existing metadata definitions and falls >>>>>>>>>>>>> back to >>>>>>>>>>>>> simple views. That is a significantly smaller spec and small is >>>>>>>>>>>>> very, very >>>>>>>>>>>>> important when it comes to specs. I think that the argument for a >>>>>>>>>>>>> new >>>>>>>>>>>>> definition of a materialized view needs to overcome this >>>>>>>>>>>>> disadvantage. >>>>>>>>>>>>> >>>>>>>>>>>>> The arguments that I see for a combined materialized view >>>>>>>>>>>>> object are: >>>>>>>>>>>>> >>>>>>>>>>>>> - Regular views are separate, rather than being tables >>>>>>>>>>>>> with SQL and no data so it would be inconsistent (“Iceberg >>>>>>>>>>>>> view is just a >>>>>>>>>>>>> table with no data but with representations defined. But we >>>>>>>>>>>>> did not do >>>>>>>>>>>>> that.”) >>>>>>>>>>>>> - Materialized views are different objects in DDL >>>>>>>>>>>>> - Tables may be a superset of functionality needed for >>>>>>>>>>>>> materialized views >>>>>>>>>>>>> - Tables are not typically exposed to end users — but this >>>>>>>>>>>>> isn’t required by the separate view and table option >>>>>>>>>>>>> >>>>>>>>>>>>> Am I missing any arguments for combined metadata? >>>>>>>>>>>>> >>>>>>>>>>>>> Ryan >>>>>>>>>>>>> -- >>>>>>>>>>>>> Ryan Blue >>>>>>>>>>>>> Tabular >>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> -- >>>>>>>>>>> Ryan Blue >>>>>>>>>>> Tabular >>>>>>>>>>> >>>>>>>>>> >>>>>>>> >>>>>>>> -- >>>>>>>> Ryan Blue >>>>>>>> Tabular >>>>>>>> >>>>>>>> >>> >>> -- >>> Ryan Blue >>> Tabular >>> >>