Thanks Walaa, Option1 seems to be a better one, and one of the primary reason is how to keep it simple for the engine.
Regrds, Manish On Sun, Mar 24, 2024 at 5:02 AM Renjie Liu <liurenjie2...@gmail.com> wrote: > Hi, Walaa: > > Thanks for your summary. I lean toward option 1, due to the huge effort > for engines to adopt new spec and api. > > On Sun, Mar 24, 2024 at 8:49 AM Yufei Gu <flyrain...@gmail.com> wrote: > >> Thanks Walaa for the write-up. The option 1 looks good to me. >> Yufei >> >> >> On Sat, Mar 23, 2024 at 5:05 PM Walaa Eldin Moustafa < >> wa.moust...@gmail.com> wrote: >> >>> I have started the doc here >>> <https://docs.google.com/document/d/1zg0wQ5bVKTckf7-K_cdwF4mlRi6sixLcyEh6jErpGYY/edit>. >>> I have given folks on this thread edit access and everyone else comment >>> access. Feel free to suggest edits (preferred to direct edits, but feel >>> free to directly edit), or add comments. Once we agree on the pros and cons >>> (and possibly any missing details of either approach/design), we can move >>> to the next step (e.g., already reaching consensus or voting). >>> >>> Thanks, >>> Walaa. >>> >>> >>> On Fri, Mar 22, 2024 at 7:19 PM Walaa Eldin Moustafa < >>> wa.moust...@gmail.com> wrote: >>> >>>> Yes, will share the doc tomorrow. >>>> >>>> On Fri, Mar 22, 2024 at 10:49 AM Szehon Ho <szehon.apa...@gmail.com> >>>> wrote: >>>> >>>>> Sounds good to me, can you start a document then, and we can all >>>>> contribute there? >>>>> >>>>> On Fri, Mar 22, 2024 at 10:47 AM Walaa Eldin Moustafa < >>>>> wa.moust...@gmail.com> wrote: >>>>> >>>> Let us list the pros and cons as originally planned. I can help as well >>>>>> if needed. We can get started and have Jack chime in when he is back? >>>>>> >>>>>> On Fri, Mar 22, 2024 at 10:35 AM Szehon Ho <szehon.apa...@gmail.com> >>>>>> wrote: >>>>>> >>>>> Hi >>>>>>> >>>>>>> My understanding was last time it was still unresolved, and the >>>>>>> action item was on Jack and/or/ Jan to make a shorter document. I think >>>>>>> the debate now has boiled down to Ryan's three options: >>>>>>> >>>>>>> 1. separate table/view >>>>>>> 2. combination of table/view tied together via commit >>>>>>> 3. new metadata type >>>>>>> >>>>>>> with probably the first and third being the main contenders. My >>>>>>> understanding was we wanted a table of pros/cons between (1) and (3), >>>>>>> presumably giving folks a chance to address the cons, before the next >>>>>>> meeting. >>>>>>> >>>>>>> Jack (main proponent of option (3) just went on paternity leave, so >>>>>>> not sure if there was someone from Amazon with some context of Jack's >>>>>>> thought to continue that train of thought though? Otherwise maybe Jan >>>>>>> can >>>>>>> give it a shot? Else I will be out and can't make the next iceberg >>>>>>> sync, >>>>>>> but can prepare one for the one after that, if needed. >>>>>>> >>>>>>> Re: 'new' proposal', not sure if we are ready for a formal one, >>>>>>> given the deadlock between the two options, but Im open to that as well >>>>>>> to >>>>>>> make a proposal based on one of the options above. What do folks think? >>>>>>> >>>>>>> Thanks, >>>>>>> Szehon >>>>>>> >>>>>>> On Fri, Mar 22, 2024 at 3:15 AM Renjie Liu <liurenjie2...@gmail.com> >>>>>>> wrote: >>>>>>> >>>>>> +1 >>>>>>>> >>>>>>>> On Fri, Mar 22, 2024 at 16:42 Jean-Baptiste Onofré <j...@nanthrax.net> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> Hi Renjie, >>>>>>>>> >>>>>>>>> We discussed the MV proposal, without yet reaching any conclusion. >>>>>>>>> >>>>>>>>> I propose: >>>>>>>>> - to use the "new" proposal process in place (creating an GH issue >>>>>>>>> with proposal flag, with link to the document) >>>>>>>>> - use the document and/or GH issue to add comments >>>>>>>>> - finalize the document heading to a vote (to get consensus) >>>>>>>>> >>>>>>>>> Thoughts ? >>>>>>>>> >>>>>>>>> NB: I will follow up with "stale PR/proposal" PR to be sure we are >>>>>>>>> moving forward ;) >>>>>>>>> >>>>>>>>> Regards >>>>>>>>> JB >>>>>>>>> >>>>>>>>> On Fri, Mar 22, 2024 at 4:29 AM Renjie Liu < >>>>>>>>> liurenjie2...@gmail.com> wrote: >>>>>>>>> >>>>>>>>>> Hi: >>>>>>>>>> >>>>>>>>>> Sorry I didn't make it to join the last community sync. Did we >>>>>>>>>> reach any conclusion about mv spec? >>>>>>>>>> >>>>>>>>>> On Tue, Mar 5, 2024 at 11:28 PM himadri pal <meh...@gmail.com> >>>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>>> For me the calendar link did not work in mobile, but I was able >>>>>>>>>>> to add the dev Google calendar from >>>>>>>>>>> https://iceberg.apache.org/community/#iceberg-community-events by >>>>>>>>>>> accessing it from laptop. >>>>>>>>>>> >>>>>>>>>>> Regards, >>>>>>>>>>> Himadri Pal >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On Mon, Mar 4, 2024 at 4:43 PM Walaa Eldin Moustafa < >>>>>>>>>>> wa.moust...@gmail.com> wrote: >>>>>>>>>>> >>>>>>>>>>>> Thanks Jack! I think the images are stripped from the message, >>>>>>>>>>>> but they are there on the doc >>>>>>>>>>>> <https://docs.google.com/spreadsheets/d/1a0tlyh8f2ft2SepE7H3bgoY2A0q5IILgzAsJMnwjTBs/edit#gid=0> >>>>>>>>>>>> if >>>>>>>>>>>> someone wants to check them out (I have left some comments while >>>>>>>>>>>> there). >>>>>>>>>>>> >>>>>>>>>>>> Also I no longer see the community sync calendar >>>>>>>>>>>> https://iceberg.apache.org/community/#slack, so it is unclear >>>>>>>>>>>> when the meeting is (and we do not have the link). >>>>>>>>>>>> >>>>>>>>>>>> Thanks, >>>>>>>>>>>> Walaa. >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> On Mon, Mar 4, 2024 at 9:58 AM Jack Ye <yezhao...@gmail.com> >>>>>>>>>>>> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> Thanks Jan! +1 for everyone to take a look before the >>>>>>>>>>>>> discussion, and see if there are any missing options or major >>>>>>>>>>>>> arguments. >>>>>>>>>>>>> >>>>>>>>>>>>> I have also added the images regarding all the options, it >>>>>>>>>>>>> might be easier to parse than the big sheet. I will also put it >>>>>>>>>>>>> here for >>>>>>>>>>>>> people that do not have time to read through it: >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> *Option 1: Add storage table identifier in view metadata >>>>>>>>>>>>> content* >>>>>>>>>>>>> >>>>>>>>>>>>> [image: MV option 1.png] >>>>>>>>>>>>> *Option 2: Add storage table metadata file pointer in view >>>>>>>>>>>>> object* >>>>>>>>>>>>> >>>>>>>>>>>>> [image: MV option 2.png] >>>>>>>>>>>>> *Option 3: Add storage table metadata file pointer in view >>>>>>>>>>>>> metadata content* >>>>>>>>>>>>> >>>>>>>>>>>>> [image: MV option 3.png] >>>>>>>>>>>>> >>>>>>>>>>>>> *Option 4: Embed table metadata in view metadata content* >>>>>>>>>>>>> >>>>>>>>>>>>> [image: MV option 4.png] >>>>>>>>>>>>> *Option 5: New MV spec, MV object has table and view metadata >>>>>>>>>>>>> file pointers* >>>>>>>>>>>>> >>>>>>>>>>>>> [image: MV option 5.png] >>>>>>>>>>>>> *Option 6: New MV spec, MV metadata content embeds table and >>>>>>>>>>>>> view metadata* >>>>>>>>>>>>> >>>>>>>>>>>>> [image: MV option 6.png] >>>>>>>>>>>>> *Option 7: New MV spec, completely new MV metadata content* >>>>>>>>>>>>> >>>>>>>>>>>>> [image: MV option 7.png] >>>>>>>>>>>>> >>>>>>>>>>>>> -Jack >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> On Sun, Mar 3, 2024 at 11:45 PM Jan Kaul >>>>>>>>>>>>> <jank...@mailbox.org.invalid> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> I think it's great to have a face to face discussion about >>>>>>>>>>>>>> this. Additionally, I would propose to use Jacks' document >>>>>>>>>>>>>> <https://docs.google.com/spreadsheets/d/1a0tlyh8f2ft2SepE7H3bgoY2A0q5IILgzAsJMnwjTBs/edit#gid=0> >>>>>>>>>>>>>> as a common ground for the discussion and that everyone has a >>>>>>>>>>>>>> quick look >>>>>>>>>>>>>> before the next community sync. If you think the document is >>>>>>>>>>>>>> still missing >>>>>>>>>>>>>> some arguments, please make suggestions to add them. This way we >>>>>>>>>>>>>> have to >>>>>>>>>>>>>> spend less time to get everyone up to speed and have a more >>>>>>>>>>>>>> common >>>>>>>>>>>>>> terminology. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Looking forward to the discussion, best wishes >>>>>>>>>>>>>> >>>>>>>>>>>>>> Jan >>>>>>>>>>>>>> On 02.03.24 02:06, Walaa Eldin Moustafa wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>> The calendar on the site is currently broken >>>>>>>>>>>>>> https://iceberg.apache.org/community/#iceberg-community-events. >>>>>>>>>>>>>> Might help to fix it or share the meeting link here. >>>>>>>>>>>>>> >>>>>>>>>>>>>> On Fri, Mar 1, 2024 at 3:43 PM Jack Ye <yezhao...@gmail.com> >>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>>> Sounds good, let's discuss this in person! >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> I am a bit worried that we have quite a few critical topics >>>>>>>>>>>>>>> going on right now on devlist, and this will take up a lot of >>>>>>>>>>>>>>> time to >>>>>>>>>>>>>>> discuss. If it ends up going for too long, l propose let us >>>>>>>>>>>>>>> have a >>>>>>>>>>>>>>> dedicated meeting, and I am more than happy to organize it. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Best, >>>>>>>>>>>>>>> Jack Ye >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> On Fri, Mar 1, 2024 at 12:48 PM Ryan Blue <b...@tabular.io> >>>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Hey everyone, >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> I think this thread has hit a point of diminishing returns >>>>>>>>>>>>>>>> and that we still don't have a common understanding of what >>>>>>>>>>>>>>>> the options >>>>>>>>>>>>>>>> under consideration actually are. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Since we were already planning on discussing this at the >>>>>>>>>>>>>>>> next community sync, I suggest we pick this up there and use >>>>>>>>>>>>>>>> that time to >>>>>>>>>>>>>>>> align on what exactly we're considering. We can then start a >>>>>>>>>>>>>>>> new thread to >>>>>>>>>>>>>>>> lay out the designs under consideration in more detail and >>>>>>>>>>>>>>>> then have a >>>>>>>>>>>>>>>> discussion about trade-offs. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Does that sound reasonable? >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Ryan >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> On Fri, Mar 1, 2024 at 11:09 AM Walaa Eldin Moustafa < >>>>>>>>>>>>>>>> wa.moust...@gmail.com> wrote: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> I am finding it hard to interpret the options concretely. >>>>>>>>>>>>>>>>> I would also suggest breaking the expectation/outcome to >>>>>>>>>>>>>>>>> milestones. Maybe >>>>>>>>>>>>>>>>> it becomes easier if we agree to distinguish between an >>>>>>>>>>>>>>>>> approach that is >>>>>>>>>>>>>>>>> feasible in the near term and another in the long term, >>>>>>>>>>>>>>>>> especially if the >>>>>>>>>>>>>>>>> latter requires significant engine-side changes. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Further, maybe it helps if we start with an option that >>>>>>>>>>>>>>>>> fully reuses the existing spec, and see how we view it in >>>>>>>>>>>>>>>>> comparison with >>>>>>>>>>>>>>>>> the options discussed previously. I am sharing one below. It >>>>>>>>>>>>>>>>> reuses the >>>>>>>>>>>>>>>>> current spec of Iceberg views and tables by leveraging table >>>>>>>>>>>>>>>>> properties to >>>>>>>>>>>>>>>>> capture materialized view metadata. What is common (and not >>>>>>>>>>>>>>>>> common) between >>>>>>>>>>>>>>>>> this and the desired representations? >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> The new properties are: >>>>>>>>>>>>>>>>> Properties on a View: >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> 1. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> *iceberg.materialized.view*: >>>>>>>>>>>>>>>>> - *Type*: View property >>>>>>>>>>>>>>>>> - *Purpose*: This property is used to mark whether >>>>>>>>>>>>>>>>> a view is a materialized view. If set to true, the >>>>>>>>>>>>>>>>> view is treated as a materialized view. This helps in >>>>>>>>>>>>>>>>> differentiating >>>>>>>>>>>>>>>>> between virtual and materialized views within the >>>>>>>>>>>>>>>>> catalog and dictates >>>>>>>>>>>>>>>>> specific handling and validation logic for materialized >>>>>>>>>>>>>>>>> views. >>>>>>>>>>>>>>>>> 2. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> *iceberg.materialized.view.storage.location*: >>>>>>>>>>>>>>>>> - *Type*: View property >>>>>>>>>>>>>>>>> - *Purpose*: Specifies the location of the storage >>>>>>>>>>>>>>>>> table associated with the materialized view. This >>>>>>>>>>>>>>>>> property is used for >>>>>>>>>>>>>>>>> linking a materialized view with its corresponding >>>>>>>>>>>>>>>>> storage table, enabling >>>>>>>>>>>>>>>>> data management and query execution based on the stored >>>>>>>>>>>>>>>>> data freshness. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Properties on a Table: >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> 1. *base.snapshot.[UUID]*: >>>>>>>>>>>>>>>>> - *Type*: Table property >>>>>>>>>>>>>>>>> - *Purpose*: These properties store the snapshot >>>>>>>>>>>>>>>>> IDs of the base tables at the time the materialized >>>>>>>>>>>>>>>>> view's data was last >>>>>>>>>>>>>>>>> updated. Each property is prefixed with >>>>>>>>>>>>>>>>> base.snapshot. followed by the UUID of the base >>>>>>>>>>>>>>>>> table. They are used to track whether the materialized >>>>>>>>>>>>>>>>> view's data is up to >>>>>>>>>>>>>>>>> date with the base tables by comparing these snapshot >>>>>>>>>>>>>>>>> IDs with the current >>>>>>>>>>>>>>>>> snapshot IDs of the base tables. If all the base >>>>>>>>>>>>>>>>> tables' current snapshot >>>>>>>>>>>>>>>>> IDs match the ones stored in these properties, the >>>>>>>>>>>>>>>>> materialized view's data >>>>>>>>>>>>>>>>> is considered fresh. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Thanks, >>>>>>>>>>>>>>>>> Walaa. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> On Fri, Mar 1, 2024 at 9:15 AM Jack Ye < >>>>>>>>>>>>>>>>> yezhao...@gmail.com> wrote: >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> > All of these approaches are aligned in one, specific >>>>>>>>>>>>>>>>>> way: the storage table is an iceberg table. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> I do not think that is true. I think people are aligned >>>>>>>>>>>>>>>>>> that we would like to re-use the Iceberg table metadata >>>>>>>>>>>>>>>>>> defined in the >>>>>>>>>>>>>>>>>> Iceberg table spec to express the data in MV, but I don't >>>>>>>>>>>>>>>>>> think it goes >>>>>>>>>>>>>>>>>> that far to say it must be an Iceberg table. Once you have >>>>>>>>>>>>>>>>>> that mindset, >>>>>>>>>>>>>>>>>> then of course option 1 (separate table and view) is the >>>>>>>>>>>>>>>>>> only option. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> > I don't think that is necessary and it >>>>>>>>>>>>>>>>>> significantly increases the complexity. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> And can you quantify what you mean by >>>>>>>>>>>>>>>>>> "significantly increases the complexity"? Seems like a lot >>>>>>>>>>>>>>>>>> of concerns are >>>>>>>>>>>>>>>>>> coming from the tradeoff with complexity. We probably all >>>>>>>>>>>>>>>>>> agree that using >>>>>>>>>>>>>>>>>> option 7 (a completely new metadata type) is a lot of work >>>>>>>>>>>>>>>>>> from scratch, >>>>>>>>>>>>>>>>>> that is why it is not favored. However, my understanding is >>>>>>>>>>>>>>>>>> that as long as >>>>>>>>>>>>>>>>>> we re-use the view and table metadata, then the majority of >>>>>>>>>>>>>>>>>> the existing >>>>>>>>>>>>>>>>>> logic can be reused. I think what we have gone through in >>>>>>>>>>>>>>>>>> Slack to draft >>>>>>>>>>>>>>>>>> the rough Java API shape helps here, because people can >>>>>>>>>>>>>>>>>> estimate the amount >>>>>>>>>>>>>>>>>> of effort required to implement it. And I don't think they >>>>>>>>>>>>>>>>>> are >>>>>>>>>>>>>>>>>> **significantly** more complex to implement. Could you >>>>>>>>>>>>>>>>>> elaborate more about >>>>>>>>>>>>>>>>>> the complexity that you imagine? >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> -Jack >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> On Fri, Mar 1, 2024 at 8:57 AM Daniel Weeks < >>>>>>>>>>>>>>>>>> daniel.c.we...@gmail.com> wrote: >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> I feel I've been most vocal about pushing back against >>>>>>>>>>>>>>>>>>> options 2+ (or Ryan's categories of combined table/view, or >>>>>>>>>>>>>>>>>>> new metadata >>>>>>>>>>>>>>>>>>> type), so I'll try to expand on my reasoning. >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> I understand the appeal of creating a design where we >>>>>>>>>>>>>>>>>>> encapsulate the view/storage from both a structural and >>>>>>>>>>>>>>>>>>> performance >>>>>>>>>>>>>>>>>>> standpoint, but I don't think that is necessary and it >>>>>>>>>>>>>>>>>>> significantly increases the complexity. >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> All of these approaches are aligned in one, specific >>>>>>>>>>>>>>>>>>> way: the storage table is an iceberg table. >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> Because of this, all the behaviors and requirements >>>>>>>>>>>>>>>>>>> still apply to these tables. They need to be maintained >>>>>>>>>>>>>>>>>>> (snapshot cleanup, >>>>>>>>>>>>>>>>>>> orphan files), in cases need to be optimized (compaction, >>>>>>>>>>>>>>>>>>> manifest >>>>>>>>>>>>>>>>>>> rewrites), they need to be able to be inspected (this will >>>>>>>>>>>>>>>>>>> be even more >>>>>>>>>>>>>>>>>>> important with MV since staleness can produce different >>>>>>>>>>>>>>>>>>> results and >>>>>>>>>>>>>>>>>>> questions will arise about what state the storage table was >>>>>>>>>>>>>>>>>>> in). There may >>>>>>>>>>>>>>>>>>> be cases where the tables need to be managed directly. >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> Anywhere we deviate from the existing >>>>>>>>>>>>>>>>>>> constructs/commit/access for tables, we will ultimately >>>>>>>>>>>>>>>>>>> have to then >>>>>>>>>>>>>>>>>>> unwrap to re-expose the underlying Iceberg behavior. This >>>>>>>>>>>>>>>>>>> creates >>>>>>>>>>>>>>>>>>> unnecessary complexity in the library/API layer, which are >>>>>>>>>>>>>>>>>>> not the primary >>>>>>>>>>>>>>>>>>> interface users will have with materialized views where an >>>>>>>>>>>>>>>>>>> engine is almost >>>>>>>>>>>>>>>>>>> entirely necessary to interact with the dataset. >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> As to the performance concerns around option 1, I think >>>>>>>>>>>>>>>>>>> we're overstating the downsides. It really comes down to >>>>>>>>>>>>>>>>>>> how many metadata >>>>>>>>>>>>>>>>>>> loads are necessary and evaluating freshness would likely >>>>>>>>>>>>>>>>>>> be the real >>>>>>>>>>>>>>>>>>> bottleneck as it involves potentially loading many tables. >>>>>>>>>>>>>>>>>>> All of the >>>>>>>>>>>>>>>>>>> options are on the same order of performance for the >>>>>>>>>>>>>>>>>>> metadata and table >>>>>>>>>>>>>>>>>>> loads. >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> As to the visibility of tables and whether they're >>>>>>>>>>>>>>>>>>> registered in the catalog, I think registering in the >>>>>>>>>>>>>>>>>>> catalog is the right >>>>>>>>>>>>>>>>>>> approach so that the tables are still addressable for >>>>>>>>>>>>>>>>>>> maintenance/etc. The >>>>>>>>>>>>>>>>>>> visibility of the storage table is a catalog implementation >>>>>>>>>>>>>>>>>>> decision and >>>>>>>>>>>>>>>>>>> shouldn't be a requirement of the MV spec (I can see cases >>>>>>>>>>>>>>>>>>> for both and it >>>>>>>>>>>>>>>>>>> isn't necessary to dictate a behavior). >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> I'm still strongly in favor of Option 1 (separate table >>>>>>>>>>>>>>>>>>> and view) for these reasons. >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> -Dan >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> On Thu, Feb 29, 2024 at 11:07 PM Jack Ye < >>>>>>>>>>>>>>>>>>> yezhao...@gmail.com> wrote: >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> > Jack, it sounds like you’re the proponent of a >>>>>>>>>>>>>>>>>>>> combined table and view (rather than a new metadata spec >>>>>>>>>>>>>>>>>>>> for a materialized >>>>>>>>>>>>>>>>>>>> view). What is the main motivation? It seems like you’re >>>>>>>>>>>>>>>>>>>> convinced of that >>>>>>>>>>>>>>>>>>>> approach, but I don’t understand the advantage it brings. >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> Sorry I have to make a Google Sheet to capture all the >>>>>>>>>>>>>>>>>>>> options we have discussed so far, I wanted to use the >>>>>>>>>>>>>>>>>>>> existing Google Doc, >>>>>>>>>>>>>>>>>>>> but it has really bad table/sheet support... >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> https://docs.google.com/spreadsheets/d/1a0tlyh8f2ft2SepE7H3bgoY2A0q5IILgzAsJMnwjTBs/edit#gid=0 >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> I have listed all the options, with how they are >>>>>>>>>>>>>>>>>>>> implemented and some important considerations we have >>>>>>>>>>>>>>>>>>>> discussed so far. >>>>>>>>>>>>>>>>>>>> Note that: >>>>>>>>>>>>>>>>>>>> 1. This sheet currently excludes the lineage >>>>>>>>>>>>>>>>>>>> information, which we can discuss more later after the >>>>>>>>>>>>>>>>>>>> current topic is >>>>>>>>>>>>>>>>>>>> resolved. >>>>>>>>>>>>>>>>>>>> 2. I removed the considerations for REST integration >>>>>>>>>>>>>>>>>>>> since from the other thread we have clarified that they >>>>>>>>>>>>>>>>>>>> should be >>>>>>>>>>>>>>>>>>>> considered completely separately. >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> *Why I come as a proponent of having a new MV object >>>>>>>>>>>>>>>>>>>> with table and view metadata file pointer* >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> In my sheet, there are 3 options that do not have major >>>>>>>>>>>>>>>>>>>> problems: >>>>>>>>>>>>>>>>>>>> Option 2: Add storage table metadata file pointer in >>>>>>>>>>>>>>>>>>>> view object >>>>>>>>>>>>>>>>>>>> Option 5: New MV object with table and view metadata >>>>>>>>>>>>>>>>>>>> file pointer >>>>>>>>>>>>>>>>>>>> Option 6: New MV spec with table and view metadata >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> I originally excluded option 2 because I think it does >>>>>>>>>>>>>>>>>>>> not align with the REST spec, but after the other >>>>>>>>>>>>>>>>>>>> discussion thread about "Inconsistency >>>>>>>>>>>>>>>>>>>> between REST spec and table/view spec", I think my >>>>>>>>>>>>>>>>>>>> original concern no >>>>>>>>>>>>>>>>>>>> longer holds true so now I put it back. And based on >>>>>>>>>>>>>>>>>>>> my personal preference that MV is an independent object >>>>>>>>>>>>>>>>>>>> that should be >>>>>>>>>>>>>>>>>>>> separated from view and table, plus the fact that option 5 >>>>>>>>>>>>>>>>>>>> is probably less >>>>>>>>>>>>>>>>>>>> work than option 6 for implementation, that is how I come >>>>>>>>>>>>>>>>>>>> as a proponent of >>>>>>>>>>>>>>>>>>>> option 5 at this moment. >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> *Regarding Ryan's evaluation framework * >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> I think we need to reconcile this sheet with Ryan's >>>>>>>>>>>>>>>>>>>> evaluation framework. That framework categorization puts >>>>>>>>>>>>>>>>>>>> option 2, 3, 4, 5, >>>>>>>>>>>>>>>>>>>> 6 all under the same category of "A combination of a >>>>>>>>>>>>>>>>>>>> view and a table" and concludes that they don't have any >>>>>>>>>>>>>>>>>>>> advantage for the >>>>>>>>>>>>>>>>>>>> same set of reasons. But those reasons are not really >>>>>>>>>>>>>>>>>>>> convincing to me so >>>>>>>>>>>>>>>>>>>> let's talk about them in more detail. >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> (1) You said "I don’t see a reason why a combined view >>>>>>>>>>>>>>>>>>>> and table is advantageous" as "this would cause >>>>>>>>>>>>>>>>>>>> unnecessary dependence >>>>>>>>>>>>>>>>>>>> between the view and table in catalogs." What dependency >>>>>>>>>>>>>>>>>>>> exactly do you >>>>>>>>>>>>>>>>>>>> mean here? And why is that unnecessary, given there has to >>>>>>>>>>>>>>>>>>>> be some sort of >>>>>>>>>>>>>>>>>>>> dependency anyway unless we go with option 5 or 6? >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> (2) You said "I guess there’s an argument that you >>>>>>>>>>>>>>>>>>>> could load both table and view metadata locations at the >>>>>>>>>>>>>>>>>>>> same time. That >>>>>>>>>>>>>>>>>>>> hardly seems worth the trouble". I disagree with that. >>>>>>>>>>>>>>>>>>>> Catalog interaction >>>>>>>>>>>>>>>>>>>> performance is critical to at least everyone working in >>>>>>>>>>>>>>>>>>>> EMR and Athena, and >>>>>>>>>>>>>>>>>>>> MV itself as an acceleration approach needs to be as fast >>>>>>>>>>>>>>>>>>>> as possible. >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> I have put 3 key operations in the doc that I think >>>>>>>>>>>>>>>>>>>> matters for MV during interactions with engine: >>>>>>>>>>>>>>>>>>>> 1. refreshes storage table >>>>>>>>>>>>>>>>>>>> 2. get the storage table of the MV >>>>>>>>>>>>>>>>>>>> 3. if stale, get the view SQL >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> And option 1 clearly falls short with 4 sequential >>>>>>>>>>>>>>>>>>>> steps required to load a storage table. You mentioned >>>>>>>>>>>>>>>>>>>> "recent issues with >>>>>>>>>>>>>>>>>>>> adding views to the JDBC catalog" in this topic, could you >>>>>>>>>>>>>>>>>>>> explain a bit >>>>>>>>>>>>>>>>>>>> more? >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> (3) You said "I also think that once we decide on >>>>>>>>>>>>>>>>>>>> structure, we can make it possible for REST catalog >>>>>>>>>>>>>>>>>>>> implementations to do >>>>>>>>>>>>>>>>>>>> smart things, in a way that doesn’t put additional >>>>>>>>>>>>>>>>>>>> requirements on the >>>>>>>>>>>>>>>>>>>> underlying catalog store." If REST is fully compatible >>>>>>>>>>>>>>>>>>>> with Iceberg spec >>>>>>>>>>>>>>>>>>>> then I have no problem with this statement. However, as we >>>>>>>>>>>>>>>>>>>> discussed in the >>>>>>>>>>>>>>>>>>>> other thread, it is not the case. In the current state, I >>>>>>>>>>>>>>>>>>>> think the >>>>>>>>>>>>>>>>>>>> sequence of action should be to evolve the Iceberg >>>>>>>>>>>>>>>>>>>> table/view spec (or add >>>>>>>>>>>>>>>>>>>> a MV spec) first, and then think about how REST can >>>>>>>>>>>>>>>>>>>> incorporate it or do >>>>>>>>>>>>>>>>>>>> smart things that are not Iceberg spec compliant. Do you >>>>>>>>>>>>>>>>>>>> agree with that? >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> (4) You said the table identifier pointer "is a problem >>>>>>>>>>>>>>>>>>>> we need to solve generally because a materialized table >>>>>>>>>>>>>>>>>>>> needs to be able to >>>>>>>>>>>>>>>>>>>> track the upstream state of tables that were used". I >>>>>>>>>>>>>>>>>>>> don't think that is a >>>>>>>>>>>>>>>>>>>> reason to choose to use a table identifier pointer for a >>>>>>>>>>>>>>>>>>>> storage table. The >>>>>>>>>>>>>>>>>>>> issue is not about using a table identifier pointer. It is >>>>>>>>>>>>>>>>>>>> about exposing >>>>>>>>>>>>>>>>>>>> the storage table as a separate entity in the catalog, >>>>>>>>>>>>>>>>>>>> which is what people >>>>>>>>>>>>>>>>>>>> do not like and is already discussed in length in Jan's >>>>>>>>>>>>>>>>>>>> question 3 (also >>>>>>>>>>>>>>>>>>>> linked in the sheet). I agree with that statement, because >>>>>>>>>>>>>>>>>>>> without a REST >>>>>>>>>>>>>>>>>>>> implementation that can magically hide the storage table, >>>>>>>>>>>>>>>>>>>> this model adds >>>>>>>>>>>>>>>>>>>> additional burden regarding compliance and data governance >>>>>>>>>>>>>>>>>>>> for any other >>>>>>>>>>>>>>>>>>>> non-REST catalog implementations that are compliant to the >>>>>>>>>>>>>>>>>>>> Iceberg spec. >>>>>>>>>>>>>>>>>>>> Many mechanisms need to be built in a catalog to hide, >>>>>>>>>>>>>>>>>>>> protect, maintain, >>>>>>>>>>>>>>>>>>>> recycle the storage table, that can be avoided by using >>>>>>>>>>>>>>>>>>>> other approaches. I >>>>>>>>>>>>>>>>>>>> think we should reach a consensus about that and discuss >>>>>>>>>>>>>>>>>>>> further if you do >>>>>>>>>>>>>>>>>>>> not agree. >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> Best, >>>>>>>>>>>>>>>>>>>> Jack Ye >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> On Thu, Feb 29, 2024 at 10:53 PM Jan Kaul >>>>>>>>>>>>>>>>>>>> <jank...@mailbox.org.invalid> >>>>>>>>>>>>>>>>>>>> <jank...@mailbox.org.invalid> wrote: >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> Hi Ryan, we actually discussed your categories in >>>>>>>>>>>>>>>>>>>>> this question >>>>>>>>>>>>>>>>>>>>> <https://docs.google.com/document/d/1UnhldHhe3Grz8JBngwXPA6ZZord1xMedY5ukEhZYF-A/edit?pli=1#heading=h.y70rtfhi9qxi>. >>>>>>>>>>>>>>>>>>>>> Where your categories correspond to the following designs: >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> - Separate table and view => Design 1 >>>>>>>>>>>>>>>>>>>>> - Combination of view and table => Design 2 >>>>>>>>>>>>>>>>>>>>> - A new metadata type => Design 4 >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> Jan >>>>>>>>>>>>>>>>>>>>> On 01.03.24 00:03, Ryan Blue wrote: >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> Looks like it wasn’t clear what I meant for the 3 >>>>>>>>>>>>>>>>>>>>> categories, so I’ll be more specific: >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> - *Separate table and view*: this option is to >>>>>>>>>>>>>>>>>>>>> have the objects that we have today, with extra >>>>>>>>>>>>>>>>>>>>> metadata. Commit processes >>>>>>>>>>>>>>>>>>>>> are separate: committing to the table doesn’t alter >>>>>>>>>>>>>>>>>>>>> the view and committing >>>>>>>>>>>>>>>>>>>>> to the view doesn’t change the table. However, >>>>>>>>>>>>>>>>>>>>> changing the view can make >>>>>>>>>>>>>>>>>>>>> it so the table is no longer useful as a >>>>>>>>>>>>>>>>>>>>> materialization. >>>>>>>>>>>>>>>>>>>>> - *A combination of a view and a table*: in this >>>>>>>>>>>>>>>>>>>>> option, the table metadata and view metadata are the >>>>>>>>>>>>>>>>>>>>> same as the first >>>>>>>>>>>>>>>>>>>>> option. The difference is that the commit process >>>>>>>>>>>>>>>>>>>>> combines them, either by >>>>>>>>>>>>>>>>>>>>> embedding a table metadata location in view metadata >>>>>>>>>>>>>>>>>>>>> or by tracking both in >>>>>>>>>>>>>>>>>>>>> the same catalog reference. >>>>>>>>>>>>>>>>>>>>> - *A new metadata type*: this option is where we >>>>>>>>>>>>>>>>>>>>> define a new metadata object that has view attributes, >>>>>>>>>>>>>>>>>>>>> like SQL >>>>>>>>>>>>>>>>>>>>> representations, along with table attributes, like >>>>>>>>>>>>>>>>>>>>> partition specs and >>>>>>>>>>>>>>>>>>>>> snapshots. >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> Hopefully this is clear because I think much of the >>>>>>>>>>>>>>>>>>>>> confusion is caused by different definitions. >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> The LoadTableResponse having optional >>>>>>>>>>>>>>>>>>>>> metadata-location field implies that the object in the >>>>>>>>>>>>>>>>>>>>> catalog no longer >>>>>>>>>>>>>>>>>>>>> needs to hold a metadata file pointer >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> The REST protocol has not removed the requirement for >>>>>>>>>>>>>>>>>>>>> a metadata file, so I’m going to keep focused on the MV >>>>>>>>>>>>>>>>>>>>> design options. >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> When we say a MV can be a “new metadata type”, it does >>>>>>>>>>>>>>>>>>>>> not mean it needs to define a completely brand new >>>>>>>>>>>>>>>>>>>>> structure of the >>>>>>>>>>>>>>>>>>>>> metadata content >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> I’m making a distinction between separate metadata >>>>>>>>>>>>>>>>>>>>> files for the table and the view and a combined metadata >>>>>>>>>>>>>>>>>>>>> object, as above. >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> We can define an “Iceberg MV” to be an object in a >>>>>>>>>>>>>>>>>>>>> catalog, which has 1 table metadata file pointer, and 1 >>>>>>>>>>>>>>>>>>>>> view metadata file >>>>>>>>>>>>>>>>>>>>> pointer >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> This is the option I am referring to as a “combination >>>>>>>>>>>>>>>>>>>>> of a view and a table”. >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> So to review my initial email, I don’t see a reason >>>>>>>>>>>>>>>>>>>>> why a combined view and table is advantageous, either >>>>>>>>>>>>>>>>>>>>> implemented by having >>>>>>>>>>>>>>>>>>>>> a catalog reference with two metadata locations or >>>>>>>>>>>>>>>>>>>>> embedding a table >>>>>>>>>>>>>>>>>>>>> metadata location in view metadata. This would cause >>>>>>>>>>>>>>>>>>>>> unnecessary dependence >>>>>>>>>>>>>>>>>>>>> between the view and table in catalogs. I guess there’s >>>>>>>>>>>>>>>>>>>>> an argument that >>>>>>>>>>>>>>>>>>>>> you could load both table and view metadata locations at >>>>>>>>>>>>>>>>>>>>> the same time. >>>>>>>>>>>>>>>>>>>>> That hardly seems worth the trouble given the recent >>>>>>>>>>>>>>>>>>>>> issues with adding >>>>>>>>>>>>>>>>>>>>> views to the JDBC catalog. >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> I also think that once we decide on structure, we can >>>>>>>>>>>>>>>>>>>>> make it possible for REST catalog implementations to do >>>>>>>>>>>>>>>>>>>>> smart things, in a >>>>>>>>>>>>>>>>>>>>> way that doesn’t put additional requirements on the >>>>>>>>>>>>>>>>>>>>> underlying catalog >>>>>>>>>>>>>>>>>>>>> store. For instance, we could specify how to send >>>>>>>>>>>>>>>>>>>>> additional objects in a >>>>>>>>>>>>>>>>>>>>> LoadViewResult, in case the catalog wants to pre-fetch >>>>>>>>>>>>>>>>>>>>> table metadata. I >>>>>>>>>>>>>>>>>>>>> think these optimizations are a later addition, after we >>>>>>>>>>>>>>>>>>>>> define the >>>>>>>>>>>>>>>>>>>>> relationship between views and tables. >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> Jack, it sounds like you’re the proponent of a >>>>>>>>>>>>>>>>>>>>> combined table and view (rather than a new metadata spec >>>>>>>>>>>>>>>>>>>>> for a materialized >>>>>>>>>>>>>>>>>>>>> view). What is the main motivation? It seems like you’re >>>>>>>>>>>>>>>>>>>>> convinced of that >>>>>>>>>>>>>>>>>>>>> approach, but I don’t understand the advantage it brings. >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> Ryan >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> On Thu, Feb 29, 2024 at 12:26 PM Szehon Ho < >>>>>>>>>>>>>>>>>>>>> szehon.apa...@gmail.com> wrote: >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> Hi >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> Yes I mostly agree with the assessment. To clarify a >>>>>>>>>>>>>>>>>>>>>> few minor points. >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> is a materialized view a view and a separate table, a >>>>>>>>>>>>>>>>>>>>>>> combination of the two (i.e. commits are combined), or >>>>>>>>>>>>>>>>>>>>>>> a new metadata type? >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> For 'new metadata type', I consider mostly Jack's >>>>>>>>>>>>>>>>>>>>>> initial proposal of a new Catalog MV object that has two >>>>>>>>>>>>>>>>>>>>>> references >>>>>>>>>>>>>>>>>>>>>> (ViewMetadata + TableMetadata). >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> The arguments that I see for a combined materialized >>>>>>>>>>>>>>>>>>>>>>> view object are: >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> - Regular views are separate, rather than being >>>>>>>>>>>>>>>>>>>>>>> tables with SQL and no data so it would be >>>>>>>>>>>>>>>>>>>>>>> inconsistent (“Iceberg view is >>>>>>>>>>>>>>>>>>>>>>> just a table with no data but with representations >>>>>>>>>>>>>>>>>>>>>>> defined. But we did not >>>>>>>>>>>>>>>>>>>>>>> do that.”) >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> - Materialized views are different objects in DDL >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> - Tables may be a superset of functionality >>>>>>>>>>>>>>>>>>>>>>> needed for materialized views >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> - Tables are not typically exposed to end users >>>>>>>>>>>>>>>>>>>>>>> — but this isn’t required by the separate view and >>>>>>>>>>>>>>>>>>>>>>> table option >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> For completeness, there seem to be a few additional >>>>>>>>>>>>>>>>>>>>>> ones (mentioned in the Slack and above messages). >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> - Lack of spec change (to ViewMetadata). But as >>>>>>>>>>>>>>>>>>>>>> Jack says it is a spec change (ie, to catalogs) >>>>>>>>>>>>>>>>>>>>>> - A single call to get the View's StorageTable >>>>>>>>>>>>>>>>>>>>>> (versus two calls) >>>>>>>>>>>>>>>>>>>>>> - A more natural API, no opportunity for user to >>>>>>>>>>>>>>>>>>>>>> call Catalog.dropTable() and renameTable() on storage >>>>>>>>>>>>>>>>>>>>>> table >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> *Thoughts: *I think the long discussion sessions we >>>>>>>>>>>>>>>>>>>>>> had on Slack was fruitful for me, as seeing the API >>>>>>>>>>>>>>>>>>>>>> clarified some things. >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> I was initially more in favor of MV being a new >>>>>>>>>>>>>>>>>>>>>> metadata type (TableMetadata + ViewMetadata). But >>>>>>>>>>>>>>>>>>>>>> seeing most of the MV >>>>>>>>>>>>>>>>>>>>>> operations end up being ViewCatalog or Catalog >>>>>>>>>>>>>>>>>>>>>> operations, I am starting to >>>>>>>>>>>>>>>>>>>>>> think API-wise that it may not align with the new >>>>>>>>>>>>>>>>>>>>>> metadata type (unless we >>>>>>>>>>>>>>>>>>>>>> define MVCatalog and /MV REST endpoints, which then are >>>>>>>>>>>>>>>>>>>>>> boilerplate >>>>>>>>>>>>>>>>>>>>>> wrappers). >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> Initially one question I had for option 'a view and a >>>>>>>>>>>>>>>>>>>>>> separate table', was how to make this table reference >>>>>>>>>>>>>>>>>>>>>> (metadata.json or >>>>>>>>>>>>>>>>>>>>>> catalog reference). In the previous option, we had a >>>>>>>>>>>>>>>>>>>>>> precedent of Catalog >>>>>>>>>>>>>>>>>>>>>> references to Metadata, but not pointers between >>>>>>>>>>>>>>>>>>>>>> Metadatas. I initially >>>>>>>>>>>>>>>>>>>>>> saw the proposed Catalog's TableIdentifier pointer as >>>>>>>>>>>>>>>>>>>>>> 'polluting' catalog >>>>>>>>>>>>>>>>>>>>>> concerns in ViewMetadata. (I saw Catalog and >>>>>>>>>>>>>>>>>>>>>> ViewCatalog as a layer above >>>>>>>>>>>>>>>>>>>>>> TableMetadata and ViewMetadata). But I think Dan in the >>>>>>>>>>>>>>>>>>>>>> Slack made a fair >>>>>>>>>>>>>>>>>>>>>> point that ViewMetadata already is tightly bound with a >>>>>>>>>>>>>>>>>>>>>> Catalog. In this >>>>>>>>>>>>>>>>>>>>>> case, I think this approach does have its merits as well >>>>>>>>>>>>>>>>>>>>>> in aligning >>>>>>>>>>>>>>>>>>>>>> Catalog API's with the metadata. >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> Thanks >>>>>>>>>>>>>>>>>>>>>> Szehon >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> On Thu, Feb 29, 2024 at 5:45 AM Jan Kaul >>>>>>>>>>>>>>>>>>>>>> <jank...@mailbox.org.invalid> >>>>>>>>>>>>>>>>>>>>>> <jank...@mailbox.org.invalid> wrote: >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> Hi all, >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> I would like to provide my perspective on the >>>>>>>>>>>>>>>>>>>>>>> question of what a materialized view is and elaborate >>>>>>>>>>>>>>>>>>>>>>> on Jack's recent >>>>>>>>>>>>>>>>>>>>>>> proposal to view a materialized view as a catalog >>>>>>>>>>>>>>>>>>>>>>> concept. >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> Firstly, let's look at the role of the catalog. >>>>>>>>>>>>>>>>>>>>>>> Every entity in the catalog has a *unique >>>>>>>>>>>>>>>>>>>>>>> identifier*, and the catalog provides methods to >>>>>>>>>>>>>>>>>>>>>>> create, load, and update these entities. An important >>>>>>>>>>>>>>>>>>>>>>> thing to note is that >>>>>>>>>>>>>>>>>>>>>>> the catalog methods exhibit two different behaviors: >>>>>>>>>>>>>>>>>>>>>>> the *create >>>>>>>>>>>>>>>>>>>>>>> and load methods deal with the entire entity*, >>>>>>>>>>>>>>>>>>>>>>> while the *update(commit) method only deals with >>>>>>>>>>>>>>>>>>>>>>> partial changes* to the entities. >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> In the context of our current discussion, >>>>>>>>>>>>>>>>>>>>>>> materialized view (MV) metadata is a union of view and >>>>>>>>>>>>>>>>>>>>>>> table metadata. The >>>>>>>>>>>>>>>>>>>>>>> fact that the update method deals only with partial >>>>>>>>>>>>>>>>>>>>>>> changes, enables us to *reuse >>>>>>>>>>>>>>>>>>>>>>> the existing methods for updating tables and views*. >>>>>>>>>>>>>>>>>>>>>>> For updates we don't have to define what constitutes an >>>>>>>>>>>>>>>>>>>>>>> entire materialized >>>>>>>>>>>>>>>>>>>>>>> view. Changes to a materialized view targeting the >>>>>>>>>>>>>>>>>>>>>>> properties related to >>>>>>>>>>>>>>>>>>>>>>> the view metadata could use the update(commit) view >>>>>>>>>>>>>>>>>>>>>>> method. Similarly, >>>>>>>>>>>>>>>>>>>>>>> changes targeting the properties related to the table >>>>>>>>>>>>>>>>>>>>>>> metadata could use >>>>>>>>>>>>>>>>>>>>>>> the update(commit) table method. This is great news >>>>>>>>>>>>>>>>>>>>>>> because we don't have >>>>>>>>>>>>>>>>>>>>>>> to redefine view and table commits (requirements, >>>>>>>>>>>>>>>>>>>>>>> updates). >>>>>>>>>>>>>>>>>>>>>>> This is shown in the fact that Jack uses the same >>>>>>>>>>>>>>>>>>>>>>> operation to update the storage table for Option 1 and >>>>>>>>>>>>>>>>>>>>>>> 3: >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> // REST: POST >>>>>>>>>>>>>>>>>>>>>>> /namespaces/db1/tables/mv1?materializedView=true >>>>>>>>>>>>>>>>>>>>>>> // non-REST: update JSON files at >>>>>>>>>>>>>>>>>>>>>>> table_metadata_location >>>>>>>>>>>>>>>>>>>>>>> storageTable.newAppend().appendFile(...).commit(); >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> The open question is *whether the create and load >>>>>>>>>>>>>>>>>>>>>>> methods should treat the properties that constitute the >>>>>>>>>>>>>>>>>>>>>>> MV metadata as two >>>>>>>>>>>>>>>>>>>>>>> entities (View + Table) or one entity (new MV object)*. >>>>>>>>>>>>>>>>>>>>>>> This is all part of Jack's proposal, where Option 1 >>>>>>>>>>>>>>>>>>>>>>> proposes a new MV >>>>>>>>>>>>>>>>>>>>>>> object, and Option 3 proposes two separate entities. >>>>>>>>>>>>>>>>>>>>>>> The advantage of >>>>>>>>>>>>>>>>>>>>>>> Option 1 is that it doesn't require two operations to >>>>>>>>>>>>>>>>>>>>>>> load the metadata. On >>>>>>>>>>>>>>>>>>>>>>> the other hand, the advantage of Option 3 is that no >>>>>>>>>>>>>>>>>>>>>>> new operations or >>>>>>>>>>>>>>>>>>>>>>> catalogs have to be defined. >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> In my opinion, defining a new representation for >>>>>>>>>>>>>>>>>>>>>>> materialized views (Option 1) is generally the cleaner >>>>>>>>>>>>>>>>>>>>>>> solution. However, I >>>>>>>>>>>>>>>>>>>>>>> see a path where we could first introduce Option 3 and >>>>>>>>>>>>>>>>>>>>>>> still have the >>>>>>>>>>>>>>>>>>>>>>> possibility to transition to Option 1 if needed. The >>>>>>>>>>>>>>>>>>>>>>> great thing about >>>>>>>>>>>>>>>>>>>>>>> Option 3 is that it only requires minor changes to the >>>>>>>>>>>>>>>>>>>>>>> current spec and is >>>>>>>>>>>>>>>>>>>>>>> mostly implementation detail. >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> Therefore I would propose small additions to Jacks >>>>>>>>>>>>>>>>>>>>>>> Option 3 that only introduce changes to the spec that >>>>>>>>>>>>>>>>>>>>>>> are not specific to >>>>>>>>>>>>>>>>>>>>>>> materialized views. The idea is to introduce boolean >>>>>>>>>>>>>>>>>>>>>>> properties to be set >>>>>>>>>>>>>>>>>>>>>>> on the creation of the view and the storage table that >>>>>>>>>>>>>>>>>>>>>>> indicate that they >>>>>>>>>>>>>>>>>>>>>>> belong to a materialized view. The view property >>>>>>>>>>>>>>>>>>>>>>> "materialized" is set to >>>>>>>>>>>>>>>>>>>>>>> "true" for a MV and "false" for a regular view. And the >>>>>>>>>>>>>>>>>>>>>>> table property >>>>>>>>>>>>>>>>>>>>>>> "storage_table" is set to "true" for a storage table >>>>>>>>>>>>>>>>>>>>>>> and "false" for a >>>>>>>>>>>>>>>>>>>>>>> regular table. The absence of these properties >>>>>>>>>>>>>>>>>>>>>>> indicates a regular view or >>>>>>>>>>>>>>>>>>>>>>> table. >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> ViewCatalog viewCatalog = (ViewCatalog) catalog; >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> // REST: GET /namespaces/db1/views/mv1 >>>>>>>>>>>>>>>>>>>>>>> // non-REST: load JSON file at metadata_location >>>>>>>>>>>>>>>>>>>>>>> View mv = >>>>>>>>>>>>>>>>>>>>>>> viewCatalog.loadView(TableIdentifier.of("db1", "mv1")); >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> // REST: GET /namespaces/db1/tables/mv1 >>>>>>>>>>>>>>>>>>>>>>> // non-REST: load JSON file at >>>>>>>>>>>>>>>>>>>>>>> table_metadata_location if present >>>>>>>>>>>>>>>>>>>>>>> Table storageTable = view.storageTable(); >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> // REST: POST /namespaces/db1/tables/mv1 >>>>>>>>>>>>>>>>>>>>>>> // non-REST: update JSON file at >>>>>>>>>>>>>>>>>>>>>>> table_metadata_location >>>>>>>>>>>>>>>>>>>>>>> storageTable.newAppend().appendFile(...).commit(); >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> We could then introduce a new requirement for views >>>>>>>>>>>>>>>>>>>>>>> and tables called >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>