Re: Materialized view integration with REST spec

Manish Malhotra Sun, 24 Mar 2024 07:49:01 -0700

Thanks Walaa,
Option1 seems to be a better one, and one of the primary reason is how to
keep it simple for the engine.


Regrds,
Manish


On Sun, Mar 24, 2024 at 5:02 AM Renjie Liu <[email protected]> wrote:

> Hi, Walaa:
>
> Thanks for your summary. I lean toward option 1, due to the huge effort
> for engines to adopt new spec and api.
>
> On Sun, Mar 24, 2024 at 8:49 AM Yufei Gu <[email protected]> wrote:
>
>> Thanks Walaa for the write-up. The option 1 looks good to me.
>> Yufei
>>
>>
>> On Sat, Mar 23, 2024 at 5:05 PM Walaa Eldin Moustafa <
>> [email protected]> wrote:
>>
>>> I have started the doc here
>>> <https://docs.google.com/document/d/1zg0wQ5bVKTckf7-K_cdwF4mlRi6sixLcyEh6jErpGYY/edit>.
>>> I have given folks on this thread edit access and everyone else comment
>>> access. Feel free to suggest edits (preferred to direct edits, but feel
>>> free to directly edit), or add comments. Once we agree on the pros and cons
>>> (and possibly any missing details of either approach/design), we can move
>>> to the next step (e.g., already reaching consensus or voting).
>>>
>>> Thanks,
>>> Walaa.
>>>
>>>
>>> On Fri, Mar 22, 2024 at 7:19 PM Walaa Eldin Moustafa <
>>> [email protected]> wrote:
>>>
>>>> Yes, will share the doc tomorrow.
>>>>
>>>> On Fri, Mar 22, 2024 at 10:49 AM Szehon Ho <[email protected]>
>>>> wrote:
>>>>
>>>>> Sounds good to me, can you start a document then, and we can all
>>>>> contribute there?
>>>>>
>>>>> On Fri, Mar 22, 2024 at 10:47 AM Walaa Eldin Moustafa <
>>>>> [email protected]> wrote:
>>>>>
>>>> Let us list the pros and cons as originally planned. I can help as well
>>>>>> if needed. We can get started and have Jack chime in when he is back?
>>>>>>
>>>>>> On Fri, Mar 22, 2024 at 10:35 AM Szehon Ho <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>> Hi
>>>>>>>
>>>>>>> My understanding was last time it was still unresolved, and the
>>>>>>> action item was on Jack and/or/ Jan to make a shorter document.  I think
>>>>>>> the debate now has boiled down to Ryan's three options:
>>>>>>>
>>>>>>>    1. separate table/view
>>>>>>>    2. combination of table/view tied together via commit
>>>>>>>    3. new metadata type
>>>>>>>
>>>>>>>  with probably the first and third being the main contenders. My
>>>>>>> understanding was we wanted a table of pros/cons between (1) and (3),
>>>>>>> presumably giving folks a chance to address the cons, before the next
>>>>>>> meeting.
>>>>>>>
>>>>>>> Jack (main proponent of option (3) just went on paternity leave, so
>>>>>>> not sure if there was someone from Amazon with some context of Jack's
>>>>>>> thought to continue that train of thought though?  Otherwise maybe Jan 
>>>>>>> can
>>>>>>> give it a shot?  Else I will be out and can't make the next iceberg 
>>>>>>> sync,
>>>>>>> but can prepare one for the one after that, if needed.
>>>>>>>
>>>>>>> Re: 'new' proposal', not sure if we are ready for a formal one,
>>>>>>> given the deadlock between the two options, but Im open to that as well 
>>>>>>> to
>>>>>>> make a proposal based on one of the options above.  What do folks think?
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Szehon
>>>>>>>
>>>>>>> On Fri, Mar 22, 2024 at 3:15 AM Renjie Liu <[email protected]>
>>>>>>> wrote:
>>>>>>>
>>>>>> +1
>>>>>>>>
>>>>>>>> On Fri, Mar 22, 2024 at 16:42 Jean-Baptiste Onofré <[email protected]>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Hi Renjie,
>>>>>>>>>
>>>>>>>>> We discussed the MV proposal, without yet reaching any conclusion.
>>>>>>>>>
>>>>>>>>> I propose:
>>>>>>>>> - to use the "new" proposal process in place (creating an GH issue
>>>>>>>>> with proposal flag, with link to the document)
>>>>>>>>> - use the document and/or GH issue to add comments
>>>>>>>>> - finalize the document heading to a vote (to get consensus)
>>>>>>>>>
>>>>>>>>> Thoughts ?
>>>>>>>>>
>>>>>>>>> NB: I will follow up with "stale PR/proposal" PR to be sure we are
>>>>>>>>> moving forward ;)
>>>>>>>>>
>>>>>>>>> Regards
>>>>>>>>> JB
>>>>>>>>>
>>>>>>>>> On Fri, Mar 22, 2024 at 4:29 AM Renjie Liu <
>>>>>>>>> [email protected]> wrote:
>>>>>>>>>
>>>>>>>>>> Hi:
>>>>>>>>>>
>>>>>>>>>> Sorry I didn't make it to join the last community sync. Did we
>>>>>>>>>> reach any conclusion about mv spec?
>>>>>>>>>>
>>>>>>>>>> On Tue, Mar 5, 2024 at 11:28 PM himadri pal <[email protected]>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> For me the calendar link did not work in mobile, but I was able
>>>>>>>>>>> to add the dev Google calendar from
>>>>>>>>>>> https://iceberg.apache.org/community/#iceberg-community-events by
>>>>>>>>>>> accessing it from  laptop.
>>>>>>>>>>>
>>>>>>>>>>> Regards,
>>>>>>>>>>> Himadri Pal
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Mon, Mar 4, 2024 at 4:43 PM Walaa Eldin Moustafa <
>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Thanks Jack! I think the images are stripped from the message,
>>>>>>>>>>>> but they are there on the doc
>>>>>>>>>>>> <https://docs.google.com/spreadsheets/d/1a0tlyh8f2ft2SepE7H3bgoY2A0q5IILgzAsJMnwjTBs/edit#gid=0>
>>>>>>>>>>>>  if
>>>>>>>>>>>> someone wants to check them out (I have left some comments while 
>>>>>>>>>>>> there).
>>>>>>>>>>>>
>>>>>>>>>>>> Also I no longer see the community sync calendar
>>>>>>>>>>>> https://iceberg.apache.org/community/#slack, so it is unclear
>>>>>>>>>>>> when the meeting is (and we do not have the link).
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks,
>>>>>>>>>>>> Walaa.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Mon, Mar 4, 2024 at 9:58 AM Jack Ye <[email protected]>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks Jan! +1 for everyone to take a look before the
>>>>>>>>>>>>> discussion, and see if there are any missing options or major 
>>>>>>>>>>>>> arguments.
>>>>>>>>>>>>>
>>>>>>>>>>>>> I have also added the images regarding all the options, it
>>>>>>>>>>>>> might be easier to parse than the big sheet. I will also put it 
>>>>>>>>>>>>> here for
>>>>>>>>>>>>> people that do not have time to read through it:
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> *Option 1: Add storage table identifier in view metadata
>>>>>>>>>>>>> content*
>>>>>>>>>>>>>
>>>>>>>>>>>>> [image: MV option 1.png]
>>>>>>>>>>>>> *Option 2: Add storage table metadata file pointer in view
>>>>>>>>>>>>> object*
>>>>>>>>>>>>>
>>>>>>>>>>>>> [image: MV option 2.png]
>>>>>>>>>>>>> *Option 3: Add storage table metadata file pointer in view
>>>>>>>>>>>>> metadata content*
>>>>>>>>>>>>>
>>>>>>>>>>>>> [image: MV option 3.png]
>>>>>>>>>>>>>
>>>>>>>>>>>>> *Option 4: Embed table metadata in view metadata content*
>>>>>>>>>>>>>
>>>>>>>>>>>>> [image: MV option 4.png]
>>>>>>>>>>>>> *Option 5: New MV spec, MV object has table and view metadata
>>>>>>>>>>>>> file pointers*
>>>>>>>>>>>>>
>>>>>>>>>>>>> [image: MV option 5.png]
>>>>>>>>>>>>> *Option 6: New MV spec, MV metadata content embeds table and
>>>>>>>>>>>>> view metadata*
>>>>>>>>>>>>>
>>>>>>>>>>>>> [image: MV option 6.png]
>>>>>>>>>>>>> *Option 7: New MV spec, completely new MV metadata content*
>>>>>>>>>>>>>
>>>>>>>>>>>>> [image: MV option 7.png]
>>>>>>>>>>>>>
>>>>>>>>>>>>> -Jack
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Sun, Mar 3, 2024 at 11:45 PM Jan Kaul
>>>>>>>>>>>>> <[email protected]> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> I think it's great to have a face to face discussion about
>>>>>>>>>>>>>> this. Additionally, I would propose to use Jacks' document
>>>>>>>>>>>>>> <https://docs.google.com/spreadsheets/d/1a0tlyh8f2ft2SepE7H3bgoY2A0q5IILgzAsJMnwjTBs/edit#gid=0>
>>>>>>>>>>>>>> as a common ground for the discussion and that everyone has a 
>>>>>>>>>>>>>> quick look
>>>>>>>>>>>>>> before the next community sync. If you think the document is 
>>>>>>>>>>>>>> still missing
>>>>>>>>>>>>>> some arguments, please make suggestions to add them. This way we 
>>>>>>>>>>>>>> have to
>>>>>>>>>>>>>> spend less time to get everyone up to speed and have a more 
>>>>>>>>>>>>>> common
>>>>>>>>>>>>>> terminology.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Looking forward to the discussion, best wishes
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Jan
>>>>>>>>>>>>>> On 02.03.24 02:06, Walaa Eldin Moustafa wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> The calendar on the site is currently broken
>>>>>>>>>>>>>> https://iceberg.apache.org/community/#iceberg-community-events.
>>>>>>>>>>>>>> Might help to fix it or share the meeting link here.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Fri, Mar 1, 2024 at 3:43 PM Jack Ye <[email protected]>
>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Sounds good, let's discuss this in person!
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I am a bit worried that we have quite a few critical topics
>>>>>>>>>>>>>>> going on right now on devlist, and this will take up a lot of 
>>>>>>>>>>>>>>> time to
>>>>>>>>>>>>>>> discuss. If it ends up going for too long, l propose let us 
>>>>>>>>>>>>>>> have a
>>>>>>>>>>>>>>> dedicated meeting, and I am more than happy to organize it.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>>> Jack Ye
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Fri, Mar 1, 2024 at 12:48 PM Ryan Blue <[email protected]>
>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Hey everyone,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I think this thread has hit a point of diminishing returns
>>>>>>>>>>>>>>>> and that we still don't have a common understanding of what 
>>>>>>>>>>>>>>>> the options
>>>>>>>>>>>>>>>> under consideration actually are.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Since we were already planning on discussing this at the
>>>>>>>>>>>>>>>> next community sync, I suggest we pick this up there and use 
>>>>>>>>>>>>>>>> that time to
>>>>>>>>>>>>>>>> align on what exactly we're considering. We can then start a 
>>>>>>>>>>>>>>>> new thread to
>>>>>>>>>>>>>>>> lay out the designs under consideration in more detail and 
>>>>>>>>>>>>>>>> then have a
>>>>>>>>>>>>>>>> discussion about trade-offs.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Does that sound reasonable?
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Ryan
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Fri, Mar 1, 2024 at 11:09 AM Walaa Eldin Moustafa <
>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> I am finding it hard to interpret the options concretely.
>>>>>>>>>>>>>>>>> I would also suggest breaking the expectation/outcome to 
>>>>>>>>>>>>>>>>> milestones. Maybe
>>>>>>>>>>>>>>>>> it becomes easier if we agree to distinguish between an 
>>>>>>>>>>>>>>>>> approach that is
>>>>>>>>>>>>>>>>> feasible in the near term and another in the long term, 
>>>>>>>>>>>>>>>>> especially if the
>>>>>>>>>>>>>>>>> latter requires significant engine-side changes.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Further, maybe it helps if we start with an option that
>>>>>>>>>>>>>>>>> fully reuses the existing spec, and see how we view it in 
>>>>>>>>>>>>>>>>> comparison with
>>>>>>>>>>>>>>>>> the options discussed previously. I am sharing one below. It 
>>>>>>>>>>>>>>>>> reuses the
>>>>>>>>>>>>>>>>> current spec of Iceberg views and tables by leveraging table 
>>>>>>>>>>>>>>>>> properties to
>>>>>>>>>>>>>>>>> capture materialized view metadata. What is common (and not 
>>>>>>>>>>>>>>>>> common) between
>>>>>>>>>>>>>>>>> this and the desired representations?
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> The new properties are:
>>>>>>>>>>>>>>>>> Properties on a View:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>    1.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>    *iceberg.materialized.view*:
>>>>>>>>>>>>>>>>>    - *Type*: View property
>>>>>>>>>>>>>>>>>       - *Purpose*: This property is used to mark whether
>>>>>>>>>>>>>>>>>       a view is a materialized view. If set to true, the
>>>>>>>>>>>>>>>>>       view is treated as a materialized view. This helps in 
>>>>>>>>>>>>>>>>> differentiating
>>>>>>>>>>>>>>>>>       between virtual and materialized views within the 
>>>>>>>>>>>>>>>>> catalog and dictates
>>>>>>>>>>>>>>>>>       specific handling and validation logic for materialized 
>>>>>>>>>>>>>>>>> views.
>>>>>>>>>>>>>>>>>    2.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>    *iceberg.materialized.view.storage.location*:
>>>>>>>>>>>>>>>>>    - *Type*: View property
>>>>>>>>>>>>>>>>>       - *Purpose*: Specifies the location of the storage
>>>>>>>>>>>>>>>>>       table associated with the materialized view. This 
>>>>>>>>>>>>>>>>> property is used for
>>>>>>>>>>>>>>>>>       linking a materialized view with its corresponding 
>>>>>>>>>>>>>>>>> storage table, enabling
>>>>>>>>>>>>>>>>>       data management and query execution based on the stored 
>>>>>>>>>>>>>>>>> data freshness.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Properties on a Table:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>    1. *base.snapshot.[UUID]*:
>>>>>>>>>>>>>>>>>       - *Type*: Table property
>>>>>>>>>>>>>>>>>       - *Purpose*: These properties store the snapshot
>>>>>>>>>>>>>>>>>       IDs of the base tables at the time the materialized 
>>>>>>>>>>>>>>>>> view's data was last
>>>>>>>>>>>>>>>>>       updated. Each property is prefixed with
>>>>>>>>>>>>>>>>>       base.snapshot. followed by the UUID of the base
>>>>>>>>>>>>>>>>>       table. They are used to track whether the materialized 
>>>>>>>>>>>>>>>>> view's data is up to
>>>>>>>>>>>>>>>>>       date with the base tables by comparing these snapshot 
>>>>>>>>>>>>>>>>> IDs with the current
>>>>>>>>>>>>>>>>>       snapshot IDs of the base tables. If all the base 
>>>>>>>>>>>>>>>>> tables' current snapshot
>>>>>>>>>>>>>>>>>       IDs match the ones stored in these properties, the 
>>>>>>>>>>>>>>>>> materialized view's data
>>>>>>>>>>>>>>>>>       is considered fresh.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>> Walaa.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On Fri, Mar 1, 2024 at 9:15 AM Jack Ye <
>>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> > All of these approaches are aligned in one, specific
>>>>>>>>>>>>>>>>>> way: the storage table is an iceberg table.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> I do not think that is true. I think people are aligned
>>>>>>>>>>>>>>>>>> that we would like to re-use the Iceberg table metadata 
>>>>>>>>>>>>>>>>>> defined in the
>>>>>>>>>>>>>>>>>> Iceberg table spec to express the data in MV, but I don't 
>>>>>>>>>>>>>>>>>> think it goes
>>>>>>>>>>>>>>>>>> that far to say it must be an Iceberg table. Once you have 
>>>>>>>>>>>>>>>>>> that mindset,
>>>>>>>>>>>>>>>>>> then of course option 1 (separate table and view) is the 
>>>>>>>>>>>>>>>>>> only option.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> > I don't think that is necessary and it
>>>>>>>>>>>>>>>>>> significantly increases the complexity.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> And can you quantify what you mean by
>>>>>>>>>>>>>>>>>> "significantly increases the complexity"? Seems like a lot 
>>>>>>>>>>>>>>>>>> of concerns are
>>>>>>>>>>>>>>>>>> coming from the tradeoff with complexity. We probably all 
>>>>>>>>>>>>>>>>>> agree that using
>>>>>>>>>>>>>>>>>> option 7 (a completely new metadata type) is a lot of work 
>>>>>>>>>>>>>>>>>> from scratch,
>>>>>>>>>>>>>>>>>> that is why it is not favored. However, my understanding is 
>>>>>>>>>>>>>>>>>> that as long as
>>>>>>>>>>>>>>>>>> we re-use the view and table metadata, then the majority of 
>>>>>>>>>>>>>>>>>> the existing
>>>>>>>>>>>>>>>>>> logic can be reused. I think what we have gone through in 
>>>>>>>>>>>>>>>>>> Slack to draft
>>>>>>>>>>>>>>>>>> the rough Java API shape helps here, because people can 
>>>>>>>>>>>>>>>>>> estimate the amount
>>>>>>>>>>>>>>>>>> of effort required to implement it. And I don't think they 
>>>>>>>>>>>>>>>>>> are
>>>>>>>>>>>>>>>>>> **significantly** more complex to implement. Could you 
>>>>>>>>>>>>>>>>>> elaborate more about
>>>>>>>>>>>>>>>>>> the complexity that you imagine?
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> -Jack
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> On Fri, Mar 1, 2024 at 8:57 AM Daniel Weeks <
>>>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> I feel I've been most vocal about pushing back against
>>>>>>>>>>>>>>>>>>> options 2+ (or Ryan's categories of combined table/view, or 
>>>>>>>>>>>>>>>>>>> new metadata
>>>>>>>>>>>>>>>>>>> type), so I'll try to expand on my reasoning.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> I understand the appeal of creating a design where we
>>>>>>>>>>>>>>>>>>> encapsulate the view/storage from both a structural and 
>>>>>>>>>>>>>>>>>>> performance
>>>>>>>>>>>>>>>>>>> standpoint, but I don't think that is necessary and it
>>>>>>>>>>>>>>>>>>> significantly increases the complexity.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> All of these approaches are aligned in one, specific
>>>>>>>>>>>>>>>>>>> way: the storage table is an iceberg table.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Because of this, all the behaviors and requirements
>>>>>>>>>>>>>>>>>>> still apply to these tables.  They need to be maintained 
>>>>>>>>>>>>>>>>>>> (snapshot cleanup,
>>>>>>>>>>>>>>>>>>> orphan files), in cases need to be optimized (compaction, 
>>>>>>>>>>>>>>>>>>> manifest
>>>>>>>>>>>>>>>>>>> rewrites), they need to be able to be inspected (this will 
>>>>>>>>>>>>>>>>>>> be even more
>>>>>>>>>>>>>>>>>>> important with MV since staleness can produce different 
>>>>>>>>>>>>>>>>>>> results and
>>>>>>>>>>>>>>>>>>> questions will arise about what state the storage table was 
>>>>>>>>>>>>>>>>>>> in).  There may
>>>>>>>>>>>>>>>>>>> be cases where the tables need to be managed directly.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Anywhere we deviate from the existing
>>>>>>>>>>>>>>>>>>> constructs/commit/access for tables, we will ultimately 
>>>>>>>>>>>>>>>>>>> have to then
>>>>>>>>>>>>>>>>>>> unwrap to re-expose the underlying Iceberg behavior.  This 
>>>>>>>>>>>>>>>>>>> creates
>>>>>>>>>>>>>>>>>>> unnecessary complexity in the library/API layer, which are 
>>>>>>>>>>>>>>>>>>> not the primary
>>>>>>>>>>>>>>>>>>> interface users will have with materialized views where an 
>>>>>>>>>>>>>>>>>>> engine is almost
>>>>>>>>>>>>>>>>>>> entirely necessary to interact with the dataset.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> As to the performance concerns around option 1, I think
>>>>>>>>>>>>>>>>>>> we're overstating the downsides.  It really comes down to 
>>>>>>>>>>>>>>>>>>> how many metadata
>>>>>>>>>>>>>>>>>>> loads are necessary and evaluating freshness would likely 
>>>>>>>>>>>>>>>>>>> be the real
>>>>>>>>>>>>>>>>>>> bottleneck as it involves potentially loading many tables.  
>>>>>>>>>>>>>>>>>>> All of the
>>>>>>>>>>>>>>>>>>> options are on the same order of performance for the 
>>>>>>>>>>>>>>>>>>> metadata and table
>>>>>>>>>>>>>>>>>>> loads.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> As to the visibility of tables and whether they're
>>>>>>>>>>>>>>>>>>> registered in the catalog, I think registering in the 
>>>>>>>>>>>>>>>>>>> catalog is the right
>>>>>>>>>>>>>>>>>>> approach so that the tables are still addressable for 
>>>>>>>>>>>>>>>>>>> maintenance/etc.  The
>>>>>>>>>>>>>>>>>>> visibility of the storage table is a catalog implementation 
>>>>>>>>>>>>>>>>>>> decision and
>>>>>>>>>>>>>>>>>>> shouldn't be a requirement of the MV spec (I can see cases 
>>>>>>>>>>>>>>>>>>> for both and it
>>>>>>>>>>>>>>>>>>> isn't necessary to dictate a behavior).
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> I'm still strongly in favor of Option 1 (separate table
>>>>>>>>>>>>>>>>>>> and view) for these reasons.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> -Dan
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> On Thu, Feb 29, 2024 at 11:07 PM Jack Ye <
>>>>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> > Jack, it sounds like you’re the proponent of a
>>>>>>>>>>>>>>>>>>>> combined table and view (rather than a new metadata spec 
>>>>>>>>>>>>>>>>>>>> for a materialized
>>>>>>>>>>>>>>>>>>>> view). What is the main motivation? It seems like you’re 
>>>>>>>>>>>>>>>>>>>> convinced of that
>>>>>>>>>>>>>>>>>>>> approach, but I don’t understand the advantage it brings.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Sorry I have to make a Google Sheet to capture all the
>>>>>>>>>>>>>>>>>>>> options we have discussed so far, I wanted to use the 
>>>>>>>>>>>>>>>>>>>> existing Google Doc,
>>>>>>>>>>>>>>>>>>>> but it has really bad table/sheet support...
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> https://docs.google.com/spreadsheets/d/1a0tlyh8f2ft2SepE7H3bgoY2A0q5IILgzAsJMnwjTBs/edit#gid=0
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> I have listed all the options, with how they are
>>>>>>>>>>>>>>>>>>>> implemented and some important considerations we have 
>>>>>>>>>>>>>>>>>>>> discussed so far.
>>>>>>>>>>>>>>>>>>>> Note that:
>>>>>>>>>>>>>>>>>>>> 1. This sheet currently excludes the lineage
>>>>>>>>>>>>>>>>>>>> information, which we can discuss more later after the 
>>>>>>>>>>>>>>>>>>>> current topic is
>>>>>>>>>>>>>>>>>>>> resolved.
>>>>>>>>>>>>>>>>>>>> 2. I removed the considerations for REST integration
>>>>>>>>>>>>>>>>>>>> since from the other thread we have clarified that they 
>>>>>>>>>>>>>>>>>>>> should be
>>>>>>>>>>>>>>>>>>>> considered completely separately.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> *Why I come as a proponent of having a new MV object
>>>>>>>>>>>>>>>>>>>> with table and view metadata file pointer*
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> In my sheet, there are 3 options that do not have major
>>>>>>>>>>>>>>>>>>>> problems:
>>>>>>>>>>>>>>>>>>>> Option 2: Add storage table metadata file pointer in
>>>>>>>>>>>>>>>>>>>> view object
>>>>>>>>>>>>>>>>>>>> Option 5: New MV object with table and view metadata
>>>>>>>>>>>>>>>>>>>> file pointer
>>>>>>>>>>>>>>>>>>>> Option 6: New MV spec with table and view metadata
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> I originally excluded option 2 because I think it does
>>>>>>>>>>>>>>>>>>>> not align with the REST spec, but after the other 
>>>>>>>>>>>>>>>>>>>> discussion thread about "Inconsistency
>>>>>>>>>>>>>>>>>>>> between REST spec and table/view spec", I think my 
>>>>>>>>>>>>>>>>>>>> original concern no
>>>>>>>>>>>>>>>>>>>> longer holds true so now I put it back. And based on
>>>>>>>>>>>>>>>>>>>> my personal preference that MV is an independent object 
>>>>>>>>>>>>>>>>>>>> that should be
>>>>>>>>>>>>>>>>>>>> separated from view and table, plus the fact that option 5 
>>>>>>>>>>>>>>>>>>>> is probably less
>>>>>>>>>>>>>>>>>>>> work than option 6 for implementation, that is how I come 
>>>>>>>>>>>>>>>>>>>> as a proponent of
>>>>>>>>>>>>>>>>>>>> option 5 at this moment.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> *Regarding Ryan's evaluation framework *
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> I think we need to reconcile this sheet with Ryan's
>>>>>>>>>>>>>>>>>>>> evaluation framework. That framework categorization puts 
>>>>>>>>>>>>>>>>>>>> option 2, 3, 4, 5,
>>>>>>>>>>>>>>>>>>>> 6 all under the same category of "A combination of a
>>>>>>>>>>>>>>>>>>>> view and a table" and concludes that they don't have any 
>>>>>>>>>>>>>>>>>>>> advantage for the
>>>>>>>>>>>>>>>>>>>> same set of reasons. But those reasons are not really 
>>>>>>>>>>>>>>>>>>>> convincing to me so
>>>>>>>>>>>>>>>>>>>> let's talk about them in more detail.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> (1) You said "I don’t see a reason why a combined view
>>>>>>>>>>>>>>>>>>>> and table is advantageous" as "this would cause 
>>>>>>>>>>>>>>>>>>>> unnecessary dependence
>>>>>>>>>>>>>>>>>>>> between the view and table in catalogs."  What dependency 
>>>>>>>>>>>>>>>>>>>> exactly do you
>>>>>>>>>>>>>>>>>>>> mean here? And why is that unnecessary, given there has to 
>>>>>>>>>>>>>>>>>>>> be some sort of
>>>>>>>>>>>>>>>>>>>> dependency anyway unless we go with option 5 or 6?
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> (2) You said "I guess there’s an argument that you
>>>>>>>>>>>>>>>>>>>> could load both table and view metadata locations at the 
>>>>>>>>>>>>>>>>>>>> same time. That
>>>>>>>>>>>>>>>>>>>> hardly seems worth the trouble". I disagree with that. 
>>>>>>>>>>>>>>>>>>>> Catalog interaction
>>>>>>>>>>>>>>>>>>>> performance is critical to at least everyone working in 
>>>>>>>>>>>>>>>>>>>> EMR and Athena, and
>>>>>>>>>>>>>>>>>>>> MV itself as an acceleration approach needs to be as fast 
>>>>>>>>>>>>>>>>>>>> as possible.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> I have put 3 key operations in the doc that I think
>>>>>>>>>>>>>>>>>>>> matters for MV during interactions with engine:
>>>>>>>>>>>>>>>>>>>> 1. refreshes storage table
>>>>>>>>>>>>>>>>>>>> 2. get the storage table of the MV
>>>>>>>>>>>>>>>>>>>> 3. if stale, get the view SQL
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> And option 1 clearly falls short with 4 sequential
>>>>>>>>>>>>>>>>>>>> steps required to load a storage table. You mentioned 
>>>>>>>>>>>>>>>>>>>> "recent issues with
>>>>>>>>>>>>>>>>>>>> adding views to the JDBC catalog" in this topic, could you 
>>>>>>>>>>>>>>>>>>>> explain a bit
>>>>>>>>>>>>>>>>>>>> more?
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> (3) You said "I also think that once we decide on
>>>>>>>>>>>>>>>>>>>> structure, we can make it possible for REST catalog 
>>>>>>>>>>>>>>>>>>>> implementations to do
>>>>>>>>>>>>>>>>>>>> smart things, in a way that doesn’t put additional 
>>>>>>>>>>>>>>>>>>>> requirements on the
>>>>>>>>>>>>>>>>>>>> underlying catalog store." If REST is fully compatible 
>>>>>>>>>>>>>>>>>>>> with Iceberg spec
>>>>>>>>>>>>>>>>>>>> then I have no problem with this statement. However, as we 
>>>>>>>>>>>>>>>>>>>> discussed in the
>>>>>>>>>>>>>>>>>>>> other thread, it is not the case. In the current state, I 
>>>>>>>>>>>>>>>>>>>> think the
>>>>>>>>>>>>>>>>>>>> sequence of action should be to evolve the Iceberg 
>>>>>>>>>>>>>>>>>>>> table/view spec (or add
>>>>>>>>>>>>>>>>>>>> a MV spec) first, and then think about how REST can 
>>>>>>>>>>>>>>>>>>>> incorporate it or do
>>>>>>>>>>>>>>>>>>>> smart things that are not Iceberg spec compliant. Do you 
>>>>>>>>>>>>>>>>>>>> agree with that?
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> (4) You said the table identifier pointer "is a problem
>>>>>>>>>>>>>>>>>>>> we need to solve generally because a materialized table 
>>>>>>>>>>>>>>>>>>>> needs to be able to
>>>>>>>>>>>>>>>>>>>> track the upstream state of tables that were used". I 
>>>>>>>>>>>>>>>>>>>> don't think that is a
>>>>>>>>>>>>>>>>>>>> reason to choose to use a table identifier pointer for a 
>>>>>>>>>>>>>>>>>>>> storage table. The
>>>>>>>>>>>>>>>>>>>> issue is not about using a table identifier pointer. It is 
>>>>>>>>>>>>>>>>>>>> about exposing
>>>>>>>>>>>>>>>>>>>> the storage table as a separate entity in the catalog, 
>>>>>>>>>>>>>>>>>>>> which is what people
>>>>>>>>>>>>>>>>>>>> do not like and is already discussed in length in Jan's 
>>>>>>>>>>>>>>>>>>>> question 3 (also
>>>>>>>>>>>>>>>>>>>> linked in the sheet). I agree with that statement, because 
>>>>>>>>>>>>>>>>>>>> without a REST
>>>>>>>>>>>>>>>>>>>> implementation that can magically hide the storage table, 
>>>>>>>>>>>>>>>>>>>> this model adds
>>>>>>>>>>>>>>>>>>>> additional burden regarding compliance and data governance 
>>>>>>>>>>>>>>>>>>>> for any other
>>>>>>>>>>>>>>>>>>>> non-REST catalog implementations that are compliant to the 
>>>>>>>>>>>>>>>>>>>> Iceberg spec.
>>>>>>>>>>>>>>>>>>>> Many mechanisms need to be built in a catalog to hide, 
>>>>>>>>>>>>>>>>>>>> protect, maintain,
>>>>>>>>>>>>>>>>>>>> recycle the storage table, that can be avoided by using 
>>>>>>>>>>>>>>>>>>>> other approaches. I
>>>>>>>>>>>>>>>>>>>> think we should reach a consensus about that and discuss 
>>>>>>>>>>>>>>>>>>>> further if you do
>>>>>>>>>>>>>>>>>>>> not agree.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>>>>>>>> Jack Ye
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> On Thu, Feb 29, 2024 at 10:53 PM Jan Kaul
>>>>>>>>>>>>>>>>>>>> <[email protected]>
>>>>>>>>>>>>>>>>>>>> <[email protected]> wrote:
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Hi Ryan, we actually discussed your categories in
>>>>>>>>>>>>>>>>>>>>> this question
>>>>>>>>>>>>>>>>>>>>> <https://docs.google.com/document/d/1UnhldHhe3Grz8JBngwXPA6ZZord1xMedY5ukEhZYF-A/edit?pli=1#heading=h.y70rtfhi9qxi>.
>>>>>>>>>>>>>>>>>>>>> Where your categories correspond to the following designs:
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>    - Separate table and view => Design 1
>>>>>>>>>>>>>>>>>>>>>    - Combination of view and table => Design 2
>>>>>>>>>>>>>>>>>>>>>    - A new metadata type => Design 4
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Jan
>>>>>>>>>>>>>>>>>>>>> On 01.03.24 00:03, Ryan Blue wrote:
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Looks like it wasn’t clear what I meant for the 3
>>>>>>>>>>>>>>>>>>>>> categories, so I’ll be more specific:
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>    - *Separate table and view*: this option is to
>>>>>>>>>>>>>>>>>>>>>    have the objects that we have today, with extra 
>>>>>>>>>>>>>>>>>>>>> metadata. Commit processes
>>>>>>>>>>>>>>>>>>>>>    are separate: committing to the table doesn’t alter 
>>>>>>>>>>>>>>>>>>>>> the view and committing
>>>>>>>>>>>>>>>>>>>>>    to the view doesn’t change the table. However, 
>>>>>>>>>>>>>>>>>>>>> changing the view can make
>>>>>>>>>>>>>>>>>>>>>    it so the table is no longer useful as a 
>>>>>>>>>>>>>>>>>>>>> materialization.
>>>>>>>>>>>>>>>>>>>>>    - *A combination of a view and a table*: in this
>>>>>>>>>>>>>>>>>>>>>    option, the table metadata and view metadata are the 
>>>>>>>>>>>>>>>>>>>>> same as the first
>>>>>>>>>>>>>>>>>>>>>    option. The difference is that the commit process 
>>>>>>>>>>>>>>>>>>>>> combines them, either by
>>>>>>>>>>>>>>>>>>>>>    embedding a table metadata location in view metadata 
>>>>>>>>>>>>>>>>>>>>> or by tracking both in
>>>>>>>>>>>>>>>>>>>>>    the same catalog reference.
>>>>>>>>>>>>>>>>>>>>>    - *A new metadata type*: this option is where we
>>>>>>>>>>>>>>>>>>>>>    define a new metadata object that has view attributes, 
>>>>>>>>>>>>>>>>>>>>> like SQL
>>>>>>>>>>>>>>>>>>>>>    representations, along with table attributes, like 
>>>>>>>>>>>>>>>>>>>>> partition specs and
>>>>>>>>>>>>>>>>>>>>>    snapshots.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Hopefully this is clear because I think much of the
>>>>>>>>>>>>>>>>>>>>> confusion is caused by different definitions.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> The LoadTableResponse having optional
>>>>>>>>>>>>>>>>>>>>> metadata-location field implies that the object in the 
>>>>>>>>>>>>>>>>>>>>> catalog no longer
>>>>>>>>>>>>>>>>>>>>> needs to hold a metadata file pointer
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> The REST protocol has not removed the requirement for
>>>>>>>>>>>>>>>>>>>>> a metadata file, so I’m going to keep focused on the MV 
>>>>>>>>>>>>>>>>>>>>> design options.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> When we say a MV can be a “new metadata type”, it does
>>>>>>>>>>>>>>>>>>>>> not mean it needs to define a completely brand new 
>>>>>>>>>>>>>>>>>>>>> structure of the
>>>>>>>>>>>>>>>>>>>>> metadata content
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> I’m making a distinction between separate metadata
>>>>>>>>>>>>>>>>>>>>> files for the table and the view and a combined metadata 
>>>>>>>>>>>>>>>>>>>>> object, as above.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> We can define an “Iceberg MV” to be an object in a
>>>>>>>>>>>>>>>>>>>>> catalog, which has 1 table metadata file pointer, and 1 
>>>>>>>>>>>>>>>>>>>>> view metadata file
>>>>>>>>>>>>>>>>>>>>> pointer
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> This is the option I am referring to as a “combination
>>>>>>>>>>>>>>>>>>>>> of a view and a table”.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> So to review my initial email, I don’t see a reason
>>>>>>>>>>>>>>>>>>>>> why a combined view and table is advantageous, either 
>>>>>>>>>>>>>>>>>>>>> implemented by having
>>>>>>>>>>>>>>>>>>>>> a catalog reference with two metadata locations or 
>>>>>>>>>>>>>>>>>>>>> embedding a table
>>>>>>>>>>>>>>>>>>>>> metadata location in view metadata. This would cause 
>>>>>>>>>>>>>>>>>>>>> unnecessary dependence
>>>>>>>>>>>>>>>>>>>>> between the view and table in catalogs. I guess there’s 
>>>>>>>>>>>>>>>>>>>>> an argument that
>>>>>>>>>>>>>>>>>>>>> you could load both table and view metadata locations at 
>>>>>>>>>>>>>>>>>>>>> the same time.
>>>>>>>>>>>>>>>>>>>>> That hardly seems worth the trouble given the recent 
>>>>>>>>>>>>>>>>>>>>> issues with adding
>>>>>>>>>>>>>>>>>>>>> views to the JDBC catalog.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> I also think that once we decide on structure, we can
>>>>>>>>>>>>>>>>>>>>> make it possible for REST catalog implementations to do 
>>>>>>>>>>>>>>>>>>>>> smart things, in a
>>>>>>>>>>>>>>>>>>>>> way that doesn’t put additional requirements on the 
>>>>>>>>>>>>>>>>>>>>> underlying catalog
>>>>>>>>>>>>>>>>>>>>> store. For instance, we could specify how to send 
>>>>>>>>>>>>>>>>>>>>> additional objects in a
>>>>>>>>>>>>>>>>>>>>> LoadViewResult, in case the catalog wants to pre-fetch 
>>>>>>>>>>>>>>>>>>>>> table metadata. I
>>>>>>>>>>>>>>>>>>>>> think these optimizations are a later addition, after we 
>>>>>>>>>>>>>>>>>>>>> define the
>>>>>>>>>>>>>>>>>>>>> relationship between views and tables.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Jack, it sounds like you’re the proponent of a
>>>>>>>>>>>>>>>>>>>>> combined table and view (rather than a new metadata spec 
>>>>>>>>>>>>>>>>>>>>> for a materialized
>>>>>>>>>>>>>>>>>>>>> view). What is the main motivation? It seems like you’re 
>>>>>>>>>>>>>>>>>>>>> convinced of that
>>>>>>>>>>>>>>>>>>>>> approach, but I don’t understand the advantage it brings.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Ryan
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> On Thu, Feb 29, 2024 at 12:26 PM Szehon Ho <
>>>>>>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Hi
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Yes I mostly agree with the assessment.  To clarify a
>>>>>>>>>>>>>>>>>>>>>> few minor points.
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> is a materialized view a view and a separate table, a
>>>>>>>>>>>>>>>>>>>>>>> combination of the two (i.e. commits are combined), or 
>>>>>>>>>>>>>>>>>>>>>>> a new metadata type?
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> For 'new metadata type', I consider mostly Jack's
>>>>>>>>>>>>>>>>>>>>>> initial proposal of a new Catalog MV object that has two 
>>>>>>>>>>>>>>>>>>>>>> references
>>>>>>>>>>>>>>>>>>>>>> (ViewMetadata + TableMetadata).
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> The arguments that I see for a combined materialized
>>>>>>>>>>>>>>>>>>>>>>> view object are:
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>    - Regular views are separate, rather than being
>>>>>>>>>>>>>>>>>>>>>>>    tables with SQL and no data so it would be 
>>>>>>>>>>>>>>>>>>>>>>> inconsistent (“Iceberg view is
>>>>>>>>>>>>>>>>>>>>>>>    just a table with no data but with representations 
>>>>>>>>>>>>>>>>>>>>>>> defined. But we did not
>>>>>>>>>>>>>>>>>>>>>>>    do that.”)
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>    - Materialized views are different objects in DDL
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>    - Tables may be a superset of functionality
>>>>>>>>>>>>>>>>>>>>>>>    needed for materialized views
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>    - Tables are not typically exposed to end users
>>>>>>>>>>>>>>>>>>>>>>>    — but this isn’t required by the separate view and 
>>>>>>>>>>>>>>>>>>>>>>> table option
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> For completeness, there seem to be a few additional
>>>>>>>>>>>>>>>>>>>>>> ones (mentioned in the Slack and above messages).
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>    - Lack of spec change (to ViewMetadata).  But as
>>>>>>>>>>>>>>>>>>>>>>    Jack says it is a spec change (ie, to catalogs)
>>>>>>>>>>>>>>>>>>>>>>    - A single call to get the View's StorageTable
>>>>>>>>>>>>>>>>>>>>>>    (versus two calls)
>>>>>>>>>>>>>>>>>>>>>>    - A more natural API, no opportunity for user to
>>>>>>>>>>>>>>>>>>>>>>    call Catalog.dropTable() and renameTable() on storage 
>>>>>>>>>>>>>>>>>>>>>> table
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> *Thoughts:  *I think the long discussion sessions we
>>>>>>>>>>>>>>>>>>>>>> had on Slack was fruitful for me, as seeing the API 
>>>>>>>>>>>>>>>>>>>>>> clarified some things.
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> I was initially more in favor of MV being a new
>>>>>>>>>>>>>>>>>>>>>> metadata type (TableMetadata + ViewMetadata).  But 
>>>>>>>>>>>>>>>>>>>>>> seeing most of the MV
>>>>>>>>>>>>>>>>>>>>>> operations end up being ViewCatalog or Catalog 
>>>>>>>>>>>>>>>>>>>>>> operations, I am starting to
>>>>>>>>>>>>>>>>>>>>>> think API-wise that it may not align with the new 
>>>>>>>>>>>>>>>>>>>>>> metadata type (unless we
>>>>>>>>>>>>>>>>>>>>>> define MVCatalog and /MV REST endpoints, which then are 
>>>>>>>>>>>>>>>>>>>>>> boilerplate
>>>>>>>>>>>>>>>>>>>>>> wrappers).
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Initially one question I had for option 'a view and a
>>>>>>>>>>>>>>>>>>>>>> separate table', was how to make this table reference 
>>>>>>>>>>>>>>>>>>>>>> (metadata.json or
>>>>>>>>>>>>>>>>>>>>>> catalog reference).  In the previous option, we had a 
>>>>>>>>>>>>>>>>>>>>>> precedent of Catalog
>>>>>>>>>>>>>>>>>>>>>> references to Metadata, but not pointers between 
>>>>>>>>>>>>>>>>>>>>>> Metadatas.  I initially
>>>>>>>>>>>>>>>>>>>>>> saw the proposed Catalog's TableIdentifier pointer as 
>>>>>>>>>>>>>>>>>>>>>> 'polluting' catalog
>>>>>>>>>>>>>>>>>>>>>> concerns in ViewMetadata.  (I saw Catalog and 
>>>>>>>>>>>>>>>>>>>>>> ViewCatalog as a layer above
>>>>>>>>>>>>>>>>>>>>>> TableMetadata and ViewMetadata).  But I think Dan in the 
>>>>>>>>>>>>>>>>>>>>>> Slack made a fair
>>>>>>>>>>>>>>>>>>>>>> point that ViewMetadata already is tightly bound with a 
>>>>>>>>>>>>>>>>>>>>>> Catalog.  In this
>>>>>>>>>>>>>>>>>>>>>> case, I think this approach does have its merits as well 
>>>>>>>>>>>>>>>>>>>>>> in aligning
>>>>>>>>>>>>>>>>>>>>>> Catalog API's with the metadata.
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Thanks
>>>>>>>>>>>>>>>>>>>>>> Szehon
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> On Thu, Feb 29, 2024 at 5:45 AM Jan Kaul
>>>>>>>>>>>>>>>>>>>>>> <[email protected]>
>>>>>>>>>>>>>>>>>>>>>> <[email protected]> wrote:
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> Hi all,
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> I would like to provide my perspective on the
>>>>>>>>>>>>>>>>>>>>>>> question of what a materialized view is and elaborate 
>>>>>>>>>>>>>>>>>>>>>>> on Jack's recent
>>>>>>>>>>>>>>>>>>>>>>> proposal to view a materialized view as a catalog 
>>>>>>>>>>>>>>>>>>>>>>> concept.
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> Firstly, let's look at the role of the catalog.
>>>>>>>>>>>>>>>>>>>>>>> Every entity in the catalog has a *unique
>>>>>>>>>>>>>>>>>>>>>>> identifier*, and the catalog provides methods to
>>>>>>>>>>>>>>>>>>>>>>> create, load, and update these entities. An important 
>>>>>>>>>>>>>>>>>>>>>>> thing to note is that
>>>>>>>>>>>>>>>>>>>>>>> the catalog methods exhibit two different behaviors: 
>>>>>>>>>>>>>>>>>>>>>>> the *create
>>>>>>>>>>>>>>>>>>>>>>> and load methods deal with the entire entity*,
>>>>>>>>>>>>>>>>>>>>>>> while the *update(commit) method only deals with
>>>>>>>>>>>>>>>>>>>>>>> partial changes* to the entities.
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> In the context of our current discussion,
>>>>>>>>>>>>>>>>>>>>>>> materialized view (MV) metadata is a union of view and 
>>>>>>>>>>>>>>>>>>>>>>> table metadata. The
>>>>>>>>>>>>>>>>>>>>>>> fact that the update method deals only with partial 
>>>>>>>>>>>>>>>>>>>>>>> changes, enables us to *reuse
>>>>>>>>>>>>>>>>>>>>>>> the existing methods for updating tables and views*.
>>>>>>>>>>>>>>>>>>>>>>> For updates we don't have to define what constitutes an 
>>>>>>>>>>>>>>>>>>>>>>> entire materialized
>>>>>>>>>>>>>>>>>>>>>>> view. Changes to a materialized view targeting the 
>>>>>>>>>>>>>>>>>>>>>>> properties related to
>>>>>>>>>>>>>>>>>>>>>>> the view metadata could use the update(commit) view 
>>>>>>>>>>>>>>>>>>>>>>> method. Similarly,
>>>>>>>>>>>>>>>>>>>>>>> changes targeting the properties related to the table 
>>>>>>>>>>>>>>>>>>>>>>> metadata could use
>>>>>>>>>>>>>>>>>>>>>>> the update(commit) table method. This is great news 
>>>>>>>>>>>>>>>>>>>>>>> because we don't have
>>>>>>>>>>>>>>>>>>>>>>> to redefine view and table commits (requirements, 
>>>>>>>>>>>>>>>>>>>>>>> updates).
>>>>>>>>>>>>>>>>>>>>>>> This is shown in the fact that Jack uses the same
>>>>>>>>>>>>>>>>>>>>>>> operation to update the storage table for Option 1 and 
>>>>>>>>>>>>>>>>>>>>>>> 3:
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> // REST: POST
>>>>>>>>>>>>>>>>>>>>>>> /namespaces/db1/tables/mv1?materializedView=true
>>>>>>>>>>>>>>>>>>>>>>> // non-REST: update JSON files at
>>>>>>>>>>>>>>>>>>>>>>> table_metadata_location
>>>>>>>>>>>>>>>>>>>>>>> storageTable.newAppend().appendFile(...).commit();
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> The open question is *whether the create and load
>>>>>>>>>>>>>>>>>>>>>>> methods should treat the properties that constitute the 
>>>>>>>>>>>>>>>>>>>>>>> MV metadata as two
>>>>>>>>>>>>>>>>>>>>>>> entities (View + Table) or one entity (new MV object)*.
>>>>>>>>>>>>>>>>>>>>>>> This is all part of Jack's proposal, where Option 1 
>>>>>>>>>>>>>>>>>>>>>>> proposes a new MV
>>>>>>>>>>>>>>>>>>>>>>> object, and Option 3 proposes two separate entities. 
>>>>>>>>>>>>>>>>>>>>>>> The advantage of
>>>>>>>>>>>>>>>>>>>>>>> Option 1 is that it doesn't require two operations to 
>>>>>>>>>>>>>>>>>>>>>>> load the metadata. On
>>>>>>>>>>>>>>>>>>>>>>> the other hand, the advantage of Option 3 is that no 
>>>>>>>>>>>>>>>>>>>>>>> new operations or
>>>>>>>>>>>>>>>>>>>>>>> catalogs have to be defined.
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> In my opinion, defining a new representation for
>>>>>>>>>>>>>>>>>>>>>>> materialized views (Option 1) is generally the cleaner 
>>>>>>>>>>>>>>>>>>>>>>> solution. However, I
>>>>>>>>>>>>>>>>>>>>>>> see a path where we could first introduce Option 3 and 
>>>>>>>>>>>>>>>>>>>>>>> still have the
>>>>>>>>>>>>>>>>>>>>>>> possibility to transition to Option 1 if needed. The 
>>>>>>>>>>>>>>>>>>>>>>> great thing about
>>>>>>>>>>>>>>>>>>>>>>> Option 3 is that it only requires minor changes to the 
>>>>>>>>>>>>>>>>>>>>>>> current spec and is
>>>>>>>>>>>>>>>>>>>>>>> mostly implementation detail.
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> Therefore I would propose small additions to Jacks
>>>>>>>>>>>>>>>>>>>>>>> Option 3 that only introduce changes to the spec that 
>>>>>>>>>>>>>>>>>>>>>>> are not specific to
>>>>>>>>>>>>>>>>>>>>>>> materialized views. The idea is to introduce boolean 
>>>>>>>>>>>>>>>>>>>>>>> properties to be set
>>>>>>>>>>>>>>>>>>>>>>> on the creation of the view and the storage table that 
>>>>>>>>>>>>>>>>>>>>>>> indicate that they
>>>>>>>>>>>>>>>>>>>>>>> belong to a materialized view. The view property 
>>>>>>>>>>>>>>>>>>>>>>> "materialized" is set to
>>>>>>>>>>>>>>>>>>>>>>> "true" for a MV and "false" for a regular view. And the 
>>>>>>>>>>>>>>>>>>>>>>> table property
>>>>>>>>>>>>>>>>>>>>>>> "storage_table" is set to "true" for a storage table 
>>>>>>>>>>>>>>>>>>>>>>> and "false" for a
>>>>>>>>>>>>>>>>>>>>>>> regular table. The absence of these properties 
>>>>>>>>>>>>>>>>>>>>>>> indicates a regular view or
>>>>>>>>>>>>>>>>>>>>>>> table.
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> ViewCatalog viewCatalog = (ViewCatalog) catalog;
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> // REST: GET /namespaces/db1/views/mv1
>>>>>>>>>>>>>>>>>>>>>>> // non-REST: load JSON file at metadata_location
>>>>>>>>>>>>>>>>>>>>>>> View mv =
>>>>>>>>>>>>>>>>>>>>>>> viewCatalog.loadView(TableIdentifier.of("db1", "mv1"));
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> // REST: GET /namespaces/db1/tables/mv1
>>>>>>>>>>>>>>>>>>>>>>> // non-REST: load JSON file at
>>>>>>>>>>>>>>>>>>>>>>> table_metadata_location if present
>>>>>>>>>>>>>>>>>>>>>>> Table storageTable = view.storageTable();
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> // REST: POST /namespaces/db1/tables/mv1
>>>>>>>>>>>>>>>>>>>>>>> // non-REST: update JSON file at
>>>>>>>>>>>>>>>>>>>>>>> table_metadata_location
>>>>>>>>>>>>>>>>>>>>>>> storageTable.newAppend().appendFile(...).commit();
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> We could then introduce a new requirement for views
>>>>>>>>>>>>>>>>>>>>>>> and tables called
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>

Re: Materialized view integration with REST spec

Reply via email to