Re: Materialized view integration with REST spec

Renjie Liu Fri, 22 Mar 2024 03:15:14 -0700

+1

On Fri, Mar 22, 2024 at 16:42 Jean-Baptiste Onofré <[email protected]> wrote:


> Hi Renjie,
>
> We discussed the MV proposal, without yet reaching any conclusion.
>
> I propose:
> - to use the "new" proposal process in place (creating an GH issue with
> proposal flag, with link to the document)
> - use the document and/or GH issue to add comments
> - finalize the document heading to a vote (to get consensus)
>
> Thoughts ?
>
> NB: I will follow up with "stale PR/proposal" PR to be sure we are moving
> forward ;)
>
> Regards
> JB
>
> On Fri, Mar 22, 2024 at 4:29 AM Renjie Liu <[email protected]>
> wrote:
>
>> Hi:
>>
>> Sorry I didn't make it to join the last community sync. Did we reach any
>> conclusion about mv spec?
>>
>> On Tue, Mar 5, 2024 at 11:28 PM himadri pal <[email protected]> wrote:
>>
>>> For me the calendar link did not work in mobile, but I was able to add
>>> the dev Google calendar from
>>> https://iceberg.apache.org/community/#iceberg-community-events by
>>> accessing it from  laptop.
>>>
>>> Regards,
>>> Himadri Pal
>>>
>>>
>>> On Mon, Mar 4, 2024 at 4:43 PM Walaa Eldin Moustafa <
>>> [email protected]> wrote:
>>>
>>>> Thanks Jack! I think the images are stripped from the message, but they
>>>> are there on the doc
>>>> <https://docs.google.com/spreadsheets/d/1a0tlyh8f2ft2SepE7H3bgoY2A0q5IILgzAsJMnwjTBs/edit#gid=0>
>>>>  if
>>>> someone wants to check them out (I have left some comments while there).
>>>>
>>>> Also I no longer see the community sync calendar
>>>> https://iceberg.apache.org/community/#slack, so it is unclear when the
>>>> meeting is (and we do not have the link).
>>>>
>>>> Thanks,
>>>> Walaa.
>>>>
>>>>
>>>> On Mon, Mar 4, 2024 at 9:58 AM Jack Ye <[email protected]> wrote:
>>>>
>>>>> Thanks Jan! +1 for everyone to take a look before the discussion, and
>>>>> see if there are any missing options or major arguments.
>>>>>
>>>>> I have also added the images regarding all the options, it might be
>>>>> easier to parse than the big sheet. I will also put it here for people 
>>>>> that
>>>>> do not have time to read through it:
>>>>>
>>>>>
>>>>> *Option 1: Add storage table identifier in view metadata content*
>>>>>
>>>>> [image: MV option 1.png]
>>>>> *Option 2: Add storage table metadata file pointer in view object*
>>>>>
>>>>> [image: MV option 2.png]
>>>>> *Option 3: Add storage table metadata file pointer in view metadata
>>>>> content*
>>>>>
>>>>> [image: MV option 3.png]
>>>>>
>>>>> *Option 4: Embed table metadata in view metadata content*
>>>>>
>>>>> [image: MV option 4.png]
>>>>> *Option 5: New MV spec, MV object has table and view metadata file
>>>>> pointers*
>>>>>
>>>>> [image: MV option 5.png]
>>>>> *Option 6: New MV spec, MV metadata content embeds table and view
>>>>> metadata*
>>>>>
>>>>> [image: MV option 6.png]
>>>>> *Option 7: New MV spec, completely new MV metadata content*
>>>>>
>>>>> [image: MV option 7.png]
>>>>>
>>>>> -Jack
>>>>>
>>>>>
>>>>> On Sun, Mar 3, 2024 at 11:45 PM Jan Kaul <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> I think it's great to have a face to face discussion about this.
>>>>>> Additionally, I would propose to use Jacks' document
>>>>>> <https://docs.google.com/spreadsheets/d/1a0tlyh8f2ft2SepE7H3bgoY2A0q5IILgzAsJMnwjTBs/edit#gid=0>
>>>>>> as a common ground for the discussion and that everyone has a quick look
>>>>>> before the next community sync. If you think the document is still 
>>>>>> missing
>>>>>> some arguments, please make suggestions to add them. This way we have to
>>>>>> spend less time to get everyone up to speed and have a more common
>>>>>> terminology.
>>>>>>
>>>>>> Looking forward to the discussion, best wishes
>>>>>>
>>>>>> Jan
>>>>>> On 02.03.24 02:06, Walaa Eldin Moustafa wrote:
>>>>>>
>>>>>> The calendar on the site is currently broken
>>>>>> https://iceberg.apache.org/community/#iceberg-community-events.
>>>>>> Might help to fix it or share the meeting link here.
>>>>>>
>>>>>> On Fri, Mar 1, 2024 at 3:43 PM Jack Ye <[email protected]> wrote:
>>>>>>
>>>>>>> Sounds good, let's discuss this in person!
>>>>>>>
>>>>>>> I am a bit worried that we have quite a few critical topics going on
>>>>>>> right now on devlist, and this will take up a lot of time to discuss. 
>>>>>>> If it
>>>>>>> ends up going for too long, l propose let us have a dedicated meeting, 
>>>>>>> and
>>>>>>> I am more than happy to organize it.
>>>>>>>
>>>>>>> Best,
>>>>>>> Jack Ye
>>>>>>>
>>>>>>> On Fri, Mar 1, 2024 at 12:48 PM Ryan Blue <[email protected]> wrote:
>>>>>>>
>>>>>>>> Hey everyone,
>>>>>>>>
>>>>>>>> I think this thread has hit a point of diminishing returns and that
>>>>>>>> we still don't have a common understanding of what the options under
>>>>>>>> consideration actually are.
>>>>>>>>
>>>>>>>> Since we were already planning on discussing this at the next
>>>>>>>> community sync, I suggest we pick this up there and use that time to 
>>>>>>>> align
>>>>>>>> on what exactly we're considering. We can then start a new thread to 
>>>>>>>> lay
>>>>>>>> out the designs under consideration in more detail and then have a
>>>>>>>> discussion about trade-offs.
>>>>>>>>
>>>>>>>> Does that sound reasonable?
>>>>>>>>
>>>>>>>> Ryan
>>>>>>>>
>>>>>>>>
>>>>>>>> On Fri, Mar 1, 2024 at 11:09 AM Walaa Eldin Moustafa <
>>>>>>>> [email protected]> wrote:
>>>>>>>>
>>>>>>>>> I am finding it hard to interpret the options concretely. I would
>>>>>>>>> also suggest breaking the expectation/outcome to milestones. Maybe it
>>>>>>>>> becomes easier if we agree to distinguish between an approach that is
>>>>>>>>> feasible in the near term and another in the long term, especially if 
>>>>>>>>> the
>>>>>>>>> latter requires significant engine-side changes.
>>>>>>>>>
>>>>>>>>> Further, maybe it helps if we start with an option that fully
>>>>>>>>> reuses the existing spec, and see how we view it in comparison with 
>>>>>>>>> the
>>>>>>>>> options discussed previously. I am sharing one below. It reuses the 
>>>>>>>>> current
>>>>>>>>> spec of Iceberg views and tables by leveraging table properties to 
>>>>>>>>> capture
>>>>>>>>> materialized view metadata. What is common (and not common) between 
>>>>>>>>> this
>>>>>>>>> and the desired representations?
>>>>>>>>>
>>>>>>>>> The new properties are:
>>>>>>>>> Properties on a View:
>>>>>>>>>
>>>>>>>>>    1.
>>>>>>>>>
>>>>>>>>>    *iceberg.materialized.view*:
>>>>>>>>>    - *Type*: View property
>>>>>>>>>       - *Purpose*: This property is used to mark whether a view
>>>>>>>>>       is a materialized view. If set to true, the view is treated
>>>>>>>>>       as a materialized view. This helps in differentiating between 
>>>>>>>>> virtual and
>>>>>>>>>       materialized views within the catalog and dictates specific 
>>>>>>>>> handling and
>>>>>>>>>       validation logic for materialized views.
>>>>>>>>>    2.
>>>>>>>>>
>>>>>>>>>    *iceberg.materialized.view.storage.location*:
>>>>>>>>>    - *Type*: View property
>>>>>>>>>       - *Purpose*: Specifies the location of the storage table
>>>>>>>>>       associated with the materialized view. This property is used 
>>>>>>>>> for linking a
>>>>>>>>>       materialized view with its corresponding storage table, 
>>>>>>>>> enabling data
>>>>>>>>>       management and query execution based on the stored data 
>>>>>>>>> freshness.
>>>>>>>>>
>>>>>>>>> Properties on a Table:
>>>>>>>>>
>>>>>>>>>    1. *base.snapshot.[UUID]*:
>>>>>>>>>       - *Type*: Table property
>>>>>>>>>       - *Purpose*: These properties store the snapshot IDs of the
>>>>>>>>>       base tables at the time the materialized view's data was last 
>>>>>>>>> updated. Each
>>>>>>>>>       property is prefixed with base.snapshot. followed by the
>>>>>>>>>       UUID of the base table. They are used to track whether the 
>>>>>>>>> materialized
>>>>>>>>>       view's data is up to date with the base tables by comparing 
>>>>>>>>> these snapshot
>>>>>>>>>       IDs with the current snapshot IDs of the base tables. If all 
>>>>>>>>> the base
>>>>>>>>>       tables' current snapshot IDs match the ones stored in these 
>>>>>>>>> properties, the
>>>>>>>>>       materialized view's data is considered fresh.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Walaa.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Fri, Mar 1, 2024 at 9:15 AM Jack Ye <[email protected]>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> > All of these approaches are aligned in one, specific way: the
>>>>>>>>>> storage table is an iceberg table.
>>>>>>>>>>
>>>>>>>>>> I do not think that is true. I think people are aligned that we
>>>>>>>>>> would like to re-use the Iceberg table metadata defined in the 
>>>>>>>>>> Iceberg
>>>>>>>>>> table spec to express the data in MV, but I don't think it goes that 
>>>>>>>>>> far to
>>>>>>>>>> say it must be an Iceberg table. Once you have that mindset, then of 
>>>>>>>>>> course
>>>>>>>>>> option 1 (separate table and view) is the only option.
>>>>>>>>>>
>>>>>>>>>> > I don't think that is necessary and it significantly increases
>>>>>>>>>> the complexity.
>>>>>>>>>>
>>>>>>>>>> And can you quantify what you mean by "significantly increases
>>>>>>>>>> the complexity"? Seems like a lot of concerns are coming from the 
>>>>>>>>>> tradeoff
>>>>>>>>>> with complexity. We probably all agree that using option 7 (a 
>>>>>>>>>> completely
>>>>>>>>>> new metadata type) is a lot of work from scratch, that is why it is 
>>>>>>>>>> not
>>>>>>>>>> favored. However, my understanding is that as long as we re-use the 
>>>>>>>>>> view
>>>>>>>>>> and table metadata, then the majority of the existing logic can be 
>>>>>>>>>> reused.
>>>>>>>>>> I think what we have gone through in Slack to draft the rough Java 
>>>>>>>>>> API
>>>>>>>>>> shape helps here, because people can estimate the amount of effort 
>>>>>>>>>> required
>>>>>>>>>> to implement it. And I don't think they are **significantly** more 
>>>>>>>>>> complex
>>>>>>>>>> to implement. Could you elaborate more about the complexity that you
>>>>>>>>>> imagine?
>>>>>>>>>>
>>>>>>>>>> -Jack
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Fri, Mar 1, 2024 at 8:57 AM Daniel Weeks <
>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>
>>>>>>>>>>> I feel I've been most vocal about pushing back against options
>>>>>>>>>>> 2+ (or Ryan's categories of combined table/view, or new metadata 
>>>>>>>>>>> type), so
>>>>>>>>>>> I'll try to expand on my reasoning.
>>>>>>>>>>>
>>>>>>>>>>> I understand the appeal of creating a design where we
>>>>>>>>>>> encapsulate the view/storage from both a structural and performance
>>>>>>>>>>> standpoint, but I don't think that is necessary and it
>>>>>>>>>>> significantly increases the complexity.
>>>>>>>>>>>
>>>>>>>>>>> All of these approaches are aligned in one, specific way: the
>>>>>>>>>>> storage table is an iceberg table.
>>>>>>>>>>>
>>>>>>>>>>> Because of this, all the behaviors and requirements still apply
>>>>>>>>>>> to these tables.  They need to be maintained (snapshot cleanup, 
>>>>>>>>>>> orphan
>>>>>>>>>>> files), in cases need to be optimized (compaction, manifest 
>>>>>>>>>>> rewrites), they
>>>>>>>>>>> need to be able to be inspected (this will be even more important 
>>>>>>>>>>> with MV
>>>>>>>>>>> since staleness can produce different results and questions will 
>>>>>>>>>>> arise
>>>>>>>>>>> about what state the storage table was in).  There may be cases 
>>>>>>>>>>> where the
>>>>>>>>>>> tables need to be managed directly.
>>>>>>>>>>>
>>>>>>>>>>> Anywhere we deviate from the existing constructs/commit/access
>>>>>>>>>>> for tables, we will ultimately have to then unwrap to re-expose the
>>>>>>>>>>> underlying Iceberg behavior.  This creates unnecessary complexity 
>>>>>>>>>>> in the
>>>>>>>>>>> library/API layer, which are not the primary interface users will 
>>>>>>>>>>> have with
>>>>>>>>>>> materialized views where an engine is almost entirely necessary to 
>>>>>>>>>>> interact
>>>>>>>>>>> with the dataset.
>>>>>>>>>>>
>>>>>>>>>>> As to the performance concerns around option 1, I think we're
>>>>>>>>>>> overstating the downsides.  It really comes down to how many 
>>>>>>>>>>> metadata loads
>>>>>>>>>>> are necessary and evaluating freshness would likely be the real 
>>>>>>>>>>> bottleneck
>>>>>>>>>>> as it involves potentially loading many tables.  All of the options 
>>>>>>>>>>> are on
>>>>>>>>>>> the same order of performance for the metadata and table loads.
>>>>>>>>>>>
>>>>>>>>>>> As to the visibility of tables and whether they're registered in
>>>>>>>>>>> the catalog, I think registering in the catalog is the right 
>>>>>>>>>>> approach so
>>>>>>>>>>> that the tables are still addressable for maintenance/etc.  The 
>>>>>>>>>>> visibility
>>>>>>>>>>> of the storage table is a catalog implementation decision and 
>>>>>>>>>>> shouldn't be
>>>>>>>>>>> a requirement of the MV spec (I can see cases for both and it isn't
>>>>>>>>>>> necessary to dictate a behavior).
>>>>>>>>>>>
>>>>>>>>>>> I'm still strongly in favor of Option 1 (separate table and
>>>>>>>>>>> view) for these reasons.
>>>>>>>>>>>
>>>>>>>>>>> -Dan
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Thu, Feb 29, 2024 at 11:07 PM Jack Ye <[email protected]>
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> > Jack, it sounds like you’re the proponent of a combined table
>>>>>>>>>>>> and view (rather than a new metadata spec for a materialized 
>>>>>>>>>>>> view). What is
>>>>>>>>>>>> the main motivation? It seems like you’re convinced of that 
>>>>>>>>>>>> approach, but I
>>>>>>>>>>>> don’t understand the advantage it brings.
>>>>>>>>>>>>
>>>>>>>>>>>> Sorry I have to make a Google Sheet to capture all the options
>>>>>>>>>>>> we have discussed so far, I wanted to use the existing Google Doc, 
>>>>>>>>>>>> but it
>>>>>>>>>>>> has really bad table/sheet support...
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> https://docs.google.com/spreadsheets/d/1a0tlyh8f2ft2SepE7H3bgoY2A0q5IILgzAsJMnwjTBs/edit#gid=0
>>>>>>>>>>>>
>>>>>>>>>>>> I have listed all the options, with how they are implemented
>>>>>>>>>>>> and some important considerations we have discussed so far. Note 
>>>>>>>>>>>> that:
>>>>>>>>>>>> 1. This sheet currently excludes the lineage information, which
>>>>>>>>>>>> we can discuss more later after the current topic is resolved.
>>>>>>>>>>>> 2. I removed the considerations for REST integration since from
>>>>>>>>>>>> the other thread we have clarified that they should be considered
>>>>>>>>>>>> completely separately.
>>>>>>>>>>>>
>>>>>>>>>>>> *Why I come as a proponent of having a new MV object with table
>>>>>>>>>>>> and view metadata file pointer*
>>>>>>>>>>>>
>>>>>>>>>>>> In my sheet, there are 3 options that do not have major
>>>>>>>>>>>> problems:
>>>>>>>>>>>> Option 2: Add storage table metadata file pointer in view
>>>>>>>>>>>> object
>>>>>>>>>>>> Option 5: New MV object with table and view metadata file
>>>>>>>>>>>> pointer
>>>>>>>>>>>> Option 6: New MV spec with table and view metadata
>>>>>>>>>>>>
>>>>>>>>>>>> I originally excluded option 2 because I think it does not
>>>>>>>>>>>> align with the REST spec, but after the other discussion thread 
>>>>>>>>>>>> about "Inconsistency
>>>>>>>>>>>> between REST spec and table/view spec", I think my original 
>>>>>>>>>>>> concern no
>>>>>>>>>>>> longer holds true so now I put it back. And based on my
>>>>>>>>>>>> personal preference that MV is an independent object that should be
>>>>>>>>>>>> separated from view and table, plus the fact that option 5 is 
>>>>>>>>>>>> probably less
>>>>>>>>>>>> work than option 6 for implementation, that is how I come as a 
>>>>>>>>>>>> proponent of
>>>>>>>>>>>> option 5 at this moment.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> *Regarding Ryan's evaluation framework *
>>>>>>>>>>>>
>>>>>>>>>>>> I think we need to reconcile this sheet with Ryan's evaluation
>>>>>>>>>>>> framework. That framework categorization puts option 2, 3, 4, 5, 6 
>>>>>>>>>>>> all
>>>>>>>>>>>> under the same category of "A combination of a view and a
>>>>>>>>>>>> table" and concludes that they don't have any advantage for the 
>>>>>>>>>>>> same set of
>>>>>>>>>>>> reasons. But those reasons are not really convincing to me so 
>>>>>>>>>>>> let's talk
>>>>>>>>>>>> about them in more detail.
>>>>>>>>>>>>
>>>>>>>>>>>> (1) You said "I don’t see a reason why a combined view and
>>>>>>>>>>>> table is advantageous" as "this would cause unnecessary dependence 
>>>>>>>>>>>> between
>>>>>>>>>>>> the view and table in catalogs."  What dependency exactly do you 
>>>>>>>>>>>> mean here?
>>>>>>>>>>>> And why is that unnecessary, given there has to be some sort of 
>>>>>>>>>>>> dependency
>>>>>>>>>>>> anyway unless we go with option 5 or 6?
>>>>>>>>>>>>
>>>>>>>>>>>> (2) You said "I guess there’s an argument that you could load
>>>>>>>>>>>> both table and view metadata locations at the same time. That 
>>>>>>>>>>>> hardly seems
>>>>>>>>>>>> worth the trouble". I disagree with that. Catalog interaction 
>>>>>>>>>>>> performance
>>>>>>>>>>>> is critical to at least everyone working in EMR and Athena, and MV 
>>>>>>>>>>>> itself
>>>>>>>>>>>> as an acceleration approach needs to be as fast as possible.
>>>>>>>>>>>>
>>>>>>>>>>>> I have put 3 key operations in the doc that I think matters for
>>>>>>>>>>>> MV during interactions with engine:
>>>>>>>>>>>> 1. refreshes storage table
>>>>>>>>>>>> 2. get the storage table of the MV
>>>>>>>>>>>> 3. if stale, get the view SQL
>>>>>>>>>>>>
>>>>>>>>>>>> And option 1 clearly falls short with 4 sequential steps
>>>>>>>>>>>> required to load a storage table. You mentioned "recent issues 
>>>>>>>>>>>> with adding
>>>>>>>>>>>> views to the JDBC catalog" in this topic, could you explain a bit 
>>>>>>>>>>>> more?
>>>>>>>>>>>>
>>>>>>>>>>>> (3) You said "I also think that once we decide on structure, we
>>>>>>>>>>>> can make it possible for REST catalog implementations to do smart 
>>>>>>>>>>>> things,
>>>>>>>>>>>> in a way that doesn’t put additional requirements on the 
>>>>>>>>>>>> underlying catalog
>>>>>>>>>>>> store." If REST is fully compatible with Iceberg spec then I have 
>>>>>>>>>>>> no
>>>>>>>>>>>> problem with this statement. However, as we discussed in the other 
>>>>>>>>>>>> thread,
>>>>>>>>>>>> it is not the case. In the current state, I think the sequence of 
>>>>>>>>>>>> action
>>>>>>>>>>>> should be to evolve the Iceberg table/view spec (or add a MV spec) 
>>>>>>>>>>>> first,
>>>>>>>>>>>> and then think about how REST can incorporate it or do smart 
>>>>>>>>>>>> things that
>>>>>>>>>>>> are not Iceberg spec compliant. Do you agree with that?
>>>>>>>>>>>>
>>>>>>>>>>>> (4) You said the table identifier pointer "is a problem we need
>>>>>>>>>>>> to solve generally because a materialized table needs to be able 
>>>>>>>>>>>> to track
>>>>>>>>>>>> the upstream state of tables that were used". I don't think that 
>>>>>>>>>>>> is a
>>>>>>>>>>>> reason to choose to use a table identifier pointer for a storage 
>>>>>>>>>>>> table. The
>>>>>>>>>>>> issue is not about using a table identifier pointer. It is about 
>>>>>>>>>>>> exposing
>>>>>>>>>>>> the storage table as a separate entity in the catalog, which is 
>>>>>>>>>>>> what people
>>>>>>>>>>>> do not like and is already discussed in length in Jan's question 3 
>>>>>>>>>>>> (also
>>>>>>>>>>>> linked in the sheet). I agree with that statement, because without 
>>>>>>>>>>>> a REST
>>>>>>>>>>>> implementation that can magically hide the storage table, this 
>>>>>>>>>>>> model adds
>>>>>>>>>>>> additional burden regarding compliance and data governance for any 
>>>>>>>>>>>> other
>>>>>>>>>>>> non-REST catalog implementations that are compliant to the Iceberg 
>>>>>>>>>>>> spec.
>>>>>>>>>>>> Many mechanisms need to be built in a catalog to hide, protect, 
>>>>>>>>>>>> maintain,
>>>>>>>>>>>> recycle the storage table, that can be avoided by using other 
>>>>>>>>>>>> approaches. I
>>>>>>>>>>>> think we should reach a consensus about that and discuss further 
>>>>>>>>>>>> if you do
>>>>>>>>>>>> not agree.
>>>>>>>>>>>>
>>>>>>>>>>>> Best,
>>>>>>>>>>>> Jack Ye
>>>>>>>>>>>>
>>>>>>>>>>>> On Thu, Feb 29, 2024 at 10:53 PM Jan Kaul
>>>>>>>>>>>> <[email protected]> <[email protected]>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Hi Ryan, we actually discussed your categories in this
>>>>>>>>>>>>> question
>>>>>>>>>>>>> <https://docs.google.com/document/d/1UnhldHhe3Grz8JBngwXPA6ZZord1xMedY5ukEhZYF-A/edit?pli=1#heading=h.y70rtfhi9qxi>.
>>>>>>>>>>>>> Where your categories correspond to the following designs:
>>>>>>>>>>>>>
>>>>>>>>>>>>>    - Separate table and view => Design 1
>>>>>>>>>>>>>    - Combination of view and table => Design 2
>>>>>>>>>>>>>    - A new metadata type => Design 4
>>>>>>>>>>>>>
>>>>>>>>>>>>> Jan
>>>>>>>>>>>>> On 01.03.24 00:03, Ryan Blue wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>> Looks like it wasn’t clear what I meant for the 3 categories,
>>>>>>>>>>>>> so I’ll be more specific:
>>>>>>>>>>>>>
>>>>>>>>>>>>>    - *Separate table and view*: this option is to have the
>>>>>>>>>>>>>    objects that we have today, with extra metadata. Commit 
>>>>>>>>>>>>> processes are
>>>>>>>>>>>>>    separate: committing to the table doesn’t alter the view and 
>>>>>>>>>>>>> committing to
>>>>>>>>>>>>>    the view doesn’t change the table. However, changing the view 
>>>>>>>>>>>>> can make it
>>>>>>>>>>>>>    so the table is no longer useful as a materialization.
>>>>>>>>>>>>>    - *A combination of a view and a table*: in this option,
>>>>>>>>>>>>>    the table metadata and view metadata are the same as the first 
>>>>>>>>>>>>> option. The
>>>>>>>>>>>>>    difference is that the commit process combines them, either by 
>>>>>>>>>>>>> embedding a
>>>>>>>>>>>>>    table metadata location in view metadata or by tracking both 
>>>>>>>>>>>>> in the same
>>>>>>>>>>>>>    catalog reference.
>>>>>>>>>>>>>    - *A new metadata type*: this option is where we define a
>>>>>>>>>>>>>    new metadata object that has view attributes, like SQL 
>>>>>>>>>>>>> representations,
>>>>>>>>>>>>>    along with table attributes, like partition specs and 
>>>>>>>>>>>>> snapshots.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Hopefully this is clear because I think much of the confusion
>>>>>>>>>>>>> is caused by different definitions.
>>>>>>>>>>>>>
>>>>>>>>>>>>> The LoadTableResponse having optional metadata-location field
>>>>>>>>>>>>> implies that the object in the catalog no longer needs to hold a 
>>>>>>>>>>>>> metadata
>>>>>>>>>>>>> file pointer
>>>>>>>>>>>>>
>>>>>>>>>>>>> The REST protocol has not removed the requirement for a
>>>>>>>>>>>>> metadata file, so I’m going to keep focused on the MV design 
>>>>>>>>>>>>> options.
>>>>>>>>>>>>>
>>>>>>>>>>>>> When we say a MV can be a “new metadata type”, it does not
>>>>>>>>>>>>> mean it needs to define a completely brand new structure of the 
>>>>>>>>>>>>> metadata
>>>>>>>>>>>>> content
>>>>>>>>>>>>>
>>>>>>>>>>>>> I’m making a distinction between separate metadata files for
>>>>>>>>>>>>> the table and the view and a combined metadata object, as above.
>>>>>>>>>>>>>
>>>>>>>>>>>>> We can define an “Iceberg MV” to be an object in a catalog,
>>>>>>>>>>>>> which has 1 table metadata file pointer, and 1 view metadata file 
>>>>>>>>>>>>> pointer
>>>>>>>>>>>>>
>>>>>>>>>>>>> This is the option I am referring to as a “combination of a
>>>>>>>>>>>>> view and a table”.
>>>>>>>>>>>>>
>>>>>>>>>>>>> So to review my initial email, I don’t see a reason why a
>>>>>>>>>>>>> combined view and table is advantageous, either implemented by 
>>>>>>>>>>>>> having a
>>>>>>>>>>>>> catalog reference with two metadata locations or embedding a 
>>>>>>>>>>>>> table metadata
>>>>>>>>>>>>> location in view metadata. This would cause unnecessary 
>>>>>>>>>>>>> dependence between
>>>>>>>>>>>>> the view and table in catalogs. I guess there’s an argument that 
>>>>>>>>>>>>> you could
>>>>>>>>>>>>> load both table and view metadata locations at the same time. 
>>>>>>>>>>>>> That hardly
>>>>>>>>>>>>> seems worth the trouble given the recent issues with adding views 
>>>>>>>>>>>>> to the
>>>>>>>>>>>>> JDBC catalog.
>>>>>>>>>>>>>
>>>>>>>>>>>>> I also think that once we decide on structure, we can make it
>>>>>>>>>>>>> possible for REST catalog implementations to do smart things, in 
>>>>>>>>>>>>> a way that
>>>>>>>>>>>>> doesn’t put additional requirements on the underlying catalog 
>>>>>>>>>>>>> store. For
>>>>>>>>>>>>> instance, we could specify how to send additional objects in a
>>>>>>>>>>>>> LoadViewResult, in case the catalog wants to pre-fetch table 
>>>>>>>>>>>>> metadata. I
>>>>>>>>>>>>> think these optimizations are a later addition, after we define 
>>>>>>>>>>>>> the
>>>>>>>>>>>>> relationship between views and tables.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Jack, it sounds like you’re the proponent of a combined table
>>>>>>>>>>>>> and view (rather than a new metadata spec for a materialized 
>>>>>>>>>>>>> view). What is
>>>>>>>>>>>>> the main motivation? It seems like you’re convinced of that 
>>>>>>>>>>>>> approach, but I
>>>>>>>>>>>>> don’t understand the advantage it brings.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Ryan
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Thu, Feb 29, 2024 at 12:26 PM Szehon Ho <
>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hi
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Yes I mostly agree with the assessment.  To clarify a few
>>>>>>>>>>>>>> minor points.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> is a materialized view a view and a separate table, a
>>>>>>>>>>>>>>> combination of the two (i.e. commits are combined), or a new 
>>>>>>>>>>>>>>> metadata type?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> For 'new metadata type', I consider mostly Jack's initial
>>>>>>>>>>>>>> proposal of a new Catalog MV object that has two references 
>>>>>>>>>>>>>> (ViewMetadata +
>>>>>>>>>>>>>> TableMetadata).
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> The arguments that I see for a combined materialized view
>>>>>>>>>>>>>>> object are:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>    - Regular views are separate, rather than being tables
>>>>>>>>>>>>>>>    with SQL and no data so it would be inconsistent (“Iceberg 
>>>>>>>>>>>>>>> view is just a
>>>>>>>>>>>>>>>    table with no data but with representations defined. But we 
>>>>>>>>>>>>>>> did not do
>>>>>>>>>>>>>>>    that.”)
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>    - Materialized views are different objects in DDL
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>    - Tables may be a superset of functionality needed for
>>>>>>>>>>>>>>>    materialized views
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>    - Tables are not typically exposed to end users — but
>>>>>>>>>>>>>>>    this isn’t required by the separate view and table option
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> For completeness, there seem to be a few additional ones
>>>>>>>>>>>>>> (mentioned in the Slack and above messages).
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>    - Lack of spec change (to ViewMetadata).  But as Jack
>>>>>>>>>>>>>>    says it is a spec change (ie, to catalogs)
>>>>>>>>>>>>>>    - A single call to get the View's StorageTable (versus
>>>>>>>>>>>>>>    two calls)
>>>>>>>>>>>>>>    - A more natural API, no opportunity for user to call
>>>>>>>>>>>>>>    Catalog.dropTable() and renameTable() on storage table
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> *Thoughts:  *I think the long discussion sessions we had on
>>>>>>>>>>>>>> Slack was fruitful for me, as seeing the API clarified some 
>>>>>>>>>>>>>> things.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I was initially more in favor of MV being a new metadata type
>>>>>>>>>>>>>> (TableMetadata + ViewMetadata).  But seeing most of the MV 
>>>>>>>>>>>>>> operations end
>>>>>>>>>>>>>> up being ViewCatalog or Catalog operations, I am starting to 
>>>>>>>>>>>>>> think API-wise
>>>>>>>>>>>>>> that it may not align with the new metadata type (unless we 
>>>>>>>>>>>>>> define
>>>>>>>>>>>>>> MVCatalog and /MV REST endpoints, which then are boilerplate 
>>>>>>>>>>>>>> wrappers).
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Initially one question I had for option 'a view and a
>>>>>>>>>>>>>> separate table', was how to make this table reference 
>>>>>>>>>>>>>> (metadata.json or
>>>>>>>>>>>>>> catalog reference).  In the previous option, we had a precedent 
>>>>>>>>>>>>>> of Catalog
>>>>>>>>>>>>>> references to Metadata, but not pointers between Metadatas.  I 
>>>>>>>>>>>>>> initially
>>>>>>>>>>>>>> saw the proposed Catalog's TableIdentifier pointer as 
>>>>>>>>>>>>>> 'polluting' catalog
>>>>>>>>>>>>>> concerns in ViewMetadata.  (I saw Catalog and ViewCatalog as a 
>>>>>>>>>>>>>> layer above
>>>>>>>>>>>>>> TableMetadata and ViewMetadata).  But I think Dan in the Slack 
>>>>>>>>>>>>>> made a fair
>>>>>>>>>>>>>> point that ViewMetadata already is tightly bound with a Catalog. 
>>>>>>>>>>>>>>  In this
>>>>>>>>>>>>>> case, I think this approach does have its merits as well in 
>>>>>>>>>>>>>> aligning
>>>>>>>>>>>>>> Catalog API's with the metadata.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks
>>>>>>>>>>>>>> Szehon
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Thu, Feb 29, 2024 at 5:45 AM Jan Kaul
>>>>>>>>>>>>>> <[email protected]> <[email protected]>
>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Hi all,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I would like to provide my perspective on the question of
>>>>>>>>>>>>>>> what a materialized view is and elaborate on Jack's recent 
>>>>>>>>>>>>>>> proposal to view
>>>>>>>>>>>>>>> a materialized view as a catalog concept.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Firstly, let's look at the role of the catalog. Every entity
>>>>>>>>>>>>>>> in the catalog has a *unique identifier*, and the catalog
>>>>>>>>>>>>>>> provides methods to create, load, and update these entities. An 
>>>>>>>>>>>>>>> important
>>>>>>>>>>>>>>> thing to note is that the catalog methods exhibit two different 
>>>>>>>>>>>>>>> behaviors:
>>>>>>>>>>>>>>> the *create and load methods deal with the entire entity*,
>>>>>>>>>>>>>>> while the *update(commit) method only deals with partial
>>>>>>>>>>>>>>> changes* to the entities.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> In the context of our current discussion, materialized view
>>>>>>>>>>>>>>> (MV) metadata is a union of view and table metadata. The fact 
>>>>>>>>>>>>>>> that the
>>>>>>>>>>>>>>> update method deals only with partial changes, enables us to 
>>>>>>>>>>>>>>> *reuse
>>>>>>>>>>>>>>> the existing methods for updating tables and views*. For
>>>>>>>>>>>>>>> updates we don't have to define what constitutes an entire 
>>>>>>>>>>>>>>> materialized
>>>>>>>>>>>>>>> view. Changes to a materialized view targeting the properties 
>>>>>>>>>>>>>>> related to
>>>>>>>>>>>>>>> the view metadata could use the update(commit) view method. 
>>>>>>>>>>>>>>> Similarly,
>>>>>>>>>>>>>>> changes targeting the properties related to the table metadata 
>>>>>>>>>>>>>>> could use
>>>>>>>>>>>>>>> the update(commit) table method. This is great news because we 
>>>>>>>>>>>>>>> don't have
>>>>>>>>>>>>>>> to redefine view and table commits (requirements, updates).
>>>>>>>>>>>>>>> This is shown in the fact that Jack uses the same operation
>>>>>>>>>>>>>>> to update the storage table for Option 1 and 3:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> // REST: POST
>>>>>>>>>>>>>>> /namespaces/db1/tables/mv1?materializedView=true
>>>>>>>>>>>>>>> // non-REST: update JSON files at table_metadata_location
>>>>>>>>>>>>>>> storageTable.newAppend().appendFile(...).commit();
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> The open question is *whether the create and load methods
>>>>>>>>>>>>>>> should treat the properties that constitute the MV metadata as 
>>>>>>>>>>>>>>> two entities
>>>>>>>>>>>>>>> (View + Table) or one entity (new MV object)*. This is all
>>>>>>>>>>>>>>> part of Jack's proposal, where Option 1 proposes a new MV 
>>>>>>>>>>>>>>> object, and
>>>>>>>>>>>>>>> Option 3 proposes two separate entities. The advantage of 
>>>>>>>>>>>>>>> Option 1 is that
>>>>>>>>>>>>>>> it doesn't require two operations to load the metadata. On the 
>>>>>>>>>>>>>>> other hand,
>>>>>>>>>>>>>>> the advantage of Option 3 is that no new operations or catalogs 
>>>>>>>>>>>>>>> have to be
>>>>>>>>>>>>>>> defined.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> In my opinion, defining a new representation for
>>>>>>>>>>>>>>> materialized views (Option 1) is generally the cleaner 
>>>>>>>>>>>>>>> solution. However, I
>>>>>>>>>>>>>>> see a path where we could first introduce Option 3 and still 
>>>>>>>>>>>>>>> have the
>>>>>>>>>>>>>>> possibility to transition to Option 1 if needed. The great 
>>>>>>>>>>>>>>> thing about
>>>>>>>>>>>>>>> Option 3 is that it only requires minor changes to the current 
>>>>>>>>>>>>>>> spec and is
>>>>>>>>>>>>>>> mostly implementation detail.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Therefore I would propose small additions to Jacks Option 3
>>>>>>>>>>>>>>> that only introduce changes to the spec that are not specific to
>>>>>>>>>>>>>>> materialized views. The idea is to introduce boolean properties 
>>>>>>>>>>>>>>> to be set
>>>>>>>>>>>>>>> on the creation of the view and the storage table that indicate 
>>>>>>>>>>>>>>> that they
>>>>>>>>>>>>>>> belong to a materialized view. The view property "materialized" 
>>>>>>>>>>>>>>> is set to
>>>>>>>>>>>>>>> "true" for a MV and "false" for a regular view. And the table 
>>>>>>>>>>>>>>> property
>>>>>>>>>>>>>>> "storage_table" is set to "true" for a storage table and 
>>>>>>>>>>>>>>> "false" for a
>>>>>>>>>>>>>>> regular table. The absence of these properties indicates a 
>>>>>>>>>>>>>>> regular view or
>>>>>>>>>>>>>>> table.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> ViewCatalog viewCatalog = (ViewCatalog) catalog;
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> // REST: GET /namespaces/db1/views/mv1
>>>>>>>>>>>>>>> // non-REST: load JSON file at metadata_location
>>>>>>>>>>>>>>> View mv = viewCatalog.loadView(TableIdentifier.of("db1",
>>>>>>>>>>>>>>> "mv1"));
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> // REST: GET /namespaces/db1/tables/mv1
>>>>>>>>>>>>>>> // non-REST: load JSON file at table_metadata_location if
>>>>>>>>>>>>>>> present
>>>>>>>>>>>>>>> Table storageTable = view.storageTable();
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> // REST: POST /namespaces/db1/tables/mv1
>>>>>>>>>>>>>>> // non-REST: update JSON file at table_metadata_location
>>>>>>>>>>>>>>> storageTable.newAppend().appendFile(...).commit();
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> We could then introduce a new requirement for views and
>>>>>>>>>>>>>>> tables called "AssertProperty" which could make sure to only 
>>>>>>>>>>>>>>> perform
>>>>>>>>>>>>>>> updates that are inline with materialized views. The additional 
>>>>>>>>>>>>>>> requirement
>>>>>>>>>>>>>>> can be seen as a general extension which does not need to be 
>>>>>>>>>>>>>>> changed if we
>>>>>>>>>>>>>>> decide to got with Option 1 in the future.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Let me know what you think.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Best wishes,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Jan
>>>>>>>>>>>>>>> On 29.02.24 04:09, Walaa Eldin Moustafa wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Thanks Ryan for the insights. I agree that reusing existing
>>>>>>>>>>>>>>> metadata definitions and minimizing spec changes are very 
>>>>>>>>>>>>>>> important. This
>>>>>>>>>>>>>>> also minimizes spec drift (between materialized views and views 
>>>>>>>>>>>>>>> spec, and
>>>>>>>>>>>>>>> between materialized views and tables spec), and simplifies the
>>>>>>>>>>>>>>> implementation.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> In an effort to take the discussion forward with concrete
>>>>>>>>>>>>>>> design options based on an end-to-end implementation, I have 
>>>>>>>>>>>>>>> prototyped the
>>>>>>>>>>>>>>> implementation (and added Spark support) in this PR
>>>>>>>>>>>>>>> https://github.com/apache/iceberg/pull/9830. I hope it
>>>>>>>>>>>>>>> helps us reach convergence faster. More details about some of 
>>>>>>>>>>>>>>> the design
>>>>>>>>>>>>>>> options are discussed in the description of the PR.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>> Walaa.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Wed, Feb 28, 2024 at 6:20 PM Ryan Blue <[email protected]>
>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I mean separate table and view metadata that is somehow
>>>>>>>>>>>>>>>> combined through a commit process. For instance, keeping a 
>>>>>>>>>>>>>>>> pointer to a
>>>>>>>>>>>>>>>> table metadata file in a view metadata file or combining 
>>>>>>>>>>>>>>>> commits to
>>>>>>>>>>>>>>>> reference both. I don't see the value in either option.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Wed, Feb 28, 2024 at 5:05 PM Jack Ye <
>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Thanks Ryan for the help to trace back to the root
>>>>>>>>>>>>>>>>> question! Just a clarification question regarding your reply 
>>>>>>>>>>>>>>>>> before I reply
>>>>>>>>>>>>>>>>> further: what exactly does the option "a combination of the 
>>>>>>>>>>>>>>>>> two (i.e.
>>>>>>>>>>>>>>>>> commits are combined)" mean? How is that different from "a 
>>>>>>>>>>>>>>>>> new metadata
>>>>>>>>>>>>>>>>> type"?
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> -Jack
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>

Re: Materialized view integration with REST spec

Reply via email to