Re: Materialized view integration with REST spec

Jack Ye Mon, 04 Mar 2024 09:58:42 -0800

Thanks Jan! +1 for everyone to take a look before the discussion, and see
if there are any missing options or major arguments.


I have also added the images regarding all the options, it might be easier
to parse than the big sheet. I will also put it here for people that do not
have time to read through it:


*Option 1: Add storage table identifier in view metadata content*

[image: MV option 1.png]
*Option 2: Add storage table metadata file pointer in view object*

[image: MV option 2.png]
*Option 3: Add storage table metadata file pointer in view metadata content*

[image: MV option 3.png]

*Option 4: Embed table metadata in view metadata content*

[image: MV option 4.png]
*Option 5: New MV spec, MV object has table and view metadata file pointers*

[image: MV option 5.png]
*Option 6: New MV spec, MV metadata content embeds table and view metadata*

[image: MV option 6.png]
*Option 7: New MV spec, completely new MV metadata content*

[image: MV option 7.png]

-Jack


On Sun, Mar 3, 2024 at 11:45 PM Jan Kaul <jank...@mailbox.org.invalid>
wrote:

> I think it's great to have a face to face discussion about this.
> Additionally, I would propose to use Jacks' document
> <https://docs.google.com/spreadsheets/d/1a0tlyh8f2ft2SepE7H3bgoY2A0q5IILgzAsJMnwjTBs/edit#gid=0>
> as a common ground for the discussion and that everyone has a quick look
> before the next community sync. If you think the document is still missing
> some arguments, please make suggestions to add them. This way we have to
> spend less time to get everyone up to speed and have a more common
> terminology.
>
> Looking forward to the discussion, best wishes
>
> Jan
> On 02.03.24 02:06, Walaa Eldin Moustafa wrote:
>
> The calendar on the site is currently broken
> https://iceberg.apache.org/community/#iceberg-community-events. Might
> help to fix it or share the meeting link here.
>
> On Fri, Mar 1, 2024 at 3:43 PM Jack Ye <yezhao...@gmail.com> wrote:
>
>> Sounds good, let's discuss this in person!
>>
>> I am a bit worried that we have quite a few critical topics going on
>> right now on devlist, and this will take up a lot of time to discuss. If it
>> ends up going for too long, l propose let us have a dedicated meeting, and
>> I am more than happy to organize it.
>>
>> Best,
>> Jack Ye
>>
>> On Fri, Mar 1, 2024 at 12:48 PM Ryan Blue <b...@tabular.io> wrote:
>>
>>> Hey everyone,
>>>
>>> I think this thread has hit a point of diminishing returns and that we
>>> still don't have a common understanding of what the options under
>>> consideration actually are.
>>>
>>> Since we were already planning on discussing this at the next community
>>> sync, I suggest we pick this up there and use that time to align on what
>>> exactly we're considering. We can then start a new thread to lay out the
>>> designs under consideration in more detail and then have a discussion about
>>> trade-offs.
>>>
>>> Does that sound reasonable?
>>>
>>> Ryan
>>>
>>>
>>> On Fri, Mar 1, 2024 at 11:09 AM Walaa Eldin Moustafa <
>>> wa.moust...@gmail.com> wrote:
>>>
>>>> I am finding it hard to interpret the options concretely. I would also
>>>> suggest breaking the expectation/outcome to milestones. Maybe it becomes
>>>> easier if we agree to distinguish between an approach that is feasible in
>>>> the near term and another in the long term, especially if the latter
>>>> requires significant engine-side changes.
>>>>
>>>> Further, maybe it helps if we start with an option that fully reuses
>>>> the existing spec, and see how we view it in comparison with the options
>>>> discussed previously. I am sharing one below. It reuses the current spec of
>>>> Iceberg views and tables by leveraging table properties to capture
>>>> materialized view metadata. What is common (and not common) between this
>>>> and the desired representations?
>>>>
>>>> The new properties are:
>>>> Properties on a View:
>>>>
>>>>    1.
>>>>
>>>>    *iceberg.materialized.view*:
>>>>    - *Type*: View property
>>>>       - *Purpose*: This property is used to mark whether a view is a
>>>>       materialized view. If set to true, the view is treated as a
>>>>       materialized view. This helps in differentiating between virtual and
>>>>       materialized views within the catalog and dictates specific handling 
>>>> and
>>>>       validation logic for materialized views.
>>>>    2.
>>>>
>>>>    *iceberg.materialized.view.storage.location*:
>>>>    - *Type*: View property
>>>>       - *Purpose*: Specifies the location of the storage table
>>>>       associated with the materialized view. This property is used for 
>>>> linking a
>>>>       materialized view with its corresponding storage table, enabling data
>>>>       management and query execution based on the stored data freshness.
>>>>
>>>> Properties on a Table:
>>>>
>>>>    1. *base.snapshot.[UUID]*:
>>>>       - *Type*: Table property
>>>>       - *Purpose*: These properties store the snapshot IDs of the base
>>>>       tables at the time the materialized view's data was last updated. 
>>>> Each
>>>>       property is prefixed with base.snapshot. followed by the UUID of
>>>>       the base table. They are used to track whether the materialized 
>>>> view's data
>>>>       is up to date with the base tables by comparing these snapshot IDs 
>>>> with the
>>>>       current snapshot IDs of the base tables. If all the base tables' 
>>>> current
>>>>       snapshot IDs match the ones stored in these properties, the 
>>>> materialized
>>>>       view's data is considered fresh.
>>>>
>>>>
>>>> Thanks,
>>>> Walaa.
>>>>
>>>>
>>>> On Fri, Mar 1, 2024 at 9:15 AM Jack Ye <yezhao...@gmail.com> wrote:
>>>>
>>>>> > All of these approaches are aligned in one, specific way: the
>>>>> storage table is an iceberg table.
>>>>>
>>>>> I do not think that is true. I think people are aligned that we would
>>>>> like to re-use the Iceberg table metadata defined in the Iceberg table 
>>>>> spec
>>>>> to express the data in MV, but I don't think it goes that far to say it
>>>>> must be an Iceberg table. Once you have that mindset, then of course 
>>>>> option
>>>>> 1 (separate table and view) is the only option.
>>>>>
>>>>> > I don't think that is necessary and it significantly increases the
>>>>> complexity.
>>>>>
>>>>> And can you quantify what you mean by "significantly increases the
>>>>> complexity"? Seems like a lot of concerns are coming from the tradeoff 
>>>>> with
>>>>> complexity. We probably all agree that using option 7 (a completely new
>>>>> metadata type) is a lot of work from scratch, that is why it is not
>>>>> favored. However, my understanding is that as long as we re-use the view
>>>>> and table metadata, then the majority of the existing logic can be reused.
>>>>> I think what we have gone through in Slack to draft the rough Java API
>>>>> shape helps here, because people can estimate the amount of effort 
>>>>> required
>>>>> to implement it. And I don't think they are **significantly** more complex
>>>>> to implement. Could you elaborate more about the complexity that you
>>>>> imagine?
>>>>>
>>>>> -Jack
>>>>>
>>>>>
>>>>>
>>>>> On Fri, Mar 1, 2024 at 8:57 AM Daniel Weeks <daniel.c.we...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> I feel I've been most vocal about pushing back against options 2+ (or
>>>>>> Ryan's categories of combined table/view, or new metadata type), so I'll
>>>>>> try to expand on my reasoning.
>>>>>>
>>>>>> I understand the appeal of creating a design where we encapsulate the
>>>>>> view/storage from both a structural and performance standpoint, but I 
>>>>>> don't
>>>>>> think that is necessary and it significantly increases the complexity.
>>>>>>
>>>>>> All of these approaches are aligned in one, specific way: the storage
>>>>>> table is an iceberg table.
>>>>>>
>>>>>> Because of this, all the behaviors and requirements still apply to
>>>>>> these tables.  They need to be maintained (snapshot cleanup, orphan 
>>>>>> files),
>>>>>> in cases need to be optimized (compaction, manifest rewrites), they need 
>>>>>> to
>>>>>> be able to be inspected (this will be even more important with MV since
>>>>>> staleness can produce different results and questions will arise about 
>>>>>> what
>>>>>> state the storage table was in).  There may be cases where the tables 
>>>>>> need
>>>>>> to be managed directly.
>>>>>>
>>>>>> Anywhere we deviate from the existing constructs/commit/access for
>>>>>> tables, we will ultimately have to then unwrap to re-expose the 
>>>>>> underlying
>>>>>> Iceberg behavior.  This creates unnecessary complexity in the library/API
>>>>>> layer, which are not the primary interface users will have with
>>>>>> materialized views where an engine is almost entirely necessary to 
>>>>>> interact
>>>>>> with the dataset.
>>>>>>
>>>>>> As to the performance concerns around option 1, I think we're
>>>>>> overstating the downsides.  It really comes down to how many metadata 
>>>>>> loads
>>>>>> are necessary and evaluating freshness would likely be the real 
>>>>>> bottleneck
>>>>>> as it involves potentially loading many tables.  All of the options are 
>>>>>> on
>>>>>> the same order of performance for the metadata and table loads.
>>>>>>
>>>>>> As to the visibility of tables and whether they're registered in the
>>>>>> catalog, I think registering in the catalog is the right approach so that
>>>>>> the tables are still addressable for maintenance/etc.  The visibility of
>>>>>> the storage table is a catalog implementation decision and shouldn't be a
>>>>>> requirement of the MV spec (I can see cases for both and it isn't 
>>>>>> necessary
>>>>>> to dictate a behavior).
>>>>>>
>>>>>> I'm still strongly in favor of Option 1 (separate table and view) for
>>>>>> these reasons.
>>>>>>
>>>>>> -Dan
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Thu, Feb 29, 2024 at 11:07 PM Jack Ye <yezhao...@gmail.com> wrote:
>>>>>>
>>>>>>> > Jack, it sounds like you’re the proponent of a combined table and
>>>>>>> view (rather than a new metadata spec for a materialized view). What is 
>>>>>>> the
>>>>>>> main motivation? It seems like you’re convinced of that approach, but I
>>>>>>> don’t understand the advantage it brings.
>>>>>>>
>>>>>>> Sorry I have to make a Google Sheet to capture all the options we
>>>>>>> have discussed so far, I wanted to use the existing Google Doc, but it 
>>>>>>> has
>>>>>>> really bad table/sheet support...
>>>>>>>
>>>>>>>
>>>>>>> https://docs.google.com/spreadsheets/d/1a0tlyh8f2ft2SepE7H3bgoY2A0q5IILgzAsJMnwjTBs/edit#gid=0
>>>>>>>
>>>>>>> I have listed all the options, with how they are implemented and
>>>>>>> some important considerations we have discussed so far. Note that:
>>>>>>> 1. This sheet currently excludes the lineage information, which we
>>>>>>> can discuss more later after the current topic is resolved.
>>>>>>> 2. I removed the considerations for REST integration since from the
>>>>>>> other thread we have clarified that they should be considered completely
>>>>>>> separately.
>>>>>>>
>>>>>>> *Why I come as a proponent of having a new MV object with table and
>>>>>>> view metadata file pointer*
>>>>>>>
>>>>>>> In my sheet, there are 3 options that do not have major problems:
>>>>>>> Option 2: Add storage table metadata file pointer in view object
>>>>>>> Option 5: New MV object with table and view metadata file pointer
>>>>>>> Option 6: New MV spec with table and view metadata
>>>>>>>
>>>>>>> I originally excluded option 2 because I think it does not align
>>>>>>> with the REST spec, but after the other discussion thread about 
>>>>>>> "Inconsistency
>>>>>>> between REST spec and table/view spec", I think my original concern no
>>>>>>> longer holds true so now I put it back. And based on my personal
>>>>>>> preference that MV is an independent object that should be separated 
>>>>>>> from
>>>>>>> view and table, plus the fact that option 5 is probably less work than
>>>>>>> option 6 for implementation, that is how I come as a proponent of 
>>>>>>> option 5
>>>>>>> at this moment.
>>>>>>>
>>>>>>>
>>>>>>> *Regarding Ryan's evaluation framework *
>>>>>>>
>>>>>>> I think we need to reconcile this sheet with Ryan's evaluation
>>>>>>> framework. That framework categorization puts option 2, 3, 4, 5, 6 all
>>>>>>> under the same category of "A combination of a view and a table"
>>>>>>> and concludes that they don't have any advantage for the same set of
>>>>>>> reasons. But those reasons are not really convincing to me so let's talk
>>>>>>> about them in more detail.
>>>>>>>
>>>>>>> (1) You said "I don’t see a reason why a combined view and table is
>>>>>>> advantageous" as "this would cause unnecessary dependence between the 
>>>>>>> view
>>>>>>> and table in catalogs."  What dependency exactly do you mean here? And 
>>>>>>> why
>>>>>>> is that unnecessary, given there has to be some sort of dependency 
>>>>>>> anyway
>>>>>>> unless we go with option 5 or 6?
>>>>>>>
>>>>>>> (2) You said "I guess there’s an argument that you could load both
>>>>>>> table and view metadata locations at the same time. That hardly seems 
>>>>>>> worth
>>>>>>> the trouble". I disagree with that. Catalog interaction performance is
>>>>>>> critical to at least everyone working in EMR and Athena, and MV itself 
>>>>>>> as
>>>>>>> an acceleration approach needs to be as fast as possible.
>>>>>>>
>>>>>>> I have put 3 key operations in the doc that I think matters for MV
>>>>>>> during interactions with engine:
>>>>>>> 1. refreshes storage table
>>>>>>> 2. get the storage table of the MV
>>>>>>> 3. if stale, get the view SQL
>>>>>>>
>>>>>>> And option 1 clearly falls short with 4 sequential steps required to
>>>>>>> load a storage table. You mentioned "recent issues with adding views to 
>>>>>>> the
>>>>>>> JDBC catalog" in this topic, could you explain a bit more?
>>>>>>>
>>>>>>> (3) You said "I also think that once we decide on structure, we can
>>>>>>> make it possible for REST catalog implementations to do smart things, 
>>>>>>> in a
>>>>>>> way that doesn’t put additional requirements on the underlying catalog
>>>>>>> store." If REST is fully compatible with Iceberg spec then I have no
>>>>>>> problem with this statement. However, as we discussed in the other 
>>>>>>> thread,
>>>>>>> it is not the case. In the current state, I think the sequence of action
>>>>>>> should be to evolve the Iceberg table/view spec (or add a MV spec) 
>>>>>>> first,
>>>>>>> and then think about how REST can incorporate it or do smart things that
>>>>>>> are not Iceberg spec compliant. Do you agree with that?
>>>>>>>
>>>>>>> (4) You said the table identifier pointer "is a problem we need to
>>>>>>> solve generally because a materialized table needs to be able to track 
>>>>>>> the
>>>>>>> upstream state of tables that were used". I don't think that is a 
>>>>>>> reason to
>>>>>>> choose to use a table identifier pointer for a storage table. The issue 
>>>>>>> is
>>>>>>> not about using a table identifier pointer. It is about exposing the
>>>>>>> storage table as a separate entity in the catalog, which is what people 
>>>>>>> do
>>>>>>> not like and is already discussed in length in Jan's question 3 (also
>>>>>>> linked in the sheet). I agree with that statement, because without a 
>>>>>>> REST
>>>>>>> implementation that can magically hide the storage table, this model 
>>>>>>> adds
>>>>>>> additional burden regarding compliance and data governance for any other
>>>>>>> non-REST catalog implementations that are compliant to the Iceberg spec.
>>>>>>> Many mechanisms need to be built in a catalog to hide, protect, 
>>>>>>> maintain,
>>>>>>> recycle the storage table, that can be avoided by using other 
>>>>>>> approaches. I
>>>>>>> think we should reach a consensus about that and discuss further if you 
>>>>>>> do
>>>>>>> not agree.
>>>>>>>
>>>>>>> Best,
>>>>>>> Jack Ye
>>>>>>>
>>>>>>> On Thu, Feb 29, 2024 at 10:53 PM Jan Kaul
>>>>>>> <jank...@mailbox.org.invalid> <jank...@mailbox.org.invalid> wrote:
>>>>>>>
>>>>>>>> Hi Ryan, we actually discussed your categories in this question
>>>>>>>> <https://docs.google.com/document/d/1UnhldHhe3Grz8JBngwXPA6ZZord1xMedY5ukEhZYF-A/edit?pli=1#heading=h.y70rtfhi9qxi>.
>>>>>>>> Where your categories correspond to the following designs:
>>>>>>>>
>>>>>>>>    - Separate table and view => Design 1
>>>>>>>>    - Combination of view and table => Design 2
>>>>>>>>    - A new metadata type => Design 4
>>>>>>>>
>>>>>>>> Jan
>>>>>>>> On 01.03.24 00:03, Ryan Blue wrote:
>>>>>>>>
>>>>>>>> Looks like it wasn’t clear what I meant for the 3 categories, so
>>>>>>>> I’ll be more specific:
>>>>>>>>
>>>>>>>>    - *Separate table and view*: this option is to have the objects
>>>>>>>>    that we have today, with extra metadata. Commit processes are 
>>>>>>>> separate:
>>>>>>>>    committing to the table doesn’t alter the view and committing to 
>>>>>>>> the view
>>>>>>>>    doesn’t change the table. However, changing the view can make it so 
>>>>>>>> the
>>>>>>>>    table is no longer useful as a materialization.
>>>>>>>>    - *A combination of a view and a table*: in this option, the
>>>>>>>>    table metadata and view metadata are the same as the first option. 
>>>>>>>> The
>>>>>>>>    difference is that the commit process combines them, either by 
>>>>>>>> embedding a
>>>>>>>>    table metadata location in view metadata or by tracking both in the 
>>>>>>>> same
>>>>>>>>    catalog reference.
>>>>>>>>    - *A new metadata type*: this option is where we define a new
>>>>>>>>    metadata object that has view attributes, like SQL representations, 
>>>>>>>> along
>>>>>>>>    with table attributes, like partition specs and snapshots.
>>>>>>>>
>>>>>>>> Hopefully this is clear because I think much of the confusion is
>>>>>>>> caused by different definitions.
>>>>>>>>
>>>>>>>> The LoadTableResponse having optional metadata-location field
>>>>>>>> implies that the object in the catalog no longer needs to hold a 
>>>>>>>> metadata
>>>>>>>> file pointer
>>>>>>>>
>>>>>>>> The REST protocol has not removed the requirement for a metadata
>>>>>>>> file, so I’m going to keep focused on the MV design options.
>>>>>>>>
>>>>>>>> When we say a MV can be a “new metadata type”, it does not mean it
>>>>>>>> needs to define a completely brand new structure of the metadata 
>>>>>>>> content
>>>>>>>>
>>>>>>>> I’m making a distinction between separate metadata files for the
>>>>>>>> table and the view and a combined metadata object, as above.
>>>>>>>>
>>>>>>>> We can define an “Iceberg MV” to be an object in a catalog, which
>>>>>>>> has 1 table metadata file pointer, and 1 view metadata file pointer
>>>>>>>>
>>>>>>>> This is the option I am referring to as a “combination of a view
>>>>>>>> and a table”.
>>>>>>>>
>>>>>>>> So to review my initial email, I don’t see a reason why a combined
>>>>>>>> view and table is advantageous, either implemented by having a catalog
>>>>>>>> reference with two metadata locations or embedding a table metadata
>>>>>>>> location in view metadata. This would cause unnecessary dependence 
>>>>>>>> between
>>>>>>>> the view and table in catalogs. I guess there’s an argument that you 
>>>>>>>> could
>>>>>>>> load both table and view metadata locations at the same time. That 
>>>>>>>> hardly
>>>>>>>> seems worth the trouble given the recent issues with adding views to 
>>>>>>>> the
>>>>>>>> JDBC catalog.
>>>>>>>>
>>>>>>>> I also think that once we decide on structure, we can make it
>>>>>>>> possible for REST catalog implementations to do smart things, in a way 
>>>>>>>> that
>>>>>>>> doesn’t put additional requirements on the underlying catalog store. 
>>>>>>>> For
>>>>>>>> instance, we could specify how to send additional objects in a
>>>>>>>> LoadViewResult, in case the catalog wants to pre-fetch table metadata. 
>>>>>>>> I
>>>>>>>> think these optimizations are a later addition, after we define the
>>>>>>>> relationship between views and tables.
>>>>>>>>
>>>>>>>> Jack, it sounds like you’re the proponent of a combined table and
>>>>>>>> view (rather than a new metadata spec for a materialized view). What 
>>>>>>>> is the
>>>>>>>> main motivation? It seems like you’re convinced of that approach, but I
>>>>>>>> don’t understand the advantage it brings.
>>>>>>>>
>>>>>>>> Ryan
>>>>>>>>
>>>>>>>> On Thu, Feb 29, 2024 at 12:26 PM Szehon Ho <szehon.apa...@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Hi
>>>>>>>>>
>>>>>>>>> Yes I mostly agree with the assessment.  To clarify a few minor
>>>>>>>>> points.
>>>>>>>>>
>>>>>>>>> is a materialized view a view and a separate table, a combination
>>>>>>>>>> of the two (i.e. commits are combined), or a new metadata type?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> For 'new metadata type', I consider mostly Jack's initial proposal
>>>>>>>>> of a new Catalog MV object that has two references (ViewMetadata +
>>>>>>>>> TableMetadata).
>>>>>>>>>
>>>>>>>>> The arguments that I see for a combined materialized view object
>>>>>>>>>> are:
>>>>>>>>>>
>>>>>>>>>>    - Regular views are separate, rather than being tables with
>>>>>>>>>>    SQL and no data so it would be inconsistent (“Iceberg view is 
>>>>>>>>>> just a table
>>>>>>>>>>    with no data but with representations defined. But we did not do 
>>>>>>>>>> that.”)
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>    - Materialized views are different objects in DDL
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>    - Tables may be a superset of functionality needed for
>>>>>>>>>>    materialized views
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>    - Tables are not typically exposed to end users — but this
>>>>>>>>>>    isn’t required by the separate view and table option
>>>>>>>>>>
>>>>>>>>>> For completeness, there seem to be a few additional ones
>>>>>>>>> (mentioned in the Slack and above messages).
>>>>>>>>>
>>>>>>>>>    - Lack of spec change (to ViewMetadata).  But as Jack says it
>>>>>>>>>    is a spec change (ie, to catalogs)
>>>>>>>>>    - A single call to get the View's StorageTable (versus two
>>>>>>>>>    calls)
>>>>>>>>>    - A more natural API, no opportunity for user to call
>>>>>>>>>    Catalog.dropTable() and renameTable() on storage table
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> *Thoughts:  *I think the long discussion sessions we had on Slack
>>>>>>>>> was fruitful for me, as seeing the API clarified some things.
>>>>>>>>>
>>>>>>>>> I was initially more in favor of MV being a new metadata type
>>>>>>>>> (TableMetadata + ViewMetadata).  But seeing most of the MV operations 
>>>>>>>>> end
>>>>>>>>> up being ViewCatalog or Catalog operations, I am starting to think 
>>>>>>>>> API-wise
>>>>>>>>> that it may not align with the new metadata type (unless we define
>>>>>>>>> MVCatalog and /MV REST endpoints, which then are boilerplate 
>>>>>>>>> wrappers).
>>>>>>>>>
>>>>>>>>> Initially one question I had for option 'a view and a separate
>>>>>>>>> table', was how to make this table reference (metadata.json or catalog
>>>>>>>>> reference).  In the previous option, we had a precedent of Catalog
>>>>>>>>> references to Metadata, but not pointers between Metadatas.  I 
>>>>>>>>> initially
>>>>>>>>> saw the proposed Catalog's TableIdentifier pointer as 'polluting' 
>>>>>>>>> catalog
>>>>>>>>> concerns in ViewMetadata.  (I saw Catalog and ViewCatalog as a layer 
>>>>>>>>> above
>>>>>>>>> TableMetadata and ViewMetadata).  But I think Dan in the Slack made a 
>>>>>>>>> fair
>>>>>>>>> point that ViewMetadata already is tightly bound with a Catalog.  In 
>>>>>>>>> this
>>>>>>>>> case, I think this approach does have its merits as well in aligning
>>>>>>>>> Catalog API's with the metadata.
>>>>>>>>>
>>>>>>>>> Thanks
>>>>>>>>> Szehon
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Thu, Feb 29, 2024 at 5:45 AM Jan Kaul
>>>>>>>>> <jank...@mailbox.org.invalid> <jank...@mailbox.org.invalid> wrote:
>>>>>>>>>
>>>>>>>>>> Hi all,
>>>>>>>>>>
>>>>>>>>>> I would like to provide my perspective on the question of what a
>>>>>>>>>> materialized view is and elaborate on Jack's recent proposal to view 
>>>>>>>>>> a
>>>>>>>>>> materialized view as a catalog concept.
>>>>>>>>>>
>>>>>>>>>> Firstly, let's look at the role of the catalog. Every entity in
>>>>>>>>>> the catalog has a *unique identifier*, and the catalog provides
>>>>>>>>>> methods to create, load, and update these entities. An important 
>>>>>>>>>> thing to
>>>>>>>>>> note is that the catalog methods exhibit two different behaviors: 
>>>>>>>>>> the *create
>>>>>>>>>> and load methods deal with the entire entity*, while the 
>>>>>>>>>> *update(commit)
>>>>>>>>>> method only deals with partial changes* to the entities.
>>>>>>>>>>
>>>>>>>>>> In the context of our current discussion, materialized view (MV)
>>>>>>>>>> metadata is a union of view and table metadata. The fact that the 
>>>>>>>>>> update
>>>>>>>>>> method deals only with partial changes, enables us to *reuse the
>>>>>>>>>> existing methods for updating tables and views*. For updates we
>>>>>>>>>> don't have to define what constitutes an entire materialized view. 
>>>>>>>>>> Changes
>>>>>>>>>> to a materialized view targeting the properties related to the view
>>>>>>>>>> metadata could use the update(commit) view method. Similarly, changes
>>>>>>>>>> targeting the properties related to the table metadata could use the
>>>>>>>>>> update(commit) table method. This is great news because we don't 
>>>>>>>>>> have to
>>>>>>>>>> redefine view and table commits (requirements, updates).
>>>>>>>>>> This is shown in the fact that Jack uses the same operation to
>>>>>>>>>> update the storage table for Option 1 and 3:
>>>>>>>>>>
>>>>>>>>>> // REST: POST /namespaces/db1/tables/mv1?materializedView=true
>>>>>>>>>> // non-REST: update JSON files at table_metadata_location
>>>>>>>>>> storageTable.newAppend().appendFile(...).commit();
>>>>>>>>>>
>>>>>>>>>> The open question is *whether the create and load methods should
>>>>>>>>>> treat the properties that constitute the MV metadata as two entities 
>>>>>>>>>> (View
>>>>>>>>>> + Table) or one entity (new MV object)*. This is all part of
>>>>>>>>>> Jack's proposal, where Option 1 proposes a new MV object, and Option 
>>>>>>>>>> 3
>>>>>>>>>> proposes two separate entities. The advantage of Option 1 is that it
>>>>>>>>>> doesn't require two operations to load the metadata. On the other 
>>>>>>>>>> hand, the
>>>>>>>>>> advantage of Option 3 is that no new operations or catalogs have to 
>>>>>>>>>> be
>>>>>>>>>> defined.
>>>>>>>>>>
>>>>>>>>>> In my opinion, defining a new representation for materialized
>>>>>>>>>> views (Option 1) is generally the cleaner solution. However, I see a 
>>>>>>>>>> path
>>>>>>>>>> where we could first introduce Option 3 and still have the 
>>>>>>>>>> possibility to
>>>>>>>>>> transition to Option 1 if needed. The great thing about Option 3 is 
>>>>>>>>>> that it
>>>>>>>>>> only requires minor changes to the current spec and is mostly
>>>>>>>>>> implementation detail.
>>>>>>>>>>
>>>>>>>>>> Therefore I would propose small additions to Jacks Option 3 that
>>>>>>>>>> only introduce changes to the spec that are not specific to 
>>>>>>>>>> materialized
>>>>>>>>>> views. The idea is to introduce boolean properties to be set on the
>>>>>>>>>> creation of the view and the storage table that indicate that they 
>>>>>>>>>> belong
>>>>>>>>>> to a materialized view. The view property "materialized" is set to 
>>>>>>>>>> "true"
>>>>>>>>>> for a MV and "false" for a regular view. And the table property
>>>>>>>>>> "storage_table" is set to "true" for a storage table and "false" for 
>>>>>>>>>> a
>>>>>>>>>> regular table. The absence of these properties indicates a regular 
>>>>>>>>>> view or
>>>>>>>>>> table.
>>>>>>>>>>
>>>>>>>>>> ViewCatalog viewCatalog = (ViewCatalog) catalog;
>>>>>>>>>>
>>>>>>>>>> // REST: GET /namespaces/db1/views/mv1
>>>>>>>>>> // non-REST: load JSON file at metadata_location
>>>>>>>>>> View mv = viewCatalog.loadView(TableIdentifier.of("db1", "mv1"));
>>>>>>>>>>
>>>>>>>>>> // REST: GET /namespaces/db1/tables/mv1
>>>>>>>>>> // non-REST: load JSON file at table_metadata_location if present
>>>>>>>>>> Table storageTable = view.storageTable();
>>>>>>>>>>
>>>>>>>>>> // REST: POST /namespaces/db1/tables/mv1
>>>>>>>>>> // non-REST: update JSON file at table_metadata_location
>>>>>>>>>> storageTable.newAppend().appendFile(...).commit();
>>>>>>>>>>
>>>>>>>>>> We could then introduce a new requirement for views and tables
>>>>>>>>>> called "AssertProperty" which could make sure to only perform 
>>>>>>>>>> updates that
>>>>>>>>>> are inline with materialized views. The additional requirement can 
>>>>>>>>>> be seen
>>>>>>>>>> as a general extension which does not need to be changed if we 
>>>>>>>>>> decide to
>>>>>>>>>> got with Option 1 in the future.
>>>>>>>>>>
>>>>>>>>>> Let me know what you think.
>>>>>>>>>>
>>>>>>>>>> Best wishes,
>>>>>>>>>>
>>>>>>>>>> Jan
>>>>>>>>>> On 29.02.24 04:09, Walaa Eldin Moustafa wrote:
>>>>>>>>>>
>>>>>>>>>> Thanks Ryan for the insights. I agree that reusing existing
>>>>>>>>>> metadata definitions and minimizing spec changes are very important. 
>>>>>>>>>> This
>>>>>>>>>> also minimizes spec drift (between materialized views and views 
>>>>>>>>>> spec, and
>>>>>>>>>> between materialized views and tables spec), and simplifies the
>>>>>>>>>> implementation.
>>>>>>>>>>
>>>>>>>>>> In an effort to take the discussion forward with concrete design
>>>>>>>>>> options based on an end-to-end implementation, I have prototyped the
>>>>>>>>>> implementation (and added Spark support) in this PR
>>>>>>>>>> https://github.com/apache/iceberg/pull/9830. I hope it helps us
>>>>>>>>>> reach convergence faster. More details about some of the design 
>>>>>>>>>> options are
>>>>>>>>>> discussed in the description of the PR.
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>> Walaa.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Wed, Feb 28, 2024 at 6:20 PM Ryan Blue <b...@tabular.io>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> I mean separate table and view metadata that is somehow combined
>>>>>>>>>>> through a commit process. For instance, keeping a pointer to a table
>>>>>>>>>>> metadata file in a view metadata file or combining commits to 
>>>>>>>>>>> reference
>>>>>>>>>>> both. I don't see the value in either option.
>>>>>>>>>>>
>>>>>>>>>>> On Wed, Feb 28, 2024 at 5:05 PM Jack Ye <yezhao...@gmail.com>
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Thanks Ryan for the help to trace back to the root question!
>>>>>>>>>>>> Just a clarification question regarding your reply before I reply 
>>>>>>>>>>>> further:
>>>>>>>>>>>> what exactly does the option "a combination of the two (i.e. 
>>>>>>>>>>>> commits are
>>>>>>>>>>>> combined)" mean? How is that different from "a new metadata type"?
>>>>>>>>>>>>
>>>>>>>>>>>> -Jack
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Wed, Feb 28, 2024 at 2:10 PM Ryan Blue <b...@tabular.io>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> I’m catching up on this conversation, so hopefully I can bring
>>>>>>>>>>>>> a fresh perspective.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Jack already pointed out that we need to start from the basics
>>>>>>>>>>>>> and I agree with that. Let’s remove voting at this point. Right 
>>>>>>>>>>>>> now is the
>>>>>>>>>>>>> time for discussing trade-offs, not lining up and taking sides. I 
>>>>>>>>>>>>> realize
>>>>>>>>>>>>> that wasn’t the intent with adding a vote, but that’s almost 
>>>>>>>>>>>>> always the
>>>>>>>>>>>>> result. It’s too easy to use it as a stand-in for consensus and 
>>>>>>>>>>>>> move on
>>>>>>>>>>>>> prematurely. I get the impression from the swirl in Slack that 
>>>>>>>>>>>>> discussion
>>>>>>>>>>>>> has moved ahead of agreement.
>>>>>>>>>>>>>
>>>>>>>>>>>>> We’re still at the most basic question: is a materialized view
>>>>>>>>>>>>> a view and a separate table, a combination of the two (i.e. 
>>>>>>>>>>>>> commits are
>>>>>>>>>>>>> combined), or a new metadata type?
>>>>>>>>>>>>>
>>>>>>>>>>>>> For now, I’m ignoring whether the “separate table” is some
>>>>>>>>>>>>> kind of “system table” (meaning hidden?) or if it is exposed in 
>>>>>>>>>>>>> the
>>>>>>>>>>>>> catalog. That’s a later choice (already pointed out) and, I 
>>>>>>>>>>>>> suspect, it
>>>>>>>>>>>>> should be delegated to catalog implementations.
>>>>>>>>>>>>>
>>>>>>>>>>>>> To simplify this a little, I think that we can eliminate the
>>>>>>>>>>>>> option to combine table and view commits. I don’t think there is 
>>>>>>>>>>>>> a reason
>>>>>>>>>>>>> to combine the two. If separate, a table would track the view 
>>>>>>>>>>>>> version used
>>>>>>>>>>>>> along with freshness information for referenced tables. If the 
>>>>>>>>>>>>> table is
>>>>>>>>>>>>> automatically skipped when the version no longer matches the 
>>>>>>>>>>>>> view, then no
>>>>>>>>>>>>> action needs to happen when a view definition changes. Similarly, 
>>>>>>>>>>>>> the table
>>>>>>>>>>>>> can be updated independently without needing to also swap view 
>>>>>>>>>>>>> metadata.
>>>>>>>>>>>>> This also aligns with the idea from the original doc that there 
>>>>>>>>>>>>> can be
>>>>>>>>>>>>> multiple materialization tables for a view. Each should operate
>>>>>>>>>>>>> independently unless I’m missing something
>>>>>>>>>>>>>
>>>>>>>>>>>>> I don’t think the last paragraph’s conclusion is contentious
>>>>>>>>>>>>> so I’ll move on, but please stop here and reply if you disagree!
>>>>>>>>>>>>>
>>>>>>>>>>>>> That leaves the main two options, a view and a separate table
>>>>>>>>>>>>> linked by metadata, or, combined materialized view metadata.
>>>>>>>>>>>>>
>>>>>>>>>>>>> As the doc notes, the separate view and table option is
>>>>>>>>>>>>> simpler because it reuses existing metadata definitions and falls 
>>>>>>>>>>>>> back to
>>>>>>>>>>>>> simple views. That is a significantly smaller spec and small is 
>>>>>>>>>>>>> very, very
>>>>>>>>>>>>> important when it comes to specs. I think that the argument for a 
>>>>>>>>>>>>> new
>>>>>>>>>>>>> definition of a materialized view needs to overcome this 
>>>>>>>>>>>>> disadvantage.
>>>>>>>>>>>>>
>>>>>>>>>>>>> The arguments that I see for a combined materialized view
>>>>>>>>>>>>> object are:
>>>>>>>>>>>>>
>>>>>>>>>>>>>    - Regular views are separate, rather than being tables
>>>>>>>>>>>>>    with SQL and no data so it would be inconsistent (“Iceberg 
>>>>>>>>>>>>> view is just a
>>>>>>>>>>>>>    table with no data but with representations defined. But we 
>>>>>>>>>>>>> did not do
>>>>>>>>>>>>>    that.”)
>>>>>>>>>>>>>    - Materialized views are different objects in DDL
>>>>>>>>>>>>>    - Tables may be a superset of functionality needed for
>>>>>>>>>>>>>    materialized views
>>>>>>>>>>>>>    - Tables are not typically exposed to end users — but this
>>>>>>>>>>>>>    isn’t required by the separate view and table option
>>>>>>>>>>>>>
>>>>>>>>>>>>> Am I missing any arguments for combined metadata?
>>>>>>>>>>>>>
>>>>>>>>>>>>> Ryan
>>>>>>>>>>>>> --
>>>>>>>>>>>>> Ryan Blue
>>>>>>>>>>>>> Tabular
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>> Ryan Blue
>>>>>>>>>>> Tabular
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Ryan Blue
>>>>>>>> Tabular
>>>>>>>>
>>>>>>>>
>>>
>>> --
>>> Ryan Blue
>>> Tabular
>>>
>>

Re: Materialized view integration with REST spec

Reply via email to