Sorry I guess another longer question:

*What do we even mean here when we use the terms of table "metadata", view
"metadata" and new "metadata" type?*

This was clear before the REST spec was introduced, but is not so clear
now. Maybe this is a good time to clarify it.

If we look into the table/view spec, the optimistic concurrency section
<https://iceberg.apache.org/spec/#optimistic-concurrency> requires the
existence of a metadata file, and the atomic swap of the metadata file
ensures serializable isolation. This implies 2 things:
1. the metadata file in a storage that holds the information described in
the rest of the spec.
2. there is an object in a catalog that holds the pointer of the metadata
file. What object and what catalog is implementation dependent, but these
generalized concepts are always intact.

So in my opinion, when we talk about an Iceberg table/view/MV metadata, it
is the combination of all these 4 components:
1. the object in the catalog
2. the metadata file pointer in the object
3. the metadata file in storage
4. the metadata content in the metadata file

However, the REST spec technically removed the need for component 2 and 3.
The LoadTableResponse
<https://github.com/apache/iceberg/blob/main/open-api/rest-catalog-open-api.yaml#L2721-L2728>
having optional metadata-location field implies that the object in the
catalog no longer needs to hold a metadata file pointer, and the metadata
content does not necessarily need to live in the metadata file anymore.
Does that mean the REST spec actually contradicts the other specs? This is
when the whole definition of an Iceberg table/view has become ambiguous.

Let's forget about the REST side, and assume that the 4 component metadata
structure I described still holds. Then I think the latest proposal for the
"new metadata type" approach has already overcome the disadvantage you
described.

When we say a MV can be a "new metadata type", it does not mean it needs to
define a completely brand new structure of the metadata content (although
it could be, and I initially made that as an option). It can also be just a
different set up of metadata components 2 and 3. We can define an "Iceberg
MV" to be an object in a catalog, which has 1 table metadata file pointer,
and 1 view metadata file pointer. (This is basically what I meant by "there
is no MV spec" in the previous reply, because there is no new metadata
content, as we are re-using the table and view metadata content. But now I
rethink, it is wrong to say there is no MV spec, because the pointer
structure is a part of the spec)

And in this approach, all the arguments you listed for the new metadata
type still hold true, and it also "reuses existing metadata definitions"
and can "fall back to simple views".

-Jack

On Wed, Feb 28, 2024 at 5:05 PM Jack Ye <yezhao...@gmail.com> wrote:

> Thanks Ryan for the help to trace back to the root question! Just a
> clarification question regarding your reply before I reply further: what
> exactly does the option "a combination of the two (i.e. commits are
> combined)" mean? How is that different from "a new metadata type"?
>
> -Jack
>
>
>
>
> On Wed, Feb 28, 2024 at 2:10 PM Ryan Blue <b...@tabular.io> wrote:
>
>> I’m catching up on this conversation, so hopefully I can bring a fresh
>> perspective.
>>
>> Jack already pointed out that we need to start from the basics and I
>> agree with that. Let’s remove voting at this point. Right now is the time
>> for discussing trade-offs, not lining up and taking sides. I realize that
>> wasn’t the intent with adding a vote, but that’s almost always the result.
>> It’s too easy to use it as a stand-in for consensus and move on
>> prematurely. I get the impression from the swirl in Slack that discussion
>> has moved ahead of agreement.
>>
>> We’re still at the most basic question: is a materialized view a view and
>> a separate table, a combination of the two (i.e. commits are combined), or
>> a new metadata type?
>>
>> For now, I’m ignoring whether the “separate table” is some kind of
>> “system table” (meaning hidden?) or if it is exposed in the catalog. That’s
>> a later choice (already pointed out) and, I suspect, it should be delegated
>> to catalog implementations.
>>
>> To simplify this a little, I think that we can eliminate the option to
>> combine table and view commits. I don’t think there is a reason to combine
>> the two. If separate, a table would track the view version used along with
>> freshness information for referenced tables. If the table is automatically
>> skipped when the version no longer matches the view, then no action needs
>> to happen when a view definition changes. Similarly, the table can be
>> updated independently without needing to also swap view metadata. This also
>> aligns with the idea from the original doc that there can be multiple
>> materialization tables for a view. Each should operate independently unless
>> I’m missing something
>>
>> I don’t think the last paragraph’s conclusion is contentious so I’ll move
>> on, but please stop here and reply if you disagree!
>>
>> That leaves the main two options, a view and a separate table linked by
>> metadata, or, combined materialized view metadata.
>>
>> As the doc notes, the separate view and table option is simpler because
>> it reuses existing metadata definitions and falls back to simple views.
>> That is a significantly smaller spec and small is very, very important when
>> it comes to specs. I think that the argument for a new definition of a
>> materialized view needs to overcome this disadvantage.
>>
>> The arguments that I see for a combined materialized view object are:
>>
>>    - Regular views are separate, rather than being tables with SQL and
>>    no data so it would be inconsistent (“Iceberg view is just a table with no
>>    data but with representations defined. But we did not do that.”)
>>    - Materialized views are different objects in DDL
>>    - Tables may be a superset of functionality needed for materialized
>>    views
>>    - Tables are not typically exposed to end users — but this isn’t
>>    required by the separate view and table option
>>
>> Am I missing any arguments for combined metadata?
>>
>> Ryan
>> --
>> Ryan Blue
>> Tabular
>>
>

Reply via email to