Re: Materialized view integration with REST spec

Walaa Eldin Moustafa Tue, 20 Feb 2024 12:17:10 -0800

I would vote to keep a log in the doc with open questions, and keep the doc
updated with open questions as they arise/get resolved.


On Tue, Feb 20, 2024 at 11:37 AM Jack Ye <yezhao...@gmail.com> wrote:

> Thanks for the response from everyone!
>
> Before proceeding further, I see a few people referring back to the
> current design from Jan. I specifically raised this thread based on the
> information in the doc and a few latest discussions we had there. Because
> there are many threads in the doc, and each thread points further to other
> discussion threads in the same doc or other doc, it is now quite hard to
> follow and continue discussing all different topics there.
>
> I hope we can make incremental consensus of the questions in the doc
> through devlist, because it provides more visibility, and also a single
> thread instead of multiple threads going on at the same time. If we think
> this format is not effective, I propose that we create a new mv channel in
> Iceberg Slack workspace, and people interested can join and discuss all
> these points directly. What do we think?
>
> Best,
> Jack Ye
>
>
>
> On Mon, Feb 19, 2024 at 6:03 PM Szehon Ho <szehon.apa...@gmail.com> wrote:
>
>> Hi,
>>
>> Great to see more discussion on the MV spec.  Actually, Jan's document 
>> "Iceberg
>> Materialized View Spec"
>> <https://docs.google.com/document/d/1UnhldHhe3Grz8JBngwXPA6ZZord1xMedY5ukEhZYF-A>
>>  has
>> been organized , with a "Design Questions" section to track these debates,
>> and it would be nice to centralize the debates there, as Micah mentions.
>>
>> For Dan's question, I think this debate was tracked in "Design Question
>> 3: Should the storage table be registered in the catalog?". I think the
>> general idea there was to not expose it directly via Catalog as it is then
>> exposed to user modification. If the engine wants to access anything about
>> the storage table (including audit and storage), it is of course there via
>> the storage table pointer. I think Walaa's point is also good, we could
>> expose it as we expose metadata tables, but I am still not sure if there is
>> still some use-cases of engine access not covered?
>>
>> It is true that for Jack's initial question (Do we really want to go with
>> the MV = view + storage table design approach for Iceberg MV?),
>> unfortunately we did not capture it as a "Design Question" in Jan's doc, as
>> it was an implicit assumption of 'yes', because it is the choice of Hive,
>> Trino, and other engines , as others have pointed out.
>>
>> Jack's point about potential evolution of MV (like to add partitioning)
>> is an interesting one, but definitely hard to grasp.  I think it makes
>> sense to add this as a separate Design Question in the doc, and add the
>> options.  This will allow us to flesh out this alternative
>> option(s).  Maybe Micah's point about modifying existing proposal to
>> 'embed' the required table metadata fields in the existing view metadata,
>> is one middle ground option.  Or we add a totally new MV object spec for
>> MV, separate than existing View spec?
>>
>> Also , as Jack pointed out, it may make sense to have the REST / Catalog
>> API proposal in the doc to educate the above decision.
>>
>> Thanks
>> Szehon
>>
>> On Mon, Feb 19, 2024 at 4:08 PM Walaa Eldin Moustafa <
>> wa.moust...@gmail.com> wrote:
>>
>>> I think it would help if we answer the question of whether an MV is a
>>> view + storage table (and degree of exposing this underlying
>>> implementation) in the context of the user interfacing with those concepts:
>>>
>>> For the end user, interfacing with the engine APIs (e.g., through SQL),
>>> materialized view APIs should be almost the same as regular view APIs
>>> (except for operations specific to materialized views like REFRESH command
>>> etc). Typically, the end user interacts with the (materialized) view object
>>> as a view, and the engine performs the abstraction over the storage table.
>>>
>>> For the engines interfacing with Iceberg, it sounds the correct
>>> abstraction at this layer is indeed view + storage table, and engines could
>>> have access to both objects to optimize queries.
>>>
>>> So in a sense, the engine will ultimately hide most of the
>>> storage detail from the end user (except for advanced users who want to
>>> explicitly access the storage table with a modifier like
>>> "db.view.storageTable" -- and they can only read it), while Iceberg will
>>> expose the storage details to the engine catalog to use it in scans if
>>> needed. So the storage table is hidden or exposed based on the context/the
>>> actual users. From Iceberg point of view (which interacts with the
>>> engines), the storage table is exposed. Note that this does not
>>> necessarily mean that the storage table is registered in the catalog with
>>> its own independent name (e.g., where we can drop the view but keep the
>>> storage table and access it from the catalog). Addressing the storage table
>>> using a virtual namespace like "db.view.storageTable" sounds like a good
>>> middle ground. Anyways, end users should not need to directly access the
>>> storage table in most cases.
>>>
>>> Thanks,
>>> Walaa.
>>>
>>> On Mon, Feb 19, 2024 at 3:38 PM Micah Kornfield <emkornfi...@gmail.com>
>>> wrote:
>>>
>>>> Hi Jack,
>>>>
>>>>
>>>>> In my mind, the first key point we all need to agree upon to move this
>>>>> design forward is*: Do we really want to go with the MV = view +
>>>>> storage table design approach for Iceberg MV?*
>>>>
>>>>
>>>> I think we want this to the extent that we do not want to redefine the
>>>> same concept with different representations/naming to the greatest degree
>>>> possible.  This is why borrowing the concepts from the view (e.g. multiple
>>>> ways of expressing the same view logic in different dialects) and aspects
>>>> of the materialized data (e.g. partitioning, ordering) feels most natural.
>>>> IIUC your proposal, I think you are saying maybe two modifications to the
>>>> existing proposals in the document:
>>>>
>>>> 1.  No separate storage table link, instead embed most of the metadata
>>>> of the materialized table into the MV document (the exception seems to be
>>>> snapshot history)
>>>> 2.  For snapshot history, have one unified history specific to the MV.
>>>>
>>>> This seems fairly reasonable to me and I think I can solve some
>>>> challenges with the existing proposal in an elegant way.  If this is
>>>> correct (or maybe if it isn't quite correct) perhaps you can make
>>>> suggestions to the document so all of the trade-offs can be discussed in
>>>> one place?
>>>>
>>>> I think the one thing the current draft of the materialized view
>>>> ignores is how to store algebraic summaries (e.g. separate sum and count
>>>> for averages, or other sketches), so that new data can be incrementally
>>>> incorporated.  But representing these structures feels like it potentially
>>>> has value beyond just MVs (e.g. it can be a natural way to express summary
>>>> statistics in table metadata), so I think it deserves at least a try in
>>>> incorporating the concepts in the table specification, so the definitions
>>>> can be shared.  I was imagining this could come as part of the next
>>>> revision of MV specification.
>>>>
>>>> The MV internal structure could evolve in a way that works more
>>>>> efficiently with the reduced scope of functionalities, without relying on
>>>>> table to offer the same capabilities. I can at least say that is true 
>>>>> based
>>>>> on my internal knowledge of how Redshift MVs work.
>>>>
>>>>
>>>> I'm not sure I fully understand this point, but it seems the main
>>>> question here is what would break if it started to evolve in this
>>>> direction.  Is it purely additive or do we suspect some elements would need
>>>> to be removed?  My gut feeling here is the main concerns here are  getting
>>>> the cardinatities correct (i.e. 1 MV should probably have 0, 1 or more
>>>> materialized storage tables associated with it, to support more advanced
>>>> algebraic structures listed above, and perhaps a second without them, and
>>>> additional metadata to distinguish between these two different modes).
>>>>
>>>> If after the evaluation, we are confident that the MV = view + storage
>>>>> table approach is the right way to go, then we can debate the other 
>>>>> issues,
>>>>> and I think the next issue to reach consensus should be "Should the 
>>>>> storage
>>>>> table be registered in the catalog?".
>>>>
>>>>
>>>> I actually think there are actually more fundamental questions posed:
>>>> 1.  Should be considering how items should be modelled in the REST API
>>>> concurrently with the Iceberg spec, as that potentially impacts design
>>>> decision (I think the answer is yes, and we should update the doc with
>>>> sketches on new endpoints and operations on the endpoints to ensure things
>>>> align).
>>>> 2.  Going forward should new aspects of Iceberg artifacts rely on the
>>>> fact that a catalog is present and we can rely on a naming convention for
>>>> looking up other artifacts in a catalog as pointers (I lean yes on this,
>>>> but I'm a little bit more ambivalent).
>>>>
>>>> Thanks,
>>>> Micah
>>>>
>>>> On Mon, Feb 19, 2024 at 12:52 PM Jack Ye <yezhao...@gmail.com> wrote:
>>>>
>>>>> I suggest we need a step-by-step process to make incremental
>>>>> consensus, otherwise we are constantly talking about many different 
>>>>> debates
>>>>> at the same time.
>>>>>
>>>>> In my mind, the first key point we all need to agree upon to move this
>>>>> design forward is*: Do we really want to go with the MV = view +
>>>>> storage table design approach for Iceberg MV?*
>>>>>
>>>>> I think we (at least me) started with this assumption, mostly because
>>>>> this is how Trino implements MV, and how Hive tables store MV information
>>>>> today. But does it mean we should design it that way in Iceberg?
>>>>>
>>>>> Now I look back at how we did the view spec design, we could also say
>>>>> that we just add a representation field in the table spec to store view,
>>>>> and an Iceberg view is just a table with no data but with representations
>>>>> defined. But we did not do that. So it feels now quite inconsistent to say
>>>>> we want to just add a few fields in the table and view spec to call it an
>>>>> Iceberg MV.
>>>>>
>>>>> If we look into most of the other database systems (e.g. Redshift,
>>>>> BigQuery, Snowflake), they never expose such implementation details like
>>>>> storage table. Apart from being close-sourced systems, I think it is also
>>>>> for good technical reasons. There are many more things that a table needs
>>>>> to support, but does not really apply to MV. The MV internal structure
>>>>> could evolve in a way that works more efficiently with the reduced scope 
>>>>> of
>>>>> functionalities, without relying on table to offer the same capabilities. 
>>>>> I
>>>>> can at least say that is true based on my internal knowledge of how
>>>>> Redshift MVs work.
>>>>>
>>>>> I think we should fully evaluate both directions, and commit to one
>>>>> first before debating more things.
>>>>>
>>>>> If we have a new and independent Iceberg MV spec, then an Iceberg MV
>>>>> is under-the-hood a single object containing all MV information. It has 
>>>>> its
>>>>> own name, snapshots, view representation, etc. I don't believe we will be
>>>>> blocked by Trino due to its MV SPIs currently requiring the existence of a
>>>>> storage table, as it will just be a different implementation from the
>>>>> existing one in Trino-Iceberg. In this direction, I don't think we need to
>>>>> have any further debate about pointers, metadata locations, storage table,
>>>>> etc. because everything will be new.
>>>>>
>>>>> If after the evaluation, we are confident that the MV = view + storage
>>>>> table approach is the right way to go, then we can debate the other 
>>>>> issues,
>>>>> and I think the next issue to reach consensus should be "Should the 
>>>>> storage
>>>>> table be registered in the catalog?".
>>>>>
>>>>> What do we think?
>>>>>
>>>>> -Jack
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Mon, Feb 19, 2024 at 11:32 AM Daniel Weeks <dwe...@apache.org>
>>>>> wrote:
>>>>>
>>>>>> Jack,
>>>>>>
>>>>>> I think we should consider either allowing the storage table to be
>>>>>> fully exposed/addressable via the catalog or allow access via namespacing
>>>>>> like with metadata tables.  E.g. <catalog>.<database>.<table>.<storage>,
>>>>>> which would allow for full access to the underlying table.
>>>>>>
>>>>>> For other engines to interact with the storage table (e.g. to execute
>>>>>> the query to materialize the table), it may be necessary that the table 
>>>>>> is
>>>>>> fully addressable.  Whether the storage table is returned as part of list
>>>>>> operations may be something we leave up to the catalog implementation.
>>>>>>
>>>>>> I don't think the table should reference a physical location (only a
>>>>>> logical reference) since things will be changing behind the view 
>>>>>> definition
>>>>>> and I'm not confident we want to have to update the view representation
>>>>>> everytime the storage table is updated.
>>>>>>
>>>>>> I think there's still some exploration as to whether we need to model
>>>>>> this as separate from view endpoints, but there may be enough overlap 
>>>>>> that
>>>>>> it's not necessary to have yet another set of endpoints for materialized
>>>>>> views (maybe filter params if you need to distinguish?).
>>>>>>
>>>>>> -Dan
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Sun, Feb 18, 2024 at 6:57 PM Renjie Liu <liurenjie2...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi, Jack:
>>>>>>>
>>>>>>> Thanks for raising this.
>>>>>>>
>>>>>>> In most database systems, MV, view and table are considered
>>>>>>>> independent objects, at least at API level. It is very rare for a 
>>>>>>>> system to
>>>>>>>> support operations like "materializing a logical view" or "upgrading a
>>>>>>>> logical view to MV", because view and MV are very different in almost 
>>>>>>>> every
>>>>>>>> aspect of user experience. Extending the existing view or table spec to
>>>>>>>> accommodate MV might give us a MV implementation similar to the current
>>>>>>>> Trino or Hive views, save us some effort and a few APIs in REST, but it
>>>>>>>> binds us to a very specific design of MV, which we might regret in the
>>>>>>>> future.
>>>>>>>
>>>>>>>
>>>>>>> When I reviewed the doc, I thought we were discussing the spec of
>>>>>>> materialized view, just like the spec of table metadata, but didn't not 
>>>>>>> the
>>>>>>> user facing api. I would definitely agree that we should consider MV as
>>>>>>> another kind of database object in user facing api, even though it's
>>>>>>> internally modelled as a view + storage table pointer.
>>>>>>>
>>>>>>> If we want to make the REST experience good for MV, I think we
>>>>>>>> should at least consider directly describing the full metadata of the
>>>>>>>> storage table in Iceberg view, instead of pointing to a JSON file.
>>>>>>>
>>>>>>>
>>>>>>> Do you mean we need to add components like
>>>>>>> `LoadMaterializedViewResponse`, if so, I would +1 for this.
>>>>>>>
>>>>>>> *Q2: what REST APIs do we expect to use for interactions with MVs?*
>>>>>>>
>>>>>>>
>>>>>>> As I have mentioned above,  I think we should consider MV as another
>>>>>>> database object, so I think we should add a set of apis specifically
>>>>>>> designed for MV, such as `loadMV`, `freshMV`.
>>>>>>>
>>>>>>> On Sat, Feb 17, 2024 at 11:14 AM Jack Ye <yezhao...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi everyone,
>>>>>>>>
>>>>>>>> As we are discussing the spec change for materialized view, there
>>>>>>>> has been a missing aspect that is technically also related, and might
>>>>>>>> affect the MV spec design: *how do we want to add MV support to
>>>>>>>> the REST spec?*
>>>>>>>>
>>>>>>>> I would like to discuss this in a new thread to collect people's
>>>>>>>> thoughts. This topic expands to the following 2 sub-questions:
>>>>>>>>
>>>>>>>> *Q1: how would the MV spec change affect the REST spec?*
>>>>>>>> In the current proposal, it looks like we are using a design where
>>>>>>>> a MV is modeled as an Iceberg view linking to an Iceberg storage 
>>>>>>>> table. At
>>>>>>>> the same time, we do not want to expose this storage table in the 
>>>>>>>> catalog,
>>>>>>>> thus the Iceberg view has a pointer to only a metadata JSON file of the
>>>>>>>> Iceberg storage table. Each MV refresh updates the pointer to a new
>>>>>>>> metadata JSON file.
>>>>>>>>
>>>>>>>> I feel this does not play very well with the direction that REST is
>>>>>>>> going. The REST catalog is trying to remove the dependency to the 
>>>>>>>> metadata
>>>>>>>> JSON file. For example, in LoadTableResponse the only required field 
>>>>>>>> is the
>>>>>>>> metadata, and metadata-location is actually optional.
>>>>>>>>
>>>>>>>> If we want to make the REST experience good for MV, I think we
>>>>>>>> should at least consider directly describing the full metadata of the
>>>>>>>> storage table in Iceberg view, instead of pointing to a JSON file.
>>>>>>>>
>>>>>>>> *Q2: what REST APIs do we expect to use for interactions with MVs?*
>>>>>>>> So far we have been thinking about amending the view spec to
>>>>>>>> accommodate MV. This entails likely having MVs also being handled 
>>>>>>>> through
>>>>>>>> the view APIs in REST spec.
>>>>>>>>
>>>>>>>> We need to agree with that first in the community, because this has
>>>>>>>> various implications, and I am not really sure at this point if it is 
>>>>>>>> the
>>>>>>>> best way to go.
>>>>>>>>
>>>>>>>> If MV interactions are through the view APIs, the view APIs need to
>>>>>>>> be updated to accommodate MV constructs that are not really related to
>>>>>>>> logical views. In fact, most actions performed on MVs are more similar 
>>>>>>>> to
>>>>>>>> actions performed on table rather than view, which involve configuring 
>>>>>>>> data
>>>>>>>> layout, read and write constructs. For example, users might run 
>>>>>>>> something
>>>>>>>> like:
>>>>>>>>
>>>>>>>> CREATE MATERIALIZED VIEW mv
>>>>>>>> PARTITION BY col1
>>>>>>>> CLUSTER BY col2
>>>>>>>> AS ( // some sql )
>>>>>>>>
>>>>>>>> then the CreateView API needs to accept partition spec and sort
>>>>>>>> order that are completely not relevant for logical views.
>>>>>>>>
>>>>>>>> When reading a MV, we might even want to have a
>>>>>>>> PlanMaterializedView API similar to the PlanTable API we are adding.
>>>>>>>>
>>>>>>>> *My personal take*
>>>>>>>> It feels like we need to reconsider the question of what is the
>>>>>>>> best way to model MV in Iceberg. Should it be (1) a view linked to a
>>>>>>>> storage table, or (2) a table with a view SQL associated with it, or 
>>>>>>>> (3)
>>>>>>>> it's a completely independent thing. This topic was discussed in the 
>>>>>>>> past in
>>>>>>>> this doc
>>>>>>>> <https://docs.google.com/document/d/1QAuy-meSZ6Oy37iPym8sV_n7R2yKZOHunVR-ZWhhZ6Q/edit?pli=1>,
>>>>>>>> but at that time we did not have much perspective about aspects like 
>>>>>>>> REST
>>>>>>>> spec, and the view integration was also not fully completed yet. With 
>>>>>>>> the
>>>>>>>> new knowledge, currently I am actually leaning a bit more towards (3).
>>>>>>>>
>>>>>>>> In most database systems, MV, view and table are considered
>>>>>>>> independent objects, at least at API level. It is very rare for a 
>>>>>>>> system to
>>>>>>>> support operations like "materializing a logical view" or "upgrading a
>>>>>>>> logical view to MV", because view and MV are very different in almost 
>>>>>>>> every
>>>>>>>> aspect of user experience. Extending the existing view or table spec to
>>>>>>>> accommodate MV might give us a MV implementation similar to the current
>>>>>>>> Trino or Hive views, save us some effort and a few APIs in REST, but it
>>>>>>>> binds us to a very specific design of MV, which we might regret in the
>>>>>>>> future.
>>>>>>>>
>>>>>>>> If we make a new MV spec, it can be made up of fields that already
>>>>>>>> exist in the table and view specs, but it is a whole new spec. In this 
>>>>>>>> way,
>>>>>>>> the spec can evolve independently to accommodate MV specific features, 
>>>>>>>> and
>>>>>>>> we can also create MV-related REST endpoints that will evolve 
>>>>>>>> independently
>>>>>>>> from table and view REST APIs.
>>>>>>>>
>>>>>>>> But on the other side it is definitely associated with more work to
>>>>>>>> maintain a new spec, and potentially big refactoring in the codebase to
>>>>>>>> make sure operations today that work on table or view can now support 
>>>>>>>> MV as
>>>>>>>> a different object. And it definitely has other problems that I have
>>>>>>>> overlooked. I would greatly appreciate any thoughts about this!
>>>>>>>>
>>>>>>>> Best,
>>>>>>>> Jack Ye
>>>>>>>>
>>>>>>>>

Re: Materialized view integration with REST spec

Reply via email to