Thanks Igor. The PR has a suggestion for exactly what you suggested. I called it a "*warm*" state which is a state where stale materialization can still be used. https://github.com/apache/iceberg/pull/11041/files#r2474661166
I think if we continue with the assumption that MVs can only reference iceberg tables and views, then it makes sense for the max-staleness grace period to be dynamic based on snapshot history. This is what Trino does: https://trino.io/docs/current/connector/iceberg.html?utm_source=chatgpt.com#materialized-views If there are non-Iceberg tables in the view SQL, then the grace period will have to be based on last refresh which is also what Trino describes here: https://trino.io/docs/current/sql/create-materialized-view.html#mv-grace-period Should we call out both scenarios in the MV spec? I think this is worth being explicit here. Thanks On Tue, Nov 18, 2025 at 11:03 AM Igor Belianski <[email protected]> wrote: > Re: max-stalenss-ms interpretation > proposal: > A Materialized View(MNV) considered fresh if and only if the results > stored are equivalent to the those that would have been obtained by running > MV's defining query at some point in time within interval : > [CurrentTime-max-staleness-ms, Current_time] > > Note: this definition allows for optimization proposed by option 2 > (implementing which is definitely a great idea) , but doesn't mandate it. > One can also imagine some other optimization that would be possible given > definition above , and would be left up to the engines toi implement. > > > > > > On Tue, Nov 18, 2025 at 10:54 AM Steven Wu <[email protected]> wrote: > >> A reminder for tomorrow's community sync for the MV spec. >> https://calendar.app.google/T4zSk6qKWoy1vV6P7 >> >> We have one open question from the last meeting on how >> `max-stalenesss-ms` should be interpreted. You can find more details in the >> meeting notes. >> >> https://docs.google.com/document/d/1EVCM-hKr5tY33t0Yzq37cAXSPncySc6Ghke7OZEcqXU/edit?tab=t.0#heading=h.75r8e0rwq02o >> >> Please also bring other topics that we should discuss. >> >> On Sat, Nov 1, 2025 at 10:14 PM Steven Wu <[email protected]> wrote: >> >>> Sorry for the delay. Here are the recording and meeting notes for the MV >>> sync meeting on Wednesday, Oct 29. >>> >>> https://docs.google.com/document/d/1EVCM-hKr5tY33t0Yzq37cAXSPncySc6Ghke7OZEcqXU/edit?tab=t.0#heading=h.75r8e0rwq02o >>> >>> We have started to collect them in the above google doc. >>> >>> On Mon, Oct 27, 2025 at 8:58 AM Péter Váry <[email protected]> >>> wrote: >>> >>>> If we have materialized views (MVs) and support for incremental change >>>> scans, then by introducing a Java-based representation of the view, we can >>>> expose a scan API that always returns up-to-date results for the MV. >>>> >>>> The scan could include multiple tasks: >>>> >>>> - A task for reading the current version of the MV. >>>> - An incremental change log scan covering the range between the >>>> snapshot ID of the source table at the time the MV was last refreshed >>>> and >>>> its current snapshot ID. Applying the Java representation of the view >>>> when >>>> transformations are required. >>>> >>>> This approach allows us to build an always up-to-date index >>>> table/single source MV, using existing components. >>>> >>>> Benny Chow <[email protected]> ezt írta (időpont: 2025. okt. 24., P, >>>> 7:44): >>>> >>>>> Hi Peter >>>>> >>>>> I think the current proposal would support your example. In most >>>>> situations, replace table operations after a view is materialized wouldn’t >>>>> invalidate the materialization. However, if the view includes metadata >>>>> columns, then the replace operations should invalidate the >>>>> materialization. >>>>> >>>>> >>>>> This also brings up another important point that engines will differ >>>>> on what views can be materialized or not. For example, maybe metadata >>>>> columns are not allowed similar to non deterministic functions like >>>>> random. But some engines like Dremio may allow views that use current >>>>> date >>>>> functions. It should be possible for one engine to materialize a view and >>>>> another engine to look at the query tree and decide it’s not a view it >>>>> supports materializations on and choose not to use that materialization. >>>>> >>>>> Thanks >>>>> Benny >>>>> >>>>> >>>>> >>>>> On Oct 23, 2025, at 8:44 AM, Péter Váry <[email protected]> >>>>> wrote: >>>>> >>>>> >>>>> Hi All, >>>>> >>>>> I’ve been catching up on the discussion and wanted to share an >>>>> observation. One aspect that stands out to me in the proposed staleness >>>>> evaluation logic is that snapshots which don’t modify data can still >>>>> affect >>>>> the view’s contents if the view includes metadata columns. >>>>> >>>>> I was considering using a materialized view as an index for a given >>>>> table to accelerate the conversion of equality deletes to position >>>>> deletes. >>>>> For example, the query might look like: >>>>> >>>>> *SELECT _POS, _FILE, id FROM target_table* >>>>> >>>>> >>>>> During compaction, the materialized view would need to be refreshed to >>>>> ensure it reflects the correct data. >>>>> >>>>> Does this seem like a valid use case? Or should we explicitly exclude >>>>> scenarios like this? >>>>> >>>>> Thanks, >>>>> Peter >>>>> >>>>> Steven Wu <[email protected]> ezt írta (időpont: 2025. okt. 20., >>>>> H, 17:30): >>>>> >>>>>> Walaa, >>>>>> >>>>>> > while Option 2 is described in your summary as "giving engines >>>>>> *flexibility* to determine freshness recursively beyond a source >>>>>> MV", that *isn’t achievable* under the MV evaluation model itself. >>>>>> Because each MV treats upstream MVs as physical tables, recursion >>>>>> stops at the first materialized boundary; *deeper staleness cannot >>>>>> be discovered without switching to a logical-view evaluation model, i.e., >>>>>> stepping outside the MV model altogether (note that in Option 3 we can >>>>>> determine recursive staleness while still inside the MV model).* >>>>>> >>>>>> In option 2, when determining the freshness of mv_3, engines can >>>>>> choose to recursively evaluate the freshness of mv_1 and mv_2 since they >>>>>> are also MVs. But engines can also choose not to. >>>>>> >>>>>> > This means that there seems to be an implicit “Option 3”. This >>>>>> option treats MVs as logical views, i.e., storing only view versions + >>>>>> base >>>>>> table snapshot IDs (no MV storage snapshot IDs, no per-path lineage). >>>>>> >>>>>> In the new option 3 you described, how could the engine update mv3's >>>>>> refresh state for base table_a and table_b? unless all connected MVs are >>>>>> refreshed and committed in one single transaction, one entry per base >>>>>> table >>>>>> doesn't seem feasible. That's the main reason for option 1 to require the >>>>>> lineage path information in refresh state for base tables. >>>>>> >>>>>> It also seems that option 3 can only interpret freshness recursively, >>>>>> while today there are engines that support MVs without recursively >>>>>> evaluating source MVs. >>>>>> >>>>>> Thanks, >>>>>> Steven >>>>>> >>>>>> >>>>>> On Mon, Oct 20, 2025 at 1:44 AM Walaa Eldin Moustafa < >>>>>> [email protected]> wrote: >>>>>> >>>>>>> Hi Steven, >>>>>>> >>>>>>> Thanks for organizing the series and summarizing the outcome. >>>>>>> >>>>>>> After re-reading the Option 1/2 proposal, initially I interpreted >>>>>>> Option 1 as simply expanding MVs like regular logical views. On closer >>>>>>> look, it is actually more complex. It also preserves per-path lineage >>>>>>> state >>>>>>> (e.g., multiple entries for the same base table via different parents), >>>>>>> which increases expressiveness but significantly increases metadata >>>>>>> complexity. So I agree it is not a practical option. >>>>>>> >>>>>>> This means that there seems to be an implicit “Option 3”. This >>>>>>> option treats MVs as logical views, i.e., storing only view versions + >>>>>>> base >>>>>>> table snapshot IDs (no MV storage snapshot IDs, no per-path lineage). >>>>>>> Under >>>>>>> this model, mv_3’s metadata might look like: >>>>>>> >>>>>>> Type Name Tracked State >>>>>>> ----- ------- ----------------------- >>>>>>> view mv_1 view_version_id >>>>>>> view mv_2 view_version_id >>>>>>> table table_a table_snapshot_id >>>>>>> table table_b table_snapshot_id >>>>>>> >>>>>>> This preserves logical semantics and aligns MV behavior with pure >>>>>>> views. >>>>>>> >>>>>>> *If we choose Option 2 (treat source MV as a materialized table), we >>>>>>> may have to be consider those constraints:* >>>>>>> >>>>>>> * Staleness only degrades up the chain. mv_1 and mv_2 may already be >>>>>>> stale relative to the base tables, but if mv_3 is refreshed using their >>>>>>> storage snapshots, then mv_3 will be marked as fresh under Option 2, >>>>>>> even >>>>>>> though all three MVs are stale relative to the base tables. >>>>>>> >>>>>>> * Engines can no longer discover staleness beyond mv_1. Once mv_3 >>>>>>> sees mv_1 (or mv_2) as fresh based only on their storage snapshots, it >>>>>>> will >>>>>>> not expand into mv_1 or mv_2 to check whether they are stale relative to >>>>>>> the base tables. >>>>>>> >>>>>>> * If mv_2 and mv_3 were purely logical views instead of MVs, they >>>>>>> would evaluate directly against base tables and return newer data. Under >>>>>>> Option 2, the same definitions but materialized upstream produce >>>>>>> different >>>>>>> data, not just different metadata. >>>>>>> >>>>>>> Therefore, while Option 2 is described in your summary as "giving >>>>>>> engines *flexibility* to determine freshness recursively beyond a >>>>>>> source MV", that *isn’t achievable* under the MV evaluation model >>>>>>> itself. >>>>>>> Because each MV treats upstream MVs as physical tables, recursion >>>>>>> stops at the first materialized boundary; *deeper staleness cannot >>>>>>> be discovered without switching to a logical-view evaluation model, >>>>>>> i.e., >>>>>>> stepping outside the MV model altogether (note that in Option 3 we can >>>>>>> determine recursive staleness while still inside the MV model).* >>>>>>> >>>>>>> Let me know your thoughts. I slightly prefer Option 3. I’m also fine >>>>>>> with Option 2, but I don’t think the flexibility to recursively >>>>>>> determine >>>>>>> freshness actually exists under its evaluation model. Not sure if this >>>>>>> changes anyone’s view, but I wanted to clarify how I’m reading it. >>>>>>> >>>>>>> Thanks, >>>>>>> Walaa. >>>>>>> >>>>>>> >>>>>>> On Wed, Oct 8, 2025 at 11:11 PM Benny Chow <[email protected]> wrote: >>>>>>> >>>>>>>> I just listened to the recording. I'm the tech lead for MVs at >>>>>>>> Dremio and responsible for both refresh management and query rewrites >>>>>>>> with >>>>>>>> MVs. >>>>>>>> >>>>>>>> It's great that we seem to agree that Iceberg MV spec won't require >>>>>>>> that MVs always be up to date in order to be usable for query rewrites. >>>>>>>> There can be many data consistency issues (as Dan pointed out) but >>>>>>>> that is >>>>>>>> the state of affairs today. >>>>>>>> >>>>>>>> It sounds like we are converging on the following scenarios for an >>>>>>>> engine to validate the MV freshness: >>>>>>>> >>>>>>>> 1. Use storage table without any validation. This might be the >>>>>>>> extreme "async MV" example. >>>>>>>> 2. Ignore storage table even if one exists because SQL command or >>>>>>>> use case requires that. >>>>>>>> 3. Use storage table only if data is not more than x hours old. >>>>>>>> This can be achieved with the proposed refresh-start-timestamp-ms which >>>>>>>> is currently in the proposed spec. For this to work with MVs built on >>>>>>>> MVs, >>>>>>>> we should probably state in the spec that if a MV is built on another >>>>>>>> MV, >>>>>>>> then it needs to inherit the refresh-start-timestamp-ms of the child >>>>>>>> MV. >>>>>>>> In Steven's example, when building mv3, refresh-start-timestamp-ms >>>>>>>> needs to >>>>>>>> be set to the minimum of mv1 or mv2's refresh-start-timestamp-ms. If >>>>>>>> this >>>>>>>> property name is confusing, we can rename it to >>>>>>>> "refresh-earliest-table-timestamp-ms". I originally proposed this >>>>>>>> property >>>>>>>> and also listed out other benefits here: >>>>>>>> https://github.com/apache/iceberg/pull/11041#discussion_r1779797796 >>>>>>>> Also, at the time, MVs built on MVs weren't being considered. Now >>>>>>>> that it >>>>>>>> is, I would recommend we have both "refresh-start-timestamp-ms" (when >>>>>>>> the >>>>>>>> refresh was started on the storage table) and >>>>>>>> "refresh-earliest-table-timestamp-ms" (used for freshness validation). >>>>>>>> 4. Don't use the storage table if it is older than X hours. This >>>>>>>> is what I had originally proposed for the >>>>>>>> *materialization.max-stalessness-ms* view property here: >>>>>>>> https://github.com/apache/iceberg/pull/11041#discussion_r1744837644 >>>>>>>> It wasn't meant to validate the freshness but more to prevent use of a >>>>>>>> materialization after some criteria. >>>>>>>> 5. Use storage table if recursive validation passes... i.e. >>>>>>>> refresh-state matches the current expanded query tree state. This is >>>>>>>> what >>>>>>>> I think Steven is calling the "synchronous MV". >>>>>>>> >>>>>>>> For scenario 1-4, it would support the nice use case of an Iceberg >>>>>>>> client using a view's data through the storage table without needing to >>>>>>>> know how to parse/validate/expand any view SQLs. >>>>>>>> >>>>>>>> In Dremio's planner, we primarily use scenario 1 and 4 together to >>>>>>>> determine MV validity for query rewrite. Scenario 2 and 5 also apply >>>>>>>> in >>>>>>>> certain situations. For scenario 3, Dremio only exposes the >>>>>>>> "refresh-earliest-table-timestamp-ms" as an fyi to the user but it >>>>>>>> would be >>>>>>>> interesting to allow the user to set this time so that they could run >>>>>>>> queries and be 100% certain that they were not seeing data older than x >>>>>>>> hours. >>>>>>>> >>>>>>>> Thanks >>>>>>>> Benny >>>>>>>> >>>>>>>> On Wed, Oct 8, 2025 at 3:37 PM Steven Wu <[email protected]> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> correction for a typo. >>>>>>>>> >>>>>>>>> Prashanth brought up another scenario of compaction/rewrite where >>>>>>>>> a new snapshot was added *with* actual data change >>>>>>>>> --> >>>>>>>>> Prashanth brought up another scenario of compaction/rewrite where >>>>>>>>> a new snapshot was added *without* actual data change >>>>>>>>> >>>>>>>>> >>>>>>>>> On Wed, Oct 8, 2025 at 2:12 PM Steven Wu <[email protected]> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> Hi, >>>>>>>>>> >>>>>>>>>> Thanks everyone for joining the MV discussion meeting. We will >>>>>>>>>> continue to have the recurring sync meeting on Wednesday 9 am >>>>>>>>>> (Pacific) >>>>>>>>>> every 3 weeks until we get to the finish line where Jan's MV spec PR >>>>>>>>>> [1] is >>>>>>>>>> merged. I have scheduled our next meeting on Oct 29 in the Iceberg >>>>>>>>>> dev >>>>>>>>>> events calendar. >>>>>>>>>> >>>>>>>>>> Here is the video recording for today's meeting. >>>>>>>>>> >>>>>>>>>> https://drive.google.com/file/d/1-nfhBPDWLoAFDu5cKP0rwLd_30HB6byR/view?usp=sharing >>>>>>>>>> >>>>>>>>>> We mostly discussed freshness evaluation. Here is the meeting >>>>>>>>>> summary. >>>>>>>>>> >>>>>>>>>> 1. For tracking the refresh state for the source MV [2], the >>>>>>>>>> consensus is option 2 (treating source MV as a materialized >>>>>>>>>> table) which >>>>>>>>>> would give engines the flexibility on freshness determination >>>>>>>>>> (recursive >>>>>>>>>> beyond source MV or not). >>>>>>>>>> 2. Earlier design doc [3] discussed max staleness config. But >>>>>>>>>> it wasn't reflected in the spec PR. The general opinion is to add >>>>>>>>>> the >>>>>>>>>> config to the spec PR. The open question is whether the ` >>>>>>>>>> materialization.max-staleness-ms` config should be added to >>>>>>>>>> the view metadata or the storage table metadata. Either can work. >>>>>>>>>> We just >>>>>>>>>> need to decide which makes a little better fit. >>>>>>>>>> 3. Prashanth brought up schema change with default value and >>>>>>>>>> how it may affect the MV refresh state (for SQL representation >>>>>>>>>> with select >>>>>>>>>> *). Jan mentioned that snapshot contains schema id when the >>>>>>>>>> snapshot was >>>>>>>>>> created. Engine can compare the snapshot schema id to the source >>>>>>>>>> table >>>>>>>>>> schema id during freshness evaluation. There is no need for >>>>>>>>>> additional >>>>>>>>>> schema info in refresh-state tracking in the storage table. >>>>>>>>>> 4. Prashanth brought up another scenario of >>>>>>>>>> compaction/rewrite where a new snapshot was added with actual >>>>>>>>>> data change. >>>>>>>>>> The general take is that the engine can optimize and decide that >>>>>>>>>> MV is >>>>>>>>>> fresh as the new snapshot doesn't have any data change. >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> We can add some clarifications in the spec PR for freshness >>>>>>>>>> evaluation based on the above discussions. >>>>>>>>>> >>>>>>>>>> [1] https://github.com/apache/iceberg/pull/11041 >>>>>>>>>> [2] >>>>>>>>>> https://docs.google.com/document/d/1_StBW5hCQhumhIvgbdsHjyW0ED3dWMkjtNzyPp9Sfr8/edit?tab=t.0 >>>>>>>>>> [3] >>>>>>>>>> https://docs.google.com/document/d/1UnhldHhe3Grz8JBngwXPA6ZZord1xMedY5ukEhZYF-A/edit?tab=t.0#heading=h.3wigecex0zls >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Thu, Sep 25, 2025 at 9:27 AM Steven Wu <[email protected]> >>>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>>> Hi all, >>>>>>>>>>> >>>>>>>>>>> Iceberg materialized view has been discussed in the community >>>>>>>>>>> for a long time. Thanks Jan Kaul for driving the discussion and the >>>>>>>>>>> spec >>>>>>>>>>> PR. It has been stalled for a long time due to lack of consensus on >>>>>>>>>>> 1 or 2 >>>>>>>>>>> topics. In Wed's Iceberg community sync meeting, Talat brought up >>>>>>>>>>> the >>>>>>>>>>> question on how to move forward and if we can have a dedicated >>>>>>>>>>> meeting for >>>>>>>>>>> MV. >>>>>>>>>>> >>>>>>>>>>> I have set up a meeting on *Oct 8 (9-10 am Pacific)*. If you >>>>>>>>>>> subscribe to the "Iceberg Dev Events" calendar, you should be >>>>>>>>>>> able to see it. If not, here is the link: >>>>>>>>>>> https://meet.google.com/nfe-guyq-pqf >>>>>>>>>>> >>>>>>>>>>> We are going to discuss >>>>>>>>>>> * remaining open questions >>>>>>>>>>> * unresolved concerns >>>>>>>>>>> * the next step and hopefully some consensus on moving forward >>>>>>>>>>> >>>>>>>>>>> MV spec PR is up to date. Jan has incorporated recent feedback. >>>>>>>>>>> This should be the base of the discussion. >>>>>>>>>>> https://github.com/apache/iceberg/pull/11041 >>>>>>>>>>> <https://www.google.com/url?q=https://github.com/apache/iceberg/pull/11041&sa=D&source=calendar&usd=2&usg=AOvVaw3w0TjRpwbC17AGzmxZmElM> >>>>>>>>>>> >>>>>>>>>>> Dev discussion thread (a long-running thread started by Jan). >>>>>>>>>>> https://lists.apache.org/thread/y1vlpzbn2x7xookjkffcl08zzyofk5hf >>>>>>>>>>> <https://www.google.com/url?q=https://lists.apache.org/thread/y1vlpzbn2x7xookjkffcl08zzyofk5hf&sa=D&source=calendar&usd=2&usg=AOvVaw0fotlsrnRBOb820mA5JRyB> >>>>>>>>>>> >>>>>>>>>>> The mail archive has broken lineage and doesn't show all >>>>>>>>>>> replies. Email subject is "*[DISCUSS] Iceberg >>>>>>>>>>> Materialzied Views*". >>>>>>>>>>> >>>>>>>>>>> Thanks, >>>>>>>>>>> Steven >>>>>>>>>>> >>>>>>>>>>>
