>> Benny: Are you suggesting that the source-table-states should only capture the leaf table nodes in the MV dependency pipeline? Yes. But to be clear with an example, suppose you have a MV like:
CREATE MV MV1 as SELECT * FROM T1 UNION ALL SELECT * FROM V1 And suppose V1 was defined as CREATE MV V1 as SELECT * FROM T1 --- Yes, T1 again to make this example interesting. Then, I'm saying that the source-table-state for MV1 is going to somehow combine the first T1 with the source table state from V1. On Fri, Dec 5, 2025 at 1:25 PM Igor Belianski <[email protected]> wrote: > Hi Benny and Walaa, > > Could you please clarify the following statement from Benny's last email: > > "We should avoid the need for consumers to expand nested MVs. I think the > producer should be combining the refresh states of all the nested MVs it > uses into two flat lists of source views and source tables. These source > tables can't contain storage tables." > > Benny: Are you suggesting that the source-table-states should only capture > the leaf table nodes in the MV dependency pipeline? > > Walaa: If we completely enumerate all source tables recursively, we must > consider the scenario of shared upstream tables (the "diamond pattern"). > Specifically: > > 1. Do we allow duplicate table entries in the list (e.g., if the same > table was used at different snapshots across different refresh path > traversals)? > 2. Would we need to include the path in the refresh state entry to make > this data interpretable? > > If we pursue the option of listing everything in the tree, we should > choose between: > > - A) Permissive: Allow duplicate table entries, treating the list as a > client hint for tables to check. This leaves it up to engines to > disambiguate or skip entries, and the list may not be strictly exhaustive. > - B) Prescriptive: Establish an exactly defined meaning for each entry, > mandating clear rules for aggregation. > > I am highly hesitant to mandate option B ( it would obviously be too > prescriptive for most engines). > > Thanks, > Igor > > On Thu, Dec 4, 2025 at 9:28 PM Benny Chow <[email protected]> wrote: > >> I agree with Walaa. In the last sync, we talked about smart producers >> and dumb consumers. We should avoid the need for consumers to expand >> nested MVs. I think the producer should be combining the refresh states of >> all the nested MVs it uses into two flat lists of source views and source >> tables. These source tables can't contain storage tables. When planning >> the refresh job, the producer can choose to use the nested MV's storage >> table or not and the refresh state needs to reflect this decision >> accordingly. >> >> There's also a somewhat corner case to consider. It is completely >> possible for a source table to show up in a materialization at different >> snapshots. In this scenario, it's up to the producer to decide whether to >> allow this or not or maybe just record the earliest snapshot. These >> scenarios are inevitable when you get MVs built on MVs built on MVs such as >> in ETL scenarios. >> >> For max-staleness, I think it should strictly apply to only source tables >> and not storage tables. For the ETL pipeline use case, the consumer is >> probably not going to care about this property. >> >> Thanks >> >> On Thu, Dec 4, 2025 at 5:37 PM Walaa Eldin Moustafa < >> [email protected]> wrote: >> >>> I think this creates significant friction for engine implementations >>> (and contradicts some of the principles we established earlier): >>> >>> * When the engine sees the tables backing mv_3, the spec provides no >>> built-in way to distinguish MV storage tables from true physical tables. >>> The engine must always perform an external lookup to determine whether a >>> “table” is really an MV. >>> >>> * Freshness evaluation becomes inconsistent: nested logical views >>> require only a one-shot leaf-table comparison, while nested MVs require >>> recursive traversal because their refresh-state does not contain leaf >>> snapshots. >>> >>> * Even if an engine uses the MV definition to detect deeper staleness, >>> it cannot refresh the MV to a consistent base-table state. Option 2 refresh >>> semantics stop at the immediate MV boundary, so recursion for freshness >>> does not align with recursion for refresh. >>> >>> For these reasons, “allowing recursive expansion” is not practically >>> usable. It introduces complexity without providing coherent semantics. >>> >>> In summary, treating MVs either as views or as tables yields a >>> consistent model, but the optionality implied in Option 2 is misleading. >>> The metadata does not support cleanly mixing the two modes. >>> >>> Thanks, >>> Walaa. >>> >>> On Thu, Dec 4, 2025 at 11:44 AM Steven Wu <[email protected]> wrote: >>> >>>> Walaa, >>>> >>>> We are saying the `refres-state` only has a source view state for the >>>> source MV and a source table state for the source MV's storage table. It >>>> would allow both evaluation strategies (recursive or not) >>>> * non-recursive: as long as the MV refresh state is aligned with the >>>> source MV's storage table (with max staleness config), it is fresh. This >>>> semantic matches many ETL pipeline use cases. >>>> * recursive: If an engine wants to enforce stronger freshness >>>> semantics, it can recursively evaluate if source mv_1 and mv_2 themselves >>>> are fresh. The current spec wording mentioned this is allowed: "query >>>> engines may recursively expand the query tree to determine freshness". >>>> >>>> We wants the spec definition to be flexible enough to support both use >>>> cases. >>>> >>>> Thanks, >>>> Steven >>>> >>>> >>>> On Wed, Dec 3, 2025 at 5:30 PM Walaa Eldin Moustafa < >>>> [email protected]> wrote: >>>> >>>>> Hi Steven, >>>>> >>>>> > In option 2, when determining the freshness of mv_3, engines can >>>>> choose to recursively evaluate the freshness of mv_1 and mv_2 since they >>>>> are also MVs. But engines can also choose not to. >>>>> >>>>> Does not "evaluating freshness of mv_1 and mv_2" mean that engines >>>>> consider mv_1 and mv_2 as views? "Tables" do not have freshness. >>>>> >>>>> Thanks, >>>>> Walaa. >>>>> >>>>> >>>>> On Tue, Nov 25, 2025 at 11:49 PM Jan Kaul via dev < >>>>> [email protected]> wrote: >>>>> >>>>>> Thank you Steven, >>>>>> >>>>>> I've included the "max-staleness" in the PR. Please have a look and >>>>>> give feedback on the phrasing. >>>>>> >>>>>> Thanks, >>>>>> >>>>>> Jan >>>>>> On 11/19/25 22:37, Steven Wu wrote: >>>>>> >>>>>> Thanks everyone for joining today's sync. We had a good discussion on >>>>>> how to interpret the "max staleness" config. >>>>>> >>>>>> You can find the meeting notes here. >>>>>> >>>>>> https://docs.google.com/document/d/1EVCM-hKr5tY33t0Yzq37cAXSPncySc6Ghke7OZEcqXU/edit?tab=t.0#heading=h.eho7jgm13usg >>>>>> >>>>>> Recording is also linked in the doc (thanks Kevin). >>>>>> >>>>>> For the next step, maybe we can collaborate on the MV spec PR to >>>>>> flush the exact wording for staleness config and semantic. >>>>>> https://github.com/apache/iceberg/pull/11041/files >>>>>> >>>>>> >>>>>> On Tue, Nov 18, 2025 at 1:05 PM Benny Chow <[email protected]> wrote: >>>>>> >>>>>>> Thanks Igor. The PR has a suggestion for exactly what you >>>>>>> suggested. I called it a "*warm*" state which is a state where >>>>>>> stale materialization can still be used. >>>>>>> https://github.com/apache/iceberg/pull/11041/files#r2474661166 >>>>>>> >>>>>>> I think if we continue with the assumption that MVs can only >>>>>>> reference iceberg tables and views, then it makes sense for the >>>>>>> max-staleness grace period to be dynamic based on snapshot history. >>>>>>> This >>>>>>> is what Trino does: >>>>>>> https://trino.io/docs/current/connector/iceberg.html?utm_source=chatgpt.com#materialized-views >>>>>>> >>>>>>> If there are non-Iceberg tables in the view SQL, then the grace >>>>>>> period will have to be based on last refresh which is also what Trino >>>>>>> describes here: >>>>>>> https://trino.io/docs/current/sql/create-materialized-view.html#mv-grace-period >>>>>>> >>>>>>> Should we call out both scenarios in the MV spec? I think this is >>>>>>> worth being explicit here. >>>>>>> >>>>>>> Thanks >>>>>>> >>>>>>> >>>>>>> On Tue, Nov 18, 2025 at 11:03 AM Igor Belianski < >>>>>>> [email protected]> wrote: >>>>>>> >>>>>>>> Re: max-stalenss-ms interpretation >>>>>>>> proposal: >>>>>>>> A Materialized View(MNV) considered fresh if and only if the >>>>>>>> results stored are equivalent to the those that would have been >>>>>>>> obtained by >>>>>>>> running MV's defining query at some point in time within interval : >>>>>>>> [CurrentTime-max-staleness-ms, Current_time] >>>>>>>> >>>>>>>> Note: this definition allows for optimization proposed by option 2 >>>>>>>> (implementing which is definitely a great idea) , but doesn't mandate >>>>>>>> it. >>>>>>>> One can also imagine some other optimization that would be >>>>>>>> possible given definition above , and would be left up to the engines >>>>>>>> toi >>>>>>>> implement. >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> On Tue, Nov 18, 2025 at 10:54 AM Steven Wu <[email protected]> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> A reminder for tomorrow's community sync for the MV spec. >>>>>>>>> https://calendar.app.google/T4zSk6qKWoy1vV6P7 >>>>>>>>> >>>>>>>>> We have one open question from the last meeting on how >>>>>>>>> `max-stalenesss-ms` should be interpreted. You can find more details >>>>>>>>> in the >>>>>>>>> meeting notes. >>>>>>>>> >>>>>>>>> https://docs.google.com/document/d/1EVCM-hKr5tY33t0Yzq37cAXSPncySc6Ghke7OZEcqXU/edit?tab=t.0#heading=h.75r8e0rwq02o >>>>>>>>> >>>>>>>>> Please also bring other topics that we should discuss. >>>>>>>>> >>>>>>>>> On Sat, Nov 1, 2025 at 10:14 PM Steven Wu <[email protected]> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> Sorry for the delay. Here are the recording and meeting notes for >>>>>>>>>> the MV sync meeting on Wednesday, Oct 29. >>>>>>>>>> >>>>>>>>>> https://docs.google.com/document/d/1EVCM-hKr5tY33t0Yzq37cAXSPncySc6Ghke7OZEcqXU/edit?tab=t.0#heading=h.75r8e0rwq02o >>>>>>>>>> >>>>>>>>>> We have started to collect them in the above google doc. >>>>>>>>>> >>>>>>>>>> On Mon, Oct 27, 2025 at 8:58 AM Péter Váry < >>>>>>>>>> [email protected]> wrote: >>>>>>>>>> >>>>>>>>>>> If we have materialized views (MVs) and support for incremental >>>>>>>>>>> change scans, then by introducing a Java-based representation of >>>>>>>>>>> the view, >>>>>>>>>>> we can expose a scan API that always returns up-to-date results for >>>>>>>>>>> the MV. >>>>>>>>>>> >>>>>>>>>>> The scan could include multiple tasks: >>>>>>>>>>> >>>>>>>>>>> - A task for reading the current version of the MV. >>>>>>>>>>> - An incremental change log scan covering the range between >>>>>>>>>>> the snapshot ID of the source table at the time the MV was last >>>>>>>>>>> refreshed >>>>>>>>>>> and its current snapshot ID. Applying the Java representation of >>>>>>>>>>> the view >>>>>>>>>>> when transformations are required. >>>>>>>>>>> >>>>>>>>>>> This approach allows us to build an always up-to-date index >>>>>>>>>>> table/single source MV, using existing components. >>>>>>>>>>> >>>>>>>>>>> Benny Chow <[email protected]> ezt írta (időpont: 2025. okt. >>>>>>>>>>> 24., P, 7:44): >>>>>>>>>>> >>>>>>>>>>>> Hi Peter >>>>>>>>>>>> >>>>>>>>>>>> I think the current proposal would support your example. In >>>>>>>>>>>> most situations, replace table operations after a view is >>>>>>>>>>>> materialized >>>>>>>>>>>> wouldn’t invalidate the materialization. However, if the view >>>>>>>>>>>> includes >>>>>>>>>>>> metadata columns, then the replace operations should invalidate the >>>>>>>>>>>> materialization. >>>>>>>>>>>> >>>>>>>>>>>> This also brings up another important point that engines will >>>>>>>>>>>> differ on what views can be materialized or not. For example, >>>>>>>>>>>> maybe >>>>>>>>>>>> metadata columns are not allowed similar to non deterministic >>>>>>>>>>>> functions >>>>>>>>>>>> like random. But some engines like Dremio may allow views that >>>>>>>>>>>> use current >>>>>>>>>>>> date functions. It should be possible for one engine to >>>>>>>>>>>> materialize a view >>>>>>>>>>>> and another engine to look at the query tree and decide it’s not a >>>>>>>>>>>> view it >>>>>>>>>>>> supports materializations on and choose not to use that >>>>>>>>>>>> materialization. >>>>>>>>>>>> >>>>>>>>>>>> Thanks >>>>>>>>>>>> Benny >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> On Oct 23, 2025, at 8:44 AM, Péter Váry < >>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> Hi All, >>>>>>>>>>>> >>>>>>>>>>>> I’ve been catching up on the discussion and wanted to share an >>>>>>>>>>>> observation. One aspect that stands out to me in the proposed >>>>>>>>>>>> staleness >>>>>>>>>>>> evaluation logic is that snapshots which don’t modify data can >>>>>>>>>>>> still affect >>>>>>>>>>>> the view’s contents if the view includes metadata columns. >>>>>>>>>>>> >>>>>>>>>>>> I was considering using a materialized view as an index for a >>>>>>>>>>>> given table to accelerate the conversion of equality deletes to >>>>>>>>>>>> position >>>>>>>>>>>> deletes. For example, the query might look like: >>>>>>>>>>>> >>>>>>>>>>>> *SELECT _POS, _FILE, id FROM target_table* >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> During compaction, the materialized view would need to be >>>>>>>>>>>> refreshed to ensure it reflects the correct data. >>>>>>>>>>>> >>>>>>>>>>>> Does this seem like a valid use case? Or should we explicitly >>>>>>>>>>>> exclude scenarios like this? >>>>>>>>>>>> >>>>>>>>>>>> Thanks, >>>>>>>>>>>> Peter >>>>>>>>>>>> >>>>>>>>>>>> Steven Wu <[email protected]> ezt írta (időpont: 2025. okt. >>>>>>>>>>>> 20., H, 17:30): >>>>>>>>>>>> >>>>>>>>>>>>> Walaa, >>>>>>>>>>>>> >>>>>>>>>>>>> > while Option 2 is described in your summary as "giving >>>>>>>>>>>>> engines *flexibility* to determine freshness recursively >>>>>>>>>>>>> beyond a source MV", that *isn’t achievable* under the MV >>>>>>>>>>>>> evaluation model itself. >>>>>>>>>>>>> Because each MV treats upstream MVs as physical tables, >>>>>>>>>>>>> recursion stops at the first materialized boundary; *deeper >>>>>>>>>>>>> staleness cannot be discovered without switching to a logical-view >>>>>>>>>>>>> evaluation model, i.e., stepping outside the MV model altogether >>>>>>>>>>>>> (note that >>>>>>>>>>>>> in Option 3 we can determine recursive staleness while still >>>>>>>>>>>>> inside the MV >>>>>>>>>>>>> model).* >>>>>>>>>>>>> >>>>>>>>>>>>> In option 2, when determining the freshness of mv_3, engines >>>>>>>>>>>>> can choose to recursively evaluate the freshness of mv_1 and mv_2 >>>>>>>>>>>>> since >>>>>>>>>>>>> they are also MVs. But engines can also choose not to. >>>>>>>>>>>>> >>>>>>>>>>>>> > This means that there seems to be an implicit “Option 3”. >>>>>>>>>>>>> This option treats MVs as logical views, i.e., storing only view >>>>>>>>>>>>> versions + >>>>>>>>>>>>> base table snapshot IDs (no MV storage snapshot IDs, no per-path >>>>>>>>>>>>> lineage). >>>>>>>>>>>>> >>>>>>>>>>>>> In the new option 3 you described, how could the engine update >>>>>>>>>>>>> mv3's refresh state for base table_a and table_b? unless all >>>>>>>>>>>>> connected MVs >>>>>>>>>>>>> are refreshed and committed in one single transaction, one entry >>>>>>>>>>>>> per base >>>>>>>>>>>>> table doesn't seem feasible. That's the main reason for option 1 >>>>>>>>>>>>> to require >>>>>>>>>>>>> the lineage path information in refresh state for base tables. >>>>>>>>>>>>> >>>>>>>>>>>>> It also seems that option 3 can only interpret freshness >>>>>>>>>>>>> recursively, while today there are engines that support MVs >>>>>>>>>>>>> without >>>>>>>>>>>>> recursively evaluating source MVs. >>>>>>>>>>>>> >>>>>>>>>>>>> Thanks, >>>>>>>>>>>>> Steven >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> On Mon, Oct 20, 2025 at 1:44 AM Walaa Eldin Moustafa < >>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> Hi Steven, >>>>>>>>>>>>>> >>>>>>>>>>>>>> Thanks for organizing the series and summarizing the outcome. >>>>>>>>>>>>>> >>>>>>>>>>>>>> After re-reading the Option 1/2 proposal, initially I >>>>>>>>>>>>>> interpreted Option 1 as simply expanding MVs like regular >>>>>>>>>>>>>> logical views. On >>>>>>>>>>>>>> closer look, it is actually more complex. It also preserves >>>>>>>>>>>>>> per-path >>>>>>>>>>>>>> lineage state (e.g., multiple entries for the same base table >>>>>>>>>>>>>> via different >>>>>>>>>>>>>> parents), which increases expressiveness but significantly >>>>>>>>>>>>>> increases >>>>>>>>>>>>>> metadata complexity. So I agree it is not a practical option. >>>>>>>>>>>>>> >>>>>>>>>>>>>> This means that there seems to be an implicit “Option 3”. >>>>>>>>>>>>>> This option treats MVs as logical views, i.e., storing only view >>>>>>>>>>>>>> versions + >>>>>>>>>>>>>> base table snapshot IDs (no MV storage snapshot IDs, no per-path >>>>>>>>>>>>>> lineage). >>>>>>>>>>>>>> Under this model, mv_3’s metadata might look like: >>>>>>>>>>>>>> >>>>>>>>>>>>>> Type Name Tracked State >>>>>>>>>>>>>> ----- ------- ----------------------- >>>>>>>>>>>>>> view mv_1 view_version_id >>>>>>>>>>>>>> view mv_2 view_version_id >>>>>>>>>>>>>> table table_a table_snapshot_id >>>>>>>>>>>>>> table table_b table_snapshot_id >>>>>>>>>>>>>> >>>>>>>>>>>>>> This preserves logical semantics and aligns MV behavior with >>>>>>>>>>>>>> pure views. >>>>>>>>>>>>>> >>>>>>>>>>>>>> *If we choose Option 2 (treat source MV as a materialized >>>>>>>>>>>>>> table), we may have to be consider those constraints:* >>>>>>>>>>>>>> >>>>>>>>>>>>>> * Staleness only degrades up the chain. mv_1 and mv_2 may >>>>>>>>>>>>>> already be stale relative to the base tables, but if mv_3 is >>>>>>>>>>>>>> refreshed >>>>>>>>>>>>>> using their storage snapshots, then mv_3 will be marked as fresh >>>>>>>>>>>>>> under >>>>>>>>>>>>>> Option 2, even though all three MVs are stale relative to the >>>>>>>>>>>>>> base tables. >>>>>>>>>>>>>> >>>>>>>>>>>>>> * Engines can no longer discover staleness beyond mv_1. Once >>>>>>>>>>>>>> mv_3 sees mv_1 (or mv_2) as fresh based only on their storage >>>>>>>>>>>>>> snapshots, it >>>>>>>>>>>>>> will not expand into mv_1 or mv_2 to check whether they are >>>>>>>>>>>>>> stale relative >>>>>>>>>>>>>> to the base tables. >>>>>>>>>>>>>> >>>>>>>>>>>>>> * If mv_2 and mv_3 were purely logical views instead of MVs, >>>>>>>>>>>>>> they would evaluate directly against base tables and return >>>>>>>>>>>>>> newer data. >>>>>>>>>>>>>> Under Option 2, the same definitions but materialized upstream >>>>>>>>>>>>>> produce >>>>>>>>>>>>>> different data, not just different metadata. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Therefore, while Option 2 is described in your summary as >>>>>>>>>>>>>> "giving engines *flexibility* to determine freshness >>>>>>>>>>>>>> recursively beyond a source MV", that *isn’t achievable* >>>>>>>>>>>>>> under the MV evaluation model itself. >>>>>>>>>>>>>> Because each MV treats upstream MVs as physical tables, >>>>>>>>>>>>>> recursion stops at the first materialized boundary; *deeper >>>>>>>>>>>>>> staleness cannot be discovered without switching to a >>>>>>>>>>>>>> logical-view >>>>>>>>>>>>>> evaluation model, i.e., stepping outside the MV model altogether >>>>>>>>>>>>>> (note that >>>>>>>>>>>>>> in Option 3 we can determine recursive staleness while still >>>>>>>>>>>>>> inside the MV >>>>>>>>>>>>>> model).* >>>>>>>>>>>>>> >>>>>>>>>>>>>> Let me know your thoughts. I slightly prefer Option 3. I’m >>>>>>>>>>>>>> also fine with Option 2, but I don’t think the flexibility to >>>>>>>>>>>>>> recursively >>>>>>>>>>>>>> determine freshness actually exists under its evaluation model. >>>>>>>>>>>>>> Not sure if >>>>>>>>>>>>>> this changes anyone’s view, but I wanted to clarify how I’m >>>>>>>>>>>>>> reading it. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Thanks, >>>>>>>>>>>>>> Walaa. >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> On Wed, Oct 8, 2025 at 11:11 PM Benny Chow <[email protected]> >>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>>> I just listened to the recording. I'm the tech lead for MVs >>>>>>>>>>>>>>> at Dremio and responsible for both refresh management and query >>>>>>>>>>>>>>> rewrites >>>>>>>>>>>>>>> with MVs. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> It's great that we seem to agree that Iceberg MV spec won't >>>>>>>>>>>>>>> require that MVs always be up to date in order to be usable for >>>>>>>>>>>>>>> query >>>>>>>>>>>>>>> rewrites. There can be many data consistency issues (as Dan >>>>>>>>>>>>>>> pointed out) >>>>>>>>>>>>>>> but that is the state of affairs today. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> It sounds like we are converging on the following scenarios >>>>>>>>>>>>>>> for an engine to validate the MV freshness: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> 1. Use storage table without any validation. This might be >>>>>>>>>>>>>>> the extreme "async MV" example. >>>>>>>>>>>>>>> 2. Ignore storage table even if one exists because SQL >>>>>>>>>>>>>>> command or use case requires that. >>>>>>>>>>>>>>> 3. Use storage table only if data is not more than x hours >>>>>>>>>>>>>>> old. This can be achieved with the proposed >>>>>>>>>>>>>>> refresh-start-timestamp-ms >>>>>>>>>>>>>>> which is currently in the proposed spec. For this to work >>>>>>>>>>>>>>> with MVs built on MVs, we should probably state in the spec >>>>>>>>>>>>>>> that if a MV is >>>>>>>>>>>>>>> built on another MV, then it needs to inherit the >>>>>>>>>>>>>>> refresh-start-timestamp-ms of the child MV. In Steven's >>>>>>>>>>>>>>> example, when >>>>>>>>>>>>>>> building mv3, refresh-start-timestamp-ms needs to be set to the >>>>>>>>>>>>>>> minimum of >>>>>>>>>>>>>>> mv1 or mv2's refresh-start-timestamp-ms. If this property name >>>>>>>>>>>>>>> is >>>>>>>>>>>>>>> confusing, we can rename it to >>>>>>>>>>>>>>> "refresh-earliest-table-timestamp-ms". I >>>>>>>>>>>>>>> originally proposed this property and also listed out other >>>>>>>>>>>>>>> benefits here: >>>>>>>>>>>>>>> https://github.com/apache/iceberg/pull/11041#discussion_r1779797796 >>>>>>>>>>>>>>> Also, at the time, MVs built on MVs weren't being considered. >>>>>>>>>>>>>>> Now that it >>>>>>>>>>>>>>> is, I would recommend we have both "refresh-start-timestamp-ms" >>>>>>>>>>>>>>> (when the >>>>>>>>>>>>>>> refresh was started on the storage table) and >>>>>>>>>>>>>>> "refresh-earliest-table-timestamp-ms" (used for freshness >>>>>>>>>>>>>>> validation). >>>>>>>>>>>>>>> 4. Don't use the storage table if it is older than X >>>>>>>>>>>>>>> hours. This is what I had originally proposed for the >>>>>>>>>>>>>>> *materialization.max-stalessness-ms* view property here: >>>>>>>>>>>>>>> https://github.com/apache/iceberg/pull/11041#discussion_r1744837644 >>>>>>>>>>>>>>> It wasn't meant to validate the freshness but more to prevent >>>>>>>>>>>>>>> use of a >>>>>>>>>>>>>>> materialization after some criteria. >>>>>>>>>>>>>>> 5. Use storage table if recursive validation passes... i.e. >>>>>>>>>>>>>>> refresh-state matches the current expanded query tree state. >>>>>>>>>>>>>>> This is what >>>>>>>>>>>>>>> I think Steven is calling the "synchronous MV". >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> For scenario 1-4, it would support the nice use case of an >>>>>>>>>>>>>>> Iceberg client using a view's data through the storage table >>>>>>>>>>>>>>> without >>>>>>>>>>>>>>> needing to know how to parse/validate/expand any view SQLs. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> In Dremio's planner, we primarily use scenario 1 and 4 >>>>>>>>>>>>>>> together to determine MV validity for query rewrite. Scenario >>>>>>>>>>>>>>> 2 and 5 also >>>>>>>>>>>>>>> apply in certain situations. For scenario 3, Dremio only >>>>>>>>>>>>>>> exposes the >>>>>>>>>>>>>>> "refresh-earliest-table-timestamp-ms" as an fyi to the user but >>>>>>>>>>>>>>> it would be >>>>>>>>>>>>>>> interesting to allow the user to set this time so that they >>>>>>>>>>>>>>> could run >>>>>>>>>>>>>>> queries and be 100% certain that they were not seeing data >>>>>>>>>>>>>>> older than x >>>>>>>>>>>>>>> hours. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Thanks >>>>>>>>>>>>>>> Benny >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> On Wed, Oct 8, 2025 at 3:37 PM Steven Wu < >>>>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> correction for a typo. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Prashanth brought up another scenario of compaction/rewrite >>>>>>>>>>>>>>>> where a new snapshot was added *with* actual data change >>>>>>>>>>>>>>>> --> >>>>>>>>>>>>>>>> Prashanth brought up another scenario of compaction/rewrite >>>>>>>>>>>>>>>> where a new snapshot was added *without* actual data change >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> On Wed, Oct 8, 2025 at 2:12 PM Steven Wu < >>>>>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Hi, >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Thanks everyone for joining the MV discussion meeting. We >>>>>>>>>>>>>>>>> will continue to have the recurring sync meeting on Wednesday >>>>>>>>>>>>>>>>> 9 am >>>>>>>>>>>>>>>>> (Pacific) every 3 weeks until we get to the finish line where >>>>>>>>>>>>>>>>> Jan's MV spec >>>>>>>>>>>>>>>>> PR [1] is merged. I have scheduled our next meeting on Oct 29 >>>>>>>>>>>>>>>>> in the >>>>>>>>>>>>>>>>> Iceberg dev events calendar. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Here is the video recording for today's meeting. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> https://drive.google.com/file/d/1-nfhBPDWLoAFDu5cKP0rwLd_30HB6byR/view?usp=sharing >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> We mostly discussed freshness evaluation. Here is the >>>>>>>>>>>>>>>>> meeting summary. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> 1. For tracking the refresh state for the source MV >>>>>>>>>>>>>>>>> [2], the consensus is option 2 (treating source MV as a >>>>>>>>>>>>>>>>> materialized table) >>>>>>>>>>>>>>>>> which would give engines the flexibility on freshness >>>>>>>>>>>>>>>>> determination >>>>>>>>>>>>>>>>> (recursive beyond source MV or not). >>>>>>>>>>>>>>>>> 2. Earlier design doc [3] discussed max staleness >>>>>>>>>>>>>>>>> config. But it wasn't reflected in the spec PR. The >>>>>>>>>>>>>>>>> general opinion is to >>>>>>>>>>>>>>>>> add the config to the spec PR. The open question is >>>>>>>>>>>>>>>>> whether the ` >>>>>>>>>>>>>>>>> materialization.max-staleness-ms` config should be >>>>>>>>>>>>>>>>> added to the view metadata or the storage table metadata. >>>>>>>>>>>>>>>>> Either can work. >>>>>>>>>>>>>>>>> We just need to decide which makes a little better fit. >>>>>>>>>>>>>>>>> 3. Prashanth brought up schema change with default >>>>>>>>>>>>>>>>> value and how it may affect the MV refresh state (for SQL >>>>>>>>>>>>>>>>> representation >>>>>>>>>>>>>>>>> with select *). Jan mentioned that snapshot contains >>>>>>>>>>>>>>>>> schema id when the >>>>>>>>>>>>>>>>> snapshot was created. Engine can compare the snapshot >>>>>>>>>>>>>>>>> schema id to the >>>>>>>>>>>>>>>>> source table schema id during freshness evaluation. There >>>>>>>>>>>>>>>>> is no need for >>>>>>>>>>>>>>>>> additional schema info in refresh-state tracking in the >>>>>>>>>>>>>>>>> storage table. >>>>>>>>>>>>>>>>> 4. Prashanth brought up another scenario of >>>>>>>>>>>>>>>>> compaction/rewrite where a new snapshot was added with >>>>>>>>>>>>>>>>> actual data change. >>>>>>>>>>>>>>>>> The general take is that the engine can optimize and >>>>>>>>>>>>>>>>> decide that MV is >>>>>>>>>>>>>>>>> fresh as the new snapshot doesn't have any data change. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> We can add some clarifications in the spec PR for >>>>>>>>>>>>>>>>> freshness evaluation based on the above discussions. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> [1] https://github.com/apache/iceberg/pull/11041 >>>>>>>>>>>>>>>>> [2] >>>>>>>>>>>>>>>>> https://docs.google.com/document/d/1_StBW5hCQhumhIvgbdsHjyW0ED3dWMkjtNzyPp9Sfr8/edit?tab=t.0 >>>>>>>>>>>>>>>>> [3] >>>>>>>>>>>>>>>>> https://docs.google.com/document/d/1UnhldHhe3Grz8JBngwXPA6ZZord1xMedY5ukEhZYF-A/edit?tab=t.0#heading=h.3wigecex0zls >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> On Thu, Sep 25, 2025 at 9:27 AM Steven Wu < >>>>>>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Hi all, >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Iceberg materialized view has been discussed in the >>>>>>>>>>>>>>>>>> community for a long time. Thanks Jan Kaul for driving the >>>>>>>>>>>>>>>>>> discussion and >>>>>>>>>>>>>>>>>> the spec PR. It has been stalled for a long time due to lack >>>>>>>>>>>>>>>>>> of consensus >>>>>>>>>>>>>>>>>> on 1 or 2 topics. In Wed's Iceberg community sync meeting, >>>>>>>>>>>>>>>>>> Talat brought up >>>>>>>>>>>>>>>>>> the question on how to move forward and if we can have a >>>>>>>>>>>>>>>>>> dedicated meeting >>>>>>>>>>>>>>>>>> for MV. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> I have set up a meeting on *Oct 8 (9-10 am Pacific)*. If >>>>>>>>>>>>>>>>>> you subscribe to the "Iceberg Dev Events" calendar, you >>>>>>>>>>>>>>>>>> should be able to see it. If not, here is the link: >>>>>>>>>>>>>>>>>> https://meet.google.com/nfe-guyq-pqf >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> We are going to discuss >>>>>>>>>>>>>>>>>> * remaining open questions >>>>>>>>>>>>>>>>>> * unresolved concerns >>>>>>>>>>>>>>>>>> * the next step and hopefully some consensus on moving >>>>>>>>>>>>>>>>>> forward >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> MV spec PR is up to date. Jan has incorporated recent >>>>>>>>>>>>>>>>>> feedback. This should be the base of the discussion. >>>>>>>>>>>>>>>>>> https://github.com/apache/iceberg/pull/11041 >>>>>>>>>>>>>>>>>> <https://www.google.com/url?q=https://github.com/apache/iceberg/pull/11041&sa=D&source=calendar&usd=2&usg=AOvVaw3w0TjRpwbC17AGzmxZmElM> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Dev discussion thread (a long-running thread started by >>>>>>>>>>>>>>>>>> Jan). >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> https://lists.apache.org/thread/y1vlpzbn2x7xookjkffcl08zzyofk5hf >>>>>>>>>>>>>>>>>> <https://www.google.com/url?q=https://lists.apache.org/thread/y1vlpzbn2x7xookjkffcl08zzyofk5hf&sa=D&source=calendar&usd=2&usg=AOvVaw0fotlsrnRBOb820mA5JRyB> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> The mail archive has broken lineage and doesn't show all >>>>>>>>>>>>>>>>>> replies. Email subject is "*[DISCUSS] Iceberg >>>>>>>>>>>>>>>>>> Materialzied Views*". >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Thanks, >>>>>>>>>>>>>>>>>> Steven >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>
