Sorry for the delay. Here are the recording and meeting notes for the MV sync meeting on Wednesday, Oct 29. https://docs.google.com/document/d/1EVCM-hKr5tY33t0Yzq37cAXSPncySc6Ghke7OZEcqXU/edit?tab=t.0#heading=h.75r8e0rwq02o
We have started to collect them in the above google doc. On Mon, Oct 27, 2025 at 8:58 AM Péter Váry <[email protected]> wrote: > If we have materialized views (MVs) and support for incremental change > scans, then by introducing a Java-based representation of the view, we can > expose a scan API that always returns up-to-date results for the MV. > > The scan could include multiple tasks: > > - A task for reading the current version of the MV. > - An incremental change log scan covering the range between the > snapshot ID of the source table at the time the MV was last refreshed and > its current snapshot ID. Applying the Java representation of the view when > transformations are required. > > This approach allows us to build an always up-to-date index table/single > source MV, using existing components. > > Benny Chow <[email protected]> ezt írta (időpont: 2025. okt. 24., P, 7:44): > >> Hi Peter >> >> I think the current proposal would support your example. In most >> situations, replace table operations after a view is materialized wouldn’t >> invalidate the materialization. However, if the view includes metadata >> columns, then the replace operations should invalidate the materialization. >> >> >> This also brings up another important point that engines will differ on >> what views can be materialized or not. For example, maybe metadata columns >> are not allowed similar to non deterministic functions like random. But >> some engines like Dremio may allow views that use current date functions. >> It should be possible for one engine to materialize a view and another >> engine to look at the query tree and decide it’s not a view it supports >> materializations on and choose not to use that materialization. >> >> Thanks >> Benny >> >> >> >> On Oct 23, 2025, at 8:44 AM, Péter Váry <[email protected]> >> wrote: >> >> >> Hi All, >> >> I’ve been catching up on the discussion and wanted to share an >> observation. One aspect that stands out to me in the proposed staleness >> evaluation logic is that snapshots which don’t modify data can still affect >> the view’s contents if the view includes metadata columns. >> >> I was considering using a materialized view as an index for a given table >> to accelerate the conversion of equality deletes to position deletes. For >> example, the query might look like: >> >> *SELECT _POS, _FILE, id FROM target_table* >> >> >> During compaction, the materialized view would need to be refreshed to >> ensure it reflects the correct data. >> >> Does this seem like a valid use case? Or should we explicitly exclude >> scenarios like this? >> >> Thanks, >> Peter >> >> Steven Wu <[email protected]> ezt írta (időpont: 2025. okt. 20., H, >> 17:30): >> >>> Walaa, >>> >>> > while Option 2 is described in your summary as "giving engines >>> *flexibility* to determine freshness recursively beyond a source MV", >>> that *isn’t achievable* under the MV evaluation model itself. >>> Because each MV treats upstream MVs as physical tables, recursion stops >>> at the first materialized boundary; *deeper staleness cannot be >>> discovered without switching to a logical-view evaluation model, i.e., >>> stepping outside the MV model altogether (note that in Option 3 we can >>> determine recursive staleness while still inside the MV model).* >>> >>> In option 2, when determining the freshness of mv_3, engines can choose >>> to recursively evaluate the freshness of mv_1 and mv_2 since they are also >>> MVs. But engines can also choose not to. >>> >>> > This means that there seems to be an implicit “Option 3”. This option >>> treats MVs as logical views, i.e., storing only view versions + base table >>> snapshot IDs (no MV storage snapshot IDs, no per-path lineage). >>> >>> In the new option 3 you described, how could the engine update mv3's >>> refresh state for base table_a and table_b? unless all connected MVs are >>> refreshed and committed in one single transaction, one entry per base table >>> doesn't seem feasible. That's the main reason for option 1 to require the >>> lineage path information in refresh state for base tables. >>> >>> It also seems that option 3 can only interpret freshness recursively, >>> while today there are engines that support MVs without recursively >>> evaluating source MVs. >>> >>> Thanks, >>> Steven >>> >>> >>> On Mon, Oct 20, 2025 at 1:44 AM Walaa Eldin Moustafa < >>> [email protected]> wrote: >>> >>>> Hi Steven, >>>> >>>> Thanks for organizing the series and summarizing the outcome. >>>> >>>> After re-reading the Option 1/2 proposal, initially I interpreted >>>> Option 1 as simply expanding MVs like regular logical views. On closer >>>> look, it is actually more complex. It also preserves per-path lineage state >>>> (e.g., multiple entries for the same base table via different parents), >>>> which increases expressiveness but significantly increases metadata >>>> complexity. So I agree it is not a practical option. >>>> >>>> This means that there seems to be an implicit “Option 3”. This option >>>> treats MVs as logical views, i.e., storing only view versions + base table >>>> snapshot IDs (no MV storage snapshot IDs, no per-path lineage). Under this >>>> model, mv_3’s metadata might look like: >>>> >>>> Type Name Tracked State >>>> ----- ------- ----------------------- >>>> view mv_1 view_version_id >>>> view mv_2 view_version_id >>>> table table_a table_snapshot_id >>>> table table_b table_snapshot_id >>>> >>>> This preserves logical semantics and aligns MV behavior with pure views. >>>> >>>> *If we choose Option 2 (treat source MV as a materialized table), we >>>> may have to be consider those constraints:* >>>> >>>> * Staleness only degrades up the chain. mv_1 and mv_2 may already be >>>> stale relative to the base tables, but if mv_3 is refreshed using their >>>> storage snapshots, then mv_3 will be marked as fresh under Option 2, even >>>> though all three MVs are stale relative to the base tables. >>>> >>>> * Engines can no longer discover staleness beyond mv_1. Once mv_3 sees >>>> mv_1 (or mv_2) as fresh based only on their storage snapshots, it will not >>>> expand into mv_1 or mv_2 to check whether they are stale relative to the >>>> base tables. >>>> >>>> * If mv_2 and mv_3 were purely logical views instead of MVs, they would >>>> evaluate directly against base tables and return newer data. Under Option >>>> 2, the same definitions but materialized upstream produce different data, >>>> not just different metadata. >>>> >>>> Therefore, while Option 2 is described in your summary as "giving >>>> engines *flexibility* to determine freshness recursively beyond a >>>> source MV", that *isn’t achievable* under the MV evaluation model >>>> itself. >>>> Because each MV treats upstream MVs as physical tables, recursion stops >>>> at the first materialized boundary; *deeper staleness cannot be >>>> discovered without switching to a logical-view evaluation model, i.e., >>>> stepping outside the MV model altogether (note that in Option 3 we can >>>> determine recursive staleness while still inside the MV model).* >>>> >>>> Let me know your thoughts. I slightly prefer Option 3. I’m also fine >>>> with Option 2, but I don’t think the flexibility to recursively determine >>>> freshness actually exists under its evaluation model. Not sure if this >>>> changes anyone’s view, but I wanted to clarify how I’m reading it. >>>> >>>> Thanks, >>>> Walaa. >>>> >>>> >>>> On Wed, Oct 8, 2025 at 11:11 PM Benny Chow <[email protected]> wrote: >>>> >>>>> I just listened to the recording. I'm the tech lead for MVs at Dremio >>>>> and responsible for both refresh management and query rewrites with MVs. >>>>> >>>>> It's great that we seem to agree that Iceberg MV spec won't require >>>>> that MVs always be up to date in order to be usable for query rewrites. >>>>> There can be many data consistency issues (as Dan pointed out) but that is >>>>> the state of affairs today. >>>>> >>>>> It sounds like we are converging on the following scenarios for an >>>>> engine to validate the MV freshness: >>>>> >>>>> 1. Use storage table without any validation. This might be the >>>>> extreme "async MV" example. >>>>> 2. Ignore storage table even if one exists because SQL command or use >>>>> case requires that. >>>>> 3. Use storage table only if data is not more than x hours old. This >>>>> can be achieved with the proposed refresh-start-timestamp-ms which is >>>>> currently in the proposed spec. For this to work with MVs built on MVs, >>>>> we >>>>> should probably state in the spec that if a MV is built on another MV, >>>>> then >>>>> it needs to inherit the refresh-start-timestamp-ms of the child MV. In >>>>> Steven's example, when building mv3, refresh-start-timestamp-ms needs to >>>>> be >>>>> set to the minimum of mv1 or mv2's refresh-start-timestamp-ms. If this >>>>> property name is confusing, we can rename it to >>>>> "refresh-earliest-table-timestamp-ms". I originally proposed this >>>>> property >>>>> and also listed out other benefits here: >>>>> https://github.com/apache/iceberg/pull/11041#discussion_r1779797796 >>>>> Also, at the time, MVs built on MVs weren't being considered. Now that it >>>>> is, I would recommend we have both "refresh-start-timestamp-ms" (when the >>>>> refresh was started on the storage table) and >>>>> "refresh-earliest-table-timestamp-ms" (used for freshness validation). >>>>> 4. Don't use the storage table if it is older than X hours. This is >>>>> what I had originally proposed for the >>>>> *materialization.max-stalessness-ms* view property here: >>>>> https://github.com/apache/iceberg/pull/11041#discussion_r1744837644 >>>>> It wasn't meant to validate the freshness but more to prevent use of a >>>>> materialization after some criteria. >>>>> 5. Use storage table if recursive validation passes... i.e. >>>>> refresh-state matches the current expanded query tree state. This is what >>>>> I think Steven is calling the "synchronous MV". >>>>> >>>>> For scenario 1-4, it would support the nice use case of an Iceberg >>>>> client using a view's data through the storage table without needing to >>>>> know how to parse/validate/expand any view SQLs. >>>>> >>>>> In Dremio's planner, we primarily use scenario 1 and 4 together to >>>>> determine MV validity for query rewrite. Scenario 2 and 5 also apply in >>>>> certain situations. For scenario 3, Dremio only exposes the >>>>> "refresh-earliest-table-timestamp-ms" as an fyi to the user but it would >>>>> be >>>>> interesting to allow the user to set this time so that they could run >>>>> queries and be 100% certain that they were not seeing data older than x >>>>> hours. >>>>> >>>>> Thanks >>>>> Benny >>>>> >>>>> On Wed, Oct 8, 2025 at 3:37 PM Steven Wu <[email protected]> wrote: >>>>> >>>>>> correction for a typo. >>>>>> >>>>>> Prashanth brought up another scenario of compaction/rewrite where a >>>>>> new snapshot was added *with* actual data change >>>>>> --> >>>>>> Prashanth brought up another scenario of compaction/rewrite where a >>>>>> new snapshot was added *without* actual data change >>>>>> >>>>>> >>>>>> On Wed, Oct 8, 2025 at 2:12 PM Steven Wu <[email protected]> >>>>>> wrote: >>>>>> >>>>>>> Hi, >>>>>>> >>>>>>> Thanks everyone for joining the MV discussion meeting. We will >>>>>>> continue to have the recurring sync meeting on Wednesday 9 am (Pacific) >>>>>>> every 3 weeks until we get to the finish line where Jan's MV spec PR >>>>>>> [1] is >>>>>>> merged. I have scheduled our next meeting on Oct 29 in the Iceberg dev >>>>>>> events calendar. >>>>>>> >>>>>>> Here is the video recording for today's meeting. >>>>>>> >>>>>>> https://drive.google.com/file/d/1-nfhBPDWLoAFDu5cKP0rwLd_30HB6byR/view?usp=sharing >>>>>>> >>>>>>> We mostly discussed freshness evaluation. Here is the meeting >>>>>>> summary. >>>>>>> >>>>>>> 1. For tracking the refresh state for the source MV [2], the >>>>>>> consensus is option 2 (treating source MV as a materialized table) >>>>>>> which >>>>>>> would give engines the flexibility on freshness determination >>>>>>> (recursive >>>>>>> beyond source MV or not). >>>>>>> 2. Earlier design doc [3] discussed max staleness config. But it >>>>>>> wasn't reflected in the spec PR. The general opinion is to add the >>>>>>> config >>>>>>> to the spec PR. The open question is whether the ` >>>>>>> materialization.max-staleness-ms` config should be added to the >>>>>>> view metadata or the storage table metadata. Either can work. We >>>>>>> just need >>>>>>> to decide which makes a little better fit. >>>>>>> 3. Prashanth brought up schema change with default value and how >>>>>>> it may affect the MV refresh state (for SQL representation with >>>>>>> select *). >>>>>>> Jan mentioned that snapshot contains schema id when the snapshot was >>>>>>> created. Engine can compare the snapshot schema id to the source >>>>>>> table >>>>>>> schema id during freshness evaluation. There is no need for >>>>>>> additional >>>>>>> schema info in refresh-state tracking in the storage table. >>>>>>> 4. Prashanth brought up another scenario of compaction/rewrite >>>>>>> where a new snapshot was added with actual data change. The general >>>>>>> take is >>>>>>> that the engine can optimize and decide that MV is fresh as the new >>>>>>> snapshot doesn't have any data change. >>>>>>> >>>>>>> >>>>>>> We can add some clarifications in the spec PR for freshness >>>>>>> evaluation based on the above discussions. >>>>>>> >>>>>>> [1] https://github.com/apache/iceberg/pull/11041 >>>>>>> [2] >>>>>>> https://docs.google.com/document/d/1_StBW5hCQhumhIvgbdsHjyW0ED3dWMkjtNzyPp9Sfr8/edit?tab=t.0 >>>>>>> [3] >>>>>>> https://docs.google.com/document/d/1UnhldHhe3Grz8JBngwXPA6ZZord1xMedY5ukEhZYF-A/edit?tab=t.0#heading=h.3wigecex0zls >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Thu, Sep 25, 2025 at 9:27 AM Steven Wu <[email protected]> >>>>>>> wrote: >>>>>>> >>>>>>>> Hi all, >>>>>>>> >>>>>>>> Iceberg materialized view has been discussed in the community for a >>>>>>>> long time. Thanks Jan Kaul for driving the discussion and the spec PR. >>>>>>>> It >>>>>>>> has been stalled for a long time due to lack of consensus on 1 or 2 >>>>>>>> topics. >>>>>>>> In Wed's Iceberg community sync meeting, Talat brought up the question >>>>>>>> on >>>>>>>> how to move forward and if we can have a dedicated meeting for MV. >>>>>>>> >>>>>>>> I have set up a meeting on *Oct 8 (9-10 am Pacific)*. If you >>>>>>>> subscribe to the "Iceberg Dev Events" calendar, you should be able >>>>>>>> to see it. If not, here is the link: >>>>>>>> https://meet.google.com/nfe-guyq-pqf >>>>>>>> >>>>>>>> We are going to discuss >>>>>>>> * remaining open questions >>>>>>>> * unresolved concerns >>>>>>>> * the next step and hopefully some consensus on moving forward >>>>>>>> >>>>>>>> MV spec PR is up to date. Jan has incorporated recent feedback. >>>>>>>> This should be the base of the discussion. >>>>>>>> https://github.com/apache/iceberg/pull/11041 >>>>>>>> <https://www.google.com/url?q=https://github.com/apache/iceberg/pull/11041&sa=D&source=calendar&usd=2&usg=AOvVaw3w0TjRpwbC17AGzmxZmElM> >>>>>>>> >>>>>>>> Dev discussion thread (a long-running thread started by Jan). >>>>>>>> https://lists.apache.org/thread/y1vlpzbn2x7xookjkffcl08zzyofk5hf >>>>>>>> <https://www.google.com/url?q=https://lists.apache.org/thread/y1vlpzbn2x7xookjkffcl08zzyofk5hf&sa=D&source=calendar&usd=2&usg=AOvVaw0fotlsrnRBOb820mA5JRyB> >>>>>>>> >>>>>>>> The mail archive has broken lineage and doesn't show all replies. >>>>>>>> Email subject is "*[DISCUSS] Iceberg Materialzied Views*". >>>>>>>> >>>>>>>> Thanks, >>>>>>>> Steven >>>>>>>> >>>>>>>>
