Re: [PR] Materialized View Spec [iceberg]

via GitHub Fri, 03 Apr 2026 07:40:43 -0700


wmoustafa commented on code in PR #11041:
URL: https://github.com/apache/iceberg/pull/11041#discussion_r3033123038



##########
format/view-spec.md:
##########
@@ -160,7 +178,121 @@ Each entry in `version-log` is a struct with the 
following fields:
 | _required_  | `timestamp-ms` | Timestamp when the view's 
`current-version-id` was updated (ms from epoch) |
 | _required_  | `version-id`   | ID that `current-version-id` was set to |
 
-## Appendix A: An Example
+#### Storage Table Identifier
+
+The table identifier for the storage table that stores the precomputed results.
+
+| Requirement | Field name     | Description |
+|-------------|----------------|-------------|
+| _required_  | `namespace`    | A list of strings for namespace levels |
+| _required_  | `name`         | A string specifying the name of the table |
+
+### Storage table metadata
+
+This section describes additional metadata for the storage table that 
supplements the regular table metadata and is required for materialized views.
+The property "refresh-state" is set on the [snapshot 
summary](https://iceberg.apache.org/spec/#snapshots) property of a storage 
table snapshot to provide information about the state of the precomputed data.
+
+| Requirement | Field name      | Description |
+|-------------|-----------------|-------------|
+| _optional_  | `refresh-state` | A [refresh state](#refresh-state) record 
stored as a JSON-encoded string |
+
+#### Freshness
+
+A materialized view is "fresh" when the storage table adequately represents 
the result of the view query at the current state of its dependencies.
+Since different systems define freshness differently, it is left to the 
consumer to evaluate freshness based on its own policy.
+
+**Consumer behavior:**
+
+When evaluating freshness, consumers:
+
+- May apply time-based freshness policies, such as allowing a staleness window 
based on `refresh-start-timestamp-ms`.
+- May compare the `source-states` list against the states loaded from the 
catalog to verify the producer's freshness interpretation.
+- May parse the view definition to implement more sophisticated policies.
+- When a materialized view is considered stale, can fail, refresh inline, or 
treat the materialized view as a logical view.
+- Should not consume the storage table as it is when the materialized view 
doesn't meet the freshness criteria.
+
+**Producer behavior:**
+
+Producers should provide the necessary information in the [refresh 
state](#refresh-state) such that consumers can verify the logical equivalence 
of the precomputed data with the query definition.
+Different producers may have different freshness interpretations, based on how 
much of the refresh state's dependency graph should be evaluated.
+Some producers expect the entire dependency graph to be evaluated and 
therefore include source MV dependencies. Other producers may only expect 
dependencies in the MV's SQL to be evaluated and therefore do not include 
dependencies of source MVs.
+
+When writing the refresh state, producers:
+
+- Should provide a sufficient list of source states such that consumers can 
determine freshness according to the producer's intent. If the producers intent 
is such that it doesn't rely on the source-states to determine freshness, it 
may provide an empty list.
+- If the source state cannot be determined for all objects (for example, for 
non-Iceberg tables) may leave the source states list empty.
+- If a stored object is reachable through multiple paths in the dependency 
graph (diamond dependency pattern), all distinct source states have to be 
included in the list.

Review Comment:
   I'd like to propose some simplificaition by decomposing this aspect to 
_three independent aspects._ Specifically around:
   * What constitutes sufficient state (logically)
   * What flexibility producers have
   * What options consumers have
   
   Finally, I would connect the above with how child table identifier 
resolution works across the view/table boundary.
   
   ### 1. What is logically a sufficient refresh state
   
   The definition of "sufficient" depends on how intermediate materialized 
views are treated:
   
   **Case A: No intermediate MVs (all sources are base tables, optionally 
through intermediate views)**
   Sufficient state = snapshot IDs of all leaf tables referenced by the view's 
query, in addition to intermediate view version IDs. This is the 
straightforward case.
   
   **Case B: Intermediate MVs treated as views (transparent expansion)**
   The engine expands intermediate MVs into their underlying queries. This 
reduces to Case A — sufficient state is the snapshot IDs of all deep leaf 
tables, plus version IDs of all intermediate views traversed during expansion.
   
   **Case C: Intermediate MVs treated as tables (opaque boundaries)**
   The engine treats intermediate MVs as materialized data sources. Sufficient 
state = snapshot IDs of the intermediate MVs' storage tables. No expansion 
beyond the MV boundary. Freshness of the intermediate MV is that MV's own 
concern.
   
   Without guidelines on what consitutes sufficiency, consumers have no way to 
distinguish "the producer intentionally recorded partial state" from "the 
producer recorded complete state." A consumer seeing two source entries has no 
signal whether that covers all sources or just the ones the producer chose to 
track. These definitions can be added to the spec for reference.
   
   ### 2. Producer options
   
   Producers decide what state to record at refresh time. This can be partial — 
a producer may only track sources from the same catalog, may skip non-Iceberg 
sources, or may not expand intermediate MVs.
   
   
   ### 3. Consumer options
   
   Consumers have two strategies:
   
   - **Trust the recorded state.** Simple to implement, but only correct if the 
producer recorded complete state. No way to verify this from the refresh state 
alone.
   - **Parse the view query independently.** The consumer re-analyzes the SQL, 
identifies all source tables, and checks whether the recorded state covers 
them. Correct but expensive — requires SQL parsing and cross engine dialect 
translation.
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Materialized View Spec [iceberg]

Reply via email to