Re: [PR] Materialized View Spec [iceberg]

via GitHub Fri, 03 Apr 2026 09:42:55 -0700


wmoustafa commented on code in PR #11041:
URL: https://github.com/apache/iceberg/pull/11041#discussion_r3033212803



##########
format/view-spec.md:
##########
@@ -160,7 +178,121 @@ Each entry in `version-log` is a struct with the 
following fields:
 | _required_  | `timestamp-ms` | Timestamp when the view's 
`current-version-id` was updated (ms from epoch) |
 | _required_  | `version-id`   | ID that `current-version-id` was set to |
 
-## Appendix A: An Example
+#### Storage Table Identifier
+
+The table identifier for the storage table that stores the precomputed results.
+
+| Requirement | Field name     | Description |
+|-------------|----------------|-------------|
+| _required_  | `namespace`    | A list of strings for namespace levels |
+| _required_  | `name`         | A string specifying the name of the table |
+
+### Storage table metadata
+
+This section describes additional metadata for the storage table that 
supplements the regular table metadata and is required for materialized views.
+The property "refresh-state" is set on the [snapshot 
summary](https://iceberg.apache.org/spec/#snapshots) property of a storage 
table snapshot to provide information about the state of the precomputed data.
+
+| Requirement | Field name      | Description |
+|-------------|-----------------|-------------|
+| _optional_  | `refresh-state` | A [refresh state](#refresh-state) record 
stored as a JSON-encoded string |
+
+#### Freshness
+
+A materialized view is "fresh" when the storage table adequately represents 
the result of the view query at the current state of its dependencies.
+Since different systems define freshness differently, it is left to the 
consumer to evaluate freshness based on its own policy.
+
+**Consumer behavior:**
+
+When evaluating freshness, consumers:
+
+- May apply time-based freshness policies, such as allowing a staleness window 
based on `refresh-start-timestamp-ms`.
+- May compare the `source-states` list against the states loaded from the 
catalog to verify the producer's freshness interpretation.
+- May parse the view definition to implement more sophisticated policies.
+- When a materialized view is considered stale, can fail, refresh inline, or 
treat the materialized view as a logical view.
+- Should not consume the storage table as it is when the materialized view 
doesn't meet the freshness criteria.
+
+**Producer behavior:**
+
+Producers should provide the necessary information in the [refresh 
state](#refresh-state) such that consumers can verify the logical equivalence 
of the precomputed data with the query definition.
+Different producers may have different freshness interpretations, based on how 
much of the refresh state's dependency graph should be evaluated.
+Some producers expect the entire dependency graph to be evaluated and 
therefore include source MV dependencies. Other producers may only expect 
dependencies in the MV's SQL to be evaluated and therefore do not include 
dependencies of source MVs.
+
+When writing the refresh state, producers:
+
+- Should provide a sufficient list of source states such that consumers can 
determine freshness according to the producer's intent. If the producers intent 
is such that it doesn't rely on the source-states to determine freshness, it 
may provide an empty list.
+- If the source state cannot be determined for all objects (for example, for 
non-Iceberg tables) may leave the source states list empty.
+- If a stored object is reachable through multiple paths in the dependency 
graph (diamond dependency pattern), all distinct source states have to be 
included in the list.
+
+#### Refresh state
+
+The refresh state record captures the dependencies in the materialized view's 
dependency graph.
+These dependencies include source Iceberg tables, views, and materialized 
views.
+
+The refresh state has the following fields:
+
+| Requirement | Field name     | Description |
+|-------------|----------------|-------------|
+| _required_  | `view-version-id`         | The `version-id` of the 
materialized view when the refresh operation was performed  |
+| _required_  | `source-states`        | A list of [source 
states](#source-state) records |
+| _required_  | `refresh-start-timestamp-ms` | A timestamp of when the refresh 
operation was started |
+
+#### Source state
+
+Source state records capture the state of objects referenced by a materialized 
view including objects referenced by source materialized views.
+Each record has a `type` field that determines its form:
+
+| Type    | Description |
+|---------|-------------|
+| `table` | An Iceberg table, including storage tables of source materialized 
views |
+| `view`  | An Iceberg view, including source materialized views |
+
+Source materialized views are represented by two source state entries: one for 
the view itself and one for its storage table.
+
+#### Source table state
+
+A source table record captures the state of a source table (including source 
MV's storage table) at the time of the last refresh operation.
+
+| Requirement | Field name     | Description |
+|-------------|----------------|-------------|
+| _required_  | `type`         | A string that must be set to `table` |
+| _required_  | `name`         | A string specifying the name of the source 
table |
+| _required_  | `namespace`    | A list of strings for namespace levels |
+| _optional_  | `catalog`      | An optional name of the catalog. If not set, 
the catalog is the same as the materialized views' |

Review Comment:
   There is some resolution problem when storing the child table identifiers on 
the storage table side. This would be addressed if fully-resolved source table 
identifiers are recorded on the view version itself, as part of the view 
metadata.
   
   #### Example illustrating the problem
   
   Consider a materialized view with this query:
   
   ```sql
   SELECT a.id, b.value FROM N.A JOIN N.B ON a.id = b.id
   ```
   
   Created with `default-catalog = "C"` and `default-namespace = ["default"]`.
   
   The refresh state in the storage table's snapshot summary records:
   
   ```json
   {
     "source-states": [
       {"type": "table", "name": "A", "namespace": ["N"]},
       {"type": "table", "name": "B", "namespace": ["N"]}
     ]
   }
   ```
   
   These entries have `namespace + name` but no catalog. A consumer reading the 
storage table's snapshot cannot resolve `A` and `B` to actual tables without 
knowing the catalog. One might assume the catalog can be obtained from the 
view's `default-catalog`, but this doesn't hold in the general case: if the 
MV's query references an intermediate view, and that view's SQL references 
tables under a different `default-catalog`, the source entries from the 
expanded view have no link back to the intermediate view whose 
`default-catalog` resolved them. 
   
   In short: source identifiers in the refresh state are incomplete, and the 
missing part (catalog) is not recoverable.
   
   #### Proposed fix
   
   The producer has the full resolution context at refresh time: it knows the 
catalog (from `default-catalog`, namespace, and name of every source table 
after SQL analysis. 
   ```json
   {
     "version-id": 1,
     "default-catalog": "C",
     "default-namespace": ["default"],
     "storage-table": {"namespace": ["default"], "name": "my_mv__storage"},
     "source-tables": [
       {"namespace": ["N"], "name": "A"},
       {"namespace": ["N"], "name": "B"}
     ]
   }
   ```
   
   The partial identifiers `N.A` and `N.B` now co-exist with the 
`default-catalog` in the same metadata object (view), and can be jointly 
resolved.
   
   #### Added benefit: state completeness without parsing
   
   With source identifiers on the view version, a consumer can compare the 
view's declared `source-tables` against the refresh state's `source-states` to 
determine whether the producer recorded complete state without parsing the SQL. 
If the view declares sources `[A, B]` and the refresh state only contains 
`[A]`, the consumer knows the state is partial and can treat the MV as stale if 
it prefers.
   
   This cleanly separates concerns:
   
   - **View version** records the structural dependency: what tables the MV 
depends on (fully-resolved identifiers).
   - **Refresh state** (in the storage table snapshot summary) records the 
temporal state: what snapshot each source was at when the refresh ran.
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Materialized View Spec [iceberg]

Reply via email to