JanKaul commented on code in PR #11041:
URL: https://github.com/apache/iceberg/pull/11041#discussion_r3037219155


##########
format/view-spec.md:
##########
@@ -160,7 +178,121 @@ Each entry in `version-log` is a struct with the 
following fields:
 | _required_  | `timestamp-ms` | Timestamp when the view's 
`current-version-id` was updated (ms from epoch) |
 | _required_  | `version-id`   | ID that `current-version-id` was set to |
 
-## Appendix A: An Example
+#### Storage Table Identifier
+
+The table identifier for the storage table that stores the precomputed results.
+
+| Requirement | Field name     | Description |
+|-------------|----------------|-------------|
+| _required_  | `namespace`    | A list of strings for namespace levels |
+| _required_  | `name`         | A string specifying the name of the table |
+
+### Storage table metadata
+
+This section describes additional metadata for the storage table that 
supplements the regular table metadata and is required for materialized views.
+The property "refresh-state" is set on the [snapshot 
summary](https://iceberg.apache.org/spec/#snapshots) property of a storage 
table snapshot to provide information about the state of the precomputed data.
+
+| Requirement | Field name      | Description |
+|-------------|-----------------|-------------|
+| _optional_  | `refresh-state` | A [refresh state](#refresh-state) record 
stored as a JSON-encoded string |
+
+#### Freshness
+
+A materialized view is "fresh" when the storage table adequately represents 
the result of the view query at the current state of its dependencies.
+Since different systems define freshness differently, it is left to the 
consumer to evaluate freshness based on its own policy.
+
+**Consumer behavior:**
+
+When evaluating freshness, consumers:
+
+- May apply time-based freshness policies, such as allowing a staleness window 
based on `refresh-start-timestamp-ms`.
+- May compare the `source-states` list against the states loaded from the 
catalog to verify the producer's freshness interpretation.
+- May parse the view definition to implement more sophisticated policies.
+- When a materialized view is considered stale, can fail, refresh inline, or 
treat the materialized view as a logical view.
+- Should not consume the storage table as it is when the materialized view 
doesn't meet the freshness criteria.
+
+**Producer behavior:**
+
+Producers should provide the necessary information in the [refresh 
state](#refresh-state) such that consumers can verify the logical equivalence 
of the precomputed data with the query definition.
+Different producers may have different freshness interpretations, based on how 
much of the refresh state's dependency graph should be evaluated.
+Some producers expect the entire dependency graph to be evaluated and 
therefore include source MV dependencies. Other producers may only expect 
dependencies in the MV's SQL to be evaluated and therefore do not include 
dependencies of source MVs.
+
+When writing the refresh state, producers:
+
+- Should provide a sufficient list of source states such that consumers can 
determine freshness according to the producer's intent. If the producers intent 
is such that it doesn't rely on the source-states to determine freshness, it 
may provide an empty list.
+- If the source state cannot be determined for all objects (for example, for 
non-Iceberg tables) may leave the source states list empty.
+- If a stored object is reachable through multiple paths in the dependency 
graph (diamond dependency pattern), all distinct source states have to be 
included in the list.

Review Comment:
   I agree with Walaa and Benny in that the consumer has no signal at all 
whether the list is expanded or is including non-iceberg sources. So if a 
consumer wants to provide certain freshness guarantees it is actually forced to 
parse the SQL to ensure it doesn't contain a non-iceberg source for example. By 
just looking at the list, there is no way of telling that only iceberg tables 
were used and the state is sufficient.
   
   I was talking with Walaa and we identified that there are actually at least 
two different dimensions mangled into the current design of the source-states 
list. These are:
   1. Iceberg vs Non-Iceberg tables
   2. Deep vs shallow nesting
   I think it would be valuable if we could provide enough information to the 
consumer to distinguish between them and maybe even provide more information.
   
   **Potential solutions**:
   
   1. Iceberg vs Non-Iceberg
   We could include source states of the `type: external` that only contain the 
identifier of the source table. This way references to Non-Iceberg tables are 
at least recorded in the refresh-state and the consumer knows about it. If no 
entry with `type: external` is present, the consumer knows that it's all 
iceberg tables and it is able to determine the freshness.
   
   2. Deep vs shallow nesting
   I think including a single boolean flag: "expanded-source-mvs" provides 
value to the consumer to know about the producers freshness interpretation. The 
consumer can than decide whether the list is enough or whether it would like to 
parse the SQL and expand the list. But without the flag it would need to parse 
the SQL to even figure out whether the producer provided and deep or shallow 
nested list.
   Flags can come across as "Tech-debt" or that parts haven't been defined 
precisely. But I think this is not the case here as there are actually two ways 
to go about the nesting. And I think it makes sense for the producer to store 
that information and communicate it to the consumer.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to