szehon-ho commented on issue #6420:
URL: https://github.com/apache/iceberg/issues/6420#issuecomment-1912878826
Hi @JanKaul . Thanks for putting this together. I went through the
detailed discussion, and see the general consensus to the "Open Questions" in
the design docs are:
1. The pointer to the storage table should be stored as an optional field
in the view metadata (option 1)
2. Lineage information should be stored as additional fields in the summary
of the storage table (option 2)
3. Only the view (and not storage table) should be registered in catalog
(option 2)
Is that correct? Then given that, I have summarized the additions we are
making to the current metadata spec below.
# View Metadata
| v1 | v2 | Field Name | Description |
|---|---|---|---|
| optional | materialization | An optional `materialization` struct. If
the value is null the entity is a common view, otherwise it is a materialized
view |
Materialization Struct
| v1 | Field Name | Description |
| -- | -- | -- |
required | format-version | An integer version number for the materialized
view format. Currently, this must be 1. Implementations must throw an exception
if the materialized view's version is higher than the supported version.
required | storage-table | Table metadata location |
# Snapshot
| v1 | Field Name | Description |
| -- | -- | -- |
| optional | refresh-version-id | Version id of the materialized view when
the refresh operation was performed.
optional | source-tables | A List of `source-table` records. |
Source Table Struct
| v1 | Field Name | Description |
| -- | -- | -- |
| required | identifier | Identifier of the table as defined in the SQL
expression. |
| required | snapshot-id | Snapshot id of the source table when the last
refresh operation was performed. |
Let me know if that looks right.
My 2c on this are:
1. The materialization struct having its own format version seems overkill
to me, maybe we can just flatten it and make the materialization directly just
the storage-table pointer itself?
2. Similar to @jackye1995 on the comment above:
https://github.com/apache/iceberg/issues/6420#issuecomment-1398572156, I feel
having the list of source-tables is a bit difficult. Can we proceed without
this in the first cut? I feel the engines, if they wanted, could parse the
source tables, look them up, and get the snapshot-ids directly. They must to
be able to parse the view-sql so should be able to parse that.
3. How about the 'refresh-strategy'? (Didnt see it in the google-doc). I
feel it can in the current 'properties' field of view metadata. iiuc, @rdblue
also had suggested putting this in the table properties of the storage table
along with other fields like materialized_view_format_version and
view_identifier, which sounds fine too.
If there is general consensus on the direction, it'd be great to move to on
the actual spec pr change and discuss specifics there, as seems like this
proposal has been sitting awhile? I can also help with that, if needed.
Thanks.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]