wmoustafa opened a new pull request, #9830:
URL: https://github.com/apache/iceberg/pull/9830
## Spec
This patch adds support for materialized views in Iceberg and integrates the
implementation with Spark SQL. It reuses the current spec of Iceberg views and
tables by leveraging table properties to capture materialized view metadata.
Those properties can be added to the Iceberg spec to formalize materialized
view support.
Below is a summary of all metadata properties introduced or utilized by this
patch, classified based on whether they are associated with a table or a view,
along with their purposes:
### Properties on a View:
1. **`iceberg.materialized.view`**:
- **Type**: View property
- **Purpose**: This property is used to mark whether a view is a
materialized view. If set to `true`, the view is treated as a materialized
view. This helps in differentiating between virtual and materialized views
within the catalog and dictates specific handling and validation logic for
materialized views.
2. **`iceberg.materialized.view.storage.location`**:
- **Type**: View property
- **Purpose**: Specifies the location of the storage table associated
with the materialized view. This property is used for linking a materialized
view with its corresponding storage table, enabling data management and query
execution based on the stored data freshness.
### Properties on a Table:
1. **`base.snapshot.[UUID]`**:
- **Type**: Table property
- **Purpose**: These properties store the snapshot IDs of the base
tables at the time the materialized view's data was last updated. Each property
is prefixed with `base.snapshot.` followed by the UUID of the base table. They
are used to track whether the materialized view's data is up to date with the
base tables by comparing these snapshot IDs with the current snapshot IDs of
the base tables. If all the base tables' current snapshot IDs match the ones
stored in these properties, the materialized view's data is considered fresh.
## Spark SQL
This patch introduces support for materialized views in the Spark module by
adding support for Spark SQL `CREATE MATERIALIZED VIEW` and adding materialized
view handling for the `DROP VIEW` DDL command. When a `CREATE MATERIALIZED
VIEW` command is executed, the patch interprets the command to create a new
materialized view, which involves not only registering the view's metadata
(including marking it as a materialized view with the appropriate properties)
but also setting up a corresponding storage table to hold the materialized data
and setting the base table current snapshot IDs (at creation time). Conversely,
when a `DROP VIEW` command is issued for a materialized view, the patch ensures
that both the metadata for the materialized view and its associated storage
table are properly removed from the catalog. Support for `REFRESH MATERIALIZED
VIEW` is left as a future enhancement.
## Spark Catalog
This patch enhances the `SparkCatalog` to intelligently decide whether to
return the view text metadata for a materialized view or the data from its
associated storage table based on the freshness of the materialized view.
Within the `loadTable` method, the patch first checks if the requested table
corresponds to a materialized view by loading the view from the Iceberg
catalog. If the identified view is marked as a materialized view (using the
`iceberg.materialized.view` property), the patch then assesses its freshness.
If it is fresh, the `loadTable` method proceeds to load and return the storage
table associated with the materialized view, allowing users to query the
pre-computed data directly. However, if the materialized view is stale, the
method simply returns to allow `SparkCatalog`'s `loadView` to run. In turn,
`loadView` returns the metadata for the virtual view itself, triggering the
usual Spark view logic that computes the result set based on the current state
of the bas
e tables.
## Storage Table API
This patch utilizes the `HadoopCatalog` to manage the storage table
associated with each materialized view by referencing the table directly by its
location. This approach hides the storage table from being directly accessed or
manipulated via the Spark SQL APIs, ensuring that the storage table remains an
internal component of the materialized view implementation, thus maintaining
the abstraction layer between the user-facing view definitions, namely from
SQL, and the underlying catalog implementation.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]