[PR] Views, Spark: Add support for Materialized Views; Integrate with Spark SQL [iceberg]

via GitHub Wed, 28 Feb 2024 19:07:35 -0800


wmoustafa opened a new pull request, #9830:
URL: https://github.com/apache/iceberg/pull/9830


   ## Spec
   This patch adds support for materialized views in Iceberg and integrates the 
implementation with Spark SQL. It reuses the current spec of Iceberg views and 
tables by leveraging table properties to capture materialized view metadata. 
Those properties can be added to the Iceberg spec to formalize materialized 
view support.
   
   Below is a summary of all metadata properties introduced or utilized by this 
patch, classified based on whether they are associated with a table or a view, 
along with their purposes:
   
   ### Properties on a View:
   
   1. **`iceberg.materialized.view`**:
       - **Type**: View property
       - **Purpose**: This property is used to mark whether a view is a 
materialized view. If set to `true`, the view is treated as a materialized 
view. This helps in differentiating between virtual and materialized views 
within the catalog and dictates specific handling and validation logic for 
materialized views.
   
   2. **`iceberg.materialized.view.storage.location`**:
       - **Type**: View property
       - **Purpose**: Specifies the location of the storage table associated 
with the materialized view. This property is used for linking a materialized 
view with its corresponding storage table, enabling data management and query 
execution based on the stored data freshness.
   
   ### Properties on a Table:
   
   1. **`base.snapshot.[UUID]`**:
       - **Type**: Table property
       - **Purpose**: These properties store the snapshot IDs of the base 
tables at the time the materialized view's data was last updated. Each property 
is prefixed with `base.snapshot.` followed by the UUID of the base table. They 
are used to track whether the materialized view's data is up to date with the 
base tables by comparing these snapshot IDs with the current snapshot IDs of 
the base tables. If all the base tables' current snapshot IDs match the ones 
stored in these properties, the materialized view's data is considered fresh.
   
   ## Spark SQL
   This patch introduces support for materialized views in the Spark module by 
adding support for Spark SQL `CREATE MATERIALIZED VIEW` and adding materialized 
view handling for the `DROP VIEW` DDL command. When a `CREATE MATERIALIZED 
VIEW` command is executed, the patch interprets the command to create a new 
materialized view, which involves not only registering the view's metadata 
(including marking it as a materialized view with the appropriate properties) 
but also setting up a corresponding storage table to hold the materialized data 
and setting the base table current snapshot IDs (at creation time). Conversely, 
when a `DROP VIEW` command is issued for a materialized view, the patch ensures 
that both the metadata for the materialized view and its associated storage 
table are properly removed from the catalog. Support for `REFRESH MATERIALIZED 
VIEW` is left as a future enhancement.
   
   ## Spark Catalog
   This patch enhances the `SparkCatalog` to intelligently decide whether to 
return the view text metadata for a materialized view or the data from its 
associated storage table based on the freshness of the materialized view. 
Within the `loadTable` method, the patch first checks if the requested table 
corresponds to a materialized view by loading the view from the Iceberg 
catalog. If the identified view is marked as a materialized view (using the 
`iceberg.materialized.view` property), the patch then assesses its freshness. 
If it is fresh, the `loadTable` method proceeds to load and return the storage 
table associated with the materialized view, allowing users to query the 
pre-computed data directly. However, if the materialized view is stale, the 
method simply returns to allow `SparkCatalog`'s `loadView` to run. In turn, 
`loadView` returns the metadata for the virtual view itself, triggering the 
usual Spark view logic that computes the result set based on the current state 
of the bas
 e tables.
   
   ## Storage Table API
   This patch utilizes the `HadoopCatalog` to manage the storage table 
associated with each materialized view by referencing the table directly by its 
location. This approach hides the storage table from being directly accessed or 
manipulated via the Spark SQL APIs, ensuring that the storage table remains an 
internal component of the materialized view implementation, thus maintaining 
the abstraction layer between the user-facing view definitions, namely from 
SQL, and the underlying catalog implementation.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] Views, Spark: Add support for Materialized Views; Integrate with Spark SQL [iceberg]

Reply via email to