Re: [Discuss][HIVE-28879] Federated Catalog Support in Apache Hive

Butao Zhang Mon, 30 Mar 2026 09:40:17 -0700

The suggestion proposed by Zhihua is: "add a plugin service between the 
Metastore client and the engine, this plugin will translate the Hive metadata 
to anything the engine wants."


This essentially refers to implementing engine-specific read/write plugins for 
various catalogs in HMS. These read/write plugins rely on engine-specific 
capabilities and must be implemented according to the interface specifications 
exposed by the engine. For example, in Gravitino, based on Spark's DataSource 
V2 interface (https://github.com/apache/gravitino/tree/main/spark-connector), 
read/write support for various catalogs in Gravitino has been implemented.

>From my perspective, implementing the engine side may be a different story. 
>This is because such implementation requires deep understanding of the open 
>interfaces of specific engines (such as Spark and Trino), and involves 
>significant development effort. However, if we enhance HMS's multi-catalog 
>capabilities, we can encourage more community developers to get involved in 
>the future and implement read/write plugins for HMS catalogs across different 
>engines.


Thanks,
Butao Zhang


On 2026/03/24 04:54:46 Zhihua Deng wrote:
> +1 for engine-agnostic, unified metadata and discovery, multi-tenancy
> and granular ACLs catalog federation.
> 
> It's important to consider how the engine will consume the metadata before we 
> start. As the catalog is engine-agnostic, I would like to add a plugin 
> service between the Metastore client and the engine, this plugin will 
> translate the Hive metadata to anything the engine wants.
> 
> On 2026/03/23 19:45:44 Sai Hemanth Gantasala wrote:
> > +1 to Deny's and Butao's suggestions.
> > 
> > Lisoda,
> > 1) I agree that relying on external permission systems for basic table
> > visibility can be complex and error-prone. However, introducing capability
> > filtering, even based on format type, still moves HMS away from its core
> > role as an engine-agnostic metadata service. We need a solution that
> > addresses the operational complexity without compromising HMS neutrality.
> > 2) I see your point on operational complexity, but the need for external
> > permissions goes beyond format support, it is essential for multi-tenancy
> > and granular security. We must be able to hide a sensitive Iceberg table
> > from a user, even if their engine is capable of reading Iceberg. Separating
> > the security policy (ACLs) from the metadata definition (HMS) remains the
> > correct architectural approach IMO.
> > 
> > Thanks,
> > Sai
> > 
> > On Mon, Mar 23, 2026 at 3:18 AM Butao Zhang <[email protected]> wrote:
> > 
> > > I mostly agree with Denys's viewpoint. That is, when querying Iceberg and
> > > Hudi tables in HMS, engines need to implement and configure their own
> > > connectors. These connectors are specific to each engine and have nothing
> > > to do with HMS itself. HMS serves as a neutral, unified metadata 
> > > management
> > > service, responsible only for managing the lifecycle of catalogs (such as
> > > creation and deletion) and providing unified metadata authorization
> > > services.
> > >
> > >
> > > Add some extra information to respond to lisoda:
> > >
> > > 1) Q1: HMS may store various types of tables (e.g., Iceberg, Hudi), and
> > > some engines may not be able to query certain types of tables stored in 
> > > HMS.
> > > First, this issue seems unrelated to the multi-catalog or federated
> > > catalog approach I proposed. This is essentially a problem where multiple
> > > table formats (Iceberg, Hudi, etc.) are mixed within a single HMS catalog.
> > > When a compute engine is configured with this HMS catalog, it may be able
> > > to see all tables via `SHOW TABLES`, but it may only be able to query a
> > > subset of them. This issue should be handled at the compute engine level.
> > > For example, the engine can determine whether a table should be visible or
> > > whether it can be queried based on table attributes like `table_type`.
> > > For instance, StarRocks provides a catalog/connector called the Unified
> > > Catalog (
> > > https://docs.starrocks.io/docs/data_source/catalog/unified_catalog/),
> > > which can query multiple table formats (such as Iceberg and Hudi) stored 
> > > in
> > > the same HMS.
> > >
> > > If users only want to query a specific type of table stored in the same
> > > HMS, such as Iceberg tables, they can create a dedicated 
> > > catalog/connector,
> > > like the Iceberg Catalog (
> > > https://docs.starrocks.io/docs/data_source/catalog/iceberg/iceberg_catalog/).
> > > This catalog/connector allows users to see only Iceberg tables when 
> > > running
> > > `SHOW TABLES`, and any other table formats will be invisible.
> > >
> > > Additionally, based on my tests, when using
> > > `org.apache.iceberg.spark.SparkSessionCatalog`, Spark should be able to
> > > query both Hive tables and Iceberg tables through the HMS catalog.
> > >
> > > 2) Q2: Regarding the issue of circular catalogs, I believe this does not
> > > exist. When a compute engine is configured with an HMS catalog, that HMS
> > > catalog can only see its own catalog namespace (databases and tables). The
> > > engine cannot see information from other catalogs through this HMS 
> > > catalog.
> > >
> > >
> > > Thanks,
> > > Butao Zhang
> > > ---- Replied Message ----
> > > From lisoda<[email protected]> <[email protected]>
> > > Date 3/20/2026 22:53
> > > To dev<[email protected]> <[email protected]>
> > > Subject Re: [Discuss][HIVE-28879] Federated Catalog Support in Apache Hive
> > > I understand your concern, but I may not have expressed myself clearly—I
> > > don't intend to tightly couple the catalog with specific engine runtime
> > > configurations either. What I'm suggesting is a lightweight convention
> > > mechanism, not deep integration.
> > > My idea is actually quite simple: engines could report just a few boolean
> > > flags upon connection (e.g.,  supports_iceberg: true/false ), or we could
> > > push the filtering logic down to the engine side via an SDK. This is less
> > > about "coupling" and more about a declarative contract.
> > > From an engineering perspective, convention over configuration is
> > > generally the better path:
> > >
> > > Convention (auto-reporting/filtering): The engine declares its
> > > capabilities → HMS or the SDK automatically masks incompatible metadata.
> > > This maintains a single source of truth—the physical properties of the
> > > table (format, location) directly determine its visibility.
> > >
> > > Configuration (manual access control): Administrators manually maintain a
> > > separate set of ACL rules outside of HMS to hide certain tables. This
> > > essentially creates duplicate definition—the metadata layer already 
> > > defines
> > > "this is an Iceberg table," and then the permission layer has to define
> > > "this engine shouldn't see this Iceberg table." As the number of tables or
> > > engines scales, this manual synchronization overhead becomes unmanageable.
> > > In other words, I'm not asking HMS to understand "what connectors Spark
> > > 3.4 has installed." I'm simply suggesting that the physical properties of
> > > the metadata (the format type) should automatically determine its
> > > distribution scope. If HMS remains completely agnostic and relies on
> > > external permission systems to retroactively hide visibility, doesn't that
> > > actually increase operational complexity?
> > >
> > >
> > > ---- Replied Message ----
> > > From Denys Kuzmenko<[email protected]> <[email protected]>
> > > Date 03/20/2026 19:12
> > > To [email protected]
> > > Cc
> > > Subject Re: [Discuss][HIVE-28879] Federated Catalog Support in Apache Hive
> > > I don’t think tying catalog behavior to engine capabilities is a good
> > > direction. A catalog should remain engine-agnostic and focus purely on
> > > metadata management and discovery, not on the execution capabilities of
> > > specific query engines.
> > >
> > > Hive Metastore is intentionally designed as a neutral metadata service. It
> > > exposes table definitions, while each engine (e.g., Apache Spark, Trino,
> > > etc.) decides whether it can actually process those tables based on its
> > > configured connectors or format support. Introducing capability 
> > > negotiation
> > > would effectively couple the catalog to specific engines and their runtime
> > > configuration, which breaks that separation of concerns and makes the
> > > catalog responsible for execution-layer logic.
> > >
> > > If a particular engine does not support a given format or catalog (for
> > > example, it does not have the appropriate client/connector installed), the
> > > cleaner solution is access control, not metadata filtering. In practice,
> > > permissions can simply be removed for users of that engine on catalogs or
> > > tables they are not expected to query.
> > >
> > > Keeping the catalog engine-agnostic preserves interoperability and avoids
> > > embedding engine-specific behavior into the metadata layer.
> > >
> > 
>

Re: [Discuss][HIVE-28879] Federated Catalog Support in Apache Hive

Reply via email to