Re: [Discuss][HIVE-28879] Federated Catalog Support in Apache Hive

Zhihua Deng Mon, 23 Mar 2026 21:54:59 -0700

+1 for engine-agnostic, unified metadata and discovery, multi-tenancy
and granular ACLs catalog federation.


It's important to consider how the engine will consume the metadata before we 
start. As the catalog is engine-agnostic, I would like to add a plugin service 
between the Metastore client and the engine, this plugin will translate the 
Hive metadata to anything the engine wants.

On 2026/03/23 19:45:44 Sai Hemanth Gantasala wrote:
> +1 to Deny's and Butao's suggestions.
> 
> Lisoda,
> 1) I agree that relying on external permission systems for basic table
> visibility can be complex and error-prone. However, introducing capability
> filtering, even based on format type, still moves HMS away from its core
> role as an engine-agnostic metadata service. We need a solution that
> addresses the operational complexity without compromising HMS neutrality.
> 2) I see your point on operational complexity, but the need for external
> permissions goes beyond format support, it is essential for multi-tenancy
> and granular security. We must be able to hide a sensitive Iceberg table
> from a user, even if their engine is capable of reading Iceberg. Separating
> the security policy (ACLs) from the metadata definition (HMS) remains the
> correct architectural approach IMO.
> 
> Thanks,
> Sai
> 
> On Mon, Mar 23, 2026 at 3:18 AM Butao Zhang <[email protected]> wrote:
> 
> > I mostly agree with Denys's viewpoint. That is, when querying Iceberg and
> > Hudi tables in HMS, engines need to implement and configure their own
> > connectors. These connectors are specific to each engine and have nothing
> > to do with HMS itself. HMS serves as a neutral, unified metadata management
> > service, responsible only for managing the lifecycle of catalogs (such as
> > creation and deletion) and providing unified metadata authorization
> > services.
> >
> >
> > Add some extra information to respond to lisoda:
> >
> > 1) Q1: HMS may store various types of tables (e.g., Iceberg, Hudi), and
> > some engines may not be able to query certain types of tables stored in HMS.
> > First, this issue seems unrelated to the multi-catalog or federated
> > catalog approach I proposed. This is essentially a problem where multiple
> > table formats (Iceberg, Hudi, etc.) are mixed within a single HMS catalog.
> > When a compute engine is configured with this HMS catalog, it may be able
> > to see all tables via `SHOW TABLES`, but it may only be able to query a
> > subset of them. This issue should be handled at the compute engine level.
> > For example, the engine can determine whether a table should be visible or
> > whether it can be queried based on table attributes like `table_type`.
> > For instance, StarRocks provides a catalog/connector called the Unified
> > Catalog (
> > https://docs.starrocks.io/docs/data_source/catalog/unified_catalog/),
> > which can query multiple table formats (such as Iceberg and Hudi) stored in
> > the same HMS.
> >
> > If users only want to query a specific type of table stored in the same
> > HMS, such as Iceberg tables, they can create a dedicated catalog/connector,
> > like the Iceberg Catalog (
> > https://docs.starrocks.io/docs/data_source/catalog/iceberg/iceberg_catalog/).
> > This catalog/connector allows users to see only Iceberg tables when running
> > `SHOW TABLES`, and any other table formats will be invisible.
> >
> > Additionally, based on my tests, when using
> > `org.apache.iceberg.spark.SparkSessionCatalog`, Spark should be able to
> > query both Hive tables and Iceberg tables through the HMS catalog.
> >
> > 2) Q2: Regarding the issue of circular catalogs, I believe this does not
> > exist. When a compute engine is configured with an HMS catalog, that HMS
> > catalog can only see its own catalog namespace (databases and tables). The
> > engine cannot see information from other catalogs through this HMS catalog.
> >
> >
> > Thanks,
> > Butao Zhang
> > ---- Replied Message ----
> > From lisoda<[email protected]> <[email protected]>
> > Date 3/20/2026 22:53
> > To dev<[email protected]> <[email protected]>
> > Subject Re: [Discuss][HIVE-28879] Federated Catalog Support in Apache Hive
> > I understand your concern, but I may not have expressed myself clearly—I
> > don't intend to tightly couple the catalog with specific engine runtime
> > configurations either. What I'm suggesting is a lightweight convention
> > mechanism, not deep integration.
> > My idea is actually quite simple: engines could report just a few boolean
> > flags upon connection (e.g.,  supports_iceberg: true/false ), or we could
> > push the filtering logic down to the engine side via an SDK. This is less
> > about "coupling" and more about a declarative contract.
> > From an engineering perspective, convention over configuration is
> > generally the better path:
> >
> > Convention (auto-reporting/filtering): The engine declares its
> > capabilities → HMS or the SDK automatically masks incompatible metadata.
> > This maintains a single source of truth—the physical properties of the
> > table (format, location) directly determine its visibility.
> >
> > Configuration (manual access control): Administrators manually maintain a
> > separate set of ACL rules outside of HMS to hide certain tables. This
> > essentially creates duplicate definition—the metadata layer already defines
> > "this is an Iceberg table," and then the permission layer has to define
> > "this engine shouldn't see this Iceberg table." As the number of tables or
> > engines scales, this manual synchronization overhead becomes unmanageable.
> > In other words, I'm not asking HMS to understand "what connectors Spark
> > 3.4 has installed." I'm simply suggesting that the physical properties of
> > the metadata (the format type) should automatically determine its
> > distribution scope. If HMS remains completely agnostic and relies on
> > external permission systems to retroactively hide visibility, doesn't that
> > actually increase operational complexity?
> >
> >
> > ---- Replied Message ----
> > From Denys Kuzmenko<[email protected]> <[email protected]>
> > Date 03/20/2026 19:12
> > To [email protected]
> > Cc
> > Subject Re: [Discuss][HIVE-28879] Federated Catalog Support in Apache Hive
> > I don’t think tying catalog behavior to engine capabilities is a good
> > direction. A catalog should remain engine-agnostic and focus purely on
> > metadata management and discovery, not on the execution capabilities of
> > specific query engines.
> >
> > Hive Metastore is intentionally designed as a neutral metadata service. It
> > exposes table definitions, while each engine (e.g., Apache Spark, Trino,
> > etc.) decides whether it can actually process those tables based on its
> > configured connectors or format support. Introducing capability negotiation
> > would effectively couple the catalog to specific engines and their runtime
> > configuration, which breaks that separation of concerns and makes the
> > catalog responsible for execution-layer logic.
> >
> > If a particular engine does not support a given format or catalog (for
> > example, it does not have the appropriate client/connector installed), the
> > cleaner solution is access control, not metadata filtering. In practice,
> > permissions can simply be removed for users of that engine on catalogs or
> > tables they are not expected to query.
> >
> > Keeping the catalog engine-agnostic preserves interoperability and avoids
> > embedding engine-specific behavior into the metadata layer.
> >
>

Re: [Discuss][HIVE-28879] Federated Catalog Support in Apache Hive

Reply via email to