Re: [Discuss][HIVE-28879] Federated Catalog Support in Apache Hive

Sai Hemanth Gantasala Mon, 23 Mar 2026 12:47:26 -0700

+1 to Deny's and Butao's suggestions.

Lisoda,
1) I agree that relying on external permission systems for basic table
visibility can be complex and error-prone. However, introducing capability
filtering, even based on format type, still moves HMS away from its core
role as an engine-agnostic metadata service. We need a solution that
addresses the operational complexity without compromising HMS neutrality.
2) I see your point on operational complexity, but the need for external
permissions goes beyond format support, it is essential for multi-tenancy
and granular security. We must be able to hide a sensitive Iceberg table
from a user, even if their engine is capable of reading Iceberg. Separating
the security policy (ACLs) from the metadata definition (HMS) remains the
correct architectural approach IMO.


Thanks,
Sai

On Mon, Mar 23, 2026 at 3:18 AM Butao Zhang <[email protected]> wrote:

> I mostly agree with Denys's viewpoint. That is, when querying Iceberg and
> Hudi tables in HMS, engines need to implement and configure their own
> connectors. These connectors are specific to each engine and have nothing
> to do with HMS itself. HMS serves as a neutral, unified metadata management
> service, responsible only for managing the lifecycle of catalogs (such as
> creation and deletion) and providing unified metadata authorization
> services.
>
>
> Add some extra information to respond to lisoda:
>
> 1) Q1: HMS may store various types of tables (e.g., Iceberg, Hudi), and
> some engines may not be able to query certain types of tables stored in HMS.
> First, this issue seems unrelated to the multi-catalog or federated
> catalog approach I proposed. This is essentially a problem where multiple
> table formats (Iceberg, Hudi, etc.) are mixed within a single HMS catalog.
> When a compute engine is configured with this HMS catalog, it may be able
> to see all tables via `SHOW TABLES`, but it may only be able to query a
> subset of them. This issue should be handled at the compute engine level.
> For example, the engine can determine whether a table should be visible or
> whether it can be queried based on table attributes like `table_type`.
> For instance, StarRocks provides a catalog/connector called the Unified
> Catalog (
> https://docs.starrocks.io/docs/data_source/catalog/unified_catalog/),
> which can query multiple table formats (such as Iceberg and Hudi) stored in
> the same HMS.
>
> If users only want to query a specific type of table stored in the same
> HMS, such as Iceberg tables, they can create a dedicated catalog/connector,
> like the Iceberg Catalog (
> https://docs.starrocks.io/docs/data_source/catalog/iceberg/iceberg_catalog/).
> This catalog/connector allows users to see only Iceberg tables when running
> `SHOW TABLES`, and any other table formats will be invisible.
>
> Additionally, based on my tests, when using
> `org.apache.iceberg.spark.SparkSessionCatalog`, Spark should be able to
> query both Hive tables and Iceberg tables through the HMS catalog.
>
> 2) Q2: Regarding the issue of circular catalogs, I believe this does not
> exist. When a compute engine is configured with an HMS catalog, that HMS
> catalog can only see its own catalog namespace (databases and tables). The
> engine cannot see information from other catalogs through this HMS catalog.
>
>
> Thanks,
> Butao Zhang
> ---- Replied Message ----
> From lisoda<[email protected]> <[email protected]>
> Date 3/20/2026 22:53
> To dev<[email protected]> <[email protected]>
> Subject Re: [Discuss][HIVE-28879] Federated Catalog Support in Apache Hive
> I understand your concern, but I may not have expressed myself clearly—I
> don't intend to tightly couple the catalog with specific engine runtime
> configurations either. What I'm suggesting is a lightweight convention
> mechanism, not deep integration.
> My idea is actually quite simple: engines could report just a few boolean
> flags upon connection (e.g.,  supports_iceberg: true/false ), or we could
> push the filtering logic down to the engine side via an SDK. This is less
> about "coupling" and more about a declarative contract.
> From an engineering perspective, convention over configuration is
> generally the better path:
>
> Convention (auto-reporting/filtering): The engine declares its
> capabilities → HMS or the SDK automatically masks incompatible metadata.
> This maintains a single source of truth—the physical properties of the
> table (format, location) directly determine its visibility.
>
> Configuration (manual access control): Administrators manually maintain a
> separate set of ACL rules outside of HMS to hide certain tables. This
> essentially creates duplicate definition—the metadata layer already defines
> "this is an Iceberg table," and then the permission layer has to define
> "this engine shouldn't see this Iceberg table." As the number of tables or
> engines scales, this manual synchronization overhead becomes unmanageable.
> In other words, I'm not asking HMS to understand "what connectors Spark
> 3.4 has installed." I'm simply suggesting that the physical properties of
> the metadata (the format type) should automatically determine its
> distribution scope. If HMS remains completely agnostic and relies on
> external permission systems to retroactively hide visibility, doesn't that
> actually increase operational complexity?
>
>
> ---- Replied Message ----
> From Denys Kuzmenko<[email protected]> <[email protected]>
> Date 03/20/2026 19:12
> To [email protected]
> Cc
> Subject Re: [Discuss][HIVE-28879] Federated Catalog Support in Apache Hive
> I don’t think tying catalog behavior to engine capabilities is a good
> direction. A catalog should remain engine-agnostic and focus purely on
> metadata management and discovery, not on the execution capabilities of
> specific query engines.
>
> Hive Metastore is intentionally designed as a neutral metadata service. It
> exposes table definitions, while each engine (e.g., Apache Spark, Trino,
> etc.) decides whether it can actually process those tables based on its
> configured connectors or format support. Introducing capability negotiation
> would effectively couple the catalog to specific engines and their runtime
> configuration, which breaks that separation of concerns and makes the
> catalog responsible for execution-layer logic.
>
> If a particular engine does not support a given format or catalog (for
> example, it does not have the appropriate client/connector installed), the
> cleaner solution is access control, not metadata filtering. In practice,
> permissions can simply be removed for users of that engine on catalogs or
> tables they are not expected to query.
>
> Keeping the catalog engine-agnostic preserves interoperability and avoids
> embedding engine-specific behavior into the metadata layer.
>

Re: [Discuss][HIVE-28879] Federated Catalog Support in Apache Hive

Reply via email to