Hi, thanks for summarizing all the challenges about Federated Catalog!
I also think it is something we should work on.

For the topic filtering of unsupported tables, I also think we can
spawn another thread. Hive has features for declaring and testing the
processor's capabilities[1]. We may have a similar logic, but I
haven't found the perfect solution yet. For example, even if HMS
filters out those tables, CREATE TABLE with the same table name must
still fail. I personally think a client should throw a kind message on
read instead.

Regards,
Okumin

- [1] 
https://github.com/apache/hive/blob/rel/release-4.2.0/standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/MetastoreDefaultTransformer.java

On Tue, Mar 31, 2026 at 1:40 AM Butao Zhang <[email protected]> wrote:
>
> The suggestion proposed by Zhihua is: "add a plugin service between the 
> Metastore client and the engine, this plugin will translate the Hive metadata 
> to anything the engine wants."
>
> This essentially refers to implementing engine-specific read/write plugins 
> for various catalogs in HMS. These read/write plugins rely on engine-specific 
> capabilities and must be implemented according to the interface 
> specifications exposed by the engine. For example, in Gravitino, based on 
> Spark's DataSource V2 interface 
> (https://github.com/apache/gravitino/tree/main/spark-connector), read/write 
> support for various catalogs in Gravitino has been implemented.
>
> From my perspective, implementing the engine side may be a different story. 
> This is because such implementation requires deep understanding of the open 
> interfaces of specific engines (such as Spark and Trino), and involves 
> significant development effort. However, if we enhance HMS's multi-catalog 
> capabilities, we can encourage more community developers to get involved in 
> the future and implement read/write plugins for HMS catalogs across different 
> engines.
>
>
> Thanks,
> Butao Zhang
>
>
> On 2026/03/24 04:54:46 Zhihua Deng wrote:
> > +1 for engine-agnostic, unified metadata and discovery, multi-tenancy
> > and granular ACLs catalog federation.
> >
> > It's important to consider how the engine will consume the metadata before 
> > we start. As the catalog is engine-agnostic, I would like to add a plugin 
> > service between the Metastore client and the engine, this plugin will 
> > translate the Hive metadata to anything the engine wants.
> >
> > On 2026/03/23 19:45:44 Sai Hemanth Gantasala wrote:
> > > +1 to Deny's and Butao's suggestions.
> > >
> > > Lisoda,
> > > 1) I agree that relying on external permission systems for basic table
> > > visibility can be complex and error-prone. However, introducing capability
> > > filtering, even based on format type, still moves HMS away from its core
> > > role as an engine-agnostic metadata service. We need a solution that
> > > addresses the operational complexity without compromising HMS neutrality.
> > > 2) I see your point on operational complexity, but the need for external
> > > permissions goes beyond format support, it is essential for multi-tenancy
> > > and granular security. We must be able to hide a sensitive Iceberg table
> > > from a user, even if their engine is capable of reading Iceberg. 
> > > Separating
> > > the security policy (ACLs) from the metadata definition (HMS) remains the
> > > correct architectural approach IMO.
> > >
> > > Thanks,
> > > Sai
> > >
> > > On Mon, Mar 23, 2026 at 3:18 AM Butao Zhang <[email protected]> wrote:
> > >
> > > > I mostly agree with Denys's viewpoint. That is, when querying Iceberg 
> > > > and
> > > > Hudi tables in HMS, engines need to implement and configure their own
> > > > connectors. These connectors are specific to each engine and have 
> > > > nothing
> > > > to do with HMS itself. HMS serves as a neutral, unified metadata 
> > > > management
> > > > service, responsible only for managing the lifecycle of catalogs (such 
> > > > as
> > > > creation and deletion) and providing unified metadata authorization
> > > > services.
> > > >
> > > >
> > > > Add some extra information to respond to lisoda:
> > > >
> > > > 1) Q1: HMS may store various types of tables (e.g., Iceberg, Hudi), and
> > > > some engines may not be able to query certain types of tables stored in 
> > > > HMS.
> > > > First, this issue seems unrelated to the multi-catalog or federated
> > > > catalog approach I proposed. This is essentially a problem where 
> > > > multiple
> > > > table formats (Iceberg, Hudi, etc.) are mixed within a single HMS 
> > > > catalog.
> > > > When a compute engine is configured with this HMS catalog, it may be 
> > > > able
> > > > to see all tables via `SHOW TABLES`, but it may only be able to query a
> > > > subset of them. This issue should be handled at the compute engine 
> > > > level.
> > > > For example, the engine can determine whether a table should be visible 
> > > > or
> > > > whether it can be queried based on table attributes like `table_type`.
> > > > For instance, StarRocks provides a catalog/connector called the Unified
> > > > Catalog (
> > > > https://docs.starrocks.io/docs/data_source/catalog/unified_catalog/),
> > > > which can query multiple table formats (such as Iceberg and Hudi) 
> > > > stored in
> > > > the same HMS.
> > > >
> > > > If users only want to query a specific type of table stored in the same
> > > > HMS, such as Iceberg tables, they can create a dedicated 
> > > > catalog/connector,
> > > > like the Iceberg Catalog (
> > > > https://docs.starrocks.io/docs/data_source/catalog/iceberg/iceberg_catalog/).
> > > > This catalog/connector allows users to see only Iceberg tables when 
> > > > running
> > > > `SHOW TABLES`, and any other table formats will be invisible.
> > > >
> > > > Additionally, based on my tests, when using
> > > > `org.apache.iceberg.spark.SparkSessionCatalog`, Spark should be able to
> > > > query both Hive tables and Iceberg tables through the HMS catalog.
> > > >
> > > > 2) Q2: Regarding the issue of circular catalogs, I believe this does not
> > > > exist. When a compute engine is configured with an HMS catalog, that HMS
> > > > catalog can only see its own catalog namespace (databases and tables). 
> > > > The
> > > > engine cannot see information from other catalogs through this HMS 
> > > > catalog.
> > > >
> > > >
> > > > Thanks,
> > > > Butao Zhang
> > > > ---- Replied Message ----
> > > > From lisoda<[email protected]> <[email protected]>
> > > > Date 3/20/2026 22:53
> > > > To dev<[email protected]> <[email protected]>
> > > > Subject Re: [Discuss][HIVE-28879] Federated Catalog Support in Apache 
> > > > Hive
> > > > I understand your concern, but I may not have expressed myself clearly—I
> > > > don't intend to tightly couple the catalog with specific engine runtime
> > > > configurations either. What I'm suggesting is a lightweight convention
> > > > mechanism, not deep integration.
> > > > My idea is actually quite simple: engines could report just a few 
> > > > boolean
> > > > flags upon connection (e.g.,  supports_iceberg: true/false ), or we 
> > > > could
> > > > push the filtering logic down to the engine side via an SDK. This is 
> > > > less
> > > > about "coupling" and more about a declarative contract.
> > > > From an engineering perspective, convention over configuration is
> > > > generally the better path:
> > > >
> > > > Convention (auto-reporting/filtering): The engine declares its
> > > > capabilities → HMS or the SDK automatically masks incompatible metadata.
> > > > This maintains a single source of truth—the physical properties of the
> > > > table (format, location) directly determine its visibility.
> > > >
> > > > Configuration (manual access control): Administrators manually maintain 
> > > > a
> > > > separate set of ACL rules outside of HMS to hide certain tables. This
> > > > essentially creates duplicate definition—the metadata layer already 
> > > > defines
> > > > "this is an Iceberg table," and then the permission layer has to define
> > > > "this engine shouldn't see this Iceberg table." As the number of tables 
> > > > or
> > > > engines scales, this manual synchronization overhead becomes 
> > > > unmanageable.
> > > > In other words, I'm not asking HMS to understand "what connectors Spark
> > > > 3.4 has installed." I'm simply suggesting that the physical properties 
> > > > of
> > > > the metadata (the format type) should automatically determine its
> > > > distribution scope. If HMS remains completely agnostic and relies on
> > > > external permission systems to retroactively hide visibility, doesn't 
> > > > that
> > > > actually increase operational complexity?
> > > >
> > > >
> > > > ---- Replied Message ----
> > > > From Denys Kuzmenko<[email protected]> <[email protected]>
> > > > Date 03/20/2026 19:12
> > > > To [email protected]
> > > > Cc
> > > > Subject Re: [Discuss][HIVE-28879] Federated Catalog Support in Apache 
> > > > Hive
> > > > I don’t think tying catalog behavior to engine capabilities is a good
> > > > direction. A catalog should remain engine-agnostic and focus purely on
> > > > metadata management and discovery, not on the execution capabilities of
> > > > specific query engines.
> > > >
> > > > Hive Metastore is intentionally designed as a neutral metadata service. 
> > > > It
> > > > exposes table definitions, while each engine (e.g., Apache Spark, Trino,
> > > > etc.) decides whether it can actually process those tables based on its
> > > > configured connectors or format support. Introducing capability 
> > > > negotiation
> > > > would effectively couple the catalog to specific engines and their 
> > > > runtime
> > > > configuration, which breaks that separation of concerns and makes the
> > > > catalog responsible for execution-layer logic.
> > > >
> > > > If a particular engine does not support a given format or catalog (for
> > > > example, it does not have the appropriate client/connector installed), 
> > > > the
> > > > cleaner solution is access control, not metadata filtering. In practice,
> > > > permissions can simply be removed for users of that engine on catalogs 
> > > > or
> > > > tables they are not expected to query.
> > > >
> > > > Keeping the catalog engine-agnostic preserves interoperability and 
> > > > avoids
> > > > embedding engine-specific behavior into the metadata layer.
> > > >
> > >
> >

Reply via email to