Re: [DISCUSS] Table and Column Label Metadata in Iceberg REST Catalog

Andrei Tserakhau via dev Tue, 24 Mar 2026 13:37:27 -0700

Thanks Ryan!

Your point about avoiding first-class metadata requirements is exactly the
design principle here. Labels let each catalog surface what it knows
without the spec dictating what catalogs must track.


To build on this, I put together a POC showing the approach works across
the ecosystem.

Key design principles that held up in practice:

- No new requirements on catalogs. Labels are optional in the response. A
catalog that doesn't serve labels returns the same response as today.

- Catalog-scoped, not table state. Every catalog we tried already has
internal metadata separate from Iceberg properties — Polaris has
internalProperties, UC has uc_properties, Lakekeeper has namespace
properties in PostgreSQL. Labels just give this existing metadata a
standard way through the protocol.

- No property overriding. Labels are explicitly separate from table
properties. Properties configure behavior, labels describe context. Engines
know which is which.

What built:

- Spec change: https://github.com/apache/iceberg/pull/15750
- PyIceberg client: https://github.com/apache/iceberg-python/pull/3191

Catalog implementations:
- Polaris: https://github.com/apache/polaris/pull/4048 (labels from
internalProperties)
- Unity Catalog OSS: https://github.com/unitycatalog/unitycatalog/pull/1417
(labels from uc_properties)
- Lakekeeper: https://github.com/lakekeeper/lakekeeper/pull/1676 (labels
from namespace properties)

Full demo: https://github.com/laskoviymishka/irc-labels

Three catalogs, two languages (Java + Rust), 40-95 lines each. The pattern
is the same everywhere, each catalog already has internal metadata that
doesn't belong in table properties. Labels give it a standard way out
through the protocol.

The Polaris implementation also addresses
https://github.com/apache/polaris/issues/3222 - the community has been
asking for a way to surface business metadata alongside table loads. Labels
solve this without adding any requirements beyond an optional field.

Beyond ownership and classification, the demo also shows labels enabling AI
agent table selection (agents reason about tables using semantic labels
instead of guessing from column names) and governance via trusted engine
(ClickHouse reading sensitivity labels to auto-generate masking policies).

Happy to discuss the spec design or any of the implementation details.

Andrei

On Fri, Mar 6, 2026 at 11:25 PM Ryan Blue <[email protected]> wrote:

> I think that this is a reasonable way to solve some persistent issues that
> we've seen.
>
> Many catalogs track additional metadata that is not part of the table spec
> (or others) like "owner", and right now there is no way to exchange or
> share that information. I'm also hesitant to start including it as
> first-class metadata because that puts additional requirements on catalogs
> that may not align. For instance, Tabular had no concept of a table "owner"
> and instead used default grants at the schema level. I like that this
> solution allows catalogs to provide information in a generic way that
> doesn't add requirements in the REST spec. And it is an alternative to
> overriding table properties with catalog-managed information, which I think
> is an anti-pattern.
>
> Thanks, Andrei! I think this is a good idea.
>
> On Thu, Mar 5, 2026 at 2:04 PM Andrei Tserakhau via dev <
> [email protected]> wrote:
>
>> Hi all,
>>
>> `LoadTableResponse` returns table metadata — schema, snapshots, file
>> locations — but catalogs have operational context about tables that has no
>> standard place to go: cost attribution, ownership, governance hints,
>> semantic metadata. Right now catalogs have two options:
>>
>> 1. Properties — durable, commit-versioned table state. Good for
>> persistent metadata; wrong for ephemeral catalog context.
>> 2. Custom fields — catalog-specific extensions with no interoperability.
>> Each catalog invents its own structure; engines have no basis to read them.
>>
>> The community has already identified this gap. Polaris opened an issue
>> [1] requesting a standard extension point in the IRC protocol for
>> catalog-managed metadata. Two earlier threads [2][3] explored column-level
>> metadata, though in the context of table format changes.
>>
>> We propose adding an optional `labels` field to `LoadTableResponse` for
>> catalog-managed metadata. Labels are string key-value pairs generated
>> per-request from the catalog's internal systems; nothing is written to
>> table files. Engines may use or ignore them entirely. Labels give catalog
>> providers a standard channel to surface context to any client without
>> bilateral custom integrations for every catalog-engine pair.
>>
>> Details:
>> - GitHub Issue: apache/iceberg#15521
>> - Design Document: [4]
>>
>> Please review the proposal and share your feedback.
>>
>> Thanks,
>> Andrei
>>
>> [1]: https://github.com/apache/polaris/issues/3222
>> [2]: https://lists.apache.org/thread/vwrc3m534gfyfjnsfflwtgkg158yzrb4
>> [3]: https://lists.apache.org/thread/yflg8w1h87qgwc4s3qtog4l8nx8nk8m0
>> [4]:
>> https://docs.google.com/document/d/1aj-6JlfBiMYEEVtNuh5WLMOrRQiMCcyYUGbouPM4hXI/edit?usp=sharing
>>
>

Re: [DISCUSS] Table and Column Label Metadata in Iceberg REST Catalog

Reply via email to