Subject: Re: [DISCUSS] Core: Add OpenTelemetry MetricsReporter

Hi Ryan, Vaquar, and all,

Thanks for the discussion. Quick update on the points raised.


*Ryan: how OTel API is expected to be packaged*
OpenTelemetry's Java API is closer to the SLF4J model than to a
multi-back-end logging library. OTel graduated from the CNCF on 2026-05-21
[1] and the Java API has been on a stable 1.x line since 2021 (current
1.61.0). In practice the API ends up on the host classpath through one of
three paths - the OpenTelemetry Java agent, programmatic
OpenTelemetrySdk setup,
or engine-side integrations (Trino, vendor APM agents) - each of which
already brings `opentelemetry-api` with it. The case where the API is not
present is also the case where the user has not set
metrics-reporter-impl=...OtelMetricsReporter, so the reporter is never
loaded.

[1]
https://www.cncf.io/announcements/2026/05/21/cloud-native-computing-foundation-announces-opentelemetrys-graduation-solidifying-status-as-the-de-facto-observability-standard/

On surprise failures: users who don't opt in see zero OTel classes on the
runtime classpath. Users who opt in by setting the catalog property have by
definition adopted OTel and are expected to bring the API/SDK/exporter, the
same way SLF4J users bring a binding. A misconfiguration here is a local
fix, not something we ship to other users.

The compileOnly declaration adds only opentelemetry-api (~440 KB) at build
time; nothing requires "bundling the whole world." If the project would
rather match the SLF4J pattern more closely we can switch to implementation
opentelemetry-api instead - that's a one-line change. Happy to take
direction here; my own lean is compileOnly because the reporter is opt-in
and OTel users already have the API.


*Vaquar: snapshot.id <http://snapshot.id> cardinality*
You were right - iceberg.snapshot.id as a metric attribute was a
cardinality footgun. Commit 4461830a9 drops it from the attribute set; the
bounded set is now {table.name, schema.id} on scans and {table.name,
operation} on commits. Per-snapshot detail is still available through
ScanReport / CommitReport, the catalog's snapshot history, and (when
scan/commit tracing is added by the host) OTel exemplars. A "Cardinality"
section in the reporter's Javadoc documents this.


*Vaquar: combined-reporter test*
Added testCombinedWithAnotherReporter, exercising
MetricsReporters.combine(otelReporter,
otherReporter) and verifying both reporters receive ScanReport and
CommitReport.


*Q2 (module placement)*
No strong preference has been voiced, so unless someone objects we'll
proceed with iceberg-core + compileOnly opentelemetry-api as in the current
PR.

PR: https://github.com/apache/iceberg/pull/16250 (latest: 4461830a9)

Thanks,
Nori

On Sat, May 23, 2026 at 9:51 PM vaquar khan <[email protected]> wrote:

> Hi Nori,
>
> I strongly support this proposal. Adding a vendor-neutral observability
> pathway via OpenTelemetry into iceberg-core is a significant improvement
> over relying solely on LoggingMetricsReporter or catalog-specific
> implementations like the RESTMetricsReporter .
>
> I reviewed PR #16250 in detail and wanted to share a few observations:
>
> *1. Dependency Isolation * The decision to declare opentelemetry-api as a
> compileOnly dependency is an excellent design choice . By requiring the
> host application (Spark, Flink, Trino, etc.) to provide the SDK at runtime,
> the reporter has zero runtime footprint when OTel isn't on the classpath .
> In that scenario, GlobalOpenTelemetry simply returns a no-op instance and
> metric calls are silently dropped. This cleanly avoids the classloading
> conflicts (such as NoClassDefFoundError or version mismatches in shaded
> jars) that frequently plague large distributed systems with deep dependency
> trees .
>
> *2. Metric Cardinality Controls * I noticed that iceberg.snapshot.id is
> attached as an attribute on every scan and commit metric. Since snapshot
> IDs are monotonically increasing and unique per commit, this creates a new
> time series for every single snapshot. This is an unbounded cardinality
> dimension that can trigger state explosions in backends like Prometheus,
> Datadog, or Grafana Cloud, potentially consuming a large portion of a
> user's observability budget while degrading query performance.
>
> I suggest one of the following mitigations:
>
>    -
>
>    Move snapshot.id to exemplars or trace span attributes rather than
>    metric attributes, keeping it available for drill-down without inflating
>    cardinality .
>    -
>
>    Make it opt-in via a catalog property (e.g.,
>    otel-metrics-include-snapshot-id=false by default).
>    -
>
>    At a minimum, document the cardinality implications clearly and
>    recommend that users leverage OpenTelemetry Collector processors (like the
>    groupbyattrs or cardinality limiter) to enforce limits in production .
>
> Bounded attributes like iceberg.table.name and iceberg.operation are
> absolutely fine, as those have naturally limited domains .
>
> *3. Combined Reporter Testing * The existing unit tests cover ScanReport,
> CommitReport, null safety, accumulation, and no-op fallback well. One gap
> I'd like to see addressed is a test that exercises 
> MetricsReporters.combine(otelReporter,
> otherReporter) to verify that both reporters receive reports in tandem .
>
> There are pre-existing edge cases in the Iceberg metrics framework for
> example, during BaseTransaction initialization, custom reporters have
> occasionally failed to receive a CommitReport, with the framework falling
> back to LoggingMetricsReporter. While this PR doesn't introduce that
> issue, adding a targeted integration test that verifies the OTel reporter
> receives both ScanReport and CommitReport through the combine(..) path
> during transactional commits would strengthen confidence and prevent silent
> regressions as the framework evolves.
>
> Overall, this is an outstanding addition to the project. The
> implementation is clean, the no-op degradation is elegant, and the timing
> is right. Looking forward to seeing this land!
>
> Best regards,
>
> Viquar Khan
>
> https://www.linkedin.com/in/vaquar-khan-b695577/
>
> On Fri, 22 May 2026 at 17:10, Talat Uyarer via dev <[email protected]>
> wrote:
>
>> I also support this PR. Iceberg has good metrics and I agree @Yufei Gu
>> <[email protected]> catalog is not right place for that.
>>
>> On Fri, May 22, 2026 at 2:48 PM Ryan Blue <[email protected]> wrote:
>>
>>> I think it is fine to add `opentelemetry-api` as `compileOnly`. We now
>>> have checks in place that should prevent future dependency leaks into our
>>> runtime Jars.
>>>
>>> My only concern is making sure that this is usable. I'm not familiar
>>> with how OpenTelemetry works, but if this is expected to be bundled into
>>> applications that use it then we generally want to meet those assumptions.
>>> For libraries with multiple back-ends (like logging, for example) we would
>>> normally include the API so that we can load classes that use it, but not
>>> the specific back-end (log4j, logback, etc.). And when that API is very
>>> commonly distributed (like slf4j) we usually don't include it. We just want
>>> to avoid surprise failures for end users by meeting their expectations. (As
>>> long as it doesn't require bundling the whole world)
>>>
>>> Ryan
>>>
>>> On Thu, May 21, 2026 at 10:22 AM Yufei Gu <[email protected]> wrote:
>>>
>>>> Hi Noritaka,
>>>>
>>>> Thanks for writing this up. I think this is a good direction overall.
>>>> Using OpenTelemetry as a vendor neutral integration layer makes sense here.
>>>> It gives Iceberg clients a standard way to integrate with modern
>>>> observability platforms without adding per backend integrations into the
>>>> project. Moreover, even for REST catalogs, the catalog itself is usually
>>>> not the right place to serve as a full metrics monitoring platform. In
>>>> practice, metrics still need to flow into systems like Prometheus,
>>>> CloudWatch, Datadog, or Grafana Cloud for storage, dashboards, alerting,
>>>> and correlation. The main extra value a REST catalog could provide is
>>>> enrichment with catalog specific metadata, but that feels like a relatively
>>>> minor benefit compared to having a standard observability integration path
>>>> overall.
>>>>
>>>> Yufei
>>>>
>>>>
>>>> On Wed, May 20, 2026 at 6:39 PM Noritaka Sekiyama via dev <
>>>> [email protected]> wrote:
>>>>
>>>>> Hi all,
>>>>>
>>>>> I'd like to propose adding an OpenTelemetry-based MetricsReporter to
>>>>> iceberg-core that exports ScanReport and CommitReport to any 
>>>>> OTLP-compatible
>>>>> backend.
>>>>>
>>>>> # Background
>>>>> Iceberg ships three built-in MetricsReporter implementations today:
>>>>> LoggingMetricsReporter, InMemoryMetricsReporter (Spark-internal), and
>>>>> RESTMetricsReporter (REST catalog only).
>>>>> None of them give users an out-of-the-box way to ship scan/commit
>>>>> metrics to an external observability platform.
>>>>> The gap applies to Spark users on non-REST catalogs and to all
>>>>> non-Spark engines (Trino, Flink, etc.).
>>>>>
>>>>> # Motivation
>>>>> OpenTelemetry is the vendor-neutral CNCF standard for telemetry,
>>>>> supported by every major observability backend (Prometheus, CloudWatch,
>>>>> Datadog, Grafana Cloud, etc.).
>>>>> A single OTLP-based MetricsReporter in Iceberg lets users reach all of
>>>>> these without per-vendor integrations in the project.
>>>>> This is complementary to #14360, which adds OTel support to HTTPClient
>>>>> at the REST-catalog HTTP layer; this proposal covers the
>>>>> Iceberg-level ScanReport / CommitReport layer.
>>>>>
>>>>> # Proposal
>>>>> Issue: https://github.com/apache/iceberg/issues/16169
>>>>> PR:    https://github.com/apache/iceberg/pull/16250
>>>>>
>>>>> The reporter follows the same SDK-ownership philosophy as #14360 -
>>>>> the host application (Spark/Flink/Trino/...) registers an OpenTelemetrySdk
>>>>> via GlobalOpenTelemetry, and the reporter just looks up a Meter from it.
>>>>> The reporter has zero Iceberg-specific catalog properties; everything
>>>>> else is owned by the host.
>>>>>
>>>>> The PR has been validated end-to-end against two unrelated OTLP
>>>>> backends (Databricks Zerobus and Amazon CloudWatch) - full procedures and
>>>>> queries are linked from the PR.
>>>>>
>>>>> # On dependencies
>>>>> Given the current sensitivity around new runtime dependencies in 1.11,
>>>>> the PR adds only opentelemetry-api to iceberg-core as compileOnly.
>>>>> The OpenTelemetry SDK and OTLP exporters are not added to the runtime 
>>>>> classpath
>>>>> - they come from the host application.
>>>>> opentelemetry-sdk / -sdk-testing are testImplementation only.
>>>>>
>>>>> # Questions for the community
>>>>>
>>>>> Q1. Any objection to taking the opentelemetry-api compileOnly
>>>>> dependency in iceberg-core?
>>>>> Q2. Module placement: iceberg-core (current PR), or a
>>>>> separate iceberg-opentelemetry module?
>>>>>
>>>>> Thanks,
>>>>> Noritaka Sekiyama, Databricks
>>>>>
>>>>

Reply via email to