This is an automated email from the ASF dual-hosted git repository. mmerli pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/pulsar.git
The following commit(s) were added to refs/heads/master by this push: new 41e515caf24 [improve] PIP 342: Support OpenTelemetry metrics in Pulsar client (#22178) 41e515caf24 is described below commit 41e515caf2474b3641f01a20d02df24468a2d53e Author: Matteo Merli <mme...@apache.org> AuthorDate: Fri Mar 22 15:21:05 2024 +0000 [improve] PIP 342: Support OpenTelemetry metrics in Pulsar client (#22178) --- pip/pip-342 OTel client metrics support.md | 168 +++++++++++++++++++++++++++++ 1 file changed, 168 insertions(+) diff --git a/pip/pip-342 OTel client metrics support.md b/pip/pip-342 OTel client metrics support.md new file mode 100644 index 00000000000..ebbe1e24660 --- /dev/null +++ b/pip/pip-342 OTel client metrics support.md @@ -0,0 +1,168 @@ +# PIP 342: Support OpenTelemetry metrics in Pulsar client + +## Motivation + +Current support for metric instrumentation in Pulsar client is very limited and poses a lot of +issues for integrating the metrics into any telemetry system. + +We have 2 ways that metrics are exposed today: + +1. Printing logs every 1 minute: While this is ok as it comes out of the box, it's very hard for + any application to get the data or use it in any meaningful way. +2. `producer.getStats()` or `consumer.getStats()`: Calling these methods will get access to + the rate of events in the last 1-minute interval. This is problematic because out of the + box the metrics are not collected anywhere. One would have to start its own thread to + periodically check these values and export them to some other system. + +Neither of these mechanism that we have today are sufficient to enable application to easily +export the telemetry data of Pulsar client SDK. + +## Goal + +Provide a good way for applications to retrieve and analyze the usage of Pulsar client operation, +in particular with respect to: + +1. Maximizing compatibility with existing telemetry systems +2. Minimizing the effort required to export these metrics + +## Why OpenTelemetry? + +[OpenTelemetry](https://opentelemetry.io/) is quickly becoming the de-facto standard API for metric and +tracing instrumentation. In fact, as part of [PIP-264](https://github.com/apache/pulsar/blob/master/pip/pip-264.md), +we are already migrating the Pulsar server side metrics to use OpenTelemetry. + +For Pulsar client SDK, we need to provide a similar way for application builder to quickly integrate and +export Pulsar metrics. + +### Why exposing OpenTelemetry directly in Pulsar API + +When deciding how to expose the metrics exporter configuration there are multiple options: + +1. Accept an `OpenTelemetry` object directly in Pulsar API +2. Build a pluggable interface that describe all the Pulsar client SDK events and allow application to + provide an implementation, perhaps providing an OpenTelemetry included option. + +For this proposal, we are following the (1) option. Here are the reasons: + +1. In a way, OpenTelemetry can be compared to [SLF4J](https://www.slf4j.org/), in the sense that it provides an API + on top of which different vendor can build multiple implementations. Therefore, there is no need to create a new + Pulsar-specific interface +2. OpenTelemetry has 2 main artifacts: API and SDK. For the context of Pulsar client, we will only depend on its + API. Applications that are going to use OpenTelemetry, will include the OTel SDK +3. Providing a custom interface has several drawbacks: + 1. Applications need to update their implementations every time a new metric is added in Pulsar SDK + 2. The surface of this plugin API can become quite big when there are several metrics + 3. If we imagine an application that uses multiple libraries, like Pulsar SDK, and each of these has its own + custom way to expose metrics, we can see the level of integration burden that is pushed to application + developers +4. It will always be easy to use OpenTelemetry to collect the metrics and export them using a custom metrics API. There + are several examples of this in OpenTelemetry documentation. + +## Public API changes + +### Enabling OpenTelemetry + +When building a `PulsarClient` instance, it will be possible to pass an `OpenTelemetry` object: + +```java +interface ClientBuilder { + // ... + ClientBuilder openTelemetry(io.opentelemetry.api.OpenTelemetry openTelemetry); +} +``` + +The common usage for an application would be something like: + +```java +// Creates a OpenTelemetry instance using environment variables to configure it +OpenTelemetry otel = AutoConfiguredOpenTelemetrySdk.builder().build() + .getOpenTelemetrySdk(); + +PulsarClient client = PulsarClient.builder() + .serviceUrl("pulsar://localhost:6650") + .openTelemetry(otel) + .build(); + +// .... +``` + +Even without passing the `OpenTelemetry` instance to Pulsar client SDK, an application using the OpenTelemetry +agent, will be able to instrument the Pulsar client automatically, because we default to use `GlobalOpenTelemetry.get()`. + +### Deprecating the old stats methods + +The old way of collecting stats will be deprecated in phases: + 1. Pulsar 3.3 - Old metrics deprecated, still enabled by default + 2. Pulsar 3.4 - Old metrics disabled by default + 3. Pulsar 4.0 - Old metrics removed + +Methods to deprecate: + +```java +interface ClientBuilder { + // ... + @Deprecated + ClientBuilder statsInterval(long statsInterval, TimeUnit unit); +} + +interface Producer { + @Deprecated + ProducerStats getStats(); +} + +interface Consumer { + @Deprecated + ConsumerStats getStats(); +} +``` + +## Initial set of metrics to include + +Based on the experience of Pulsar Go client SDK metrics ( +see: https://github.com/apache/pulsar-client-go/blob/master/pulsar/internal/metrics.go), +this is the proposed initial set of metrics to export. + +Additional metrics could be added later on, though it's better to start with the set of most important metrics +and then evaluate any missing information. + +These metrics names and attributes will be considered "Experimental" for 3.3 release and might be subject to changes. +The plan is to finalize all the namings in 4.0 LTS release. + +Attributes with `[name]` brackets will not be included by default, to avoid high cardinality metrics. + +| OTel metric name | Type | Unit | Attributes | Description | +|-------------------------------------------------|---------------|-------------|--------------------------------------------------------------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------| +| `pulsar.client.connection.opened` | Counter | connections | | The number of connections opened | +| `pulsar.client.connection.closed` | Counter | connections | | The number of connections closed | +| `pulsar.client.connection.failed` | Counter | connections | | The number of failed connection attempts | +| `pulsar.client.producer.opened` | Counter | sessions | `pulsar.tenant`, `pulsar.namespace`, [`pulsar.topic`], [`pulsar.partition`] | The number of producer sessions opened | +| `pulsar.client.producer.closed` | Counter | sessions | `pulsar.tenant`, `pulsar.namespace`, [`pulsar.topic`], [`pulsar.partition`] | The number of producer sessions closed | +| `pulsar.client.consumer.opened` | Counter | sessions | `pulsar.tenant`, `pulsar.namespace`, [`pulsar.topic`], [`pulsar.partition`], `pulsar.subscription` | The number of consumer sessions opened | +| `pulsar.client.consumer.closed` | Counter | sessions | `pulsar.tenant`, `pulsar.namespace`, [`pulsar.topic`], [`pulsar.partition`], `pulsar.subscription` | The number of consumer sessions closed | +| `pulsar.client.consumer.message.received.count` | Counter | messages | `pulsar.tenant`, `pulsar.namespace`, [`pulsar.topic`], [`pulsar.partition`], `pulsar.subscription` | The number of messages explicitly received by the consumer application | +| `pulsar.client.consumer.message.received.size` | Counter | bytes | `pulsar.tenant`, `pulsar.namespace`, [`pulsar.topic`], [`pulsar.partition`], `pulsar.subscription` | The number of bytes explicitly received by the consumer application | +| `pulsar.client.consumer.receive_queue.count` | UpDownCounter | messages | `pulsar.tenant`, `pulsar.namespace`, [`pulsar.topic`], [`pulsar.partition`], `pulsar.subscription` | The number of messages currently sitting in the consumer receive queue | +| `pulsar.client.consumer.receive_queue.size` | UpDownCounter | bytes | `pulsar.tenant`, `pulsar.namespace`, [`pulsar.topic`], [`pulsar.partition`], `pulsar.subscription` | The total size in bytes of messages currently sitting in the consumer receive queue | +| `pulsar.client.consumer.message.ack` | Counter | messages | `pulsar.tenant`, `pulsar.namespace`, [`pulsar.topic`], [`pulsar.partition`], `pulsar.subscription` | The number of acknowledged messages | +| `pulsar.client.consumer.message.nack` | Counter | messages | `pulsar.tenant`, `pulsar.namespace`, [`pulsar.topic`], [`pulsar.partition`], `pulsar.subscription` | The number of negatively acknowledged messages | +| `pulsar.client.consumer.message.dlq` | Counter | messages | `pulsar.tenant`, `pulsar.namespace`, [`pulsar.topic`], [`pulsar.partition`], `pulsar.subscription` | The number of messages sent to DLQ | +| `pulsar.client.consumer.message.ack.timeout` | Counter | messages | `pulsar.tenant`, `pulsar.namespace`, [`pulsar.topic`], [`pulsar.partition`], `pulsar.subscription` | The number of messages that were not acknowledged in the configured timeout period, hence, were requested by the client to be redelivered | +| `pulsar.client.producer.message.send.duration` | Histogram | seconds | `pulsar.tenant`, `pulsar.namespace`, [`pulsar.topic`], [`pulsar.partition`] | Publish latency experienced by the application, includes client batching time | +| `pulsar.client.producer.rpc.send.duration` | Histogram | seconds | `pulsar.tenant`, `pulsar.namespace`, [`pulsar.topic`], [`pulsar.partition`], `pulsar.response.status="success\|failed"` | Publish RPC latency experienced internally by the client when sending data to receiving an ack | +| `pulsar.client.producer.message.send.size` | Counter | bytes | `pulsar.tenant`, `pulsar.namespace`, [`pulsar.topic`], [`pulsar.partition`], `pulsar.response.status="success\|failed"` | The number of bytes published | +| `pulsar.client.producer.message.pending.count"` | UpDownCounter | messages | `pulsar.tenant`, `pulsar.namespace`, [`pulsar.topic`], [`pulsar.partition`] | The number of messages in the producer internal send queue, waiting to be sent | +| `pulsar.client.producer.message.pending.size` | UpDownCounter | bytes | `pulsar.tenant`, `pulsar.namespace`, [`pulsar.topic`], [`pulsar.partition`] | The size of the messages in the producer internal queue, waiting to sent | +| `pulsar.client.lookup.duration` | Histogram | seconds | `pulsar.lookup.transport-type="binary\|http"`, `pulsar.lookup.type="topic\|metadata\|schema\|list-topics"`, `pulsar.response.status="success\|failed"` | Duration of different types of client lookup operations | + +## Metrics cardinality + +The metrics data point will be tagged with these attributes: + + * `pulsar.tenant` + * `pulsar.namespace` + * `pulsar.topic` + * `pulsar.partition` + +By default the metrics will be exported with tenant and namespace attributes set. If an application wants to enable +a finer level, with higher cardinality, it can do so by using OpenTelemetry configuration. +