GitHub user Radiancebobo created a discussion: Should we expose an IO‑thread
level publish buffer backpressure metric?
Currently, connection‑level publish backpressure already has a metric:
```
pulsar_broker_throttled_connections
```
This is useful for observing throttling caused by per‑connection limits.
However, the publish buffer protection controlled by
`maxMessagePublishBufferSizeInMB` is applied at the IO‑thread level. When the
pending publish bytes on an IO thread exceed the configured threshold, the
broker pauses connections on that IO thread to protect memory usage. From what
I understand, this state is currently maintained inside the broker, but it is
**not directly visible as an independent metric**.
In production, this makes it harder to tell whether producers are being slowed
down specifically because of IO‑thread publish buffer backpressure. Users may
only see increased publish latency or throughput drops, but cannot directly
distinguish whether the cause is:
- connection‑level throttling,
- topic/broker publish rate limiting,
- IO‑thread publish buffer pressure,
- slow BookKeeper writes,
- or another backpressure path.
I think exposing this as a dedicated metric could be useful. For example:
- `pulsar_broker_publish_buffer_throttled_connections`
- or an OpenTelemetry‑style metric like:
`pulsar.broker.connection.rate_limit.active.count{reason="io_thread_publish_buffer"}`
Such a metric would help users directly observe whether IO‑thread publish
buffer backpressure is active. This would make it easier to decide whether to:
- increase `maxMessagePublishBufferSizeInMB`,
- tune `numIOThreads`,
- investigate BookKeeper add‑entry latency,
- or look for other broker‑side publish bottlenecks.
My personal feeling is that this would fill an observability gap. The existing
paused/resumed rate‑limit counters are useful, but they represent events and
are not specific to the IO‑thread publish buffer reason. A current‑state metric
for this specific backpressure condition may be easier to use in dashboards and
alerts.
I would like to hear the community’s opinion:
1. Do you think this kind of metric is useful?
2. Should it be added as a dedicated publish‑buffer metric, or as a more
generic active rate‑limit metric with a reason label?
If the community agrees that this is useful, I would be happy to help prepare a
PR.
GitHub link: https://github.com/apache/pulsar/discussions/25904
----
This is an automatically sent email for [email protected].
To unsubscribe, please send an email to: [email protected]