GitHub user Radiancebobo created a discussion: Should we expose an IO‑thread 
level publish buffer backpressure metric?

Currently, connection‑level publish backpressure already has a metric: 
```
pulsar_broker_throttled_connections
```

This is useful for observing throttling caused by per‑connection limits.

However, the publish buffer protection controlled by 
`maxMessagePublishBufferSizeInMB` is applied at the IO‑thread level. When the 
pending publish bytes on an IO thread exceed the configured threshold, the 
broker pauses connections on that IO thread to protect memory usage. From what 
I understand, this state is currently maintained inside the broker, but it is 
**not directly visible as an independent metric**.

In production, this makes it harder to tell whether producers are being slowed 
down specifically because of IO‑thread publish buffer backpressure. Users may 
only see increased publish latency or throughput drops, but cannot directly 
distinguish whether the cause is:

- connection‑level throttling,
- topic/broker publish rate limiting,
- IO‑thread publish buffer pressure,
- slow BookKeeper writes,
- or another backpressure path.

I think exposing this as a dedicated metric could be useful. For example:

- `pulsar_broker_publish_buffer_throttled_connections`  
- or an OpenTelemetry‑style metric like:  
  
`pulsar.broker.connection.rate_limit.active.count{reason="io_thread_publish_buffer"}`

Such a metric would help users directly observe whether IO‑thread publish 
buffer backpressure is active. This would make it easier to decide whether to:

- increase `maxMessagePublishBufferSizeInMB`,
- tune `numIOThreads`,
- investigate BookKeeper add‑entry latency,
- or look for other broker‑side publish bottlenecks.

My personal feeling is that this would fill an observability gap. The existing 
paused/resumed rate‑limit counters are useful, but they represent events and 
are not specific to the IO‑thread publish buffer reason. A current‑state metric 
for this specific backpressure condition may be easier to use in dashboards and 
alerts.

I would like to hear the community’s opinion:

1. Do you think this kind of metric is useful?
2. Should it be added as a dedicated publish‑buffer metric, or as a more 
generic active rate‑limit metric with a reason label?

If the community agrees that this is useful, I would be happy to help prepare a 
PR.

GitHub link: https://github.com/apache/pulsar/discussions/25904

----
This is an automatically sent email for [email protected].
To unsubscribe, please send an email to: [email protected]

Reply via email to