shibd commented on code in PR #23945: URL: https://github.com/apache/pulsar/pull/23945#discussion_r1974929601
########## pip/pip-406.md: ########## @@ -0,0 +1,152 @@ +# PIP-406: Introduce metrics related to dispatch_throttled_count + +# Background knowledge + +## Motivation + +Currently, users can monitor subscription backlogs using the `pulsar_subscription_back_log_no_delayed` metric. +However, if [dispatch throttling](https://pulsar.apache.org/docs/next/concepts-throttling/) is configured at the broker/topic/subscription level, +this metric may not accurately reflect whether the backlog is due to insufficient consumer capacity, as it could be caused by dispatch throttling. + +## Goals + +Introduce metrics to indicate the count of `messages/bytes throttled` for **broker/topic/subscription** level rate limit. +This allows users to write PromQL queries to identify subscriptions with high backlogs but low or no throttling, pinpointing backlogs caused by insufficient consumer capacity. + +## In Scope + +Broker Level: +- Introduce the metric `pulsar_broker_dispatch_throttled_msg_count` to represent the total count of messages throttled for a broker. +- Introduce the metric `pulsar_broker_dispatch_throttled_bytes_count` to represent the total count of bytes throttled for a broker. + +Topic Level: +- Introduce the metric `pulsar_dispatch_throttled_msg_count` to represent the total count of messages throttled for a topic. +- Introduce the metric `pulsar_dispatch_throttled_bytes_count` to represent the total count of bytes throttled for a topic. + +Subscription Level: +- Introduce the metric `pulsar_subscription_dispatch_throttled_msg_count` to represent the total count of messages throttled for a subscription. +- Introduce the metric `pulsar_subscription_dispatch_throttled_bytes_count` to represent the total count of bytes throttled for a subscription. + + +## Out of Scope +- These states are not persistent and will reset upon broker restart/ topic re-load / subscription reconnected. + +# High Level Design +1. Maintain `dispatchThrottleMsgCount` and `dispatchThrottleBytesCount` in `DispatchRateLimiter`. Increase these values in the `consumeDispatchQuota` method when the TokenBucket for messages or bytes is insufficient. +2. Output these fields when retrieving metrics. + + +# Detailed Design + +## Design & Implementation Details +1. Maintain `dispatchThrottleMsgCount` and `dispatchThrottleBytesCount` in `DispatchRateLimiter`: +```java + private final LongAdder dispatchThrottleMsgCount = new LongAdder(); + private final LongAdder dispatchThrottleBytesCount = new LongAdder(); +``` + +2. During each [consumeDispatchQuota](https://github.com/apache/pulsar/blob/c4cff0ab3dac169c0a1418ef2f63f61604f6278e/pulsar-broker/src/main/java/org/apache/pulsar/broker/service/persistent/DispatchRateLimiter.java#L97-L104), +if token bucket is insufficient, increase these fields accordingly. +```diff + public void consumeDispatchQuota(long numberOfMessages, long byteSize) { + AsyncTokenBucket localDispatchRateLimiterOnMessage = dispatchRateLimiterOnMessage; + if (numberOfMessages > 0 && localDispatchRateLimiterOnMessage != null) { +- localDispatchRateLimiterOnMessage.consumeTokens(numberOfMessages); ++ if (!localDispatchRateLimiterOnMessage.consumeTokensAndCheckIfContainsTokens(numberOfMessages)) { ++ dispatchThrottleMsgCount.increment(); ++ } + } + AsyncTokenBucket localDispatchRateLimiterOnByte = dispatchRateLimiterOnByte; + if (byteSize > 0 && localDispatchRateLimiterOnByte != null) { +- localDispatchRateLimiterOnByte.consumeTokens(byteSize); ++ if (!localDispatchRateLimiterOnByte.consumeTokensAndCheckIfContainsTokens(byteSize)) { ++ dispatchThrottleBytesCount.increment(); ++ } Review Comment: Yes, from the name of the metric, it does seem ambiguous. Initially, I added it during reading, but then I thought about it and realized that for metrics like `throttled_count`, it can only represent the `degree` of throttling. So, I added it when `consuming tokens`, rather than before reading messages. If a limit throttles, then there will definitely be a shortage when consuming tokens, and we can then increment the metric. Whether it solves the frequent reading issue wasn't my main focus at the time. We can discuss: Regarding the metric for message `throttling counts`, we can't obtain an exact value, right? Even with the new code, you've changed the logic: `delaying the re-read when messages can't be read at all` However, if we count the times here, it can only represent a `degree` of throttling. This is because the metric's count is also `related to the number of reads`. I don't have any strong suggestions(I think adding it in either place would be fine.). If you can explain more insights, I can learn from them. For now, I'll follow your suggestion and add the metric during reading. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
