lhotari commented on code in PR #23945: URL: https://github.com/apache/pulsar/pull/23945#discussion_r1985463006
########## pip/pip-406.md: ########## @@ -0,0 +1,152 @@ +# PIP-406: Introduce metrics related to dispatch_throttled_count + +# Background knowledge + +## Motivation + +Currently, users can monitor subscription backlogs using the `pulsar_subscription_back_log_no_delayed` metric. +However, if [dispatch throttling](https://pulsar.apache.org/docs/next/concepts-throttling/) is configured at the broker/topic/subscription level, +this metric may not accurately reflect whether the backlog is due to insufficient consumer capacity, as it could be caused by dispatch throttling. + +## Goals + +Introduce metrics to indicate the count of `messages/bytes throttled` for **broker/topic/subscription** level rate limit. +This allows users to write PromQL queries to identify subscriptions with high backlogs but low or no throttling, pinpointing backlogs caused by insufficient consumer capacity. + +## In Scope + +Broker Level: +- Introduce the metric `pulsar_broker_dispatch_throttled_msg_count` to represent the total count of messages throttled for a broker. +- Introduce the metric `pulsar_broker_dispatch_throttled_bytes_count` to represent the total count of bytes throttled for a broker. + +Topic Level: +- Introduce the metric `pulsar_dispatch_throttled_msg_count` to represent the total count of messages throttled for a topic. +- Introduce the metric `pulsar_dispatch_throttled_bytes_count` to represent the total count of bytes throttled for a topic. + +Subscription Level: +- Introduce the metric `pulsar_subscription_dispatch_throttled_msg_count` to represent the total count of messages throttled for a subscription. +- Introduce the metric `pulsar_subscription_dispatch_throttled_bytes_count` to represent the total count of bytes throttled for a subscription. + + +## Out of Scope +- These states are not persistent and will reset upon broker restart/ topic re-load / subscription reconnected. + +# High Level Design +1. Maintain `dispatchThrottleMsgCount` and `dispatchThrottleBytesCount` in `DispatchRateLimiter`. Increase these values in the `consumeDispatchQuota` method when the TokenBucket for messages or bytes is insufficient. +2. Output these fields when retrieving metrics. + + +# Detailed Design + +## Design & Implementation Details +1. Maintain `dispatchThrottleMsgCount` and `dispatchThrottleBytesCount` in `DispatchRateLimiter`: +```java + private final LongAdder dispatchThrottleMsgCount = new LongAdder(); + private final LongAdder dispatchThrottleBytesCount = new LongAdder(); +``` + +2. During each [consumeDispatchQuota](https://github.com/apache/pulsar/blob/c4cff0ab3dac169c0a1418ef2f63f61604f6278e/pulsar-broker/src/main/java/org/apache/pulsar/broker/service/persistent/DispatchRateLimiter.java#L97-L104), +if token bucket is insufficient, increase these fields accordingly. +```diff + public void consumeDispatchQuota(long numberOfMessages, long byteSize) { + AsyncTokenBucket localDispatchRateLimiterOnMessage = dispatchRateLimiterOnMessage; + if (numberOfMessages > 0 && localDispatchRateLimiterOnMessage != null) { +- localDispatchRateLimiterOnMessage.consumeTokens(numberOfMessages); ++ if (!localDispatchRateLimiterOnMessage.consumeTokensAndCheckIfContainsTokens(numberOfMessages)) { ++ dispatchThrottleMsgCount.increment(); ++ } + } + AsyncTokenBucket localDispatchRateLimiterOnByte = dispatchRateLimiterOnByte; + if (byteSize > 0 && localDispatchRateLimiterOnByte != null) { +- localDispatchRateLimiterOnByte.consumeTokens(byteSize); ++ if (!localDispatchRateLimiterOnByte.consumeTokensAndCheckIfContainsTokens(byteSize)) { ++ dispatchThrottleBytesCount.increment(); ++ } Review Comment: @shibd I think that the problem hasn't been addressed. > For subscription metrics implement, your understanding is correct: **the subscription throttling metrics only represent the impact of SubscriptionRateLimit**. I intentionally designed it this way. The reasons are: > 1. If it were designed to increase with subscription, topic, and broker impacts, users would find it difficult to determine which dimension's rate limit configuration is causing the issue. This is a valid concern and there are ways to address your concern. > 2. In the current design, it can be understood this way: `when users see that the broker or topic level has already been throttled, the subscription throttling metrics become irrelevant, as it indicates that the broker or topic configuration is relatively small, and the first step should be addressing the broker or topic level configuration` I don't think that this holds since "when users see that the broker or topic level has already been throttled" doesn't make sense in reality. How would a user "see" that? Metrics should be capturing the information so that there isn't a need to guess. The metrics should be present to be able to notice whether a subscription gets throttled or not. The reason (subscription level, topic level, broker level) could be tracked separately. One simple way is to track all counters at the subscription level and aggregate them upwards to topic and broker level. This way there's no need to duplicate counters at runtime for stats. Many stats already have been implemented this way. In dashboards, it's possible to calculate sums effectively directly from the subscription level metrics. There's no need to have Prometheus or OTel metrics for dispatch throttling at topic or broker level when the subscription level metric has labels which allow associating the values at topic or broker level. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
