[DISCUSS] PIP-310: Support custom publish rate limiters
Hi, Currently, there are only 2 kinds of publish rate limiters - polling based and precise. Users have an option to use either one of them in the topic publish rate limiter, but the resource group rate limiter only uses polling one. There are challenges with both the rate limiters and the fact that we can't use precise rate limiter in the resource group level. Thus, in order to support custom rate limiters, I've created the PIP-310 This is the discussion thread. Please go through the PIP and provide your inputs. Link - https://github.com/apache/pulsar/pull/21399 Regards -- Girish Sharma
Re: [DISCUSS] PIP-310: Support custom publish rate limiters
Replied in PR. On Thu, Oct 19, 2023 at 3:51 PM Girish Sharma wrote: > Hi, > Currently, there are only 2 kinds of publish rate limiters - polling based > and precise. Users have an option to use either one of them in the topic > publish rate limiter, but the resource group rate limiter only uses polling > one. > > There are challenges with both the rate limiters and the fact that we can't > use precise rate limiter in the resource group level. > > Thus, in order to support custom rate limiters, I've created the PIP-310 > > This is the discussion thread. Please go through the PIP and provide your > inputs. > > Link - https://github.com/apache/pulsar/pull/21399 > > Regards > -- > Girish Sharma >
Re: [DISCUSS] PIP-310: Support custom publish rate limiters
Hi Girish, In order to address your problem described in the PIP document [1], it might be necessary to make improvements in how rate limiters apply backpressure in Pulsar. Pulsar uses mainly TCP/IP connection level controls for achieving backpressure. The challenge is that Pulsar can share a single TCP/IP connection across multiple producers and consumers. Because of this, there could be multiple producers and consumers and rate limiters operating on the same connection on the broker, and they will do conflicting decisions, which results in undesired behavior. Regarding the shared TCP/IP connection backpressure issue, Apache Flink had a somewhat similar problem before Flink 1.5. It is described in the "inflicting backpressure" section of this blog post from 2019: https://flink.apache.org/2019/06/05/flink-network-stack.html#inflicting-backpressure-1 Flink solved the issue of multiplexing multiple streams of data on a single TCP/IP connection in Flink 1.5 by introducing it's own flow control mechanism. The backpressure and rate limiting challenges have been discussed a few times in Pulsar community meetings over the past years. There was also a generic backpressure thread on the dev mailing list [2] in Sep 2022. However, we haven't really documented Pulsar's backpressure design and how rate limiters are part of the overall solution and how we could improve. I think it might be time to do so since there's a requirement to improve rate limiting. I guess that's the main motivation also for PIP-310. -Lari 1 - https://github.com/apache/pulsar/pull/21399/files 2 - https://lists.apache.org/thread/03w6x9zsgx11mqcp5m4k4n27cyqmp271 On 2023/10/19 12:51:14 Girish Sharma wrote: > Hi, > Currently, there are only 2 kinds of publish rate limiters - polling based > and precise. Users have an option to use either one of them in the topic > publish rate limiter, but the resource group rate limiter only uses polling > one. > > There are challenges with both the rate limiters and the fact that we can't > use precise rate limiter in the resource group level. > > Thus, in order to support custom rate limiters, I've created the PIP-310 > > This is the discussion thread. Please go through the PIP and provide your > inputs. > > Link - https://github.com/apache/pulsar/pull/21399 > > Regards > -- > Girish Sharma >
Re: [DISCUSS] PIP-310: Support custom publish rate limiters
Hello Lari, Thanks for bringing this to my attention. I went through the links, but does this sharing of the same tcp/ip connection happen across partitions as well (assuming both the partitions of the topic are on the same broker)? i.e. producer 127.0.0.1 for partition `persistent://tenant/ns/topic0-partition0` and producer 127.0.0.1 for partition `persistent://tenant/ns/topic0-partition1` share the same tcp/ip connection assuming both are on broker-0 ? In general, the major use case behind this PIP for me and my organization is about supporting produce spikes. We do not want to allocate absolute maximum throughput for a topic which would not even be utilized 99.99% of the time. Thus, for a topic that stays constantly at 100MBps and goes to 150MBps only once in a blue moon, it's unwise to allocate 150MBps worth of resources 100% of the time. The poller based rate limiter is also not good here as it would allow over use of hardware without a check, degrading the system. @Asif, I have been sick these last 10 days, but will be updating the PIP with the discussed changes early next week. Regards On Fri, Nov 3, 2023 at 3:25 PM Lari Hotari wrote: > Hi Girish, > > In order to address your problem described in the PIP document [1], it > might be necessary to make improvements in how rate limiters apply > backpressure in Pulsar. > > Pulsar uses mainly TCP/IP connection level controls for achieving > backpressure. The challenge is that Pulsar can share a single TCP/IP > connection across multiple producers and consumers. Because of this, there > could be multiple producers and consumers and rate limiters operating on > the same connection on the broker, and they will do conflicting decisions, > which results in undesired behavior. > > Regarding the shared TCP/IP connection backpressure issue, Apache Flink > had a somewhat similar problem before Flink 1.5. It is described in the > "inflicting backpressure" section of this blog post from 2019: > > https://flink.apache.org/2019/06/05/flink-network-stack.html#inflicting-backpressure-1 > Flink solved the issue of multiplexing multiple streams of data on a > single TCP/IP connection in Flink 1.5 by introducing it's own flow control > mechanism. > > The backpressure and rate limiting challenges have been discussed a few > times in Pulsar community meetings over the past years. There was also a > generic backpressure thread on the dev mailing list [2] in Sep 2022. > However, we haven't really documented Pulsar's backpressure design and how > rate limiters are part of the overall solution and how we could improve. > I think it might be time to do so since there's a requirement to improve > rate limiting. I guess that's the main motivation also for PIP-310. > > -Lari > > 1 - https://github.com/apache/pulsar/pull/21399/files > 2 - https://lists.apache.org/thread/03w6x9zsgx11mqcp5m4k4n27cyqmp271 > > On 2023/10/19 12:51:14 Girish Sharma wrote: > > Hi, > > Currently, there are only 2 kinds of publish rate limiters - polling > based > > and precise. Users have an option to use either one of them in the topic > > publish rate limiter, but the resource group rate limiter only uses > polling > > one. > > > > There are challenges with both the rate limiters and the fact that we > can't > > use precise rate limiter in the resource group level. > > > > Thus, in order to support custom rate limiters, I've created the PIP-310 > > > > This is the discussion thread. Please go through the PIP and provide your > > inputs. > > > > Link - https://github.com/apache/pulsar/pull/21399 > > > > Regards > > -- > > Girish Sharma > > > -- Girish Sharma
Re: [DISCUSS] PIP-310: Support custom publish rate limiters
Hi Girish, There is also a discussion thread[1] about rate-limiting. I think there is some conflicts between some kind of rate-limiter and backpressure Take the fail-fast strategy as an example: Brokers have to reply to clients after receiving and decode the message, but the broker also has the back-pressure mechanism. Broker cannot read messages because the channel is `disableAutoRead`. So the rate-limiters have to adapt to back-pressure. Thanks, Tao Jiuming 2023年10月19日 20:51,Girish Sharma 写道: Hi, Currently, there are only 2 kinds of publish rate limiters - polling based and precise. Users have an option to use either one of them in the topic publish rate limiter, but the resource group rate limiter only uses polling one. There are challenges with both the rate limiters and the fact that we can't use precise rate limiter in the resource group level. Thus, in order to support custom rate limiters, I've created the PIP-310 This is the discussion thread. Please go through the PIP and provide your inputs. Link - https://github.com/apache/pulsar/pull/21399 Regards -- Girish Sharma
Re: [DISCUSS] PIP-310: Support custom publish rate limiters
Hello Tao, As I understand, there is a fine balance between rate-limiting, backpressure and not keeping clients waiting. Different use cases may need different approach to rate-limiting and thus, making rate limiter customizable is my first step towards making pulsar more customizable as per need. Regards On Fri, Nov 3, 2023 at 5:42 PM 太上玄元道君 wrote: > Hi Girish, > > There is also a discussion thread[1] about rate-limiting. > > I think there is some conflicts between some kind of rate-limiter and > backpressure > > Take the fail-fast strategy as an example: > Brokers have to reply to clients after receiving and decode the message, > but the broker also has the back-pressure mechanism. Broker cannot read > messages because the channel is `disableAutoRead`. > > So the rate-limiters have to adapt to back-pressure. > > Thanks, > Tao Jiuming > > 2023年10月19日 20:51,Girish Sharma 写道: > > Hi, > Currently, there are only 2 kinds of publish rate limiters - polling based > and precise. Users have an option to use either one of them in the topic > publish rate limiter, but the resource group rate limiter only uses polling > one. > > There are challenges with both the rate limiters and the fact that we can't > use precise rate limiter in the resource group level. > > Thus, in order to support custom rate limiters, I've created the PIP-310 > > This is the discussion thread. Please go through the PIP and provide your > inputs. > > Link - https://github.com/apache/pulsar/pull/21399 > > Regards > -- > Girish Sharma > -- Girish Sharma
Re: [DISCUSS] PIP-310: Support custom publish rate limiters
Hi Tao, You seemed to miss sending the link that you were referring to. Are you referring to the thread "[discuss] Support fail-fast strategy when broker rate-limited" [1] ? -Lari 1 - https://lists.apache.org/thread/tp2f1ghomj2kw5ltgz8w8k5s58gs88qz On 2023/11/03 12:11:31 太上玄元道君 wrote: > Hi Girish, > > There is also a discussion thread[1] about rate-limiting. > > I think there is some conflicts between some kind of rate-limiter and > backpressure > > Take the fail-fast strategy as an example: > Brokers have to reply to clients after receiving and decode the message, > but the broker also has the back-pressure mechanism. Broker cannot read > messages because the channel is `disableAutoRead`. > > So the rate-limiters have to adapt to back-pressure. > > Thanks, > Tao Jiuming > > 2023年10月19日 20:51,Girish Sharma 写道: > > Hi, > Currently, there are only 2 kinds of publish rate limiters - polling based > and precise. Users have an option to use either one of them in the topic > publish rate limiter, but the resource group rate limiter only uses polling > one. > > There are challenges with both the rate limiters and the fact that we can't > use precise rate limiter in the resource group level. > > Thus, in order to support custom rate limiters, I've created the PIP-310 > > This is the discussion thread. Please go through the PIP and provide your > inputs. > > Link - https://github.com/apache/pulsar/pull/21399 > > Regards > -- > Girish Sharma >
Re: [DISCUSS] PIP-310: Support custom publish rate limiters
Hi Girish, Thanks for the questions. I'll reply to them > does this sharing of the same tcp/ip connection happen across partitions as > well (assuming both the partitions of the topic are on the same broker)? > i.e. producer 127.0.0.1 for partition > `persistent://tenant/ns/topic0-partition0` and producer 127.0.0.1 for > partition `persistent://tenant/ns/topic0-partition1` share the same tcp/ip > connection assuming both are on broker-0 ? The Pulsar Java client would be sharing the same TCP/IP connection to a single broker when using the default setting of connectionsPerBroker = 1. It could be using a different connection if connectionsPerBroker > 1. > In general, the major use case behind this PIP for me and my organization > is about supporting produce spikes. We do not want to allocate absolute > maximum throughput for a topic which would not even be utilized 99.99% of > the time. Thus, for a topic that stays constantly at 100MBps and goes to > 150MBps only once in a blue moon, it's unwise to allocate 150MBps worth of > resources 100% of the time. The poller based rate limiter is also not good > here as it would allow over use of hardware without a check, degrading the > system. I'm trying to understand your use case better. In the PIP-310 document it says: > Precise rate limiter fixes the above two issues, but introduces another > challenge - the rate > limiting is too strict. > * This leads potential for data loss in case there are sudden spikes in > produce and client side >produce queue breaches. > * The produce latencies increase exponentially in case produce breaches the > set throughput >even for small windows. Could you please elaborate more on these details? Here are some questions: 1. What do you mean that it is too strict? - Should the rate limiting allow bursting over the limit for some time? 2. What type of data loss are you experiencing? 3. What is the root cause of the data loss? - Do you mean that the system performance degrades and data loss is due to not being able to produce from client to the broker quickly enough and data loss happens because messages cannot be forwarded from the client to the broker? Once there's a common understanding of the problem, it's easier to design the solution together in the Pulsar community. One possibility is that PIP-310 already solves your problem. Another possibility is that we need to improve producer flow control. I currently feel that it is needed, but I might be wrong. As mentioned in my previous email, there has been discussions about improving producer flow control. One of the solution ideas that was discussed in a Pulsar community meeting in January was to add explicit flow control to producers, somewhat similar to how there are "permits" as the flow control for consumers. The permits would be based on byte size (instead of number of messages). With explicit flow control in the protocol, the rate limiting will also be effective and deterministic and the issues that Tao Jiuming was explaining could also be resolved. It also would solve the producer/consumer multiplexing on a single TCP/IP connection when flow control and rate limiting isn't based on the TCP/IP level (and toggling the Netty channel's auto read property). Let's continue discussion, since I think that this is an important improvement area. Together we could find a good solution that works for multiple use cases and addresses existing challenges in producer flow control and rate limiting. -Lari On 2023/11/03 11:16:37 Girish Sharma wrote: > Hello Lari, > Thanks for bringing this to my attention. I went through the links, but > does this sharing of the same tcp/ip connection happen across partitions as > well (assuming both the partitions of the topic are on the same broker)? > i.e. producer 127.0.0.1 for partition > `persistent://tenant/ns/topic0-partition0` and producer 127.0.0.1 for > partition `persistent://tenant/ns/topic0-partition1` share the same tcp/ip > connection assuming both are on broker-0 ? > > In general, the major use case behind this PIP for me and my organization > is about supporting produce spikes. We do not want to allocate absolute > maximum throughput for a topic which would not even be utilized 99.99% of > the time. Thus, for a topic that stays constantly at 100MBps and goes to > 150MBps only once in a blue moon, it's unwise to allocate 150MBps worth of > resources 100% of the time. The poller based rate limiter is also not good > here as it would allow over use of hardware without a check, degrading the > system. > > @Asif, I have been sick these last 10 days, but will be updating the PIP > with the discussed changes early next week. > > Regards > > On Fri, Nov 3, 2023 at 3:25 PM Lari Hotari wrote: > > > Hi Girish, > > > > In order to address your problem described in the PIP document [1], it > > might be necessary to make improvements in how rate limiters apply > > backpressure in Pulsar. > > > >
Re: [DISCUSS] PIP-310: Support custom publish rate limiters
Hello Lari, replies inline. On Fri, Nov 3, 2023 at 11:13 PM Lari Hotari wrote: > Hi Girish, > > Thanks for the questions. I'll reply to them > > > does this sharing of the same tcp/ip connection happen across partitions > as > > well (assuming both the partitions of the topic are on the same broker)? > > i.e. producer 127.0.0.1 for partition > > `persistent://tenant/ns/topic0-partition0` and producer 127.0.0.1 for > > partition `persistent://tenant/ns/topic0-partition1` share the same > tcp/ip > > connection assuming both are on broker-0 ? > > The Pulsar Java client would be sharing the same TCP/IP connection to a > single broker when using the default setting of connectionsPerBroker = 1. > It could be using a different connection if connectionsPerBroker > 1. > > Thanks for clarifying this. Could you please elaborate more on these details? Here are some questions: > 1. What do you mean that it is too strict? > - Should the rate limiting allow bursting over the limit for some time? > That's one of the major use cases, yes. > 2. What type of data loss are you experiencing? > Messages produced by the producers which eventually get timed out due to rate limiting. > 3. What is the root cause of the data loss? >- Do you mean that the system performance degrades and data loss is due > to not being able to produce from client to the broker quickly enough and > data loss happens because messages cannot be forwarded from the client to > the broker? > No, the system performance decreases in the case of poller based rate limiters. In the precise one, it's purely the broker pausing the netty channel's auto read property. If the producer goes beyond the set throughput for a longer (than timeout) duration then it starts observing timeouts leading to the messages being timed out essentially being lost. As mentioned in my previous email, there has been discussions about > improving producer flow control. One of the solution ideas that was > discussed in a Pulsar community meeting in January was to add explicit flow > control to producers, somewhat similar to how there are "permits" as the > flow control for consumers. The permits would be based on byte size > (instead of number of messages). With explicit flow control in the > protocol, the rate limiting will also be effective and deterministic and > the issues that Tao Jiuming was explaining could also be resolved. It also > would solve the producer/consumer multiplexing on a single TCP/IP > connection when flow control and rate limiting isn't based on the TCP/IP > level (and toggling the Netty channel's auto read property). > > I think the core implementation of how the broker fails fast at the time of rate limiting (whether it is by pausing netty channel or a new permits based model) does not change the actual issue I am targeting. Multiplexing has some impact on it - but yet again only limited, and can easily be fixed by the client by increasing the connections per broker. Even after assuming both these things are somehow "fixed", the fact remains that an absolutely strict rate limiter will lead to the above mentioned data loss for burst going above the limit and that a poller based rate limiter doesn't really rate limit anything as it allows all produce in the first interval of the next second. > Let's continue discussion, since I think that this is an important > improvement area. Together we could find a good solution that works for > multiple use cases and addresses existing challenges in producer flow > control and rate limiting. > > -Lari > > On 2023/11/03 11:16:37 Girish Sharma wrote: > > Hello Lari, > > Thanks for bringing this to my attention. I went through the links, but > > does this sharing of the same tcp/ip connection happen across partitions > as > > well (assuming both the partitions of the topic are on the same broker)? > > i.e. producer 127.0.0.1 for partition > > `persistent://tenant/ns/topic0-partition0` and producer 127.0.0.1 for > > partition `persistent://tenant/ns/topic0-partition1` share the same > tcp/ip > > connection assuming both are on broker-0 ? > > > > In general, the major use case behind this PIP for me and my organization > > is about supporting produce spikes. We do not want to allocate absolute > > maximum throughput for a topic which would not even be utilized 99.99% of > > the time. Thus, for a topic that stays constantly at 100MBps and goes to > > 150MBps only once in a blue moon, it's unwise to allocate 150MBps worth > of > > resources 100% of the time. The poller based rate limiter is also not > good > > here as it would allow over use of hardware without a check, degrading > the > > system. > > > > @Asif, I have been sick these last 10 days, but will be updating the PIP > > with the discussed changes early next week. > > > > Regards > > > > On Fri, Nov 3, 2023 at 3:25 PM Lari Hotari wrote: > > > > > Hi Girish, > > > > > > In order to address your problem described in the PIP document [1], it > > >
Re: [DISCUSS] PIP-310: Support custom publish rate limiters
Replies inline On Fri, 3 Nov 2023 at 20:48, Girish Sharma wrote: > Could you please elaborate more on these details? Here are some questions: > > 1. What do you mean that it is too strict? > > - Should the rate limiting allow bursting over the limit for some > time? > > > > That's one of the major use cases, yes. > One possibility would be to improve the existing rate limiter to allow bursting. I think that Pulsar's out-of-the-box rate limiter should cover 99% of the use cases instead of having one implementing their own rate limiter algorithm. The problems you are describing seem to be common to many Pulsar use cases, and therefore, I think they should be handled directly in Pulsar. Optimally, there would be a single solution that abstracts the rate limiting in a way where it does the right thing based on the declarative configuration. I would prefer that over having a pluggable solution for rate limiter implementations. What would help is getting deeper in the design of the rate limiter itself, without limiting ourselves to the existing rate limiter implementation in Pulsar. In textbooks, there are algorithms such as "leaky bucket" [1] and "token bucket" [2]. Both algorithms have several variations and in some ways they are very similar algorithms but looking from the different point of view. It would possibly be easier to conceptualize and understand a rate limiting algorithm if common algorithm names and implementation choices mentioned in textbooks would be referenced in the implementation. It seems that a "token bucket" type of algorithm can be used to implement rate limiting with bursting. In the token bucket algorithm, the size of the token bucket defines how large bursts will be allowed. The design could also be something where 2 rate limiters with different type of algorithms and/or configuration parameters are combined to achieve a desired behavior. For example, to achieve a rate limiter with bursting and a fixed maximum rate. By default, the token bucket algorithm doesn't enforce a maximum rate for bursts, but that could be achieved by chaining 2 rate limiters if that is really needed. The current Pulsar rate limiter implementation could be implemented in a cleaner way, which would also be more efficient. Instead of having a scheduler call a method once per second, I think that the rate limiter could be implemented in a reactive way where the algorithm is implemented without a scheduler. I wonder if there are others that would be interested in getting down into such implementation details? 1 - https://en.wikipedia.org/wiki/Leaky_bucket 2 - https://en.wikipedia.org/wiki/Token_bucket > 2. What type of data loss are you experiencing? > > > > Messages produced by the producers which eventually get timed out due to > rate limiting. > Are you able to slow down producing on the client side? If that is possible, there could be ways to improve ways to do client side back pressure with Pulsar Client. Currently, the client doesn't expose this information until the sending blocks or fails with an exception (ProducerQueueIsFullError). Optimally, the client should slow down the rate of producing to the rate that it can actually send to the broker. Just curious if you have considered turning off producing timeouts on the client side completely or making them longer? Would that address the data loss problem? Or is your event/message source "hot" so that you cannot stop or slow it down, and it will just keep on flowing with a certain rate? > I think the core implementation of how the broker fails fast at the time > of rate limiting (whether it is by pausing netty channel or a new permits > based model) does not change the actual issue I am targeting. Multiplexing > has some impact on it - but yet again only limited, and can easily be fixed > by the client by increasing the connections per broker. Even after assuming > both these things are somehow "fixed", the fact remains that an absolutely > strict rate limiter will lead to the above mentioned data loss for burst > going above the limit and that a poller based rate limiter doesn't really > rate limit anything as it allows all produce in the first interval of the > next second. > Yes, it makes sense to have bursting configuration parameters in the rate limiter. As mentioned above, I think we could be improving the existing rate limiter in Pulsar to cover 99% of the use case by making it stable and by including the bursting configuration options. Is there additional functionality you feel the rate limiter needs beyond bursting support? One way to workaround the multiplexing problem would be to add a client side option for producers and consumers, where you could specify that the client picks a separate TCP/IP connection that is not shared and isn't from the connection pool. Preventing connection multiplexing seems to be the only way to make the current rate limiting deterministic and stable without adding the explicit flow control to the
Re: [DISCUSS] PIP-310: Support custom publish rate limiters
One additional question: In your use case, do you have multiple producers concurrently producing to the same topic from different clients? The use case is challenging in the current implementation when using topic producing rate limiting. The problem is that the different producers will be able to send messages at very different rates since there isn't a solution to ensure fairness across multiple producers in the topic producer rate limiting solution. This is something that should be addressed when improving rate limiting. -Lari la 4. marrask. 2023 klo 17.24 Lari Hotari kirjoitti: > Replies inline > > On Fri, 3 Nov 2023 at 20:48, Girish Sharma > wrote: > >> Could you please elaborate more on these details? Here are some questions: >> > 1. What do you mean that it is too strict? >> > - Should the rate limiting allow bursting over the limit for some >> time? >> > >> >> That's one of the major use cases, yes. >> > > One possibility would be to improve the existing rate limiter to allow > bursting. > I think that Pulsar's out-of-the-box rate limiter should cover 99% of the > use cases instead of having one implementing their own rate limiter > algorithm. > The problems you are describing seem to be common to many Pulsar use > cases, and therefore, I think they should be handled directly in Pulsar. > > Optimally, there would be a single solution that abstracts the rate > limiting in a way where it does the right thing based on the declarative > configuration. > I would prefer that over having a pluggable solution for rate limiter > implementations. > > What would help is getting deeper in the design of the rate limiter > itself, without limiting ourselves to the existing rate limiter > implementation in Pulsar. > > In textbooks, there are algorithms such as "leaky bucket" [1] and "token > bucket" [2]. Both algorithms have several variations and in some ways they > are very similar algorithms but looking from the different point of view. > It would possibly be easier to conceptualize and understand a rate limiting > algorithm if common algorithm names and implementation choices mentioned in > textbooks would be referenced in the implementation. > It seems that a "token bucket" type of algorithm can be used to implement > rate limiting with bursting. In the token bucket algorithm, the size of the > token bucket defines how large bursts will be allowed. The design could > also be something where 2 rate limiters with different type of algorithms > and/or configuration parameters are combined to achieve a desired behavior. > For example, to achieve a rate limiter with bursting and a fixed maximum > rate. > By default, the token bucket algorithm doesn't enforce a maximum rate for > bursts, but that could be achieved by chaining 2 rate limiters if that is > really needed. > > The current Pulsar rate limiter implementation could be implemented in a > cleaner way, which would also be more efficient. Instead of having a > scheduler call a method once per second, I think that the rate limiter > could be implemented in a reactive way where the algorithm is implemented > without a scheduler. > I wonder if there are others that would be interested in getting down into > such implementation details? > > 1 - https://en.wikipedia.org/wiki/Leaky_bucket > 2 - https://en.wikipedia.org/wiki/Token_bucket > > > > 2. What type of data loss are you experiencing? >> > >> >> Messages produced by the producers which eventually get timed out due to >> rate limiting. >> > > Are you able to slow down producing on the client side? If that is > possible, there could be ways to improve ways to do client side back > pressure with Pulsar Client. Currently, the client doesn't expose this > information until the sending blocks or fails with an exception > (ProducerQueueIsFullError). Optimally, the client should slow down the rate > of producing to the rate that it can actually send to the broker. > Just curious if you have considered turning off producing timeouts on the > client side completely or making them longer? Would that address the data > loss problem? > Or is your event/message source "hot" so that you cannot stop or slow it > down, and it will just keep on flowing with a certain rate? > > > I think the core implementation of how the broker fails fast at the time >> of rate limiting (whether it is by pausing netty channel or a new permits >> based model) does not change the actual issue I am targeting. Multiplexing >> has some impact on it - but yet again only limited, and can easily be >> fixed >> by the client by increasing the connections per broker. Even after >> assuming >> both these things are somehow "fixed", the fact remains that an absolutely >> strict rate limiter will lead to the above mentioned data loss for burst >> going above the limit and that a poller based rate limiter doesn't really >> rate limit anything as it allows all produce in the first interval of the >> next second. >> > > Yes, it makes sense to
Re: [DISCUSS] PIP-310: Support custom publish rate limiters
Replies inline. On Sat, Nov 4, 2023 at 8:55 PM Lari Hotari wrote: > > One possibility would be to improve the existing rate limiter to allow > bursting. > I think that Pulsar's out-of-the-box rate limiter should cover 99% of the > use cases instead of having one implementing their own rate limiter > algorithm. > There are challenges in this. As explained in the PIP, there are several different usages of rate limiter, stats, unloading, etc. While I am open to having a burstable rate limiter in pulsar out of box, it might complicate things considering backward compatibility etc. More on this below. > The problems you are describing seem to be common to many Pulsar use cases, > and therefore, I think they should be handled directly in Pulsar. > I personally haven't seen many burstability related discussions; so this feature might actually not be that useful for all current Pulsar users. > > Optimally, there would be a single solution that abstracts the rate > limiting in a way where it does the right thing based on the declarative > configuration. > I would prefer that over having a pluggable solution for rate limiter > implementations. > > What would help is getting deeper in the design of the rate limiter itself, > without limiting ourselves to the existing rate limiter implementation in > Pulsar. > I would personally suggest we tackle this problem in parts so that it's available incrementally over versions rather than making the scope so big that it takes pulsar 4.0 for these features to land. > > In textbooks, there are algorithms such as "leaky bucket" [1] and "token > bucket" [2]. Both algorithms have several variations and in some ways they > are very similar algorithms but looking from the different point of view. > It would possibly be easier to conceptualize and understand a rate limiting > algorithm if common algorithm names and implementation choices mentioned in > textbooks would be referenced in the implementation. > It seems that a "token bucket" type of algorithm can be used to implement > rate limiting with bursting. In the token bucket algorithm, the size of the > token bucket defines how large bursts will be allowed. The design could > also be something where 2 rate limiters with different type of algorithms > and/or configuration parameters are combined to achieve a desired behavior. > For example, to achieve a rate limiter with bursting and a fixed maximum > rate. > By default, the token bucket algorithm doesn't enforce a maximum rate for > bursts, but that could be achieved by chaining 2 rate limiters if that is > really needed. > Yes, we were also thinking on the same terms once this is pluggable. The idea was to have some numbers and real world usage backing an implementation of rate limiter before merging it back into pulsar. Any decision we would take right now would be limited only by theoretical discussion of the implementation and our assumption that it covers 99% of the use cases, probably just like how the precise and poller ones came into being. > > The current Pulsar rate limiter implementation could be implemented in a > cleaner way, which would also be more efficient. Instead of having a > scheduler call a method once per second, I think that the rate limiter > could be implemented in a reactive way where the algorithm is implemented > without a scheduler. > I wonder if there are others that would be interested in getting down into > such implementation details? > That would be touching RateLimiter.java class, while my goal is to improve/touch the outer classes, mainly around the interface PublishRateLimiter.java With this discussion taking a much bigger turn, how or where do we limit this PIP's scope? I am happy to work on follow up PIPs which may arrive out of this discussion. > Are you able to slow down producing on the client side? If that is > possible, there could be ways to improve ways to do client side back > pressure with Pulsar Client. Currently, the client doesn't expose this > information until the sending blocks or fails with an exception > (ProducerQueueIsFullError). Optimally, the client should slow down the rate > of producing to the rate that it can actually send to the broker. > Just curious if you have considered turning off producing timeouts on the > client side completely or making them longer? Would that address the data > loss problem? > Or is your event/message source "hot" so that you cannot stop or slow it > down, and it will just keep on flowing with a certain rate? > > Mostly, the sources are hot sources and can't be slowed down. The lack of clear error message (client-server protocol limitation) is also another issue that I was planning to tackle in another PIP. Yes, it makes sense to have bursting configuration parameters in the rate > limiter. > As mentioned above, I think we could be improving the existing rate limiter > in Pulsar to cover 99% of the use case by making it stable and by including > the bursting
Re: [DISCUSS] PIP-310: Support custom publish rate limiters
On Sat, Nov 4, 2023 at 9:21 PM Lari Hotari wrote: > One additional question: > > In your use case, do you have multiple producers concurrently producing to > the same topic from different clients? > > The use case is challenging in the current implementation when using topic > producing rate limiting. The problem is that the different producers will > be able to send messages at very different rates since there isn't a > solution to ensure fairness across multiple producers in the topic producer > rate limiting solution. This is something that should be addressed when > improving rate limiting. > We haven't personally seen this pattern that different logical producers (app/object) are producing to the same topic. I feel like this is an anti-pattern and goes against the homogenous nature of a topic. Even if such kinds of use cases arrive, they can easily be handled by moving the different producers to different topic and the consumers subscribing to more than one topic in case they need data from all of those topics. Regards > > -Lari > > la 4. marrask. 2023 klo 17.24 Lari Hotari kirjoitti: > > > Replies inline > > > > On Fri, 3 Nov 2023 at 20:48, Girish Sharma > > wrote: > > > >> Could you please elaborate more on these details? Here are some > questions: > >> > 1. What do you mean that it is too strict? > >> > - Should the rate limiting allow bursting over the limit for some > >> time? > >> > > >> > >> That's one of the major use cases, yes. > >> > > > > One possibility would be to improve the existing rate limiter to allow > > bursting. > > I think that Pulsar's out-of-the-box rate limiter should cover 99% of the > > use cases instead of having one implementing their own rate limiter > > algorithm. > > The problems you are describing seem to be common to many Pulsar use > > cases, and therefore, I think they should be handled directly in Pulsar. > > > > Optimally, there would be a single solution that abstracts the rate > > limiting in a way where it does the right thing based on the declarative > > configuration. > > I would prefer that over having a pluggable solution for rate limiter > > implementations. > > > > What would help is getting deeper in the design of the rate limiter > > itself, without limiting ourselves to the existing rate limiter > > implementation in Pulsar. > > > > In textbooks, there are algorithms such as "leaky bucket" [1] and "token > > bucket" [2]. Both algorithms have several variations and in some ways > they > > are very similar algorithms but looking from the different point of view. > > It would possibly be easier to conceptualize and understand a rate > limiting > > algorithm if common algorithm names and implementation choices mentioned > in > > textbooks would be referenced in the implementation. > > It seems that a "token bucket" type of algorithm can be used to implement > > rate limiting with bursting. In the token bucket algorithm, the size of > the > > token bucket defines how large bursts will be allowed. The design could > > also be something where 2 rate limiters with different type of algorithms > > and/or configuration parameters are combined to achieve a desired > behavior. > > For example, to achieve a rate limiter with bursting and a fixed maximum > > rate. > > By default, the token bucket algorithm doesn't enforce a maximum rate for > > bursts, but that could be achieved by chaining 2 rate limiters if that is > > really needed. > > > > The current Pulsar rate limiter implementation could be implemented in a > > cleaner way, which would also be more efficient. Instead of having a > > scheduler call a method once per second, I think that the rate limiter > > could be implemented in a reactive way where the algorithm is implemented > > without a scheduler. > > I wonder if there are others that would be interested in getting down > into > > such implementation details? > > > > 1 - https://en.wikipedia.org/wiki/Leaky_bucket > > 2 - https://en.wikipedia.org/wiki/Token_bucket > > > > > > > 2. What type of data loss are you experiencing? > >> > > >> > >> Messages produced by the producers which eventually get timed out due to > >> rate limiting. > >> > > > > Are you able to slow down producing on the client side? If that is > > possible, there could be ways to improve ways to do client side back > > pressure with Pulsar Client. Currently, the client doesn't expose this > > information until the sending blocks or fails with an exception > > (ProducerQueueIsFullError). Optimally, the client should slow down the > rate > > of producing to the rate that it can actually send to the broker. > > Just curious if you have considered turning off producing timeouts on the > > client side completely or making them longer? Would that address the data > > loss problem? > > Or is your event/message source "hot" so that you cannot stop or slow it > > down, and it will just keep on flowing with a certain rate? > > > > > I think the core implementation of how the broker
Re: [DISCUSS] PIP-310: Support custom publish rate limiters
Hi Girish, Replies inline. We are getting into a very detailed discussion. We could also discuss this topic in one of the upcoming Pulsar Community meetings. However, I might miss the next meeting that is scheduled this Thursday. Although I am currently opposing to your proposal PIP-310, I am supporting solving your problems related to rate limiting. :) Let's continue the discussion since that is necessary so that we could make progress. I hope this makes sense from your perspective. On Sat, 4 Nov 2023 at 17:53, Girish Sharma wrote: > > There are challenges in this. As explained in the PIP, there are several > different usages of rate limiter, stats, unloading, etc. While I am open to > having a burstable rate limiter in pulsar out of box, it might complicate > things considering backward compatibility etc. More on this below. > I acknowledge that there are different usages, but my assumption is that we could implement a generic solution that could be configured to handle each specific use case. I haven't yet seen any evidence that the requirements in your case are so special that it justifies adding a pluggable interface for rate limiters. Exposing yet another pluggable interface in Pulsar will add complexity without gains. Each supported public interface is a maintenance burden if we care about the quality of the exposed interfaces and put effort in ensuring that the interfaces are supported in future versions. Exposing an interface will also lock down or slow down some future refactorings. There will be a need to refactor and improve rate limiters as part of the flow control and back pressure improvements within the Pulsar broker. I'd rather keep the rate limiter internal interfaces an internal implementation detail instead of leaking the details to an exposed public interface. That's why we should primarily look for a generic solution. I hope we could put effort in looking into the characteristics of your requirements and attempt to sketch a design for a generic solution that could be configured for your purposes. > > The problems you are describing seem to be common to many Pulsar use cases, > > and therefore, I think they should be handled directly in Pulsar. > > > > I personally haven't seen many burstability related discussions; so this > feature might actually not be that useful for all current Pulsar users. In general there are not many advanced discussions on the mailing list about the Pulsar internals. It may also be difficult for others to recognize that specific behaviors could be resolved by enhancing flow control, back pressure, and rate limiting/throttling mechanisms. > I would personally suggest we tackle this problem in parts so that it's > available incrementally over versions rather than making the scope so big > that it takes pulsar 4.0 for these features to land. Sure, quick delivery time is the goal of everyone. Before talking about schedules, we should be able to discuss the use case and the design of the desired type of rate limiting and throttling in more depth. One concrete example of this is the desired behavior of bursting. In token bucket rate limiting, bursting is about using the buffered tokens in the "token bucket" and having a configurable limit for the buffer (the "bucket"). This buffer will usually only contain tokens when the actual rate has been lower than the configured maximum rate for some duration. However, there could be an expectation for a different type of bursting which is more like "auto scaling" of the rate limit in a way where the end-to-end latency of the produced messages is taken into account. The expected behavior might be about scaling the rate temporarily to a higher rate so that the queues can be cleared and that the latency of the messages being sent stay under a target latency. The current org.apache.pulsar.broker.service.PublishRateLimiter interface cannot control aspects that would be needed to handle this type of bursting where we actually need to scale up the rate limit based on end-to-end feedback . The PublishRateLimiter interface doesn't have feedback loops currently. I don't know what "bursting" means for you. Would it be possible to provide concrete examples of desired behavior? That would be very helpful in making progress. > Yes, we were also thinking on the same terms once this is pluggable. The > idea was to have some numbers and real world usage backing an > implementation of rate limiter before merging it back into pulsar. Any > decision we would take right now would be limited only by theoretical > discussion of the implementation and our assumption that it covers 99% of > the use cases, probably just like how the precise and poller ones came into > being. This will remain theoretical discussion without concrete examples of desired behavior or what problem we want to solve. Initially, we need to have a good grasp on the problem we want to solve. The solution might change while we experiment and iterate. It will be
Re: [DISCUSS] PIP-310: Support custom publish rate limiters
Hello Lari, inline once again. On Mon, Nov 6, 2023 at 5:44 PM Lari Hotari wrote: > Hi Girish, > > Replies inline. We are getting into a very detailed discussion. We > could also discuss this topic in one of the upcoming Pulsar Community > meetings. However, I might miss the next meeting that is scheduled > this Thursday. > Is this every thursday? I am willing to meet at a separate time as well if enough folks with a viewpoint on this can meet together. I assume that the community meeting has a much bigger agenda with detailed discussions not possible? > Although I am currently opposing to your proposal PIP-310, I am > supporting solving your problems related to rate limiting. :) > Let's continue the discussion since that is necessary so that we could > make progress. I hope this makes sense from your perspective. > > It is all good, as long as the final goal is met within reasonable timelines. > > I acknowledge that there are different usages, but my assumption is > that we could implement a generic solution that could be configured to > handle each specific use case. > I haven't yet seen any evidence that the requirements in your case are > so special that it justifies adding a pluggable interface for rate > Well, the blacklisting use case is a very specific use case. I am explaining below why that can't be done using metrics and a separate blacklisting API. > limiters. Exposing yet another pluggable interface in Pulsar will add > complexity without gains. Each supported public interface is a > maintenance burden if we care about the quality of the exposed > interfaces and put effort in ensuring that the interfaces are > supported in future versions. Exposing an interface will also lock > down or slow down some future refactorings. > This actually might be a blessing in disguise, at least for RateLimiter and PublishRateLimiter.java, being an internal interface, it has gone out of hand and unchecked. Explained more below. > One concrete example of this is the desired behavior of bursting. In > token bucket rate limiting, bursting is about using the buffered > tokens in the "token bucket" and having a configurable limit for the > buffer (the "bucket"). This buffer will usually only contain tokens > when the actual rate has been lower than the configured maximum rate > for some duration. > > However, there could be an expectation for a different type of > bursting which is more like "auto scaling" of the rate limit in a way > where the end-to-end latency of the produced messages > is taken into account. The expected behavior might be about scaling > the rate temporarily to a higher rate so that the queues can be > I would like to keep auto-scaling out of scope for this discussion. That opens up another huge can of worms, specially given the gaps in proper scale down support in pulsar. > > I don't know what "bursting" means for you. Would it be possible to > provide concrete examples of desired behavior? That would be very > helpful in making progress. > > Here are a few different use cases: - A producer(s) is producing at a near constant rate into a topic, with equal distribution among partitions. Due to a hiccup in their downstream component, the produce rate goes to 0 for a few seconds, and thus, to compensate, in the next few seconds, the produce rate tries to double up. - In a visitor based produce rate (where produce rate goes up in the day and goes down in the night, think in terms of popular website hourly view counts pattern) , there are cases when, due to certain external/internal triggers, the views - and thus - the produce rate spikes for a few minutes. It is also important to keep this in check so as to not allow bots to do DDOS into your system, while that might be a responsibility of an upstream system like API gateway, but we cannot be ignorant about that completely. - In streaming systems, where there are micro batches, there might be constant fluctuations in produce rate from time to time, based on batch failure or retries. In all of these situations, setting the throughput of the topic to be the absolute maximum of the various spikes observed during the day is very suboptimal. Moreover, in each of these situations, once bursting support is present in the system, it would also need to have proper checks in place to penalize the producers from trying to mis-use the system. In a true multi-tenant platform, this is very critical. Thus, blacklisting actually goes hand in hand here. Explained more below. > It's interesting that you mention that you would like to improve the > PublishRateLimiter interface. > How would you change it? > > The current interface of PublishRateLimiter has duplicate methods. I am assuming after an initial implementation (poller), the next implementation simply added more methods into the interface rather than actually using the ones already existing. For instance, there are both `tryAcquire` and
Re: [DISCUSS] PIP-310: Support custom publish rate limiters
Hi Girish, I think we are starting to get into the concrete details of rate limiters and how we could start improving the existing feature. It is very helpful that you are sharing your insight and experience of operating Pulsar at scale. Replies inline. On Mon, 6 Nov 2023 at 15:37, Girish Sharma wrote: > Is this every thursday? I am willing to meet at a separate time as well if > enough folks with a viewpoint on this can meet together. I assume that the > community meeting has a much bigger agenda with detailed discussions not > possible? It is bi-weekly on Thursdays. The meeting calendar, zoom link and meeting notes can be found at https://github.com/apache/pulsar/wiki/Community-Meetings . > It is all good, as long as the final goal is met within reasonable > timelines. +1 > Well, the blacklisting use case is a very specific use case. I am > explaining below why that can't be done using metrics and a separate > blacklisting API. ok. btw. "metrics" doesn't necessarily mean providing the rate limiter metrics via Prometheus. There might be other ways to provide this information for components that could react to this. For example, it could a be system topic where these rate limiters emit events. > This actually might be a blessing in disguise, at least for RateLimiter and > PublishRateLimiter.java, being an internal interface, it has gone out of > hand and unchecked. Explained more below. I agree. It is hard to reason about the existing solution. > I would like to keep auto-scaling out of scope for this discussion. That > opens up another huge can of worms, specially given the gaps in proper > scale down support in pulsar. I agree. I just brought up this example to ensure that your expectation about bursting isn't about controlling the rate limits based on situational information, such as end-to-end latency information. Such a feature could be useful, but it does complicate things. However, I think it's good to keep this on the radar since this might be needed to solve some advanced use cases. > > I don't know what "bursting" means for you. Would it be possible to > > provide concrete examples of desired behavior? That would be very > > helpful in making progress. > > > > > Here are a few different use cases: These use cases clarify your requirements a lot. Thanks for sharing. >- A producer(s) is producing at a near constant rate into a topic, with >equal distribution among partitions. Due to a hiccup in their downstream >component, the produce rate goes to 0 for a few seconds, and thus, to >compensate, in the next few seconds, the produce rate tries to double up. Could you also elaborate on details such as what is the current behavior of Pulsar rate limiting / throttling solution and what would be the desired behavior? Just guessing that you mean that the desired behavior would be to allow the produce rate to double up for some time (configurable)? Compared to what rate is it doubled? Please explain in detail what the current and desired behaviors would be so that it's easier to understand the gap. >- In a visitor based produce rate (where produce rate goes up in the day >and goes down in the night, think in terms of popular website hourly view >counts pattern) , there are cases when, due to certain external/internal >triggers, the views - and thus - the produce rate spikes for a few minutes. Again, please explain the current behavior and desired behavior. Explicit example values of number of messages, bandwidth, etc. would also be helpful details. >It is also important to keep this in check so as to not allow bots to do >DDOS into your system, while that might be a responsibility of an upstream >system like API gateway, but we cannot be ignorant about that completely. what would be the desired behavior? >- In streaming systems, where there are micro batches, there might be >constant fluctuations in produce rate from time to time, based on batch >failure or retries. could you share and examples with numbers about this use case too? explaining current behavior and desired behavior? > > In all of these situations, setting the throughput of the topic to be the > absolute maximum of the various spikes observed during the day is very > suboptimal. btw. A plain token bucket algorithm implementation doesn't have an absolute maximum. The maximum average rate is controlled with the rate of tokens added to the bucket. The capacity of the bucket controls how much buffer there is to spend on spikes. If there's a need to also set an absolute maximum rate, 2 token buckets could be chained to handle that case. The second rate limiter could have an average rate of the absolute maximum with a relatively small buffer (token bucket capacity). There might also be more sophisticated algorithms which vary the maximum rate in some way to smoothen spikes, but that might just be completely unnecessary in Pulsar rate limiters. Switching to use a
Re: [DISCUSS] PIP-310: Support custom publish rate limiters
Hello Lari, replies inline. I will also be going through some textbook rate limiters (the one you shared, plus others) and propose the one that at least suits our needs in the next reply. On Tue, Nov 7, 2023 at 2:49 PM Lari Hotari wrote: > > It is bi-weekly on Thursdays. The meeting calendar, zoom link and > meeting notes can be found at > https://github.com/apache/pulsar/wiki/Community-Meetings . > > Would it make sense for me to join this time given that you are skipping it? > > ok. btw. "metrics" doesn't necessarily mean providing the rate limiter > metrics via Prometheus. There might be other ways to provide this > information for components that could react to this. > For example, it could a be system topic where these rate limiters emit > events. > > Are there any other system topics than `tenent/namespace/__change_events` . While it's an improvement over querying metrics, it would still mean one consumer per namespace and would form a cyclic dependency - for example, in case a broker is degrading due to mis-use of bursting, it might lead to delays in the consumption of the event from the __change_events topic. I agree. I just brought up this example to ensure that your > expectation about bursting isn't about controlling the rate limits > based on situational information, such as end-to-end latency > information. > Such a feature could be useful, but it does complicate things. > However, I think it's good to keep this on the radar since this might > be needed to solve some advanced use cases. > > I still envision auto-scaling to be admin API driven rather than produce throughput driven. That way, it remains deterministic in nature. But it probably doesn't make sense to even talk about it until (partition) scale-down is possible. > > >- A producer(s) is producing at a near constant rate into a topic, > with > >equal distribution among partitions. Due to a hiccup in their > downstream > >component, the produce rate goes to 0 for a few seconds, and thus, to > >compensate, in the next few seconds, the produce rate tries to double > up. > > Could you also elaborate on details such as what is the current > behavior of Pulsar rate limiting / throttling solution and what would > be the desired behavior? > Just guessing that you mean that the desired behavior would be to > allow the produce rate to double up for some time (configurable)? > Compared to what rate is it doubled? > Please explain in detail what the current and desired behaviors would > be so that it's easier to understand the gap. > In all of the 3 cases that I listed, the current behavior, with precise rate limiting enabled, is to pause the netty channel in case the throughput breaches the set limits. This eventually leads to timeout at the client side in case the burst is significantly greater than the configured timeout on the producer side. The desired behavior in all three situations is to have a multiplier based bursting capability for a fixed duration. For example, it could be that a pulsar topic would be able to support 1.5x of the set quota for a burst duration of up to 5 minutes. There also needs to be a cooldown period in such a case that it would only accept one such burst every X minutes, say every 1 hour. > > >- In a visitor based produce rate (where produce rate goes up in the > day > >and goes down in the night, think in terms of popular website hourly > view > >counts pattern) , there are cases when, due to certain > external/internal > >triggers, the views - and thus - the produce rate spikes for a few > minutes. > > Again, please explain the current behavior and desired behavior. > Explicit example values of number of messages, bandwidth, etc. would > also be helpful details. > Adding to what I wrote above, think of this pattern like the following: the produce rate slowly increases from ~2MBps at around 4 AM to a known peak of about 30MBps by 4 PM and stays around that peak until 9 PM after which is again starts decreasing until it reaches ~2MBps around 2 AM. Now, due to some external triggers, maybe a scheduled sale event, at 10PM, the quota may spike up to 40MBps for 4-5 minutes and then again go back down to the usual ~20MBps . Here is a rough image showcasing the trend. [image: image.png] > > >It is also important to keep this in check so as to not allow bots to > do > >DDOS into your system, while that might be a responsibility of an > upstream > >system like API gateway, but we cannot be ignorant about that > completely. > > what would be the desired behavior? > The desired behavior is that the burst support should be short lived (5-10 minutes) and limited to a fixed number of bursts in a duration (say - 1 burst per hour). Obviously, all of these should be configurable, maybe at a broker level and not a topic level. > > >- In streaming systems, where there are micro batches, there might be > >constant fluctuations in produce rate from time to time,
Re: [DISCUSS] PIP-310: Support custom publish rate limiters
I just want to add one thing to the mix here. You can see by the amount of plugin interfaces Pulsar has, somebody "left the door open" for too long. You can agree with me that the number of those interfaces is not normal for any open source software. I know HBase for example, or Kafka - never seen so many in them. You can also see the lack of attention to code quality and high level overview by the poor implementation of current rate limiter. The feeling is: I just need this tiny little thing and I don't have time - so over time Pulsar got into this unmaintainable mess of public APIs and some parts are simply unreadable - such as the rate limiters. I *still* don't understand how rate limiting works in Pulsar, even when I read the background and browsed quickly through the code. I can see the people on this thread are highly talented - let's use this to make Pulsar better, both from a bird's-eye view and your own personal requirement. On Tue, Nov 7, 2023 at 3:26 PM Girish Sharma wrote: > Hello Lari, replies inline. > > I will also be going through some textbook rate limiters (the one you > shared, plus others) and propose the one that at least suits our needs in > the next reply. > > On Tue, Nov 7, 2023 at 2:49 PM Lari Hotari wrote: > >> >> It is bi-weekly on Thursdays. The meeting calendar, zoom link and >> meeting notes can be found at >> https://github.com/apache/pulsar/wiki/Community-Meetings . >> >> > Would it make sense for me to join this time given that you are skipping > it? > > >> >> ok. btw. "metrics" doesn't necessarily mean providing the rate limiter >> metrics via Prometheus. There might be other ways to provide this >> information for components that could react to this. >> For example, it could a be system topic where these rate limiters emit >> events. >> >> > Are there any other system topics than `tenent/namespace/__change_events` > . While it's an improvement over querying metrics, it would still mean one > consumer per namespace and would form a cyclic dependency - for example, in > case a broker is degrading due to mis-use of bursting, it might lead to > delays in the consumption of the event from the __change_events topic. > > I agree. I just brought up this example to ensure that your >> expectation about bursting isn't about controlling the rate limits >> based on situational information, such as end-to-end latency >> information. >> Such a feature could be useful, but it does complicate things. >> However, I think it's good to keep this on the radar since this might >> be needed to solve some advanced use cases. >> >> > I still envision auto-scaling to be admin API driven rather than produce > throughput driven. That way, it remains deterministic in nature. But it > probably doesn't make sense to even talk about it until (partition) > scale-down is possible. > > >> >> >- A producer(s) is producing at a near constant rate into a topic, >> with >> >equal distribution among partitions. Due to a hiccup in their >> downstream >> >component, the produce rate goes to 0 for a few seconds, and thus, to >> >compensate, in the next few seconds, the produce rate tries to >> double up. >> >> Could you also elaborate on details such as what is the current >> behavior of Pulsar rate limiting / throttling solution and what would >> be the desired behavior? >> Just guessing that you mean that the desired behavior would be to >> allow the produce rate to double up for some time (configurable)? >> Compared to what rate is it doubled? >> Please explain in detail what the current and desired behaviors would >> be so that it's easier to understand the gap. >> > > In all of the 3 cases that I listed, the current behavior, with precise > rate limiting enabled, is to pause the netty channel in case the throughput > breaches the set limits. This eventually leads to timeout at the client > side in case the burst is significantly greater than the configured timeout > on the producer side. > > The desired behavior in all three situations is to have a multiplier based > bursting capability for a fixed duration. For example, it could be that a > pulsar topic would be able to support 1.5x of the set quota for a burst > duration of up to 5 minutes. There also needs to be a cooldown period in > such a case that it would only accept one such burst every X minutes, say > every 1 hour. > > >> >> >- In a visitor based produce rate (where produce rate goes up in the >> day >> >and goes down in the night, think in terms of popular website hourly >> view >> >counts pattern) , there are cases when, due to certain >> external/internal >> >triggers, the views - and thus - the produce rate spikes for a few >> minutes. >> >> Again, please explain the current behavior and desired behavior. >> Explicit example values of number of messages, bandwidth, etc. would >> also be helpful details. >> > > Adding to what I wrote above, think of this pattern like the following: > the produce rate
Re: [DISCUSS] PIP-310: Support custom publish rate limiters
Hi Girish, Replies inline. On Tue, 7 Nov 2023 at 15:26, Girish Sharma wrote: > > Hello Lari, replies inline. > > I will also be going through some textbook rate limiters (the one you shared, > plus others) and propose the one that at least suits our needs in the next > reply. sounds good. I've been also trying to find more rate limiter resources that could be useful for our design. Bucket4J documentation gives some good ideas and it's shows how the token bucket algorithm could be varied. For example, the "Refill styles" section [1] is useful to read as an inspiration. In network routers, there's a concept of "dual token bucket" algorithms and by googling you can find both Cisco and Juniper documentation referencing this. I also asked ChatGPT-4 to explain "dual token bucket" algorithm [2]. 1 - https://bucket4j.com/8.6.0/toc.html#refill-types 2 - https://chat.openai.com/share/d4f4f740-f675-4233-964e-2910a7c8ed24 >> >> It is bi-weekly on Thursdays. The meeting calendar, zoom link and >> meeting notes can be found at >> https://github.com/apache/pulsar/wiki/Community-Meetings . >> > > Would it make sense for me to join this time given that you are skipping it? Yes, it's worth joining regularly when one is participating in Pulsar core development. There's usually a chance to discuss all topics that Pulsar community members bring up to discussion. A few times there haven't been any participants and in that case, it's good to ask on the #dev channel on Pulsar Slack whether others are joining the meeting. >> >> ok. btw. "metrics" doesn't necessarily mean providing the rate limiter >> metrics via Prometheus. There might be other ways to provide this >> information for components that could react to this. >> For example, it could a be system topic where these rate limiters emit >> events. >> > > Are there any other system topics than `tenent/namespace/__change_events` . > While it's an improvement over querying metrics, it would still mean one > consumer per namespace and would form a cyclic dependency - for example, in > case a broker is degrading due to mis-use of bursting, it might lead to > delays in the consumption of the event from the __change_events topic. The number of events is a fairly low volume so the possible degradation wouldn't necessarily impact the event communications for this purpose. I'm not sure if there are currently any public event topics in Pulsar that are supported in a way that an application could directly read from the topic. However since this is an advanced case, I would assume that we don't need to focus on this in the first phase of improving rate limiters. > > >> I agree. I just brought up this example to ensure that your >> expectation about bursting isn't about controlling the rate limits >> based on situational information, such as end-to-end latency >> information. >> Such a feature could be useful, but it does complicate things. >> However, I think it's good to keep this on the radar since this might >> be needed to solve some advanced use cases. >> > > I still envision auto-scaling to be admin API driven rather than produce > throughput driven. That way, it remains deterministic in nature. But it > probably doesn't make sense to even talk about it until (partition) > scale-down is possible. I agree. It's better to exclude this from the discussion at the moment so that it doesn't cause confusion. Modifying the rate limits up and down automatically with some component could be considered out of scope in the first phase. However, that might be controversial since the externally observable behavior of rate limiting with bursting support seems to be behave in a way where the rate limit changes automatically. The "auto scaling" aspect of rate limiting, modifying the rate limits up and down, might be necessary as part of a rate limiter implementation eventually. More of that later. > > In all of the 3 cases that I listed, the current behavior, with precise rate > limiting enabled, is to pause the netty channel in case the throughput > breaches the set limits. This eventually leads to timeout at the client side > in case the burst is significantly greater than the configured timeout on the > producer side. Makes sense. > > The desired behavior in all three situations is to have a multiplier based > bursting capability for a fixed duration. For example, it could be that a > pulsar topic would be able to support 1.5x of the set quota for a burst > duration of up to 5 minutes. There also needs to be a cooldown period in such > a case that it would only accept one such burst every X minutes, say every 1 > hour. Thanks for sharing this concrete example. It will help a lot when starting to design a solution which achieves this. I think that a token bucket algorithm based design can achieve something very close to what you are describing. In the first part I shared the references to Bucket4J's "refill styles" and the concept of "dual token bucket"
Re: [DISCUSS] PIP-310: Support custom publish rate limiters
Hi Lari/Girish, I am sorry for jumping late in the discussion but I would like to acknowledge the requirement of pluggable publish rate-limiter and I had also asked it during implementation of publish rate limiter as well. There are trade-offs between different rate-limiter implementations based on accuracy, n/w usage, simplification and user should be able to choose one based on the requirement. However, we don't have correct and extensible Publish rate limiter interface right now, and before making it pluggable we have to make sure that it should support any type of implementation for example: token based or sliding-window based throttling, support of various decaying functions (eg: exponential decay: https://en.wikipedia.org/wiki/Exponential_decay), etc.. I haven't seen such interface details and design in the PIP: https://github.com/apache/pulsar/pull/21399/. So, I would encourage to work towards building pluggable rate-limiter but current PIP is not ready as it doesn't cover such generic interfaces that can support different types of implementation. Thanks, Rajan On Tue, Nov 7, 2023 at 10:02 AM Lari Hotari wrote: > Hi Girish, > > Replies inline. > > On Tue, 7 Nov 2023 at 15:26, Girish Sharma > wrote: > > > > Hello Lari, replies inline. > > > > I will also be going through some textbook rate limiters (the one you > shared, plus others) and propose the one that at least suits our needs in > the next reply. > > > sounds good. I've been also trying to find more rate limiter resources > that could be useful for our design. > > Bucket4J documentation gives some good ideas and it's shows how the > token bucket algorithm could be varied. For example, the "Refill > styles" section [1] is useful to read as an inspiration. > In network routers, there's a concept of "dual token bucket" > algorithms and by googling you can find both Cisco and Juniper > documentation referencing this. > I also asked ChatGPT-4 to explain "dual token bucket" algorithm [2]. > > 1 - https://bucket4j.com/8.6.0/toc.html#refill-types > 2 - https://chat.openai.com/share/d4f4f740-f675-4233-964e-2910a7c8ed24 > > >> > >> It is bi-weekly on Thursdays. The meeting calendar, zoom link and > >> meeting notes can be found at > >> https://github.com/apache/pulsar/wiki/Community-Meetings . > >> > > > > Would it make sense for me to join this time given that you are skipping > it? > > Yes, it's worth joining regularly when one is participating in Pulsar > core development. There's usually a chance to discuss all topics that > Pulsar community members bring up to discussion. A few times there > haven't been any participants and in that case, it's good to ask on > the #dev channel on Pulsar Slack whether others are joining the > meeting. > > >> > >> ok. btw. "metrics" doesn't necessarily mean providing the rate limiter > >> metrics via Prometheus. There might be other ways to provide this > >> information for components that could react to this. > >> For example, it could a be system topic where these rate limiters emit > events. > >> > > > > Are there any other system topics than > `tenent/namespace/__change_events` . While it's an improvement over > querying metrics, it would still mean one consumer per namespace and would > form a cyclic dependency - for example, in case a broker is degrading due > to mis-use of bursting, it might lead to delays in the consumption of the > event from the __change_events topic. > > The number of events is a fairly low volume so the possible > degradation wouldn't necessarily impact the event communications for > this purpose. I'm not sure if there are currently any public event > topics in Pulsar that are supported in a way that an application could > directly read from the topic. However since this is an advanced case, > I would assume that we don't need to focus on this in the first phase > of improving rate limiters. > > > > > > >> I agree. I just brought up this example to ensure that your > >> expectation about bursting isn't about controlling the rate limits > >> based on situational information, such as end-to-end latency > >> information. > >> Such a feature could be useful, but it does complicate things. > >> However, I think it's good to keep this on the radar since this might > >> be needed to solve some advanced use cases. > >> > > > > I still envision auto-scaling to be admin API driven rather than produce > throughput driven. That way, it remains deterministic in nature. But it > probably doesn't make sense to even talk about it until (partition) > scale-down is possible. > > > I agree. It's better to exclude this from the discussion at the moment > so that it doesn't cause confusion. Modifying the rate limits up and > down automatically with some component could be considered out of > scope in the first phase. However, that might be controversial since > the externally observable behavior of rate limiting with bursting > support seems to be behave in a way where the rate limit changes >
Re: [DISCUSS] PIP-310: Support custom publish rate limiters
Hello Rajan, I haven't updated the PIP with a better interface for PublishRateLimiter yet as the discussion here in this thread went in a different direction. Personally, I agree with you that even if we choose one algorithm and improve the built-in rate limiter, it still may not suit all use cases as you have mentioned. On Asaf's comment on too many public interfaces in Pulsar and no other Apache software having so many public interfaces - I would like to ask, has that brought in any con though? For this particular use case, I feel like having it has a public interface would actually improve the code quality and design as the usage would be checked and changes would go through scrutiny (unlike how the current PublishRateLimiter evolved unchecked). Asaf - what are your thoughts on this? Are you okay with making the PublishRateLimiter pluggable with a better interface? On Wed, Nov 8, 2023 at 5:43 AM Rajan Dhabalia wrote: > Hi Lari/Girish, > > I am sorry for jumping late in the discussion but I would like to > acknowledge the requirement of pluggable publish rate-limiter and I had > also asked it during implementation of publish rate limiter as well. There > are trade-offs between different rate-limiter implementations based on > accuracy, n/w usage, simplification and user should be able to choose one > based on the requirement. However, we don't have correct and extensible > Publish rate limiter interface right now, and before making it pluggable we > have to make sure that it should support any type of implementation for > example: token based or sliding-window based throttling, support of various > decaying functions (eg: exponential decay: > https://en.wikipedia.org/wiki/Exponential_decay), etc.. I haven't seen > such > interface details and design in the PIP: > https://github.com/apache/pulsar/pull/21399/. So, I would encourage to > work > towards building pluggable rate-limiter but current PIP is not ready as it > doesn't cover such generic interfaces that can support different types of > implementation. > > Thanks, > Rajan > > On Tue, Nov 7, 2023 at 10:02 AM Lari Hotari wrote: > > > Hi Girish, > > > > Replies inline. > > > > On Tue, 7 Nov 2023 at 15:26, Girish Sharma > > wrote: > > > > > > Hello Lari, replies inline. > > > > > > I will also be going through some textbook rate limiters (the one you > > shared, plus others) and propose the one that at least suits our needs in > > the next reply. > > > > > > sounds good. I've been also trying to find more rate limiter resources > > that could be useful for our design. > > > > Bucket4J documentation gives some good ideas and it's shows how the > > token bucket algorithm could be varied. For example, the "Refill > > styles" section [1] is useful to read as an inspiration. > > In network routers, there's a concept of "dual token bucket" > > algorithms and by googling you can find both Cisco and Juniper > > documentation referencing this. > > I also asked ChatGPT-4 to explain "dual token bucket" algorithm [2]. > > > > 1 - https://bucket4j.com/8.6.0/toc.html#refill-types > > 2 - https://chat.openai.com/share/d4f4f740-f675-4233-964e-2910a7c8ed24 > > > > >> > > >> It is bi-weekly on Thursdays. The meeting calendar, zoom link and > > >> meeting notes can be found at > > >> https://github.com/apache/pulsar/wiki/Community-Meetings . > > >> > > > > > > Would it make sense for me to join this time given that you are > skipping > > it? > > > > Yes, it's worth joining regularly when one is participating in Pulsar > > core development. There's usually a chance to discuss all topics that > > Pulsar community members bring up to discussion. A few times there > > haven't been any participants and in that case, it's good to ask on > > the #dev channel on Pulsar Slack whether others are joining the > > meeting. > > > > >> > > >> ok. btw. "metrics" doesn't necessarily mean providing the rate limiter > > >> metrics via Prometheus. There might be other ways to provide this > > >> information for components that could react to this. > > >> For example, it could a be system topic where these rate limiters emit > > events. > > >> > > > > > > Are there any other system topics than > > `tenent/namespace/__change_events` . While it's an improvement over > > querying metrics, it would still mean one consumer per namespace and > would > > form a cyclic dependency - for example, in case a broker is degrading due > > to mis-use of bursting, it might lead to delays in the consumption of the > > event from the __change_events topic. > > > > The number of events is a fairly low volume so the possible > > degradation wouldn't necessarily impact the event communications for > > this purpose. I'm not sure if there are currently any public event > > topics in Pulsar that are supported in a way that an application could > > directly read from the topic. However since this is an advanced case, > > I would assume that we don't need to focus on this in the first phase > > of
Re: [DISCUSS] PIP-310: Support custom publish rate limiters
Hi Rajan, Thank you for sharing your opinion. It appears you are in favor of pluggable interfaces for rate limiters. I would like to offer a perspective on why we should defer the pluggability aspect of rate limiters. If there is a real need, it can be considered later. Currently, there's a pressing need to enhance the built-in rate limiters in Pulsar. The current state is not good; our default rate limiter is nearly ineffective and unusable at scale due to its high CPU usage and overhead. Additionally, while the precise rate limiter doesn't have CPU issues, it does lack the capability for configuring an average rate over an extended period. I propose we eliminate the precise rate limiter as a distinct option and converge towards a single, configurable rate limiter solution. Using terms like "precise" in our configuration options exposes unnecessary implementation details and is a practice we should avoid. Our abstractions should be designed to allow for underlying improvements without compromising the established "contract" of the feature. For rate limiting, we can maintain the current external behavior while completely overhauling the internal workings. Introducing "bursting" features, which enable average rate calculations over longer durations, will necessitate additional configuration options to define this time window and possibly a separate maximum rate. Let's prioritize refining the core rate-limiting capabilities of Pulsar. Afterwards, we can revisit the idea of pluggable rate limiters if they still seem necessary. Concentrating on Pulsar's built-in rate limiters will also help us identify the interfaces needed for pluggable rate limiters. Moreover, by focusing on the actual rate limiting behavior and collating Pulsar user requirements, we can potentially design generic solutions that address the majority of needs within Pulsar's core. Hence, our immediate goal should be to advance these improvements and integrate them into Pulsar's core as soon as possible. -Lari On Wed, 8 Nov 2023 at 02:13, Rajan Dhabalia wrote: > > Hi Lari/Girish, > > I am sorry for jumping late in the discussion but I would like to > acknowledge the requirement of pluggable publish rate-limiter and I had > also asked it during implementation of publish rate limiter as well. There > are trade-offs between different rate-limiter implementations based on > accuracy, n/w usage, simplification and user should be able to choose one > based on the requirement. However, we don't have correct and extensible > Publish rate limiter interface right now, and before making it pluggable we > have to make sure that it should support any type of implementation for > example: token based or sliding-window based throttling, support of various > decaying functions (eg: exponential decay: > https://en.wikipedia.org/wiki/Exponential_decay), etc.. I haven't seen such > interface details and design in the PIP: > https://github.com/apache/pulsar/pull/21399/. So, I would encourage to work > towards building pluggable rate-limiter but current PIP is not ready as it > doesn't cover such generic interfaces that can support different types of > implementation. > > Thanks, > Rajan > > On Tue, Nov 7, 2023 at 10:02 AM Lari Hotari wrote: > > > Hi Girish, > > > > Replies inline. > > > > On Tue, 7 Nov 2023 at 15:26, Girish Sharma > > wrote: > > > > > > Hello Lari, replies inline. > > > > > > I will also be going through some textbook rate limiters (the one you > > shared, plus others) and propose the one that at least suits our needs in > > the next reply. > > > > > > sounds good. I've been also trying to find more rate limiter resources > > that could be useful for our design. > > > > Bucket4J documentation gives some good ideas and it's shows how the > > token bucket algorithm could be varied. For example, the "Refill > > styles" section [1] is useful to read as an inspiration. > > In network routers, there's a concept of "dual token bucket" > > algorithms and by googling you can find both Cisco and Juniper > > documentation referencing this. > > I also asked ChatGPT-4 to explain "dual token bucket" algorithm [2]. > > > > 1 - https://bucket4j.com/8.6.0/toc.html#refill-types > > 2 - https://chat.openai.com/share/d4f4f740-f675-4233-964e-2910a7c8ed24 > > > > >> > > >> It is bi-weekly on Thursdays. The meeting calendar, zoom link and > > >> meeting notes can be found at > > >> https://github.com/apache/pulsar/wiki/Community-Meetings . > > >> > > > > > > Would it make sense for me to join this time given that you are skipping > > it? > > > > Yes, it's worth joining regularly when one is participating in Pulsar > > core development. There's usually a chance to discuss all topics that > > Pulsar community members bring up to discussion. A few times there > > haven't been any participants and in that case, it's good to ask on > > the #dev channel on Pulsar Slack whether others are joining the > > meeting. > > > > >> > > >> ok. btw. "metrics" doesn't
Re: [DISCUSS] PIP-310: Support custom publish rate limiters
Hi Girish, > On Asaf's comment on too many public interfaces in Pulsar and no other > Apache software having so many public interfaces - I would like to ask, has > that brought in any con though? For this particular use case, I feel like > having it has a public interface would actually improve the code quality > and design as the usage would be checked and changes would go through > scrutiny (unlike how the current PublishRateLimiter evolved unchecked). > Asaf - what are your thoughts on this? Are you okay with making the > PublishRateLimiter pluggable with a better interface? As a reply to this question, I'd like to highlight my previous email in this thread where I responded to Rajan. The current state of rate limiting is not acceptable in Pulsar. We need to fix things in the core. One of the downsides of adding a lot of pluggability and variation is the additional fragmentation and complexity that it adds to Pulsar. It doesn't really make sense if you need to install an external library to make the rate limiting feature usable in Pulsar. Rate limiting is only needed when operating Pulsar at scale. The core feature must support this to make any sense. The side product of this could be the pluggable interface if the core feature cannot be extended to cover the requirements that Pulsar users really have. Girish, it has been extremely valuable that you have shared concrete examples of your rate limiting related requirements. I'm confident that we will be able to cover the majority of your requirements in Pulsar core. It might be the case that when you try out the new features, it might actually cover your needs sufficiently. Let's keep the focus on improving the rate limiting in Pulsar core. The possible pluggable interface could follow. -Lari On Wed, 8 Nov 2023 at 10:46, Girish Sharma wrote: > > Hello Rajan, > I haven't updated the PIP with a better interface for PublishRateLimiter > yet as the discussion here in this thread went in a different direction. > > Personally, I agree with you that even if we choose one algorithm and > improve the built-in rate limiter, it still may not suit all use cases as > you have mentioned. > > On Asaf's comment on too many public interfaces in Pulsar and no other > Apache software having so many public interfaces - I would like to ask, has > that brought in any con though? For this particular use case, I feel like > having it has a public interface would actually improve the code quality > and design as the usage would be checked and changes would go through > scrutiny (unlike how the current PublishRateLimiter evolved unchecked). > Asaf - what are your thoughts on this? Are you okay with making the > PublishRateLimiter pluggable with a better interface? > > > > > > On Wed, Nov 8, 2023 at 5:43 AM Rajan Dhabalia wrote: > > > Hi Lari/Girish, > > > > I am sorry for jumping late in the discussion but I would like to > > acknowledge the requirement of pluggable publish rate-limiter and I had > > also asked it during implementation of publish rate limiter as well. There > > are trade-offs between different rate-limiter implementations based on > > accuracy, n/w usage, simplification and user should be able to choose one > > based on the requirement. However, we don't have correct and extensible > > Publish rate limiter interface right now, and before making it pluggable we > > have to make sure that it should support any type of implementation for > > example: token based or sliding-window based throttling, support of various > > decaying functions (eg: exponential decay: > > https://en.wikipedia.org/wiki/Exponential_decay), etc.. I haven't seen > > such > > interface details and design in the PIP: > > https://github.com/apache/pulsar/pull/21399/. So, I would encourage to > > work > > towards building pluggable rate-limiter but current PIP is not ready as it > > doesn't cover such generic interfaces that can support different types of > > implementation. > > > > Thanks, > > Rajan > > > > On Tue, Nov 7, 2023 at 10:02 AM Lari Hotari wrote: > > > > > Hi Girish, > > > > > > Replies inline. > > > > > > On Tue, 7 Nov 2023 at 15:26, Girish Sharma > > > wrote: > > > > > > > > Hello Lari, replies inline. > > > > > > > > I will also be going through some textbook rate limiters (the one you > > > shared, plus others) and propose the one that at least suits our needs in > > > the next reply. > > > > > > > > > sounds good. I've been also trying to find more rate limiter resources > > > that could be useful for our design. > > > > > > Bucket4J documentation gives some good ideas and it's shows how the > > > token bucket algorithm could be varied. For example, the "Refill > > > styles" section [1] is useful to read as an inspiration. > > > In network routers, there's a concept of "dual token bucket" > > > algorithms and by googling you can find both Cisco and Juniper > > > documentation referencing this. > > > I also asked ChatGPT-4 to explain "dual token bucket" algorithm [2].
Re: [DISCUSS] PIP-310: Support custom publish rate limiters
Hello Lari, while I am yet to reply to your yesterday''s email, I am trying to wrap this discussion about the need of pluggable rate limiter, so replying to this first. Comments inline On Wed, Nov 8, 2023 at 5:35 PM Lari Hotari wrote: > Hi Girish, > > > On Asaf's comment on too many public interfaces in Pulsar and no other > > Apache software having so many public interfaces - I would like to ask, > has > > that brought in any con though? For this particular use case, I feel like > > having it has a public interface would actually improve the code quality > > and design as the usage would be checked and changes would go through > > scrutiny (unlike how the current PublishRateLimiter evolved unchecked). > > Asaf - what are your thoughts on this? Are you okay with making the > > PublishRateLimiter pluggable with a better interface? > > As a reply to this question, I'd like to highlight my previous email > in this thread where I responded to Rajan. > The current state of rate limiting is not acceptable in Pulsar. We > need to fix things in the core. > I wouldn't say that it's not acceptable. The precise one works as expected as a basic rate limiter. Its only when there are complex requirements, the current rate limiters fail. The key here being "complex requirement". I am with Rajan here that whatever improvements we do to the core built-in rate limiter, would always miss one or the other complex requirement. I feel like we need both the things here. We can work on improving the built in rate limiter which does not try to solve all of my needs like blacklisting support, limiting the number of bursts in a window etc. The built in one can be improved with respect to code refactoring, extensibility, modularization etc. along with implementing a more well known rate limiting algorithm, like token bucket. Along with this, the newly improved interface can now make way for pluggable support. > > One of the downsides of adding a lot of pluggability and variation is > the additional fragmentation and complexity that it adds to Pulsar. It > doesn't really make sense if you need to install an external library > to make the rate limiting feature usable in Pulsar. Rate limiting is > This is assuming that we do not improve the default build in rate limiter at all. Think of this like AuthorizationProvider - there is a built in one. it's good enough, but each organization would have its own requirements wrt how they handle authorization, and thus, most likely, any organization with well defined AuthN/AuthZ constructs would be plugging their own providers there. The key thing here would be to make the public interface as minimal as possible while allowing for custom complex rate limiters as well. I believe this is easily doable without actually making the internal code more complex. In fact, the way I envision the pluggable proposal, things become simpler with respect to code flow, code ownership and custom if/else. > only needed when operating Pulsar at scale. The core feature must > support this to make any sense. > I will give another example here, In big organizations, where pulsar is actually being used at scale - and not just in terms of QPS/MBps, but also in terms of number of teams, tenants, namespaces, number of unique features being used, there always would be an in-house schema registry. Thus, while pulsar already has a built in schema service and registry - it is important that it also supports custom ones. This does not speak badly about the based pulsar package, but actually speaks more about the adaptability of the product. > > The side product of this could be the pluggable interface if the core > feature cannot be extended to cover the requirements that Pulsar users > really have. > Girish, it has been extremely valuable that you have shared concrete > examples of your rate limiting related requirements. I'm confident > that we will be able to cover the majority of your requirements in > Pulsar core. It might be the case that when you try out the new > features, it might actually cover your needs sufficiently. > I really respect and appreciate the discussions we have had. One of the problems I've had in the pulsar community earlier is the lack of participation. But I am getting a lot of participation this time, so its really good. I am willing to take this forward to improve the default rate limiters, but since we would _have_ to meeting somewhere in the middle, at the end of all this - our organizational requirements would still remain unfulfilled until we build _all_ of the things that I have spoken about. Just as Rajan was late to the discussion and he pointed out that they also needed custom rate limiter a while back, there may he others who are either unknownst to this mailing list, or are yet to look into rate limiter all together who may find that whatever we have built/modified/improved is still lacking. > > Let's keep the focus on improving the rate limiting in Pulsar core. >
Re: [DISCUSS] PIP-310: Support custom publish rate limiters
Hello Lari, I've now gone through a bunch of rate limiting algorithms, along with dual rate, dual bucket algorithm. Reply inline. On Tue, Nov 7, 2023 at 11:32 PM Lari Hotari wrote: > > Bucket4J documentation gives some good ideas and it's shows how the > token bucket algorithm could be varied. For example, the "Refill > styles" section [1] is useful to read as an inspiration. > In network routers, there's a concept of "dual token bucket" > algorithms and by googling you can find both Cisco and Juniper > documentation referencing this. > I also asked ChatGPT-4 to explain "dual token bucket" algorithm [2]. > > 1 - https://bucket4j.com/8.6.0/toc.html#refill-types > 2 - https://chat.openai.com/share/d4f4f740-f675-4233-964e-2910a7c8ed24 > > While dual-rate dual token bucket looks promising, there is still some challenge with respect to allowing a certain peak burst for/up to a bigger duration. I am explaining it below: Assume a 10MBps topic. Bursting support of 1.5x upto 2 minutes, once every 10 minute interval. The first bucket has the capacity of the consistent, long-term peak that the topic would observe, so basically - `limit.capacity(10_000_000).refillGreedy(1_000_000, ofMillis(100))` . Here, I am not refilling every milli because the topic will never receive uniform homogenous traffic at a millisecond level. The pattern would always be a batch of one or more messages in an instant, then nothing for next few instants, then another batch of one or more messages and so on. This first bucket is good enough to handle rate limiting in normal cases without bursting support. This is also better than the current precise rate limiter in pulsar which allows for hotspotting of produce within a second as this one spreads the quota over 100ms time windows. The second bucket would have to have a slower refill rate and also a less granular refill rate - think of something like `limit.capacity(5_000_000).refillGreedy(5_000_000, ofMinutes(10))`. Now the problem with this approach is that it would allow only 1 second worth of additional 5MBps on the topic every 10 minutes. Suppose I change the refill rate to `limit.capacity(5_000_000).refillGreedy(5_000_000, ofSeconds(1))` - then it does allow additional 5MBps burst every second, but now there is no cap (the 2 minute cap). I will have to think this through and do some more math to figure out if this is possible at all using dual-rate double token method. Inputs welcomed. > > >> > >> ok. btw. "metrics" doesn't necessarily mean providing the rate limiter > >> metrics via Prometheus. There might be other ways to provide this > >> information for components that could react to this. > >> For example, it could a be system topic where these rate limiters emit > events. > >> > > > > Are there any other system topics than > `tenent/namespace/__change_events` . While it's an improvement over > querying metrics, it would still mean one consumer per namespace and would > form a cyclic dependency - for example, in case a broker is degrading due > to mis-use of bursting, it might lead to delays in the consumption of the > event from the __change_events topic. > > The number of events is a fairly low volume so the possible > degradation wouldn't necessarily impact the event communications for > this purpose. I'm not sure if there are currently any public event > topics in Pulsar that are supported in a way that an application could > directly read from the topic. However since this is an advanced case, > I would assume that we don't need to focus on this in the first phase > of improving rate limiters. > > While the number of events in the system topic would be fairly low per namespace, the issue is that this system topic lies on the same broker where the actual topic/partitions exist and those partitions are leading to degradation of this particular broker. > I agree. It's better to exclude this from the discussion at the moment > so that it doesn't cause confusion. Modifying the rate limits up and > down automatically with some component could be considered out of > scope in the first phase. However, that might be controversial since > the externally observable behavior of rate limiting with bursting > support seems to be behave in a way where the rate limit changes > automatically. The "auto scaling" aspect of rate limiting, modifying > the rate limits up and down, might be necessary as part of a rate > limiter implementation eventually. More of that later. > Not to drag on this topic in this discussion, but doesn't this basically mean that there is no "rate limiting" at all? Basically as good as setting it `-1` and just let the hardware handle it to its best extent. > > > > The desired behavior in all three situations is to have a multiplier > based bursting capability for a fixed duration. For example, it could be > that a pulsar topic would be able to support 1.5x of the set quota for a > burst duration of up to 5 minutes. There also needs to be a cooldown period >
Re: [DISCUSS] PIP-310: Support custom publish rate limiters
Hi Girish, Replies inline. > > The current state of rate limiting is not acceptable in Pulsar. We > > need to fix things in the core. > > > > I wouldn't say that it's not acceptable. The precise one works as expected > as a basic rate limiter. Its only when there are complex requirements, the > current rate limiters fail. Just clarifying, I consider the situation with the default rate limiter not being optimal. The CPU overhead is significant. You must explicitly enable the "precise" rate limiters to resolve this issue. It's not very obvious for any Pulsar user. I don't think that this situation makes sense. If there's an abstraction of a rate limiter, it should be efficient and usable in the default configuration of Pulsar. > The key here being "complex requirement". I am with Rajan here that > whatever improvements we do to the core built-in rate limiter, would always > miss one or the other complex requirement. I haven't yet seen very complex requirements that relate directly to the rate limiter. The scope could expand to the area of capacity management, and I'm pretty sure that it gets there when we go further. Capacity management is a broader concern than rate limiting. We all know that capacity management is necessary in multi-tenant systems. Rate limiting and throttling is one way to handle that. When going to more complex requirements, it might be useful to go beyond rate limiting also in the conceptual design. For example DynamoDB has the concept of capacity units (CU). DynamoDB's conceptual design for capacity management is well described in a paper "Amazon DynamoDB: A Scalable, Predictably Performant, and Fully Managed NoSQL Database Service" and a related presentation [1]. There's also other related blog posts such as "Surprising Scalability of Multitenancy" [2] and "The Road To Serverless: Multi-tenancy" [3] which have been inspirational to me. The paper "Kora: A Cloud-Native Event Streaming Platform For Kafka" [4] is also a very useful read to learn about serverless capacity management. Capacity management goes beyond rate limiting since it has a tight relation to end-to-end flow control, load balancing, service levels and possible auto-scaling solutions. One of the goals of capacity management in a multi-tenant system is to address the dreaded "noisy neighbor" problem in a cost optimal and efficient way. 1 - https://www.usenix.org/conference/atc22/presentation/elhemali 2 - https://brooker.co.za/blog/2023/03/23/economics.html 3 - https://me.0x.me/dbaas3.html 4 - https://www.vldb.org/pvldb/vol16/p3822-povzner.pdf > > I feel like we need both the things here. We can work on improving the > built in rate limiter which does not try to solve all of my needs like > blacklisting support, limiting the number of bursts in a window etc. The > built in one can be improved with respect to code refactoring, > extensibility, modularization etc. along with implementing a more well > known rate limiting algorithm, like token bucket. > Along with this, the newly improved interface can now make way for > pluggable support. Yes, I agree. Improving the rate limiter and exposing an interface aren't exclusive. > This is assuming that we do not improve the default build in rate limiter > at all. Think of this like AuthorizationProvider - there is a built in one. > it's good enough, but each organization would have its own requirements wrt > how they handle authorization, and thus, most likely, any organization with > well defined AuthN/AuthZ constructs would be plugging their own providers > there. I don't think that this is a valid comparison. The current rate limiter has an explicit "contract". The user sets the maximum rate in bytes and/or messages and the rate limiter takes care of enforcing that limit. It's hard to see why that "contract" would have too many interpretations of what it means. Another reason is that I haven't seen any other messaging product where there would be a need to add support for user provided rate limiter algorithm. What makes Pulsar a special case that it would be needed? For authentication and authorization, it's a completely different story. The abstractions require that you pick a specific implementation for your way of doing authentication and authorization. Many other systems out there do it in a somewhat similar way as Pulsar. > The key thing here would be to make the public interface as minimal as > possible while allowing for custom complex rate limiters as well. I believe > this is easily doable without actually making the internal code more > complex. It's doable, but a different question is whether this is necessary in the end. We'll see over time how we can improve the Pulsar core rate limiter and whether there's a need to override it. The current interfaces will change when the Pulsar core rate limiter is improved. This work won't easily meet in the middle unless we start by improving the core rate limiter. What Pulsar does right now might have to
Re: [DISCUSS] PIP-310: Support custom publish rate limiters
Hi Girish, replies inline. On Thu, 9 Nov 2023 at 00:29, Girish Sharma wrote: > While dual-rate dual token bucket looks promising, there is still some > challenge with respect to allowing a certain peak burst for/up to a bigger > duration. I am explaining it below: > Assume a 10MBps topic. Bursting support of 1.5x upto 2 minutes, once every > 10 minute interval. It's possible to have many ways to model a dual token buckets. When there are tokens in the bucket, they are consumed as fast as possible. This is why there is a need for the second token bucket which is used to rate limit the traffic to the absolute maximum rate. Technically the second bucket rate limits the average rate for a short time window. I'd pick the first bucket for handling the 10MB rate. The capacity of the first bucket would be 15MB * 120=1800MB. The fill would happen in special way. I'm not sure if Bucket4J has this at all. So describing the way of adding tokens to the bucket: the tokens in the bucket would remain the same when the rate is <10MB. As many tokens would be added to the bucket as are consumed by the actual traffic. The left over tokens 10MB - actual rate would go to a separate filling bucket that gets poured into the actual bucket every 10 minutes. This first bucket with this separate "filling bucket" would handle the bursting up to 1800MB. The second bucket would solely enforce the 1.5x limit of 15MB rate with a small capacity bucket which enforces the average rate for a short time window. There's one nuance here. The bursting support will only allow bursting if the average rate has been lower than 10MBps for the tokens to use for the bursting to be usable. It would be possible that for example 50% of the tokens would be immediately available and 50% of the tokens are made available in the "filling bucket" that gets poured into the actual bucket every 10 minutes. Without having some way to earn the burst, I don't think that there's a reasonable way to make things usable. The 10MB limit wouldn't have an actual meaning unless that is used to "earn" the tokens to be used for the burst. One other detail of topic publishing throttling in Pulsar is that the actual throttling happens after the limit has been exceeded. This is due to the fact that Pulsar's network handling uses Netty where you cannot block. When using the token bucket concepts, the tokens are always first consumed and after that there's a chance to pause message publishing. In code, you can find this at https://github.com/apache/pulsar/blob/c0eec1e46edeb46c888fa28f27b199ea7e7a1574/pulsar-broker/src/main/java/org/apache/pulsar/broker/service/ServerCnx.java#L1794 and digging down from there. In the current rate limiters in Pulsar, the implementation is not optimized to how Pulsar uses rate limiting. There's no need to use a scheduler for adding "permits" as it's called in the current rate limiter. The new tokens to add can be calculated based on the elapsed time. Resuming from the blocking state (auto read disabled -> enabled) requires a scheduler. The duration of the pause should be calculated based on the rate and the average message size. Because of the nature of asynchronous rate limiting in Pulsar topic publish throttling, the tokens can go to a negative value and it also does. The calculation of the pause could also take this into count. The rate limiting will be very accurate and efficient in this way since the scheduler will only be needed when the token bucket runs out of tokens and there's really a need to throttle. This is the change that I would do to the current implementation in an experiment and see how things behave with the revisited solution. Eliminating the precise rate limiter and having just a single rate limiter would be part of this. I think that the code base has multiple usages of the rate limiter. The dispatching rate limiting might require some variation. IIRC, Rate limiting is also used for some other reasons in the code base in a blocking manner. For example, unloading of bundles is rate limited. Working on the code base will reveal this. > While the number of events in the system topic would be fairly low per > namespace, the issue is that this system topic lies on the same broker > where the actual topic/partitions exist and those partitions are leading to > degradation of this particular broker. Yes, that could be possible but perhaps unlikely. > Agreed, there is a challenge, as it's not as straightforward as I've > demonstrated in the example above. Yes, it might require some rounds of experimentation. Although in general, I think it's a fairly straightforward problem to solve as long as the requirements could be adapted to some small details that make sense for bursting in the context of rate limiting. The detail is that the long time average rate shouldn't go over the configured rate even with the bursts. That's why the tokens usable for the burst should be "earned". I'm not sure if it even is necessary to enforce
Re: [DISCUSS] PIP-310: Support custom publish rate limiters
Hello Lari, replies inline On Thu, Nov 9, 2023 at 6:50 AM Lari Hotari wrote: > Hi Girish, > > replies inline. > > On Thu, 9 Nov 2023 at 00:29, Girish Sharma > wrote: > > While dual-rate dual token bucket looks promising, there is still some > > challenge with respect to allowing a certain peak burst for/up to a > bigger > > duration. I am explaining it below: > > > Assume a 10MBps topic. Bursting support of 1.5x upto 2 minutes, once > every > > 10 minute interval. > > It's possible to have many ways to model a dual token buckets. > When there are tokens in the bucket, they are consumed as fast as > possible. This is why there is a need for the second token bucket > which is used to rate limit the traffic to the absolute maximum rate. > Technically the second bucket rate limits the average rate for a short > time window. > > I'd pick the first bucket for handling the 10MB rate. > The capacity of the first bucket would be 15MB * 120=1800MB. The fill > would happen in special way. I'm not sure if Bucket4J has this at all. > So describing the way of adding tokens to the bucket: the tokens in > the bucket would remain the same when the rate is <10MB. As many > How is this special behavior (tokens in bucket remaining the same when rate is <10MB) achieved? I would assume that to even figure out that the rate is less than 10MB, there is some counter going around? > tokens would be added to the bucket as are consumed by the actual > traffic. The left over tokens 10MB - actual rate would go to a > separate filling bucket that gets poured into the actual bucket every > 10 minutes. > This first bucket with this separate "filling bucket" would handle the > bursting up to 1800MB. > But this isn't the requirement? Let's assume that the actual traffic has been 5MB for a while and this 1800MB capacity bucket is all filled up now.. What's the real use here for that at all? > The second bucket would solely enforce the 1.5x limit of 15MB rate > with a small capacity bucket which enforces the average rate for a > short time window. > There's one nuance here. The bursting support will only allow bursting > if the average rate has been lower than 10MBps for the tokens to use > for the bursting to be usable. > It would be possible that for example 50% of the tokens would be > immediately available and 50% of the tokens are made available in the > "filling bucket" that gets poured into the actual bucket every 10 > minutes. Without having some way to earn the burst, I don't think that > there's a reasonable way to make things usable. The 10MB limit > wouldn't have an actual meaning unless that is used to "earn" the > tokens to be used for the burst. > > I think this approach of thinking about rate limiter - "earning the right to burst by letting tokens remain into the bucket, (by doing lower than 10MB for a while)" doesn't not fit well in a messaging use case in real world, or theoretic. For a 10MB topic, if the actual produce has been , say, 5MB for a long while, this shouldn't give the right to that topic to burst to 15MB for as much as tokens are present.. This is purely due to the fact that this will then start stressing the network and bookie disks. Imagine a 100 of such topics going around with similar configuration of fixed+burst limits and were doing way lower than the fixed rate for the past couple of hours. Now that they've earned enough tokens, if they all start bursting, this will bring down the system, which is probably not capable of supporting simultaneous peaks of all possible topics at all. Now of course we can utilize a broker level fixed rate limiter to not allow the overall throughput of the system to go beyond a number, but at that point - all the earning semantic goes for a toss anyway since the behavior would be unknown wrt which topics are now going through with bursting and which are being blocked due to the broker level fixed rate limiting. As such, letting topics loose would not sit well with any sort of SLA guarantees to the end user. Moreover, contrary to the earning tokens logic, in reality a topic _should_ be allowed to burst upto the SOP/SLA as soon as produce starts in the topic. It shouldn't _have_ to wait for tokens to fill up as it does below-fixed-rate for a while before it is allowed to burst. This is because there is no real benefit or reason to not let the topic do such as the hardware is already present and the topic is already provisioned (partitions, broker spread) accordingly, assuming the burst. In an algorithmic/academic/literature setting, token bucket sounds really promising.. but a platform with SLA to users would not run like that. > In the current rate limiters in Pulsar, the implementation is not > optimized to how Pulsar uses rate limiting. There's no need to use a > scheduler for adding "permits" as it's called in the current rate > limiter. The new tokens to add can be calculated based on the elapsed > time. Resuming from the blocking state (auto read disabled
Re: [DISCUSS] PIP-310: Support custom publish rate limiters
Hi Girish, replies inline. > > I'd pick the first bucket for handling the 10MB rate. > > The capacity of the first bucket would be 15MB * 120=1800MB. The fill > > would happen in special way. I'm not sure if Bucket4J has this at all. > > So describing the way of adding tokens to the bucket: the tokens in > > the bucket would remain the same when the rate is <10MB. As many > > > > How is this special behavior (tokens in bucket remaining the same when rate > is <10MB) achieved? I would assume that to even figure out that the rate is > less than 10MB, there is some counter going around? It's possible to be creative in implementing the token bucket algorithm. New tokens will be added with the configured rate. When there actual traffic rate is less than the token bucket token rate, there will be left over tokens. One way to implement this is to calculate the amount of new tokens before using the tokens. The immediately used tokens are subtracted from the new tokens and the left over tokens are added to the separate "filling bucket". I explained earlier that this filling bucket would be poured in the actual bucket every 10 minutes in the example scenario. > > > > tokens would be added to the bucket as are consumed by the actual > > traffic. The left over tokens 10MB - actual rate would go to a > > separate filling bucket that gets poured into the actual bucket every > > 10 minutes. > > This first bucket with this separate "filling bucket" would handle the > > bursting up to 1800MB. > > > > But this isn't the requirement? Let's assume that the actual traffic has > been 5MB for a while and this 1800MB capacity bucket is all filled up now.. > What's the real use here for that at all? How would the rate limiter know if 5MB traffic is degraded traffic which would need to be allowed to burst? > I think this approach of thinking about rate limiter - "earning the right > to burst by letting tokens remain into the bucket, (by doing lower than > 10MB for a while)" doesn't not fit well in a messaging use case in real > world, or theoretic. It might feel like it doesn't fit well, but this is how most rate limiters work. The model works very well in practice. Without "earning the right to burst", there would have to be some other way to detect whether there's a need to burst. The need to burst could be detected by calculating the time that the message to be sent has been in queues until it's about to be sent. In other words, from the end-to-end latency. > For a 10MB topic, if the actual produce has been , say, 5MB for a long > while, this shouldn't give the right to that topic to burst to 15MB for as > much as tokens are present.. This is purely due to the fact that this will > then start stressing the network and bookie disks. Well, there's always an option to not configure bursting this way or limiting the maximum rate of bursting, like it is possible in a dual token bucket implementation. > Imagine a 100 of such topics going around with similar configuration of > fixed+burst limits and were doing way lower than the fixed rate for the > past couple of hours. Now that they've earned enough tokens, if they all > start bursting, this will bring down the system, which is probably not > capable of supporting simultaneous peaks of all possible topics at all. This is the challenge of capacity management and end-to-end flow control and backpressure. With proper system wide capacity management and end-to-end back pressure, the system won't collapse. As reference, let's take a look at the Confluent Kora paper [1]: In 5.2.1 Back Pressure and Auto-Tuning: "Backpressure is achieved via auto-tuning tenant quotas on the broker such that the combined tenant usage remains below the broker-wide limit. The tenant quotas are auto-tuned proportionally to their total quota allocation on the broker. This mechanism ensures fair sharing of resources among tenants during temporary overload and re-uses the quota enforcement mechanism for backpressure." "The broker-wide limits are generally defined by benchmarking brokers across clouds. The CPU-related limit is unique because there is no easy way to measure and attribute CPU usage to a tenant. Instead, the quota is defned as the clock time the broker spends processing requests and connections, and the safe limit is variable. So to protect CPU, request backpressure is triggered when request queues reache a certain threshold." In 5.2.2 Dynamic Quota Management" "A straightforward method for distributing tenant-level quotas among the brokers hosting the tenant is to statically divide the quota evenly across the brokers. This static approach, which was deployed initially, worked reasonably well on lower subscribed clusters."... "Kora addresses this issue by using a dynamic quota mechanism that adjusts bandwidth distribution based on a tenant’s bandwidth consumption. This is achieved through the use of a shared quota service to manage quota distribution, a design similar to that used by other
Re: [DISCUSS] PIP-310: Support custom publish rate limiters
Hello Lari, replies inline. It's festive season here so I might be late in the next reply. On Fri, Nov 10, 2023 at 4:51 PM Lari Hotari wrote: > Hi Girish, > > > > > > > > tokens would be added to the bucket as are consumed by the actual > > > traffic. The left over tokens 10MB - actual rate would go to a > > > separate filling bucket that gets poured into the actual bucket every > > > 10 minutes. > > > This first bucket with this separate "filling bucket" would handle the > > > bursting up to 1800MB. > > > > > > > But this isn't the requirement? Let's assume that the actual traffic has > > been 5MB for a while and this 1800MB capacity bucket is all filled up > now.. > > What's the real use here for that at all? > > How would the rate limiter know if 5MB traffic is degraded traffic > which would need to be allowed to burst? > That's not what I was implying. I was trying to question the need for 1800MB worth of capacity. I am assuming this is to allow a 2 minute burst of 15MBps? But isn't this bucket only taking are of the delta beyond 10MBps? Moreover, once 2 minutes are elapsed, which bucket is ensuring that the rate is only allowed to go upto 10MBps? > > I think this approach of thinking about rate limiter - "earning the right > > to burst by letting tokens remain into the bucket, (by doing lower than > > 10MB for a while)" doesn't not fit well in a messaging use case in real > > world, or theoretic. > > It might feel like it doesn't fit well, but this is how most rate > limiters work. The model works very well in practice. > Without "earning the right to burst", there would have to be some > other way to detect whether there's a need to burst. > Why does the rate limiter need to decide if there is a need to burst? It is dependent on the incoming message rate. Since the rate limiter has no knowledge of what is going to happen in future, it cannot assume that the messages beyond a fixed rate (10MB in our example) can be held/paused until next second - thus deciding that there is no need to burst right now. While I understand that earning the right to burst model works well in practice, another approach to think about this is that the bucket, initially, starts filled. Moreover, the tokens here are being filled into the bucket due to the available disk, cpu and network bandwidth.. The general approach of where the tokens are initially required to be earned might be helpful to tackle cold starts, but beyond that, a topic doing 5MBps to accumulate enough tokens to burst for a few minutes in the future doesn't really translate to the physical world, does it? The 1:1 translation here is to an SSD where the topic's data is actually being written to - So my bursting upto 15MBps doesn't really depend on the fact that I was only doing 5 MBps in the last few minutes (and thus, accumulating the remaining 5MBps worth tokens towards the burst) - now does it? The SSD won't really gain the ability to allow for burst just because there was low throughput in the last few minutes. Not even from a space POV either. > The need to burst could be detected by calculating the time that the > message to be sent has been in queues until it's about to be sent. > In other words, from the end-to-end latency. > In practice this level of cross component coordination would never result in a responsive and spontaneous system.. unless each message comes along with a "I waited in queue for this long" time and on the server side, we now read the netty channel, parse the message and figure this out before checking for rate limiting.. At which point, rate limiting isn' treally doing anything if the broker is reading every message anyway. This may also require prioritization of messages as some messages from a different producer to the same partition may have waited longer than others.. This is out of scope of a rate limiter at this point. > Imagine a 100 of such topics going around with similar configuration of > > fixed+burst limits and were doing way lower than the fixed rate for the > > past couple of hours. Now that they've earned enough tokens, if they all > > start bursting, this will bring down the system, which is probably not > > capable of supporting simultaneous peaks of all possible topics at all. > > This is the challenge of capacity management and end-to-end flow > control and backpressure. > Rate limiter is a major facilitator and guardian of capacity planning. So, it does relate here. > With proper system wide capacity management and end-to-end back > pressure, the system won't collapse. > At no point I am trying to go beyond the purview of a single broker here. Unless, by system, you meant a single broker itself. For which, I talk about a broker level rate limiter further below in the example. > In Pulsar, we have "PIP 82: Tenant and namespace level rate limiting" > [4] which introduced the "resource group" concept. There is the > resourcegroup resource in the Admin REST API [5]. There's also a > resource
Re: [DISCUSS] PIP-310: Support custom publish rate limiters
Hi Girish, replies inline. > Hello Lari, replies inline. It's festive season here so I might be late in > the next reply. I'll have limited availability next week so possibly not replying until the following week. We have the next Pulsar community meeting on November 23rd, so let's wrap up all of this preparation by then. > > How would the rate limiter know if 5MB traffic is degraded traffic > > which would need to be allowed to burst? > > > > That's not what I was implying. I was trying to question the need for > 1800MB worth of capacity. I am assuming this is to allow a 2 minute burst > of 15MBps? But isn't this bucket only taking are of the delta beyond 10MBps? > Moreover, once 2 minutes are elapsed, which bucket is ensuring that the > rate is only allowed to go upto 10MBps? I guess there are multiple ways to model things. I was thinking of a model where the first bucket (and the "filling bucket" to add "earned" tokens with some interval) handles all logic related to the average rate limiting of 10MBps, including the bursting capacity which a token bucket inherently has. As we know, the token bucket doesn't have a rate limit when there are available tokens in the bucket. That's the reason why there's the separate independent bucket with a relatively short token capacity to enforce the maximum rate limit of 15MBps. There needs to be tokens in both bucket for traffic to flow freely. It's like an AND and not an OR (in literature there are also examples of this where tokens are taken from another bucket when the main one runs out). > > Without "earning the right to burst", there would have to be some > > other way to detect whether there's a need to burst. > > Why does the rate limiter need to decide if there is a need to burst? It is > dependent on the incoming message rate. Since the rate limiter has no > knowledge of what is going to happen in future, it cannot assume that the > messages beyond a fixed rate (10MB in our example) can be held/paused until > next second - thus deciding that there is no need to burst right now. To explain this, I'll first take a quote from your email about the bursting use cases, sent on Nov 7th: > Adding to what I wrote above, think of this pattern like the following: the > produce rate slowly increases from ~2MBps at around 4 AM to a known peak of > about 30MBps by 4 PM and > stays around that peak until 9 PM after which is again starts decreasing > until it reaches ~2MBps around 2 AM. > Now, due to some external triggers, maybe a scheduled sale event, at 10PM, > the quota may spike up to 40MBps for 4-5 minutes and then again go back down > to the usual ~20MBps . Let's observe these cases. (When we get further, it will be helpful to have a set of well defined representative concrete use case / scenario descriptions with concrete examples of expected behavior. Failure / chaos scenarios and operations (such as software upgrade) scenarios should also be covered.) > Why does the rate limiter need to decide if there is a need to burst? It is > dependent on the incoming message rate. Since the rate limiter has no This is a very good question. When we look at these examples, it feels that we'd need to define what the actual rules for the bursting are and how bursting is decided. It seems that there are two type of ways: "earning the right to burst" with the token bucket so that bursting could happen whenever there are extra tokens in the bucket; or having a way to detect that there's a need to burst. That relates directly to SLAs. If there's an SLA where message processing end-to-end latencies must be low, that usually would be violated if large backlogs pile up. I think you said something about avoiding getting into this at all. I agree that this would add complexity, but for the sake of explaining the possible cases for the need for bursting, I think that it's necessary to cover the end-to-end aspects. > While I understand that earning the right to burst model works well in > practice, another approach to think about this is that the bucket, > initially, starts filled. Moreover, the tokens here are being filled into > the bucket due to the available disk, cpu and network bandwidth.. The I completely agree. If it's fine to rely on the token bucket for bursting, this simplifies a lot of things. As you are mentioning, there could be multiple ways to fill/add tokens to the bucket. > general approach of where the tokens are initially required to be earned > might be helpful to tackle cold starts, but beyond that, a topic doing > 5MBps to accumulate enough tokens to burst for a few minutes in the future > doesn't really translate to the physical world, does it? The 1:1 > translation here is to an SSD where the topic's data is actually being > written to - So my bursting upto 15MBps doesn't really depend on the fact > that I was only doing 5 MBps in the last few minutes (and thus, > accumulating the remaining 5MBps worth tokens towards the burst) - now does > it?
Re: [DISCUSS] PIP-310: Support custom publish rate limiters
One final reply from me before the holidays :) On Sat, Nov 11, 2023 at 4:00 PM Lari Hotari wrote: > Hi Girish, > > replies inline. > > > Hello Lari, replies inline. It's festive season here so I might be late > in > > the next reply. > > I'll have limited availability next week so possibly not replying > until the following week. We have the next Pulsar community meeting on > November 23rd, so let's wrap up all of this preparation by then. > Sounds good. > > > > How would the rate limiter know if 5MB traffic is degraded traffic > > > which would need to be allowed to burst? > > > > > > > That's not what I was implying. I was trying to question the need for > > 1800MB worth of capacity. I am assuming this is to allow a 2 minute burst > > of 15MBps? But isn't this bucket only taking are of the delta beyond > 10MBps? > > Moreover, once 2 minutes are elapsed, which bucket is ensuring that the > > rate is only allowed to go upto 10MBps? > > I guess there are multiple ways to model things. I was thinking of a > model where the first bucket (and the "filling bucket" to add "earned" > tokens with some interval) handles all logic related to the average > rate limiting of 10MBps, including the bursting capacity which a token > bucket inherently has. As we know, the token bucket doesn't have a > rate limit when there are available tokens in the bucket. That's the > Actually, the capacity is meant to simulate that particular rate limit. if we have 2 buckets anyways, the one managing the fixed rate limit part shouldn't generally have a capacity more than the fixed rate, right? > reason why there's the separate independent bucket with a relatively > short token capacity to enforce the maximum rate limit of 15MBps. > There needs to be tokens in both bucket for traffic to flow freely. > It's like an AND and not an OR (in literature there are also examples > of this where tokens are taken from another bucket when the main one > runs out). > I think it can be done, especially with that one thing you mentioned about holding off filling the second bucket for 10 minutes.. but it does become quite complicated in terms of managing the flow of the tokens.. because while we only fill the second bucket once every 10 minutes, after the 10th minute, it needs to be filled continuously for a while (the duration we want to support the bursting for).. and the capacity of this second bucket also is governed by and exactly matches the burst value. > > general approach of where the tokens are initially required to be earned > > might be helpful to tackle cold starts, but beyond that, a topic doing > > 5MBps to accumulate enough tokens to burst for a few minutes in the > future > > doesn't really translate to the physical world, does it? The 1:1 > > translation here is to an SSD where the topic's data is actually being > > written to - So my bursting upto 15MBps doesn't really depend on the fact > > that I was only doing 5 MBps in the last few minutes (and thus, > > accumulating the remaining 5MBps worth tokens towards the burst) - now > does > > it? The SSD won't really gain the ability to allow for burst just because > > there was low throughput in the last few minutes. Not even from a space > POV > > either. > > This example of the SSD isn't concrete in terms of Pulsar and > Bookkeeper. The capacity of a single SSD should and must be > significantly higher than the maximum write throughput of a single > topic. Therefore a single SSD shouldn't be a bottleneck. There will > Agreed that it is much higher than a single topics' max throughput.. but the context of my example had multiple topics lying on the same broker/bookie ensemble bursting together at the same time because they had been saving up on tokens in the bucket. always be a need to overprovision resources. You usually don't want to > go beyond 60% or 70% utilization on disk, cpu or network resources so > that queues in the system don't start to increase and impacting > latencies. In Pulsar/Bookkeeper, the storage solution has a very > effective load balancing, especially for writing. In Bookkeeper each > ledger (the segment) of a topic selects the "ensemble" and the "write > quorum", the set of bookies to write to, when the ledger is opened. > The bookkeeper client could also change the ensemble in the middle of > a ledger due to some event like a bookie becoming read-only or > While it does do that on complete failure of bookie or a bookie disk, or broker going down, degradations aren't handled this well. So if all topics in a bookie are bursting due to the fact that they had accumulated tokens, then all it will lead to is breach of write latency SLA because at one point, the disks/cpu/network etc will start choking. (even after considering the 70% utilization i.e. 30% buffer) > > now read the netty channel, parse the message and figure this out before > > checking for rate limiting.. At which point, rate limiting isn' treally > > doing anything if the broker
Re: [DISCUSS] PIP-310: Support custom publish rate limiters
Hi Girish, replies inline and after that there are some updates about my preparation for the community meeting on Thursday. (there's https://github.com/lhotari/async-tokenbucket with a PoC for a low-level high performance token bucket implementation) On Sat, 11 Nov 2023 at 17:25, Girish Sharma wrote: > Actually, the capacity is meant to simulate that particular rate limit. if > we have 2 buckets anyways, the one managing the fixed rate limit part > shouldn't generally have a capacity more than the fixed rate, right? There are multiple ways to model and understand a dual token bucket implementation. I view the 2 buckets in a dual token bucket implementation as separate buckets. They are like an AND rule, so if either bucket is empty, there will be a need to pause to wait for new tokens. Since we aren't working with code yet, these comments could be out of context. > I think it can be done, especially with that one thing you mentioned about > holding off filling the second bucket for 10 minutes.. but it does become > quite complicated in terms of managing the flow of the tokens.. because > while we only fill the second bucket once every 10 minutes, after the 10th > minute, it needs to be filled continuously for a while (the duration we > want to support the bursting for).. and the capacity of this second bucket > also is governed by and exactly matches the burst value. There might not be a need for this complexity of the "filling bucket" in the first place. It was more of a demonstration that it's possible to implement the desired behavior of limited bursting by tweaking the basic token bucket algorithm slightly. I'd rather avoid this additional complexity. > Agreed that it is much higher than a single topics' max throughput.. but > the context of my example had multiple topics lying on the same > broker/bookie ensemble bursting together at the same time because they had > been saving up on tokens in the bucket. Yes, that makes sense. > always be a need to overprovision resources. You usually don't want to > > go beyond 60% or 70% utilization on disk, cpu or network resources so > > that queues in the system don't start to increase and impacting > > latencies. In Pulsar/Bookkeeper, the storage solution has a very > > effective load balancing, especially for writing. In Bookkeeper each > > ledger (the segment) of a topic selects the "ensemble" and the "write > > quorum", the set of bookies to write to, when the ledger is opened. > > The bookkeeper client could also change the ensemble in the middle of > > a ledger due to some event like a bookie becoming read-only or > > > > While it does do that on complete failure of bookie or a bookie disk, or > broker going down, degradations aren't handled this well. So if all topics > in a bookie are bursting due to the fact that they had accumulated tokens, > then all it will lead to is breach of write latency SLA because at one > point, the disks/cpu/network etc will start choking. (even after > considering the 70% utilization i.e. 30% buffer) Yes. > That's only in the case of the default rate limiter where the tryAcquire > isn't even implemented.. since the default rate limiter checks for breach > only at a fixed rate rather than before every produce call. But in case of > precise rate limiter, the response of `tryAcquire` is respected. This is one of many reasons why I think it's better to improve the maintainability of the current solution and remove the unnecessary options between "precise" and the default one. > True, and actually, due to the fact that pulsar auto distributes topics > based on load shedding parameters, we can actually focus on a single > broker's or a single bookie ensemble and assume that it works as we scale > it. Of course this means that putting a reasonable limit in terms of > cpu/network/partition/throughput limits at each broker level and pulsar > provides ways to do that automatically. We do have plans to improve the Pulsar so that things would simply work properly under heavy load. Optimally, things would work without the need to tune and tweak the system rigorously. These improvements go beyond rate limiting. > While I have shared the core requirements over these threads (fixed rate + > burst multiplier for upto X duration every Y minutes).. We are finalizing > the details requirements internally to present. As I replied in my previous > mail, one outcome of detailed internal discussion was the discovery of > throughput contention. Sounds good. Sharing your experiences is extremely valuable. > We do use resource groups for certain namespace level quotas, but even in > our use case, rate limiter and resource groups are two separate tangents. > At least for foreseeable future. At this point they are separate, but there is a need to improve. Jack Vanlightly's blog post series "The Architecture Of Serverless Data Systems" [1] explains the competition in this area very well. Multi-tenant capacity management and SLAs are at
Re: [DISCUSS] PIP-310: Support custom publish rate limiters
I have written a long blog post that contains the context, the summary of my view point about PIP-310 and the proposal for proceeding: https://codingthestreams.com/pulsar/2023/11/22/pulsar-slos-and-rate-limiting.html Let's discuss this tomorrow in the Pulsar community meeting [1]. Let's coordinate on Pulsar Slack's #dev channel if the are issues in joining the meeting. See you tomorrow! -Lari 1 - https://github.com/apache/pulsar/wiki/Community-Meetings On Mon, 20 Nov 2023 at 20:48, Lari Hotari wrote: > > Hi Girish, > > replies inline and after that there are some updates about my > preparation for the community meeting on Thursday. (there's > https://github.com/lhotari/async-tokenbucket with a PoC for a > low-level high performance token bucket implementation) > > On Sat, 11 Nov 2023 at 17:25, Girish Sharma wrote: > > Actually, the capacity is meant to simulate that particular rate limit. if > > we have 2 buckets anyways, the one managing the fixed rate limit part > > shouldn't generally have a capacity more than the fixed rate, right? > > There are multiple ways to model and understand a dual token bucket > implementation. > I view the 2 buckets in a dual token bucket implementation as separate > buckets. They are like an AND rule, so if either bucket is empty, > there will be a need to pause to wait for new tokens. > Since we aren't working with code yet, these comments could be out of context. > > > I think it can be done, especially with that one thing you mentioned about > > holding off filling the second bucket for 10 minutes.. but it does become > > quite complicated in terms of managing the flow of the tokens.. because > > while we only fill the second bucket once every 10 minutes, after the 10th > > minute, it needs to be filled continuously for a while (the duration we > > want to support the bursting for).. and the capacity of this second bucket > > also is governed by and exactly matches the burst value. > > There might not be a need for this complexity of the "filling bucket" > in the first place. It was more of a demonstration that it's possible > to implement the desired behavior of limited bursting by tweaking the > basic token bucket algorithm slightly. > I'd rather avoid this additional complexity. > > > Agreed that it is much higher than a single topics' max throughput.. but > > the context of my example had multiple topics lying on the same > > broker/bookie ensemble bursting together at the same time because they had > > been saving up on tokens in the bucket. > > Yes, that makes sense. > > > always be a need to overprovision resources. You usually don't want to > > > go beyond 60% or 70% utilization on disk, cpu or network resources so > > > that queues in the system don't start to increase and impacting > > > latencies. In Pulsar/Bookkeeper, the storage solution has a very > > > effective load balancing, especially for writing. In Bookkeeper each > > > ledger (the segment) of a topic selects the "ensemble" and the "write > > > quorum", the set of bookies to write to, when the ledger is opened. > > > The bookkeeper client could also change the ensemble in the middle of > > > a ledger due to some event like a bookie becoming read-only or > > > > > > > While it does do that on complete failure of bookie or a bookie disk, or > > broker going down, degradations aren't handled this well. So if all topics > > in a bookie are bursting due to the fact that they had accumulated tokens, > > then all it will lead to is breach of write latency SLA because at one > > point, the disks/cpu/network etc will start choking. (even after > > considering the 70% utilization i.e. 30% buffer) > > Yes. > > > That's only in the case of the default rate limiter where the tryAcquire > > isn't even implemented.. since the default rate limiter checks for breach > > only at a fixed rate rather than before every produce call. But in case of > > precise rate limiter, the response of `tryAcquire` is respected. > > This is one of many reasons why I think it's better to improve the > maintainability of the current solution and remove the unnecessary > options between "precise" and the default one. > > > True, and actually, due to the fact that pulsar auto distributes topics > > based on load shedding parameters, we can actually focus on a single > > broker's or a single bookie ensemble and assume that it works as we scale > > it. Of course this means that putting a reasonable limit in terms of > > cpu/network/partition/throughput limits at each broker level and pulsar > > provides ways to do that automatically. > > We do have plans to improve the Pulsar so that things would simply > work properly under heavy load. Optimally, things would work without > the need to tune and tweak the system rigorously. These improvements > go beyond rate limiting. > > > While I have shared the core requirements over these threads (fixed rate + > > burst multiplier for upto X duration every Y minutes).. We are finalizing > > the
Re: [DISCUSS] PIP-310: Support custom publish rate limiters
I've captured our requirements in detail in this document - https://docs.google.com/document/d/1-y5nBaC9QuAUHKUGMVVe4By-SmMZIL4w09U1byJBbMc/edit Added it to agenda document as well. Will join the meeting and discuss. Regards On Wed, Nov 22, 2023 at 10:49 PM Lari Hotari wrote: > I have written a long blog post that contains the context, the summary > of my view point about PIP-310 and the proposal for proceeding: > > https://codingthestreams.com/pulsar/2023/11/22/pulsar-slos-and-rate-limiting.html > > Let's discuss this tomorrow in the Pulsar community meeting [1]. Let's > coordinate on Pulsar Slack's #dev channel if the are issues in joining > the meeting. > See you tomorrow! > > -Lari > > 1 - https://github.com/apache/pulsar/wiki/Community-Meetings > > On Mon, 20 Nov 2023 at 20:48, Lari Hotari wrote: > > > > Hi Girish, > > > > replies inline and after that there are some updates about my > > preparation for the community meeting on Thursday. (there's > > https://github.com/lhotari/async-tokenbucket with a PoC for a > > low-level high performance token bucket implementation) > > > > On Sat, 11 Nov 2023 at 17:25, Girish Sharma > wrote: > > > Actually, the capacity is meant to simulate that particular rate > limit. if > > > we have 2 buckets anyways, the one managing the fixed rate limit part > > > shouldn't generally have a capacity more than the fixed rate, right? > > > > There are multiple ways to model and understand a dual token bucket > > implementation. > > I view the 2 buckets in a dual token bucket implementation as separate > > buckets. They are like an AND rule, so if either bucket is empty, > > there will be a need to pause to wait for new tokens. > > Since we aren't working with code yet, these comments could be out of > context. > > > > > I think it can be done, especially with that one thing you mentioned > about > > > holding off filling the second bucket for 10 minutes.. but it does > become > > > quite complicated in terms of managing the flow of the tokens.. because > > > while we only fill the second bucket once every 10 minutes, after the > 10th > > > minute, it needs to be filled continuously for a while (the duration we > > > want to support the bursting for).. and the capacity of this second > bucket > > > also is governed by and exactly matches the burst value. > > > > There might not be a need for this complexity of the "filling bucket" > > in the first place. It was more of a demonstration that it's possible > > to implement the desired behavior of limited bursting by tweaking the > > basic token bucket algorithm slightly. > > I'd rather avoid this additional complexity. > > > > > Agreed that it is much higher than a single topics' max throughput.. > but > > > the context of my example had multiple topics lying on the same > > > broker/bookie ensemble bursting together at the same time because they > had > > > been saving up on tokens in the bucket. > > > > Yes, that makes sense. > > > > > always be a need to overprovision resources. You usually don't want to > > > > go beyond 60% or 70% utilization on disk, cpu or network resources so > > > > that queues in the system don't start to increase and impacting > > > > latencies. In Pulsar/Bookkeeper, the storage solution has a very > > > > effective load balancing, especially for writing. In Bookkeeper each > > > > ledger (the segment) of a topic selects the "ensemble" and the "write > > > > quorum", the set of bookies to write to, when the ledger is opened. > > > > The bookkeeper client could also change the ensemble in the middle of > > > > a ledger due to some event like a bookie becoming read-only or > > > > > > > > > > While it does do that on complete failure of bookie or a bookie disk, > or > > > broker going down, degradations aren't handled this well. So if all > topics > > > in a bookie are bursting due to the fact that they had accumulated > tokens, > > > then all it will lead to is breach of write latency SLA because at one > > > point, the disks/cpu/network etc will start choking. (even after > > > considering the 70% utilization i.e. 30% buffer) > > > > Yes. > > > > > That's only in the case of the default rate limiter where the > tryAcquire > > > isn't even implemented.. since the default rate limiter checks for > breach > > > only at a fixed rate rather than before every produce call. But in > case of > > > precise rate limiter, the response of `tryAcquire` is respected. > > > > This is one of many reasons why I think it's better to improve the > > maintainability of the current solution and remove the unnecessary > > options between "precise" and the default one. > > > > > True, and actually, due to the fact that pulsar auto distributes topics > > > based on load shedding parameters, we can actually focus on a single > > > broker's or a single bookie ensemble and assume that it works as we > scale > > > it. Of course this means that putting a reasonable limit in terms of > > > cpu/network/partition/throughput
Re: [DISCUSS] PIP-310: Support custom publish rate limiters
Closing this discussion thread and the PIP. Apart from the discussion present in this thread, I presented the detailed requirements in a dev meet on 23rd November and the conclusion was that we will actually go ahead and implement the requirements in pulsar itself. There was a pre-requisite of refactoring rate limiter codebase which is already covered by Lari in PIP-322. I will be creating a new parent PIP soon about the high level requirements. Thank you everyone who participated in the thread and the discussion on 23rd dev meeting. Regards On Thu, Nov 23, 2023 at 8:26 PM Girish Sharma wrote: > I've captured our requirements in detail in this document - > https://docs.google.com/document/d/1-y5nBaC9QuAUHKUGMVVe4By-SmMZIL4w09U1byJBbMc/edit > Added it to agenda document as well. Will join the meeting and discuss. > > Regards > > On Wed, Nov 22, 2023 at 10:49 PM Lari Hotari wrote: > >> I have written a long blog post that contains the context, the summary >> of my view point about PIP-310 and the proposal for proceeding: >> >> https://codingthestreams.com/pulsar/2023/11/22/pulsar-slos-and-rate-limiting.html >> >> Let's discuss this tomorrow in the Pulsar community meeting [1]. Let's >> coordinate on Pulsar Slack's #dev channel if the are issues in joining >> the meeting. >> See you tomorrow! >> >> -Lari >> >> 1 - https://github.com/apache/pulsar/wiki/Community-Meetings >> >> On Mon, 20 Nov 2023 at 20:48, Lari Hotari wrote: >> > >> > Hi Girish, >> > >> > replies inline and after that there are some updates about my >> > preparation for the community meeting on Thursday. (there's >> > https://github.com/lhotari/async-tokenbucket with a PoC for a >> > low-level high performance token bucket implementation) >> > >> > On Sat, 11 Nov 2023 at 17:25, Girish Sharma >> wrote: >> > > Actually, the capacity is meant to simulate that particular rate >> limit. if >> > > we have 2 buckets anyways, the one managing the fixed rate limit part >> > > shouldn't generally have a capacity more than the fixed rate, right? >> > >> > There are multiple ways to model and understand a dual token bucket >> > implementation. >> > I view the 2 buckets in a dual token bucket implementation as separate >> > buckets. They are like an AND rule, so if either bucket is empty, >> > there will be a need to pause to wait for new tokens. >> > Since we aren't working with code yet, these comments could be out of >> context. >> > >> > > I think it can be done, especially with that one thing you mentioned >> about >> > > holding off filling the second bucket for 10 minutes.. but it does >> become >> > > quite complicated in terms of managing the flow of the tokens.. >> because >> > > while we only fill the second bucket once every 10 minutes, after the >> 10th >> > > minute, it needs to be filled continuously for a while (the duration >> we >> > > want to support the bursting for).. and the capacity of this second >> bucket >> > > also is governed by and exactly matches the burst value. >> > >> > There might not be a need for this complexity of the "filling bucket" >> > in the first place. It was more of a demonstration that it's possible >> > to implement the desired behavior of limited bursting by tweaking the >> > basic token bucket algorithm slightly. >> > I'd rather avoid this additional complexity. >> > >> > > Agreed that it is much higher than a single topics' max throughput.. >> but >> > > the context of my example had multiple topics lying on the same >> > > broker/bookie ensemble bursting together at the same time because >> they had >> > > been saving up on tokens in the bucket. >> > >> > Yes, that makes sense. >> > >> > > always be a need to overprovision resources. You usually don't want to >> > > > go beyond 60% or 70% utilization on disk, cpu or network resources >> so >> > > > that queues in the system don't start to increase and impacting >> > > > latencies. In Pulsar/Bookkeeper, the storage solution has a very >> > > > effective load balancing, especially for writing. In Bookkeeper each >> > > > ledger (the segment) of a topic selects the "ensemble" and the >> "write >> > > > quorum", the set of bookies to write to, when the ledger is opened. >> > > > The bookkeeper client could also change the ensemble in the middle >> of >> > > > a ledger due to some event like a bookie becoming read-only or >> > > > >> > > >> > > While it does do that on complete failure of bookie or a bookie disk, >> or >> > > broker going down, degradations aren't handled this well. So if all >> topics >> > > in a bookie are bursting due to the fact that they had accumulated >> tokens, >> > > then all it will lead to is breach of write latency SLA because at one >> > > point, the disks/cpu/network etc will start choking. (even after >> > > considering the 70% utilization i.e. 30% buffer) >> > >> > Yes. >> > >> > > That's only in the case of the default rate limiter where the >> tryAcquire >> > > isn't even implemented.. since the default rate limiter