[DISCUSS] PIP-310: Support custom publish rate limiters

2023-10-19 Thread Girish Sharma
Hi,
Currently, there are only 2 kinds of publish rate limiters - polling based
and precise. Users have an option to use either one of them in the topic
publish rate limiter, but the resource group rate limiter only uses polling
one.

There are challenges with both the rate limiters and the fact that we can't
use precise rate limiter in the resource group level.

Thus, in order to support custom rate limiters, I've created the PIP-310

This is the discussion thread. Please go through the PIP and provide your
inputs.

Link - https://github.com/apache/pulsar/pull/21399

Regards
-- 
Girish Sharma


Re: [DISCUSS] PIP-310: Support custom publish rate limiters

2023-10-22 Thread Asaf Mesika
Replied in PR.


On Thu, Oct 19, 2023 at 3:51 PM Girish Sharma 
wrote:

> Hi,
> Currently, there are only 2 kinds of publish rate limiters - polling based
> and precise. Users have an option to use either one of them in the topic
> publish rate limiter, but the resource group rate limiter only uses polling
> one.
>
> There are challenges with both the rate limiters and the fact that we can't
> use precise rate limiter in the resource group level.
>
> Thus, in order to support custom rate limiters, I've created the PIP-310
>
> This is the discussion thread. Please go through the PIP and provide your
> inputs.
>
> Link - https://github.com/apache/pulsar/pull/21399
>
> Regards
> --
> Girish Sharma
>


Re: [DISCUSS] PIP-310: Support custom publish rate limiters

2023-11-03 Thread Lari Hotari
Hi Girish,

In order to address your problem described in the PIP document [1], it might be 
necessary to make improvements in how rate limiters apply backpressure in 
Pulsar.

Pulsar uses mainly TCP/IP connection level controls for achieving backpressure. 
The challenge is that Pulsar can share a single TCP/IP connection across 
multiple producers and consumers. Because of this, there could be multiple 
producers and consumers and rate limiters operating on the same connection on 
the broker, and they will do conflicting decisions, which results in undesired 
behavior.

Regarding the shared TCP/IP connection backpressure issue, Apache Flink had a 
somewhat similar problem before Flink 1.5. It is described in the "inflicting 
backpressure" section of this blog post from 2019:
https://flink.apache.org/2019/06/05/flink-network-stack.html#inflicting-backpressure-1
Flink solved the issue of multiplexing multiple streams of data on a single 
TCP/IP connection in Flink 1.5 by introducing it's own flow control mechanism.

The backpressure and rate limiting challenges have been discussed a few times 
in Pulsar community meetings over the past years. There was also a generic 
backpressure thread on the dev mailing list [2] in Sep 2022. 
However, we haven't really documented Pulsar's backpressure design and how rate 
limiters are part of the overall solution and how we could improve. 
I think it might be time to do so since there's a requirement to improve rate 
limiting. I guess that's the main motivation also for PIP-310.

-Lari

1 - https://github.com/apache/pulsar/pull/21399/files
2 - https://lists.apache.org/thread/03w6x9zsgx11mqcp5m4k4n27cyqmp271

On 2023/10/19 12:51:14 Girish Sharma wrote:
> Hi,
> Currently, there are only 2 kinds of publish rate limiters - polling based
> and precise. Users have an option to use either one of them in the topic
> publish rate limiter, but the resource group rate limiter only uses polling
> one.
> 
> There are challenges with both the rate limiters and the fact that we can't
> use precise rate limiter in the resource group level.
> 
> Thus, in order to support custom rate limiters, I've created the PIP-310
> 
> This is the discussion thread. Please go through the PIP and provide your
> inputs.
> 
> Link - https://github.com/apache/pulsar/pull/21399
> 
> Regards
> -- 
> Girish Sharma
> 


Re: [DISCUSS] PIP-310: Support custom publish rate limiters

2023-11-03 Thread Girish Sharma
Hello Lari,
Thanks for bringing this to my attention. I went through the links, but
does this sharing of the same tcp/ip connection happen across partitions as
well (assuming both the partitions of the topic are on the same broker)?
i.e. producer 127.0.0.1 for partition
`persistent://tenant/ns/topic0-partition0` and producer 127.0.0.1 for
partition `persistent://tenant/ns/topic0-partition1` share the same tcp/ip
connection assuming both are on broker-0 ?

In general, the major use case behind this PIP for me and my organization
is about supporting produce spikes. We do not want to allocate absolute
maximum throughput for a topic which would not even be utilized 99.99% of
the time. Thus, for a topic that stays constantly at 100MBps and goes to
150MBps only once in a blue moon, it's unwise to allocate 150MBps worth of
resources 100% of the time. The poller based rate limiter is also not good
here as it would allow over use of hardware without a check, degrading the
system.

@Asif, I have been sick these last 10 days, but will be updating the PIP
with the discussed changes early next week.

Regards

On Fri, Nov 3, 2023 at 3:25 PM Lari Hotari  wrote:

> Hi Girish,
>
> In order to address your problem described in the PIP document [1], it
> might be necessary to make improvements in how rate limiters apply
> backpressure in Pulsar.
>
> Pulsar uses mainly TCP/IP connection level controls for achieving
> backpressure. The challenge is that Pulsar can share a single TCP/IP
> connection across multiple producers and consumers. Because of this, there
> could be multiple producers and consumers and rate limiters operating on
> the same connection on the broker, and they will do conflicting decisions,
> which results in undesired behavior.
>
> Regarding the shared TCP/IP connection backpressure issue, Apache Flink
> had a somewhat similar problem before Flink 1.5. It is described in the
> "inflicting backpressure" section of this blog post from 2019:
>
> https://flink.apache.org/2019/06/05/flink-network-stack.html#inflicting-backpressure-1
> Flink solved the issue of multiplexing multiple streams of data on a
> single TCP/IP connection in Flink 1.5 by introducing it's own flow control
> mechanism.
>
> The backpressure and rate limiting challenges have been discussed a few
> times in Pulsar community meetings over the past years. There was also a
> generic backpressure thread on the dev mailing list [2] in Sep 2022.
> However, we haven't really documented Pulsar's backpressure design and how
> rate limiters are part of the overall solution and how we could improve.
> I think it might be time to do so since there's a requirement to improve
> rate limiting. I guess that's the main motivation also for PIP-310.
>
> -Lari
>
> 1 - https://github.com/apache/pulsar/pull/21399/files
> 2 - https://lists.apache.org/thread/03w6x9zsgx11mqcp5m4k4n27cyqmp271
>
> On 2023/10/19 12:51:14 Girish Sharma wrote:
> > Hi,
> > Currently, there are only 2 kinds of publish rate limiters - polling
> based
> > and precise. Users have an option to use either one of them in the topic
> > publish rate limiter, but the resource group rate limiter only uses
> polling
> > one.
> >
> > There are challenges with both the rate limiters and the fact that we
> can't
> > use precise rate limiter in the resource group level.
> >
> > Thus, in order to support custom rate limiters, I've created the PIP-310
> >
> > This is the discussion thread. Please go through the PIP and provide your
> > inputs.
> >
> > Link - https://github.com/apache/pulsar/pull/21399
> >
> > Regards
> > --
> > Girish Sharma
> >
>


-- 
Girish Sharma


Re: [DISCUSS] PIP-310: Support custom publish rate limiters

2023-11-03 Thread 太上玄元道君
Hi Girish,

There is also a discussion thread[1] about rate-limiting.

I think there is some conflicts between some kind of rate-limiter and
backpressure

Take the fail-fast strategy as an example:
Brokers have to reply to clients after receiving and decode the message,
but the broker also has the back-pressure mechanism. Broker cannot read
messages because the channel is `disableAutoRead`.

So the rate-limiters have to adapt to back-pressure.

Thanks,
Tao Jiuming

2023年10月19日 20:51,Girish Sharma  写道:

Hi,
Currently, there are only 2 kinds of publish rate limiters - polling based
and precise. Users have an option to use either one of them in the topic
publish rate limiter, but the resource group rate limiter only uses polling
one.

There are challenges with both the rate limiters and the fact that we can't
use precise rate limiter in the resource group level.

Thus, in order to support custom rate limiters, I've created the PIP-310

This is the discussion thread. Please go through the PIP and provide your
inputs.

Link - https://github.com/apache/pulsar/pull/21399

Regards
-- 
Girish Sharma


Re: [DISCUSS] PIP-310: Support custom publish rate limiters

2023-11-03 Thread Girish Sharma
Hello Tao,
As I understand, there is a fine balance between rate-limiting,
backpressure and not keeping clients waiting. Different use cases may need
different approach to rate-limiting and thus, making rate limiter
customizable is my first step towards making pulsar more customizable as
per need.

Regards

On Fri, Nov 3, 2023 at 5:42 PM 太上玄元道君  wrote:

> Hi Girish,
>
> There is also a discussion thread[1] about rate-limiting.
>
> I think there is some conflicts between some kind of rate-limiter and
> backpressure
>
> Take the fail-fast strategy as an example:
> Brokers have to reply to clients after receiving and decode the message,
> but the broker also has the back-pressure mechanism. Broker cannot read
> messages because the channel is `disableAutoRead`.
>
> So the rate-limiters have to adapt to back-pressure.
>
> Thanks,
> Tao Jiuming
>
> 2023年10月19日 20:51,Girish Sharma  写道:
>
> Hi,
> Currently, there are only 2 kinds of publish rate limiters - polling based
> and precise. Users have an option to use either one of them in the topic
> publish rate limiter, but the resource group rate limiter only uses polling
> one.
>
> There are challenges with both the rate limiters and the fact that we can't
> use precise rate limiter in the resource group level.
>
> Thus, in order to support custom rate limiters, I've created the PIP-310
>
> This is the discussion thread. Please go through the PIP and provide your
> inputs.
>
> Link - https://github.com/apache/pulsar/pull/21399
>
> Regards
> --
> Girish Sharma
>


-- 
Girish Sharma


Re: [DISCUSS] PIP-310: Support custom publish rate limiters

2023-11-03 Thread Lari Hotari
Hi Tao, 

You seemed to miss sending the link that you were referring to. Are you 
referring to 
the thread "[discuss] Support fail-fast strategy when broker rate-limited" [1] ?

-Lari

1 - https://lists.apache.org/thread/tp2f1ghomj2kw5ltgz8w8k5s58gs88qz


On 2023/11/03 12:11:31 太上玄元道君 wrote:
> Hi Girish,
> 
> There is also a discussion thread[1] about rate-limiting.
> 
> I think there is some conflicts between some kind of rate-limiter and
> backpressure
> 
> Take the fail-fast strategy as an example:
> Brokers have to reply to clients after receiving and decode the message,
> but the broker also has the back-pressure mechanism. Broker cannot read
> messages because the channel is `disableAutoRead`.
> 
> So the rate-limiters have to adapt to back-pressure.
> 
> Thanks,
> Tao Jiuming
> 
> 2023年10月19日 20:51,Girish Sharma  写道:
> 
> Hi,
> Currently, there are only 2 kinds of publish rate limiters - polling based
> and precise. Users have an option to use either one of them in the topic
> publish rate limiter, but the resource group rate limiter only uses polling
> one.
> 
> There are challenges with both the rate limiters and the fact that we can't
> use precise rate limiter in the resource group level.
> 
> Thus, in order to support custom rate limiters, I've created the PIP-310
> 
> This is the discussion thread. Please go through the PIP and provide your
> inputs.
> 
> Link - https://github.com/apache/pulsar/pull/21399
> 
> Regards
> -- 
> Girish Sharma
> 


Re: [DISCUSS] PIP-310: Support custom publish rate limiters

2023-11-03 Thread Lari Hotari
Hi Girish,

Thanks for the questions. I'll reply to them

> does this sharing of the same tcp/ip connection happen across partitions as
> well (assuming both the partitions of the topic are on the same broker)?
> i.e. producer 127.0.0.1 for partition
> `persistent://tenant/ns/topic0-partition0` and producer 127.0.0.1 for
> partition `persistent://tenant/ns/topic0-partition1` share the same tcp/ip
> connection assuming both are on broker-0 ?

The Pulsar Java client would be sharing the same TCP/IP connection to a single 
broker when using the default setting of connectionsPerBroker = 1. It could be 
using a different connection if connectionsPerBroker > 1.

> In general, the major use case behind this PIP for me and my organization
> is about supporting produce spikes. We do not want to allocate absolute
> maximum throughput for a topic which would not even be utilized 99.99% of
> the time. Thus, for a topic that stays constantly at 100MBps and goes to
> 150MBps only once in a blue moon, it's unwise to allocate 150MBps worth of
> resources 100% of the time. The poller based rate limiter is also not good
> here as it would allow over use of hardware without a check, degrading the
> system.

I'm trying to understand your use case better. 
In the PIP-310 document it says:
> Precise rate limiter fixes the above two issues, but introduces another 
> challenge - the rate 
> limiting is too strict.
> * This leads potential for data loss in case there are sudden spikes in 
> produce and client side >produce queue breaches.
> * The produce latencies increase exponentially in case produce breaches the 
> set throughput >even for small windows.

Could you please elaborate more on these details? Here are some questions:
1. What do you mean that it is too strict? 
- Should the rate limiting allow bursting over the limit for some time?
2. What type of data loss are you experiencing? 
3. What is the root cause of the data loss?
   - Do you mean that the system performance degrades and data loss is due to 
not being able to produce from client to the broker quickly enough and data 
loss happens because messages cannot be forwarded from the client to the broker?

Once there's a common understanding of the problem, it's easier to design the 
solution together in the Pulsar community. One possibility is that PIP-310 
already solves your problem. Another possibility is that we need to improve 
producer flow control. I currently feel that it is needed, but I might be wrong.

As mentioned in my previous email, there has been discussions about improving 
producer flow control. One of the solution ideas that was discussed in a Pulsar 
community meeting in January was to add explicit flow control to producers, 
somewhat similar to how there are "permits" as the flow control for consumers. 
The permits would be based on byte size (instead of number of messages). With 
explicit flow control in the protocol, the rate limiting will also be effective 
and deterministic and the issues that Tao Jiuming was explaining could also be 
resolved. It also would solve the producer/consumer multiplexing on a single 
TCP/IP connection when flow control and rate limiting isn't based on the TCP/IP 
level (and toggling the Netty channel's auto read property).

Let's continue discussion, since I think that this is an important improvement 
area. Together we could find a good solution that works for multiple use cases 
and addresses existing challenges in producer flow control and rate limiting. 

-Lari

On 2023/11/03 11:16:37 Girish Sharma wrote:
> Hello Lari,
> Thanks for bringing this to my attention. I went through the links, but
> does this sharing of the same tcp/ip connection happen across partitions as
> well (assuming both the partitions of the topic are on the same broker)?
> i.e. producer 127.0.0.1 for partition
> `persistent://tenant/ns/topic0-partition0` and producer 127.0.0.1 for
> partition `persistent://tenant/ns/topic0-partition1` share the same tcp/ip
> connection assuming both are on broker-0 ?
> 
> In general, the major use case behind this PIP for me and my organization
> is about supporting produce spikes. We do not want to allocate absolute
> maximum throughput for a topic which would not even be utilized 99.99% of
> the time. Thus, for a topic that stays constantly at 100MBps and goes to
> 150MBps only once in a blue moon, it's unwise to allocate 150MBps worth of
> resources 100% of the time. The poller based rate limiter is also not good
> here as it would allow over use of hardware without a check, degrading the
> system.
> 
> @Asif, I have been sick these last 10 days, but will be updating the PIP
> with the discussed changes early next week.
> 
> Regards
> 
> On Fri, Nov 3, 2023 at 3:25 PM Lari Hotari  wrote:
> 
> > Hi Girish,
> >
> > In order to address your problem described in the PIP document [1], it
> > might be necessary to make improvements in how rate limiters apply
> > backpressure in Pulsar.
> >
> > 

Re: [DISCUSS] PIP-310: Support custom publish rate limiters

2023-11-03 Thread Girish Sharma
Hello Lari, replies inline.


On Fri, Nov 3, 2023 at 11:13 PM Lari Hotari  wrote:

> Hi Girish,
>
> Thanks for the questions. I'll reply to them
>
> > does this sharing of the same tcp/ip connection happen across partitions
> as
> > well (assuming both the partitions of the topic are on the same broker)?
> > i.e. producer 127.0.0.1 for partition
> > `persistent://tenant/ns/topic0-partition0` and producer 127.0.0.1 for
> > partition `persistent://tenant/ns/topic0-partition1` share the same
> tcp/ip
> > connection assuming both are on broker-0 ?
>
> The Pulsar Java client would be sharing the same TCP/IP connection to a
> single broker when using the default setting of connectionsPerBroker = 1.
> It could be using a different connection if connectionsPerBroker > 1.
>
> Thanks for clarifying this.

Could you please elaborate more on these details? Here are some questions:
> 1. What do you mean that it is too strict?
> - Should the rate limiting allow bursting over the limit for some time?
>

That's one of the major use cases, yes.


> 2. What type of data loss are you experiencing?
>

Messages produced by the producers which eventually get timed out due to
rate limiting.


> 3. What is the root cause of the data loss?
>- Do you mean that the system performance degrades and data loss is due
> to not being able to produce from client to the broker quickly enough and
> data loss happens because messages cannot be forwarded from the client to
> the broker?
>

No, the system performance decreases in the case of poller based rate
limiters. In the precise one, it's purely the broker pausing the netty
channel's auto read property. If the producer goes beyond the set
throughput for a longer (than timeout) duration then it starts observing
timeouts leading to the messages being timed out essentially being lost.

As mentioned in my previous email, there has been discussions about
> improving producer flow control. One of the solution ideas that was
> discussed in a Pulsar community meeting in January was to add explicit flow
> control to producers, somewhat similar to how there are "permits" as the
> flow control for consumers. The permits would be based on byte size
> (instead of number of messages). With explicit flow control in the
> protocol, the rate limiting will also be effective and deterministic and
> the issues that Tao Jiuming was explaining could also be resolved. It also
> would solve the producer/consumer multiplexing on a single TCP/IP
> connection when flow control and rate limiting isn't based on the TCP/IP
> level (and toggling the Netty channel's auto read property).
>
> I think the core implementation of how the broker fails fast at the time
of rate limiting (whether it is by pausing netty channel or a new permits
based model) does not change the actual issue I am targeting. Multiplexing
has some impact on it - but yet again only limited, and can easily be fixed
by the client by increasing the connections per broker. Even after assuming
both these things are somehow "fixed", the fact remains that an absolutely
strict rate limiter will lead to the above mentioned data loss for burst
going above the limit and that a poller based rate limiter doesn't really
rate limit anything as it allows all produce in the first interval of the
next second.


> Let's continue discussion, since I think that this is an important
> improvement area. Together we could find a good solution that works for
> multiple use cases and addresses existing challenges in producer flow
> control and rate limiting.
>
> -Lari
>
> On 2023/11/03 11:16:37 Girish Sharma wrote:
> > Hello Lari,
> > Thanks for bringing this to my attention. I went through the links, but
> > does this sharing of the same tcp/ip connection happen across partitions
> as
> > well (assuming both the partitions of the topic are on the same broker)?
> > i.e. producer 127.0.0.1 for partition
> > `persistent://tenant/ns/topic0-partition0` and producer 127.0.0.1 for
> > partition `persistent://tenant/ns/topic0-partition1` share the same
> tcp/ip
> > connection assuming both are on broker-0 ?
> >
> > In general, the major use case behind this PIP for me and my organization
> > is about supporting produce spikes. We do not want to allocate absolute
> > maximum throughput for a topic which would not even be utilized 99.99% of
> > the time. Thus, for a topic that stays constantly at 100MBps and goes to
> > 150MBps only once in a blue moon, it's unwise to allocate 150MBps worth
> of
> > resources 100% of the time. The poller based rate limiter is also not
> good
> > here as it would allow over use of hardware without a check, degrading
> the
> > system.
> >
> > @Asif, I have been sick these last 10 days, but will be updating the PIP
> > with the discussed changes early next week.
> >
> > Regards
> >
> > On Fri, Nov 3, 2023 at 3:25 PM Lari Hotari  wrote:
> >
> > > Hi Girish,
> > >
> > > In order to address your problem described in the PIP document [1], it
> > > 

Re: [DISCUSS] PIP-310: Support custom publish rate limiters

2023-11-04 Thread Lari Hotari
Replies inline

On Fri, 3 Nov 2023 at 20:48, Girish Sharma  wrote:

> Could you please elaborate more on these details? Here are some questions:
> > 1. What do you mean that it is too strict?
> > - Should the rate limiting allow bursting over the limit for some
> time?
> >
>
> That's one of the major use cases, yes.
>

One possibility would be to improve the existing rate limiter to allow
bursting.
I think that Pulsar's out-of-the-box rate limiter should cover 99% of the
use cases instead of having one implementing their own rate limiter
algorithm.
The problems you are describing seem to be common to many Pulsar use cases,
and therefore, I think they should be handled directly in Pulsar.

Optimally, there would be a single solution that abstracts the rate
limiting in a way where it does the right thing based on the declarative
configuration.
I would prefer that over having a pluggable solution for rate limiter
implementations.

What would help is getting deeper in the design of the rate limiter itself,
without limiting ourselves to the existing rate limiter implementation in
Pulsar.

In textbooks, there are algorithms such as "leaky bucket" [1] and "token
bucket" [2]. Both algorithms have several variations and in some ways they
are very similar algorithms but looking from the different point of view.
It would possibly be easier to conceptualize and understand a rate limiting
algorithm if common algorithm names and implementation choices mentioned in
textbooks would be referenced in the implementation.
It seems that a "token bucket" type of algorithm can be used to implement
rate limiting with bursting. In the token bucket algorithm, the size of the
token bucket defines how large bursts will be allowed. The design could
also be something where 2 rate limiters with different type of algorithms
and/or configuration parameters are combined to achieve a desired behavior.
For example, to achieve a rate limiter with bursting and a fixed maximum
rate.
By default, the token bucket algorithm doesn't enforce a maximum rate for
bursts, but that could be achieved by chaining 2 rate limiters if that is
really needed.

The current Pulsar rate limiter implementation could be implemented in a
cleaner way, which would also be more efficient. Instead of having a
scheduler call a method once per second, I think that the rate limiter
could be implemented in a reactive way where the algorithm is implemented
without a scheduler.
I wonder if there are others that would be interested in getting down into
such implementation details?

1 - https://en.wikipedia.org/wiki/Leaky_bucket
2 - https://en.wikipedia.org/wiki/Token_bucket


> 2. What type of data loss are you experiencing?
> >
>
> Messages produced by the producers which eventually get timed out due to
> rate limiting.
>

Are you able to slow down producing on the client side? If that is
possible, there could be ways to improve ways to do client side back
pressure with Pulsar Client. Currently, the client doesn't expose this
information until the sending blocks or fails with an exception
(ProducerQueueIsFullError). Optimally, the client should slow down the rate
of producing to the rate that it can actually send to the broker.
Just curious if you have considered turning off producing timeouts on the
client side completely or making them longer? Would that address the data
loss problem?
Or is your event/message source "hot" so that you cannot stop or slow it
down, and it will just keep on flowing with a certain rate?

> I think the core implementation of how the broker fails fast at the time
> of rate limiting (whether it is by pausing netty channel or a new permits
> based model) does not change the actual issue I am targeting. Multiplexing
> has some impact on it - but yet again only limited, and can easily be fixed
> by the client by increasing the connections per broker. Even after assuming
> both these things are somehow "fixed", the fact remains that an absolutely
> strict rate limiter will lead to the above mentioned data loss for burst
> going above the limit and that a poller based rate limiter doesn't really
> rate limit anything as it allows all produce in the first interval of the
> next second.
>

Yes, it makes sense to have bursting configuration parameters in the rate
limiter.
As mentioned above, I think we could be improving the existing rate limiter
in Pulsar to cover 99% of the use case by making it stable and by including
the bursting configuration options.
Is there additional functionality you feel the rate limiter needs beyond
bursting support?

One way to workaround the multiplexing problem would be to add a client
side option for producers and consumers, where you could specify that the
client picks a separate TCP/IP connection that is not shared and isn't from
the connection pool.
Preventing connection multiplexing seems to be the only way to make the
current rate limiting deterministic and stable without adding the explicit
flow control to the 

Re: [DISCUSS] PIP-310: Support custom publish rate limiters

2023-11-04 Thread Lari Hotari
One additional question:

In your use case, do you have multiple producers concurrently producing to
the same topic from different clients?

The use case is challenging in the current implementation when using topic
producing rate limiting. The problem is that the different producers will
be able to send messages at very different rates since there isn't a
solution to ensure fairness across multiple producers in the topic producer
rate limiting solution. This is something that should be addressed when
improving rate limiting.

-Lari

la 4. marrask. 2023 klo 17.24 Lari Hotari  kirjoitti:

> Replies inline
>
> On Fri, 3 Nov 2023 at 20:48, Girish Sharma 
> wrote:
>
>> Could you please elaborate more on these details? Here are some questions:
>> > 1. What do you mean that it is too strict?
>> > - Should the rate limiting allow bursting over the limit for some
>> time?
>> >
>>
>> That's one of the major use cases, yes.
>>
>
> One possibility would be to improve the existing rate limiter to allow
> bursting.
> I think that Pulsar's out-of-the-box rate limiter should cover 99% of the
> use cases instead of having one implementing their own rate limiter
> algorithm.
> The problems you are describing seem to be common to many Pulsar use
> cases, and therefore, I think they should be handled directly in Pulsar.
>
> Optimally, there would be a single solution that abstracts the rate
> limiting in a way where it does the right thing based on the declarative
> configuration.
> I would prefer that over having a pluggable solution for rate limiter
> implementations.
>
> What would help is getting deeper in the design of the rate limiter
> itself, without limiting ourselves to the existing rate limiter
> implementation in Pulsar.
>
> In textbooks, there are algorithms such as "leaky bucket" [1] and "token
> bucket" [2]. Both algorithms have several variations and in some ways they
> are very similar algorithms but looking from the different point of view.
> It would possibly be easier to conceptualize and understand a rate limiting
> algorithm if common algorithm names and implementation choices mentioned in
> textbooks would be referenced in the implementation.
> It seems that a "token bucket" type of algorithm can be used to implement
> rate limiting with bursting. In the token bucket algorithm, the size of the
> token bucket defines how large bursts will be allowed. The design could
> also be something where 2 rate limiters with different type of algorithms
> and/or configuration parameters are combined to achieve a desired behavior.
> For example, to achieve a rate limiter with bursting and a fixed maximum
> rate.
> By default, the token bucket algorithm doesn't enforce a maximum rate for
> bursts, but that could be achieved by chaining 2 rate limiters if that is
> really needed.
>
> The current Pulsar rate limiter implementation could be implemented in a
> cleaner way, which would also be more efficient. Instead of having a
> scheduler call a method once per second, I think that the rate limiter
> could be implemented in a reactive way where the algorithm is implemented
> without a scheduler.
> I wonder if there are others that would be interested in getting down into
> such implementation details?
>
> 1 - https://en.wikipedia.org/wiki/Leaky_bucket
> 2 - https://en.wikipedia.org/wiki/Token_bucket
>
>
> > 2. What type of data loss are you experiencing?
>> >
>>
>> Messages produced by the producers which eventually get timed out due to
>> rate limiting.
>>
>
> Are you able to slow down producing on the client side? If that is
> possible, there could be ways to improve ways to do client side back
> pressure with Pulsar Client. Currently, the client doesn't expose this
> information until the sending blocks or fails with an exception
> (ProducerQueueIsFullError). Optimally, the client should slow down the rate
> of producing to the rate that it can actually send to the broker.
> Just curious if you have considered turning off producing timeouts on the
> client side completely or making them longer? Would that address the data
> loss problem?
> Or is your event/message source "hot" so that you cannot stop or slow it
> down, and it will just keep on flowing with a certain rate?
>
> > I think the core implementation of how the broker fails fast at the time
>> of rate limiting (whether it is by pausing netty channel or a new permits
>> based model) does not change the actual issue I am targeting. Multiplexing
>> has some impact on it - but yet again only limited, and can easily be
>> fixed
>> by the client by increasing the connections per broker. Even after
>> assuming
>> both these things are somehow "fixed", the fact remains that an absolutely
>> strict rate limiter will lead to the above mentioned data loss for burst
>> going above the limit and that a poller based rate limiter doesn't really
>> rate limit anything as it allows all produce in the first interval of the
>> next second.
>>
>
> Yes, it makes sense to 

Re: [DISCUSS] PIP-310: Support custom publish rate limiters

2023-11-04 Thread Girish Sharma
Replies inline.

On Sat, Nov 4, 2023 at 8:55 PM Lari Hotari  wrote:

>
> One possibility would be to improve the existing rate limiter to allow
> bursting.
> I think that Pulsar's out-of-the-box rate limiter should cover 99% of the
> use cases instead of having one implementing their own rate limiter
> algorithm.
>

There are challenges in this. As explained in the PIP, there are several
different usages of rate limiter, stats, unloading, etc. While I am open to
having a burstable rate limiter in pulsar out of box, it might complicate
things considering backward compatibility etc. More on this below.


> The problems you are describing seem to be common to many Pulsar use cases,
> and therefore, I think they should be handled directly in Pulsar.
>

I personally haven't seen many burstability related discussions; so this
feature might actually not be that useful for all current Pulsar users.


>
> Optimally, there would be a single solution that abstracts the rate
> limiting in a way where it does the right thing based on the declarative
> configuration.
> I would prefer that over having a pluggable solution for rate limiter
> implementations.
>
> What would help is getting deeper in the design of the rate limiter itself,
> without limiting ourselves to the existing rate limiter implementation in
> Pulsar.
>

I would personally suggest we tackle this problem in parts so that it's
available incrementally over versions rather than making the scope so big
that it takes pulsar 4.0 for these features to land.


>
> In textbooks, there are algorithms such as "leaky bucket" [1] and "token
> bucket" [2]. Both algorithms have several variations and in some ways they
> are very similar algorithms but looking from the different point of view.
> It would possibly be easier to conceptualize and understand a rate limiting
> algorithm if common algorithm names and implementation choices mentioned in
> textbooks would be referenced in the implementation.
> It seems that a "token bucket" type of algorithm can be used to implement
> rate limiting with bursting. In the token bucket algorithm, the size of the
> token bucket defines how large bursts will be allowed. The design could
> also be something where 2 rate limiters with different type of algorithms
> and/or configuration parameters are combined to achieve a desired behavior.
> For example, to achieve a rate limiter with bursting and a fixed maximum
> rate.
> By default, the token bucket algorithm doesn't enforce a maximum rate for
> bursts, but that could be achieved by chaining 2 rate limiters if that is
> really needed.
>

Yes, we were also thinking on the same terms once this is pluggable. The
idea was to have some numbers and real world usage backing an
implementation of rate limiter before merging it back into pulsar. Any
decision we would take right now would be limited only by theoretical
discussion of the implementation and our assumption that it covers 99% of
the use cases, probably just like how the precise and poller ones came into
being.


>
> The current Pulsar rate limiter implementation could be implemented in a
> cleaner way, which would also be more efficient. Instead of having a
> scheduler call a method once per second, I think that the rate limiter
> could be implemented in a reactive way where the algorithm is implemented
> without a scheduler.
> I wonder if there are others that would be interested in getting down into
> such implementation details?
>

That would be touching RateLimiter.java class, while my goal is to
improve/touch the outer classes, mainly around the interface
PublishRateLimiter.java

With this discussion taking a much bigger turn, how or where do we limit
this PIP's scope? I am happy to work on follow up PIPs which may arrive out
of this discussion.


> Are you able to slow down producing on the client side? If that is
> possible, there could be ways to improve ways to do client side back
> pressure with Pulsar Client. Currently, the client doesn't expose this
> information until the sending blocks or fails with an exception
> (ProducerQueueIsFullError). Optimally, the client should slow down the rate
> of producing to the rate that it can actually send to the broker.
> Just curious if you have considered turning off producing timeouts on the
> client side completely or making them longer? Would that address the data
> loss problem?
> Or is your event/message source "hot" so that you cannot stop or slow it
> down, and it will just keep on flowing with a certain rate?
>
> Mostly, the sources are hot sources and can't be slowed down. The lack of
clear error message (client-server protocol limitation) is also another
issue that I was planning to tackle in another PIP.


Yes, it makes sense to have bursting configuration parameters in the rate
> limiter.
> As mentioned above, I think we could be improving the existing rate limiter
> in Pulsar to cover 99% of the use case by making it stable and by including
> the bursting 

Re: [DISCUSS] PIP-310: Support custom publish rate limiters

2023-11-04 Thread Girish Sharma
On Sat, Nov 4, 2023 at 9:21 PM Lari Hotari  wrote:

> One additional question:
>
> In your use case, do you have multiple producers concurrently producing to
> the same topic from different clients?
>
> The use case is challenging in the current implementation when using topic
> producing rate limiting. The problem is that the different producers will
> be able to send messages at very different rates since there isn't a
> solution to ensure fairness across multiple producers in the topic producer
> rate limiting solution. This is something that should be addressed when
> improving rate limiting.
>

We haven't personally seen this pattern that different logical producers
(app/object) are producing to the same topic. I feel like this is an
anti-pattern and goes against the homogenous nature of a topic.
Even if such kinds of use cases arrive, they can easily be handled by
moving the different producers to different topic and the consumers
subscribing to more than one topic in case they need data from all of those
topics.

Regards


>
> -Lari
>
> la 4. marrask. 2023 klo 17.24 Lari Hotari  kirjoitti:
>
> > Replies inline
> >
> > On Fri, 3 Nov 2023 at 20:48, Girish Sharma 
> > wrote:
> >
> >> Could you please elaborate more on these details? Here are some
> questions:
> >> > 1. What do you mean that it is too strict?
> >> > - Should the rate limiting allow bursting over the limit for some
> >> time?
> >> >
> >>
> >> That's one of the major use cases, yes.
> >>
> >
> > One possibility would be to improve the existing rate limiter to allow
> > bursting.
> > I think that Pulsar's out-of-the-box rate limiter should cover 99% of the
> > use cases instead of having one implementing their own rate limiter
> > algorithm.
> > The problems you are describing seem to be common to many Pulsar use
> > cases, and therefore, I think they should be handled directly in Pulsar.
> >
> > Optimally, there would be a single solution that abstracts the rate
> > limiting in a way where it does the right thing based on the declarative
> > configuration.
> > I would prefer that over having a pluggable solution for rate limiter
> > implementations.
> >
> > What would help is getting deeper in the design of the rate limiter
> > itself, without limiting ourselves to the existing rate limiter
> > implementation in Pulsar.
> >
> > In textbooks, there are algorithms such as "leaky bucket" [1] and "token
> > bucket" [2]. Both algorithms have several variations and in some ways
> they
> > are very similar algorithms but looking from the different point of view.
> > It would possibly be easier to conceptualize and understand a rate
> limiting
> > algorithm if common algorithm names and implementation choices mentioned
> in
> > textbooks would be referenced in the implementation.
> > It seems that a "token bucket" type of algorithm can be used to implement
> > rate limiting with bursting. In the token bucket algorithm, the size of
> the
> > token bucket defines how large bursts will be allowed. The design could
> > also be something where 2 rate limiters with different type of algorithms
> > and/or configuration parameters are combined to achieve a desired
> behavior.
> > For example, to achieve a rate limiter with bursting and a fixed maximum
> > rate.
> > By default, the token bucket algorithm doesn't enforce a maximum rate for
> > bursts, but that could be achieved by chaining 2 rate limiters if that is
> > really needed.
> >
> > The current Pulsar rate limiter implementation could be implemented in a
> > cleaner way, which would also be more efficient. Instead of having a
> > scheduler call a method once per second, I think that the rate limiter
> > could be implemented in a reactive way where the algorithm is implemented
> > without a scheduler.
> > I wonder if there are others that would be interested in getting down
> into
> > such implementation details?
> >
> > 1 - https://en.wikipedia.org/wiki/Leaky_bucket
> > 2 - https://en.wikipedia.org/wiki/Token_bucket
> >
> >
> > > 2. What type of data loss are you experiencing?
> >> >
> >>
> >> Messages produced by the producers which eventually get timed out due to
> >> rate limiting.
> >>
> >
> > Are you able to slow down producing on the client side? If that is
> > possible, there could be ways to improve ways to do client side back
> > pressure with Pulsar Client. Currently, the client doesn't expose this
> > information until the sending blocks or fails with an exception
> > (ProducerQueueIsFullError). Optimally, the client should slow down the
> rate
> > of producing to the rate that it can actually send to the broker.
> > Just curious if you have considered turning off producing timeouts on the
> > client side completely or making them longer? Would that address the data
> > loss problem?
> > Or is your event/message source "hot" so that you cannot stop or slow it
> > down, and it will just keep on flowing with a certain rate?
> >
> > > I think the core implementation of how the broker 

Re: [DISCUSS] PIP-310: Support custom publish rate limiters

2023-11-06 Thread Lari Hotari
Hi Girish,

Replies inline. We are getting into a very detailed discussion. We
could also discuss this topic in one of the upcoming Pulsar Community
meetings. However, I might miss the next meeting that is scheduled
this Thursday.
Although I am currently opposing to your proposal PIP-310, I am
supporting solving your problems related to rate limiting. :)
Let's continue the discussion since that is necessary so that we could
make progress. I hope this makes sense from your perspective.

On Sat, 4 Nov 2023 at 17:53, Girish Sharma  wrote:
>
> There are challenges in this. As explained in the PIP, there are several
> different usages of rate limiter, stats, unloading, etc. While I am open to
> having a burstable rate limiter in pulsar out of box, it might complicate
> things considering backward compatibility etc. More on this below.
>

I acknowledge that there are different usages, but my assumption is
that we could implement a generic solution that could be configured to
handle each specific use case.
I haven't yet seen any evidence that the requirements in your case are
so special that it justifies adding a pluggable interface for rate
limiters. Exposing yet another pluggable interface in Pulsar will add
complexity without gains. Each supported public interface is a
maintenance burden if we care about the quality of the exposed
interfaces and put effort in ensuring that the interfaces are
supported in future versions. Exposing an interface will also lock
down or slow down some future refactorings.
There will be a need to refactor and improve rate limiters as part of
the flow control and back pressure improvements within the Pulsar
broker. I'd rather keep the rate limiter internal interfaces an
internal implementation detail instead of leaking the details to an
exposed public interface. That's why we should primarily look for a
generic solution. I hope we could put effort in looking into the
characteristics of your requirements and attempt to sketch a design
for a generic solution that could be configured for your purposes.

> > The problems you are describing seem to be common to many Pulsar use cases,
> > and therefore, I think they should be handled directly in Pulsar.
> >
>
> I personally haven't seen many burstability related discussions; so this
> feature might actually not be that useful for all current Pulsar users.

In general there are not many advanced discussions on the mailing list
about the Pulsar internals.
It may also be difficult for others to recognize that specific
behaviors could be resolved by enhancing flow control, back pressure,
and rate limiting/throttling mechanisms.

> I would personally suggest we tackle this problem in parts so that it's
> available incrementally over versions rather than making the scope so big
> that it takes pulsar 4.0 for these features to land.

Sure, quick delivery time is the goal of everyone. Before talking
about schedules, we should be able to discuss the use case and the
design of the desired type of rate limiting and throttling in more
depth.

One concrete example of this is the desired behavior of bursting. In
token bucket rate limiting, bursting is about using the buffered
tokens in the "token bucket" and having a configurable limit for the
buffer (the "bucket"). This buffer will usually only contain tokens
when the actual rate has been lower than the configured maximum rate
for some duration.

However, there could be an expectation for a different type of
bursting which is more like "auto scaling" of the rate limit in a way
where the end-to-end latency of the produced messages
is taken into account. The expected behavior might be about scaling
the rate temporarily to a higher rate so that the queues can be
cleared and that the latency of the messages being sent stay under a
target latency. The current
org.apache.pulsar.broker.service.PublishRateLimiter interface cannot
control aspects that would be needed to handle this type of bursting
where we actually need to scale up the rate limit based on end-to-end
feedback . The PublishRateLimiter interface doesn't have feedback
loops currently.

I don't know what "bursting" means for you. Would it be possible to
provide concrete examples of desired behavior? That would be very
helpful in making progress.

> Yes, we were also thinking on the same terms once this is pluggable. The
> idea was to have some numbers and real world usage backing an
> implementation of rate limiter before merging it back into pulsar. Any
> decision we would take right now would be limited only by theoretical
> discussion of the implementation and our assumption that it covers 99% of
> the use cases, probably just like how the precise and poller ones came into
> being.

This will remain theoretical discussion without concrete examples of
desired behavior or what problem we want to solve.
Initially, we need to have a good grasp on the problem we want to
solve. The solution might change while we experiment and iterate. It
will be 

Re: [DISCUSS] PIP-310: Support custom publish rate limiters

2023-11-06 Thread Girish Sharma
Hello Lari, inline once again.


On Mon, Nov 6, 2023 at 5:44 PM Lari Hotari  wrote:

> Hi Girish,
>
> Replies inline. We are getting into a very detailed discussion. We
> could also discuss this topic in one of the upcoming Pulsar Community
> meetings. However, I might miss the next meeting that is scheduled
> this Thursday.
>

Is this every thursday? I am willing to meet at a separate time as well if
enough folks with a viewpoint on this can meet together. I assume that the
community meeting has a much bigger agenda with detailed discussions not
possible?



> Although I am currently opposing to your proposal PIP-310, I am
> supporting solving your problems related to rate limiting. :)
> Let's continue the discussion since that is necessary so that we could
> make progress. I hope this makes sense from your perspective.
>
>
It is all good, as long as the final goal is met within reasonable
timelines.


>
> I acknowledge that there are different usages, but my assumption is
> that we could implement a generic solution that could be configured to
> handle each specific use case.
> I haven't yet seen any evidence that the requirements in your case are
> so special that it justifies adding a pluggable interface for rate
>

Well, the blacklisting use case is a very specific use case. I am
explaining below why that can't be done using metrics and a separate
blacklisting API.


> limiters. Exposing yet another pluggable interface in Pulsar will add
> complexity without gains. Each supported public interface is a
> maintenance burden if we care about the quality of the exposed
> interfaces and put effort in ensuring that the interfaces are
> supported in future versions. Exposing an interface will also lock
> down or slow down some future refactorings.
>

This actually might be a blessing in disguise, at least for RateLimiter and
PublishRateLimiter.java, being an internal interface, it has gone out of
hand and unchecked. Explained more below.


> One concrete example of this is the desired behavior of bursting. In
> token bucket rate limiting, bursting is about using the buffered
> tokens in the "token bucket" and having a configurable limit for the
> buffer (the "bucket"). This buffer will usually only contain tokens
> when the actual rate has been lower than the configured maximum rate
> for some duration.
>
> However, there could be an expectation for a different type of
> bursting which is more like "auto scaling" of the rate limit in a way
> where the end-to-end latency of the produced messages
> is taken into account. The expected behavior might be about scaling
> the rate temporarily to a higher rate so that the queues can be
>

I would like to keep auto-scaling out of scope for this discussion. That
opens up another huge can of worms, specially given the gaps in proper
scale down support in pulsar.


>
> I don't know what "bursting" means for you. Would it be possible to
> provide concrete examples of desired behavior? That would be very
> helpful in making progress.
>
>
Here are a few different use cases:

   - A producer(s) is producing at a near constant rate into a topic, with
   equal distribution among partitions. Due to a hiccup in their downstream
   component, the produce rate goes to 0 for a few seconds, and thus, to
   compensate, in the next few seconds, the produce rate tries to double up.
   - In a visitor based produce rate (where produce rate goes up in the day
   and goes down in the night, think in terms of popular website hourly view
   counts pattern) , there are cases when, due to certain external/internal
   triggers, the views - and thus - the produce rate spikes for a few minutes.
   It is also important to keep this in check so as to not allow bots to do
   DDOS into your system, while that might be a responsibility of an upstream
   system like API gateway, but we cannot be ignorant about that completely.
   - In streaming systems, where there are micro batches, there might be
   constant fluctuations in produce rate from time to time, based on batch
   failure or retries.

In all of these situations, setting the throughput of the topic to be the
absolute maximum of the various spikes observed during the day is very
suboptimal.

Moreover, in each of these situations, once bursting support is present in
the system, it would also need to have proper checks in place to penalize
the producers from trying to mis-use the system. In a true multi-tenant
platform, this is very critical. Thus, blacklisting actually goes hand in
hand here. Explained more below.



> It's interesting that you mention that you would like to improve the
> PublishRateLimiter interface.
> How would you change it?
>
>
The current interface of PublishRateLimiter has duplicate methods. I am
assuming after an initial implementation (poller), the next implementation
simply added more methods into the interface rather than actually using the
ones already existing.
For instance, there are both `tryAcquire` and 

Re: [DISCUSS] PIP-310: Support custom publish rate limiters

2023-11-07 Thread Lari Hotari
Hi Girish,

I think we are starting to get into the concrete details of rate
limiters and how we could start improving the existing feature.
It is very helpful that you are sharing your insight and experience of
operating Pulsar at scale.
Replies inline.

On Mon, 6 Nov 2023 at 15:37, Girish Sharma  wrote:
> Is this every thursday? I am willing to meet at a separate time as well if
> enough folks with a viewpoint on this can meet together. I assume that the
> community meeting has a much bigger agenda with detailed discussions not
> possible?

It is bi-weekly on Thursdays. The meeting calendar, zoom link and
meeting notes can be found at
https://github.com/apache/pulsar/wiki/Community-Meetings .

> It is all good, as long as the final goal is met within reasonable
> timelines.

+1

> Well, the blacklisting use case is a very specific use case. I am
> explaining below why that can't be done using metrics and a separate
> blacklisting API.

ok. btw. "metrics" doesn't necessarily mean providing the rate limiter
metrics via Prometheus. There might be other ways to provide this
information for components that could react to this.
For example, it could a be system topic where these rate limiters emit events.

> This actually might be a blessing in disguise, at least for RateLimiter and
> PublishRateLimiter.java, being an internal interface, it has gone out of
> hand and unchecked. Explained more below.

I agree. It is hard to reason about the existing solution.


> I would like to keep auto-scaling out of scope for this discussion. That
> opens up another huge can of worms, specially given the gaps in proper
> scale down support in pulsar.

I agree. I just brought up this example to ensure that your
expectation about bursting isn't about controlling the rate limits
based on situational information, such as end-to-end latency
information.
Such a feature could be useful, but it does complicate things.
However, I think it's good to keep this on the radar since this might
be needed to solve some advanced use cases.

> > I don't know what "bursting" means for you. Would it be possible to
> > provide concrete examples of desired behavior? That would be very
> > helpful in making progress.
> >
> >
> Here are a few different use cases:

These use cases clarify your requirements a lot. Thanks for sharing.

>- A producer(s) is producing at a near constant rate into a topic, with
>equal distribution among partitions. Due to a hiccup in their downstream
>component, the produce rate goes to 0 for a few seconds, and thus, to
>compensate, in the next few seconds, the produce rate tries to double up.

Could you also elaborate on details such as what is the current
behavior of Pulsar rate limiting / throttling solution and what would
be the desired behavior?
Just guessing that you mean that the desired behavior would be to
allow the produce rate to double up for some time (configurable)?
Compared to what rate is it doubled?
Please explain in detail what the current and desired behaviors would
be so that it's easier to understand the gap.

>- In a visitor based produce rate (where produce rate goes up in the day
>and goes down in the night, think in terms of popular website hourly view
>counts pattern) , there are cases when, due to certain external/internal
>triggers, the views - and thus - the produce rate spikes for a few minutes.

Again, please explain the current behavior and desired behavior.
Explicit example values of number of messages, bandwidth, etc. would
also be helpful details.

>It is also important to keep this in check so as to not allow bots to do
>DDOS into your system, while that might be a responsibility of an upstream
>system like API gateway, but we cannot be ignorant about that completely.

what would be the desired behavior?

>- In streaming systems, where there are micro batches, there might be
>constant fluctuations in produce rate from time to time, based on batch
>failure or retries.

could you share and examples with numbers about this use case too?
explaining current behavior and desired behavior?


>
> In all of these situations, setting the throughput of the topic to be the
> absolute maximum of the various spikes observed during the day is very
> suboptimal.

btw. A plain token bucket algorithm implementation doesn't have an
absolute maximum. The maximum average rate is controlled with the rate
of tokens added to the bucket. The capacity of the bucket controls how
much buffer there is to spend on spikes. If there's a need to also set
an absolute maximum rate, 2 token buckets could be chained to handle
that case. The second rate limiter could have an average rate of the
absolute maximum with a relatively small buffer (token bucket
capacity). There might also be more sophisticated algorithms which
vary the maximum rate in some way to smoothen spikes, but that might
just be completely unnecessary in Pulsar rate limiters.
Switching to use a 

Re: [DISCUSS] PIP-310: Support custom publish rate limiters

2023-11-07 Thread Girish Sharma
Hello Lari, replies inline.

I will also be going through some textbook rate limiters (the one you
shared, plus others) and propose the one that at least suits our needs in
the next reply.

On Tue, Nov 7, 2023 at 2:49 PM Lari Hotari  wrote:

>
> It is bi-weekly on Thursdays. The meeting calendar, zoom link and
> meeting notes can be found at
> https://github.com/apache/pulsar/wiki/Community-Meetings .
>
>
Would it make sense for me to join this time given that you are skipping it?


>
> ok. btw. "metrics" doesn't necessarily mean providing the rate limiter
> metrics via Prometheus. There might be other ways to provide this
> information for components that could react to this.
> For example, it could a be system topic where these rate limiters emit
> events.
>
>
Are there any other system topics than `tenent/namespace/__change_events` .
While it's an improvement over querying metrics, it would still mean one
consumer per namespace and would form a cyclic dependency - for example, in
case a broker is degrading due to mis-use of bursting, it might lead to
delays in the consumption of the event from the __change_events topic.

I agree. I just brought up this example to ensure that your
> expectation about bursting isn't about controlling the rate limits
> based on situational information, such as end-to-end latency
> information.
> Such a feature could be useful, but it does complicate things.
> However, I think it's good to keep this on the radar since this might
> be needed to solve some advanced use cases.
>
>
I still envision auto-scaling to be admin API driven rather than produce
throughput driven. That way, it remains deterministic in nature. But it
probably doesn't make sense to even talk about it until (partition)
scale-down is possible.


>
> >- A producer(s) is producing at a near constant rate into a topic,
> with
> >equal distribution among partitions. Due to a hiccup in their
> downstream
> >component, the produce rate goes to 0 for a few seconds, and thus, to
> >compensate, in the next few seconds, the produce rate tries to double
> up.
>
> Could you also elaborate on details such as what is the current
> behavior of Pulsar rate limiting / throttling solution and what would
> be the desired behavior?
> Just guessing that you mean that the desired behavior would be to
> allow the produce rate to double up for some time (configurable)?
> Compared to what rate is it doubled?
> Please explain in detail what the current and desired behaviors would
> be so that it's easier to understand the gap.
>

In all of the 3 cases that I listed, the current behavior, with precise
rate limiting enabled, is to pause the netty channel in case the throughput
breaches the set limits. This eventually leads to timeout at the client
side in case the burst is significantly greater than the configured timeout
on the producer side.

The desired behavior in all three situations is to have a multiplier based
bursting capability for a fixed duration. For example, it could be that a
pulsar topic would be able to support 1.5x of the set quota for a burst
duration of up to 5 minutes. There also needs to be a cooldown period in
such a case that it would only accept one such burst every X minutes, say
every 1 hour.


>
> >- In a visitor based produce rate (where produce rate goes up in the
> day
> >and goes down in the night, think in terms of popular website hourly
> view
> >counts pattern) , there are cases when, due to certain
> external/internal
> >triggers, the views - and thus - the produce rate spikes for a few
> minutes.
>
> Again, please explain the current behavior and desired behavior.
> Explicit example values of number of messages, bandwidth, etc. would
> also be helpful details.
>

Adding to what I wrote above, think of this pattern like the following: the
produce rate slowly increases from ~2MBps at around 4 AM to a known peak of
about 30MBps by 4 PM and stays around that peak until 9 PM after which is
again starts decreasing until it reaches ~2MBps around 2 AM.
Now, due to some external triggers, maybe a scheduled sale event, at 10PM,
the quota may spike up to 40MBps for 4-5 minutes and then again go back
down to the usual ~20MBps . Here is a rough image showcasing the trend.
[image: image.png]


>
> >It is also important to keep this in check so as to not allow bots to
> do
> >DDOS into your system, while that might be a responsibility of an
> upstream
> >system like API gateway, but we cannot be ignorant about that
> completely.
>
> what would be the desired behavior?
>

The desired behavior is that the burst support should be short lived (5-10
minutes) and limited to a fixed number of bursts in a duration (say - 1
burst per hour). Obviously, all of these should be configurable, maybe at a
broker level and not a topic level.


>
> >- In streaming systems, where there are micro batches, there might be
> >constant fluctuations in produce rate from time to time, 

Re: [DISCUSS] PIP-310: Support custom publish rate limiters

2023-11-07 Thread Asaf Mesika
I just want to add one thing to the mix here.

You can see by the amount of plugin interfaces Pulsar has, somebody "left
the door open" for too long.
You can agree with me that the number of those interfaces is not normal for
any open source software. I know HBase for example, or Kafka - never seen
so many in them.

You can also see the lack of attention to code quality and high level
overview by the poor implementation of current rate limiter.

The feeling is: I just need this tiny little thing and I don't have time -
so over time Pulsar got into this unmaintainable mess of public APIs and
some parts are simply unreadable - such as the rate limiters. I *still*
don't understand how rate limiting works in Pulsar, even when I read the
background  and browsed quickly through the code.

I can see the people on this thread are highly talented - let's use this to
make Pulsar better, both from a bird's-eye view and your own
personal requirement.


On Tue, Nov 7, 2023 at 3:26 PM Girish Sharma 
wrote:

> Hello Lari, replies inline.
>
> I will also be going through some textbook rate limiters (the one you
> shared, plus others) and propose the one that at least suits our needs in
> the next reply.
>
> On Tue, Nov 7, 2023 at 2:49 PM Lari Hotari  wrote:
>
>>
>> It is bi-weekly on Thursdays. The meeting calendar, zoom link and
>> meeting notes can be found at
>> https://github.com/apache/pulsar/wiki/Community-Meetings .
>>
>>
> Would it make sense for me to join this time given that you are skipping
> it?
>
>
>>
>> ok. btw. "metrics" doesn't necessarily mean providing the rate limiter
>> metrics via Prometheus. There might be other ways to provide this
>> information for components that could react to this.
>> For example, it could a be system topic where these rate limiters emit
>> events.
>>
>>
> Are there any other system topics than `tenent/namespace/__change_events`
> . While it's an improvement over querying metrics, it would still mean one
> consumer per namespace and would form a cyclic dependency - for example, in
> case a broker is degrading due to mis-use of bursting, it might lead to
> delays in the consumption of the event from the __change_events topic.
>
> I agree. I just brought up this example to ensure that your
>> expectation about bursting isn't about controlling the rate limits
>> based on situational information, such as end-to-end latency
>> information.
>> Such a feature could be useful, but it does complicate things.
>> However, I think it's good to keep this on the radar since this might
>> be needed to solve some advanced use cases.
>>
>>
> I still envision auto-scaling to be admin API driven rather than produce
> throughput driven. That way, it remains deterministic in nature. But it
> probably doesn't make sense to even talk about it until (partition)
> scale-down is possible.
>
>
>>
>> >- A producer(s) is producing at a near constant rate into a topic,
>> with
>> >equal distribution among partitions. Due to a hiccup in their
>> downstream
>> >component, the produce rate goes to 0 for a few seconds, and thus, to
>> >compensate, in the next few seconds, the produce rate tries to
>> double up.
>>
>> Could you also elaborate on details such as what is the current
>> behavior of Pulsar rate limiting / throttling solution and what would
>> be the desired behavior?
>> Just guessing that you mean that the desired behavior would be to
>> allow the produce rate to double up for some time (configurable)?
>> Compared to what rate is it doubled?
>> Please explain in detail what the current and desired behaviors would
>> be so that it's easier to understand the gap.
>>
>
> In all of the 3 cases that I listed, the current behavior, with precise
> rate limiting enabled, is to pause the netty channel in case the throughput
> breaches the set limits. This eventually leads to timeout at the client
> side in case the burst is significantly greater than the configured timeout
> on the producer side.
>
> The desired behavior in all three situations is to have a multiplier based
> bursting capability for a fixed duration. For example, it could be that a
> pulsar topic would be able to support 1.5x of the set quota for a burst
> duration of up to 5 minutes. There also needs to be a cooldown period in
> such a case that it would only accept one such burst every X minutes, say
> every 1 hour.
>
>
>>
>> >- In a visitor based produce rate (where produce rate goes up in the
>> day
>> >and goes down in the night, think in terms of popular website hourly
>> view
>> >counts pattern) , there are cases when, due to certain
>> external/internal
>> >triggers, the views - and thus - the produce rate spikes for a few
>> minutes.
>>
>> Again, please explain the current behavior and desired behavior.
>> Explicit example values of number of messages, bandwidth, etc. would
>> also be helpful details.
>>
>
> Adding to what I wrote above, think of this pattern like the following:
> the produce rate 

Re: [DISCUSS] PIP-310: Support custom publish rate limiters

2023-11-07 Thread Lari Hotari
Hi Girish,

Replies inline.

On Tue, 7 Nov 2023 at 15:26, Girish Sharma  wrote:
>
> Hello Lari, replies inline.
>
> I will also be going through some textbook rate limiters (the one you shared, 
> plus others) and propose the one that at least suits our needs in the next 
> reply.


sounds good. I've been also trying to find more rate limiter resources
that could be useful for our design.

Bucket4J documentation gives some good ideas and it's shows how the
token bucket algorithm could be varied. For example, the "Refill
styles" section [1] is useful to read as an inspiration.
In network routers, there's a concept of "dual token bucket"
algorithms and by googling you can find both Cisco and Juniper
documentation referencing this.
I also asked ChatGPT-4 to explain "dual token bucket" algorithm [2].

1 - https://bucket4j.com/8.6.0/toc.html#refill-types
2 - https://chat.openai.com/share/d4f4f740-f675-4233-964e-2910a7c8ed24

>>
>> It is bi-weekly on Thursdays. The meeting calendar, zoom link and
>> meeting notes can be found at
>> https://github.com/apache/pulsar/wiki/Community-Meetings .
>>
>
> Would it make sense for me to join this time given that you are skipping it?

Yes, it's worth joining regularly when one is participating in Pulsar
core development. There's usually a chance to discuss all topics that
Pulsar community members bring up to discussion. A few times there
haven't been any participants and in that case, it's good to ask on
the #dev channel on Pulsar Slack whether others are joining the
meeting.

>>
>> ok. btw. "metrics" doesn't necessarily mean providing the rate limiter
>> metrics via Prometheus. There might be other ways to provide this
>> information for components that could react to this.
>> For example, it could a be system topic where these rate limiters emit 
>> events.
>>
>
> Are there any other system topics than `tenent/namespace/__change_events` . 
> While it's an improvement over querying metrics, it would still mean one 
> consumer per namespace and would form a cyclic dependency - for example, in 
> case a broker is degrading due to mis-use of bursting, it might lead to 
> delays in the consumption of the event from the __change_events topic.

The number of events is a fairly low volume so the possible
degradation wouldn't necessarily impact the event communications for
this purpose. I'm not sure if there are currently any public event
topics in Pulsar that are supported in a way that an application could
directly read from the topic.  However since this is an advanced case,
I would assume that we don't need to focus on this in the first phase
of improving rate limiters.

>
>
>> I agree. I just brought up this example to ensure that your
>> expectation about bursting isn't about controlling the rate limits
>> based on situational information, such as end-to-end latency
>> information.
>> Such a feature could be useful, but it does complicate things.
>> However, I think it's good to keep this on the radar since this might
>> be needed to solve some advanced use cases.
>>
>
> I still envision auto-scaling to be admin API driven rather than produce 
> throughput driven. That way, it remains deterministic in nature. But it 
> probably doesn't make sense to even talk about it until (partition) 
> scale-down is possible.


I agree. It's better to exclude this from the discussion at the moment
so that it doesn't cause confusion. Modifying the rate limits up and
down automatically with some component could be considered out of
scope in the first phase. However, that might be controversial since
the externally observable behavior of rate limiting with bursting
support seems to be behave in a way where the rate limit changes
automatically. The "auto scaling" aspect of rate limiting, modifying
the rate limits up and down, might be necessary as part of a rate
limiter implementation eventually. More of that later.

>
> In all of the 3 cases that I listed, the current behavior, with precise rate 
> limiting enabled, is to pause the netty channel in case the throughput 
> breaches the set limits. This eventually leads to timeout at the client side 
> in case the burst is significantly greater than the configured timeout on the 
> producer side.


Makes sense.

>
> The desired behavior in all three situations is to have a multiplier based 
> bursting capability for a fixed duration. For example, it could be that a 
> pulsar topic would be able to support 1.5x of the set quota for a burst 
> duration of up to 5 minutes. There also needs to be a cooldown period in such 
> a case that it would only accept one such burst every X minutes, say every 1 
> hour.


Thanks for sharing this concrete example. It will help a lot when
starting to design a solution which achieves this. I think that a
token bucket algorithm based design can achieve something very close
to what you are describing. In the first part I shared the references
to Bucket4J's "refill styles" and the concept of "dual token bucket"

Re: [DISCUSS] PIP-310: Support custom publish rate limiters

2023-11-07 Thread Rajan Dhabalia
Hi Lari/Girish,

I am sorry for jumping late in the discussion but I would like to
acknowledge the requirement of pluggable publish rate-limiter and I had
also asked it during implementation of publish rate limiter as well. There
are trade-offs between different rate-limiter implementations based on
accuracy, n/w usage, simplification and user should be able to choose one
based on the requirement. However, we don't have correct and extensible
Publish rate limiter interface right now, and before making it pluggable we
have to make sure that it should support any type of implementation for
example: token based or sliding-window based throttling, support of various
decaying functions (eg: exponential decay:
https://en.wikipedia.org/wiki/Exponential_decay), etc.. I haven't seen such
interface details and design in the PIP:
https://github.com/apache/pulsar/pull/21399/. So, I would encourage to work
towards building pluggable rate-limiter but current PIP is not ready as it
doesn't cover such generic interfaces that can support different types of
implementation.

Thanks,
Rajan

On Tue, Nov 7, 2023 at 10:02 AM Lari Hotari  wrote:

> Hi Girish,
>
> Replies inline.
>
> On Tue, 7 Nov 2023 at 15:26, Girish Sharma 
> wrote:
> >
> > Hello Lari, replies inline.
> >
> > I will also be going through some textbook rate limiters (the one you
> shared, plus others) and propose the one that at least suits our needs in
> the next reply.
>
>
> sounds good. I've been also trying to find more rate limiter resources
> that could be useful for our design.
>
> Bucket4J documentation gives some good ideas and it's shows how the
> token bucket algorithm could be varied. For example, the "Refill
> styles" section [1] is useful to read as an inspiration.
> In network routers, there's a concept of "dual token bucket"
> algorithms and by googling you can find both Cisco and Juniper
> documentation referencing this.
> I also asked ChatGPT-4 to explain "dual token bucket" algorithm [2].
>
> 1 - https://bucket4j.com/8.6.0/toc.html#refill-types
> 2 - https://chat.openai.com/share/d4f4f740-f675-4233-964e-2910a7c8ed24
>
> >>
> >> It is bi-weekly on Thursdays. The meeting calendar, zoom link and
> >> meeting notes can be found at
> >> https://github.com/apache/pulsar/wiki/Community-Meetings .
> >>
> >
> > Would it make sense for me to join this time given that you are skipping
> it?
>
> Yes, it's worth joining regularly when one is participating in Pulsar
> core development. There's usually a chance to discuss all topics that
> Pulsar community members bring up to discussion. A few times there
> haven't been any participants and in that case, it's good to ask on
> the #dev channel on Pulsar Slack whether others are joining the
> meeting.
>
> >>
> >> ok. btw. "metrics" doesn't necessarily mean providing the rate limiter
> >> metrics via Prometheus. There might be other ways to provide this
> >> information for components that could react to this.
> >> For example, it could a be system topic where these rate limiters emit
> events.
> >>
> >
> > Are there any other system topics than
> `tenent/namespace/__change_events` . While it's an improvement over
> querying metrics, it would still mean one consumer per namespace and would
> form a cyclic dependency - for example, in case a broker is degrading due
> to mis-use of bursting, it might lead to delays in the consumption of the
> event from the __change_events topic.
>
> The number of events is a fairly low volume so the possible
> degradation wouldn't necessarily impact the event communications for
> this purpose. I'm not sure if there are currently any public event
> topics in Pulsar that are supported in a way that an application could
> directly read from the topic.  However since this is an advanced case,
> I would assume that we don't need to focus on this in the first phase
> of improving rate limiters.
>
> >
> >
> >> I agree. I just brought up this example to ensure that your
> >> expectation about bursting isn't about controlling the rate limits
> >> based on situational information, such as end-to-end latency
> >> information.
> >> Such a feature could be useful, but it does complicate things.
> >> However, I think it's good to keep this on the radar since this might
> >> be needed to solve some advanced use cases.
> >>
> >
> > I still envision auto-scaling to be admin API driven rather than produce
> throughput driven. That way, it remains deterministic in nature. But it
> probably doesn't make sense to even talk about it until (partition)
> scale-down is possible.
>
>
> I agree. It's better to exclude this from the discussion at the moment
> so that it doesn't cause confusion. Modifying the rate limits up and
> down automatically with some component could be considered out of
> scope in the first phase. However, that might be controversial since
> the externally observable behavior of rate limiting with bursting
> support seems to be behave in a way where the rate limit changes
> 

Re: [DISCUSS] PIP-310: Support custom publish rate limiters

2023-11-08 Thread Girish Sharma
Hello Rajan,
I haven't updated the PIP with a better interface for PublishRateLimiter
yet as the discussion here in this thread went in a different direction.

Personally, I agree with you that even if we choose one algorithm and
improve the built-in rate limiter, it still may not suit all use cases as
you have mentioned.

On Asaf's comment on too many public interfaces in Pulsar and no other
Apache software having so many public interfaces - I would like to ask, has
that brought in any con though? For this particular use case, I feel like
having it has a public interface would actually improve the code quality
and design as the usage would be checked and changes would go through
scrutiny (unlike how the current PublishRateLimiter evolved unchecked).
Asaf - what are your thoughts on this? Are you okay with making the
PublishRateLimiter pluggable with a better interface?





On Wed, Nov 8, 2023 at 5:43 AM Rajan Dhabalia  wrote:

> Hi Lari/Girish,
>
> I am sorry for jumping late in the discussion but I would like to
> acknowledge the requirement of pluggable publish rate-limiter and I had
> also asked it during implementation of publish rate limiter as well. There
> are trade-offs between different rate-limiter implementations based on
> accuracy, n/w usage, simplification and user should be able to choose one
> based on the requirement. However, we don't have correct and extensible
> Publish rate limiter interface right now, and before making it pluggable we
> have to make sure that it should support any type of implementation for
> example: token based or sliding-window based throttling, support of various
> decaying functions (eg: exponential decay:
> https://en.wikipedia.org/wiki/Exponential_decay), etc.. I haven't seen
> such
> interface details and design in the PIP:
> https://github.com/apache/pulsar/pull/21399/. So, I would encourage to
> work
> towards building pluggable rate-limiter but current PIP is not ready as it
> doesn't cover such generic interfaces that can support different types of
> implementation.
>
> Thanks,
> Rajan
>
> On Tue, Nov 7, 2023 at 10:02 AM Lari Hotari  wrote:
>
> > Hi Girish,
> >
> > Replies inline.
> >
> > On Tue, 7 Nov 2023 at 15:26, Girish Sharma 
> > wrote:
> > >
> > > Hello Lari, replies inline.
> > >
> > > I will also be going through some textbook rate limiters (the one you
> > shared, plus others) and propose the one that at least suits our needs in
> > the next reply.
> >
> >
> > sounds good. I've been also trying to find more rate limiter resources
> > that could be useful for our design.
> >
> > Bucket4J documentation gives some good ideas and it's shows how the
> > token bucket algorithm could be varied. For example, the "Refill
> > styles" section [1] is useful to read as an inspiration.
> > In network routers, there's a concept of "dual token bucket"
> > algorithms and by googling you can find both Cisco and Juniper
> > documentation referencing this.
> > I also asked ChatGPT-4 to explain "dual token bucket" algorithm [2].
> >
> > 1 - https://bucket4j.com/8.6.0/toc.html#refill-types
> > 2 - https://chat.openai.com/share/d4f4f740-f675-4233-964e-2910a7c8ed24
> >
> > >>
> > >> It is bi-weekly on Thursdays. The meeting calendar, zoom link and
> > >> meeting notes can be found at
> > >> https://github.com/apache/pulsar/wiki/Community-Meetings .
> > >>
> > >
> > > Would it make sense for me to join this time given that you are
> skipping
> > it?
> >
> > Yes, it's worth joining regularly when one is participating in Pulsar
> > core development. There's usually a chance to discuss all topics that
> > Pulsar community members bring up to discussion. A few times there
> > haven't been any participants and in that case, it's good to ask on
> > the #dev channel on Pulsar Slack whether others are joining the
> > meeting.
> >
> > >>
> > >> ok. btw. "metrics" doesn't necessarily mean providing the rate limiter
> > >> metrics via Prometheus. There might be other ways to provide this
> > >> information for components that could react to this.
> > >> For example, it could a be system topic where these rate limiters emit
> > events.
> > >>
> > >
> > > Are there any other system topics than
> > `tenent/namespace/__change_events` . While it's an improvement over
> > querying metrics, it would still mean one consumer per namespace and
> would
> > form a cyclic dependency - for example, in case a broker is degrading due
> > to mis-use of bursting, it might lead to delays in the consumption of the
> > event from the __change_events topic.
> >
> > The number of events is a fairly low volume so the possible
> > degradation wouldn't necessarily impact the event communications for
> > this purpose. I'm not sure if there are currently any public event
> > topics in Pulsar that are supported in a way that an application could
> > directly read from the topic.  However since this is an advanced case,
> > I would assume that we don't need to focus on this in the first phase
> > of 

Re: [DISCUSS] PIP-310: Support custom publish rate limiters

2023-11-08 Thread Lari Hotari
Hi Rajan,

Thank you for sharing your opinion. It appears you are in favor of
pluggable interfaces for rate limiters. I would like to offer a
perspective on why we should defer the pluggability aspect of rate
limiters. If there is a real need, it can be considered later.

Currently, there's a pressing need to enhance the built-in rate
limiters in Pulsar. The current state is not good; our default rate
limiter is nearly ineffective and unusable at scale due to its high
CPU usage and overhead. Additionally, while the precise rate limiter
doesn't have CPU issues, it does lack the capability for configuring
an average rate over an extended period.

I propose we eliminate the precise rate limiter as a distinct option
and converge towards a single, configurable rate limiter solution.
Using terms like "precise" in our configuration options exposes
unnecessary implementation details and is a practice we should avoid.

Our abstractions should be designed to allow for underlying
improvements without compromising the established "contract" of the
feature. For rate limiting, we can maintain the current external
behavior while completely overhauling the internal workings.

Introducing "bursting" features, which enable average rate
calculations over longer durations, will necessitate additional
configuration options to define this time window and possibly a
separate maximum rate.

Let's prioritize refining the core rate-limiting capabilities of
Pulsar. Afterwards, we can revisit the idea of pluggable rate limiters
if they still seem necessary.
Concentrating on Pulsar's built-in rate limiters will also help us
identify the interfaces needed for pluggable rate limiters.
Moreover, by focusing on the actual rate limiting behavior and
collating Pulsar user requirements, we can potentially design generic
solutions that address the majority of needs within Pulsar's core.

Hence, our immediate goal should be to advance these improvements and
integrate them into Pulsar's core as soon as possible.

-Lari


On Wed, 8 Nov 2023 at 02:13, Rajan Dhabalia  wrote:
>
> Hi Lari/Girish,
>
> I am sorry for jumping late in the discussion but I would like to
> acknowledge the requirement of pluggable publish rate-limiter and I had
> also asked it during implementation of publish rate limiter as well. There
> are trade-offs between different rate-limiter implementations based on
> accuracy, n/w usage, simplification and user should be able to choose one
> based on the requirement. However, we don't have correct and extensible
> Publish rate limiter interface right now, and before making it pluggable we
> have to make sure that it should support any type of implementation for
> example: token based or sliding-window based throttling, support of various
> decaying functions (eg: exponential decay:
> https://en.wikipedia.org/wiki/Exponential_decay), etc.. I haven't seen such
> interface details and design in the PIP:
> https://github.com/apache/pulsar/pull/21399/. So, I would encourage to work
> towards building pluggable rate-limiter but current PIP is not ready as it
> doesn't cover such generic interfaces that can support different types of
> implementation.
>
> Thanks,
> Rajan
>
> On Tue, Nov 7, 2023 at 10:02 AM Lari Hotari  wrote:
>
> > Hi Girish,
> >
> > Replies inline.
> >
> > On Tue, 7 Nov 2023 at 15:26, Girish Sharma 
> > wrote:
> > >
> > > Hello Lari, replies inline.
> > >
> > > I will also be going through some textbook rate limiters (the one you
> > shared, plus others) and propose the one that at least suits our needs in
> > the next reply.
> >
> >
> > sounds good. I've been also trying to find more rate limiter resources
> > that could be useful for our design.
> >
> > Bucket4J documentation gives some good ideas and it's shows how the
> > token bucket algorithm could be varied. For example, the "Refill
> > styles" section [1] is useful to read as an inspiration.
> > In network routers, there's a concept of "dual token bucket"
> > algorithms and by googling you can find both Cisco and Juniper
> > documentation referencing this.
> > I also asked ChatGPT-4 to explain "dual token bucket" algorithm [2].
> >
> > 1 - https://bucket4j.com/8.6.0/toc.html#refill-types
> > 2 - https://chat.openai.com/share/d4f4f740-f675-4233-964e-2910a7c8ed24
> >
> > >>
> > >> It is bi-weekly on Thursdays. The meeting calendar, zoom link and
> > >> meeting notes can be found at
> > >> https://github.com/apache/pulsar/wiki/Community-Meetings .
> > >>
> > >
> > > Would it make sense for me to join this time given that you are skipping
> > it?
> >
> > Yes, it's worth joining regularly when one is participating in Pulsar
> > core development. There's usually a chance to discuss all topics that
> > Pulsar community members bring up to discussion. A few times there
> > haven't been any participants and in that case, it's good to ask on
> > the #dev channel on Pulsar Slack whether others are joining the
> > meeting.
> >
> > >>
> > >> ok. btw. "metrics" doesn't 

Re: [DISCUSS] PIP-310: Support custom publish rate limiters

2023-11-08 Thread Lari Hotari
Hi Girish,

> On Asaf's comment on too many public interfaces in Pulsar and no other
> Apache software having so many public interfaces - I would like to ask, has
> that brought in any con though? For this particular use case, I feel like
> having it has a public interface would actually improve the code quality
> and design as the usage would be checked and changes would go through
> scrutiny (unlike how the current PublishRateLimiter evolved unchecked).
> Asaf - what are your thoughts on this? Are you okay with making the
> PublishRateLimiter pluggable with a better interface?

As a reply to this question, I'd like to highlight my previous email
in this thread where I responded to Rajan.
The current state of rate limiting is not acceptable in Pulsar. We
need to fix things in the core.

One of the downsides of adding a lot of pluggability and variation is
the additional fragmentation and complexity that it adds to Pulsar. It
doesn't really make sense if you need to install an external library
to make the rate limiting feature usable in Pulsar. Rate limiting is
only needed when operating Pulsar at scale. The core feature must
support this to make any sense.

The side product of this could be the pluggable interface if the core
feature cannot be extended to cover the requirements that Pulsar users
really have.
Girish, it has been extremely valuable that you have shared concrete
examples of your rate limiting related requirements. I'm confident
that we will be able to cover the majority of your requirements in
Pulsar core. It might be the case that when you try out the new
features, it might actually cover your needs sufficiently.

Let's keep the focus on improving the rate limiting in Pulsar core.
The possible pluggable interface could follow.


-Lari

On Wed, 8 Nov 2023 at 10:46, Girish Sharma  wrote:
>
> Hello Rajan,
> I haven't updated the PIP with a better interface for PublishRateLimiter
> yet as the discussion here in this thread went in a different direction.
>
> Personally, I agree with you that even if we choose one algorithm and
> improve the built-in rate limiter, it still may not suit all use cases as
> you have mentioned.
>
> On Asaf's comment on too many public interfaces in Pulsar and no other
> Apache software having so many public interfaces - I would like to ask, has
> that brought in any con though? For this particular use case, I feel like
> having it has a public interface would actually improve the code quality
> and design as the usage would be checked and changes would go through
> scrutiny (unlike how the current PublishRateLimiter evolved unchecked).
> Asaf - what are your thoughts on this? Are you okay with making the
> PublishRateLimiter pluggable with a better interface?
>
>
>
>
>
> On Wed, Nov 8, 2023 at 5:43 AM Rajan Dhabalia  wrote:
>
> > Hi Lari/Girish,
> >
> > I am sorry for jumping late in the discussion but I would like to
> > acknowledge the requirement of pluggable publish rate-limiter and I had
> > also asked it during implementation of publish rate limiter as well. There
> > are trade-offs between different rate-limiter implementations based on
> > accuracy, n/w usage, simplification and user should be able to choose one
> > based on the requirement. However, we don't have correct and extensible
> > Publish rate limiter interface right now, and before making it pluggable we
> > have to make sure that it should support any type of implementation for
> > example: token based or sliding-window based throttling, support of various
> > decaying functions (eg: exponential decay:
> > https://en.wikipedia.org/wiki/Exponential_decay), etc.. I haven't seen
> > such
> > interface details and design in the PIP:
> > https://github.com/apache/pulsar/pull/21399/. So, I would encourage to
> > work
> > towards building pluggable rate-limiter but current PIP is not ready as it
> > doesn't cover such generic interfaces that can support different types of
> > implementation.
> >
> > Thanks,
> > Rajan
> >
> > On Tue, Nov 7, 2023 at 10:02 AM Lari Hotari  wrote:
> >
> > > Hi Girish,
> > >
> > > Replies inline.
> > >
> > > On Tue, 7 Nov 2023 at 15:26, Girish Sharma 
> > > wrote:
> > > >
> > > > Hello Lari, replies inline.
> > > >
> > > > I will also be going through some textbook rate limiters (the one you
> > > shared, plus others) and propose the one that at least suits our needs in
> > > the next reply.
> > >
> > >
> > > sounds good. I've been also trying to find more rate limiter resources
> > > that could be useful for our design.
> > >
> > > Bucket4J documentation gives some good ideas and it's shows how the
> > > token bucket algorithm could be varied. For example, the "Refill
> > > styles" section [1] is useful to read as an inspiration.
> > > In network routers, there's a concept of "dual token bucket"
> > > algorithms and by googling you can find both Cisco and Juniper
> > > documentation referencing this.
> > > I also asked ChatGPT-4 to explain "dual token bucket" algorithm [2].

Re: [DISCUSS] PIP-310: Support custom publish rate limiters

2023-11-08 Thread Girish Sharma
Hello Lari, while I am yet to reply to your yesterday''s email, I am trying
to wrap this discussion about the need of pluggable rate limiter, so
replying to this first.

Comments inline


On Wed, Nov 8, 2023 at 5:35 PM Lari Hotari  wrote:

> Hi Girish,
>
> > On Asaf's comment on too many public interfaces in Pulsar and no other
> > Apache software having so many public interfaces - I would like to ask,
> has
> > that brought in any con though? For this particular use case, I feel like
> > having it has a public interface would actually improve the code quality
> > and design as the usage would be checked and changes would go through
> > scrutiny (unlike how the current PublishRateLimiter evolved unchecked).
> > Asaf - what are your thoughts on this? Are you okay with making the
> > PublishRateLimiter pluggable with a better interface?
>
> As a reply to this question, I'd like to highlight my previous email
> in this thread where I responded to Rajan.
> The current state of rate limiting is not acceptable in Pulsar. We
> need to fix things in the core.
>

I wouldn't say that it's not acceptable. The precise one works as expected
as a basic rate limiter. Its only when there are complex requirements, the
current rate limiters fail.
 The key here being "complex requirement". I am with Rajan here that
whatever improvements we do to the core built-in rate limiter, would always
miss one or the other complex requirement.

I feel like we need both the things here. We can work on improving the
built in rate limiter which does not try to solve all of my needs like
blacklisting support, limiting the number of bursts in a window etc. The
built in one can be improved with respect to code refactoring,
extensibility, modularization etc. along with implementing a more well
known rate limiting algorithm, like token bucket.
Along with this, the newly improved interface can now make way for
pluggable support.


>
> One of the downsides of adding a lot of pluggability and variation is
> the additional fragmentation and complexity that it adds to Pulsar. It
> doesn't really make sense if you need to install an external library
> to make the rate limiting feature usable in Pulsar. Rate limiting is
>

This is assuming that we do not improve the default build in rate limiter
at all. Think of this like AuthorizationProvider - there is a built in one.
it's good enough, but each organization would have its own requirements wrt
how they handle authorization, and thus, most likely, any organization with
well defined AuthN/AuthZ constructs would be plugging their own providers
there.

The key thing here would be to make the public interface as minimal as
possible while allowing for custom complex rate limiters as well. I believe
this is easily doable without actually making the internal code more
complex.
In fact, the way I envision the pluggable proposal, things become simpler
with respect to code flow, code ownership and custom if/else.



> only needed when operating Pulsar at scale. The core feature must
> support this to make any sense.
>

I will give another example here, In big organizations, where pulsar is
actually being used at scale - and not just in terms of QPS/MBps, but also
in terms of number of teams, tenants, namespaces, number of unique features
being used, there always would be an in-house schema registry. Thus, while
pulsar already has a built in schema service and registry - it is important
that it also supports custom ones. This does not speak badly about the
based pulsar package, but actually speaks more about the adaptability of
the product.


>
> The side product of this could be the pluggable interface if the core
> feature cannot be extended to cover the requirements that Pulsar users
> really have.
> Girish, it has been extremely valuable that you have shared concrete
> examples of your rate limiting related requirements. I'm confident
> that we will be able to cover the majority of your requirements in
> Pulsar core. It might be the case that when you try out the new
> features, it might actually cover your needs sufficiently.
>

I really respect and appreciate the discussions we have had. One of the
problems I've had in the pulsar community earlier is the lack of
participation. But I am getting a lot of participation this time, so its
really good.

I am willing to take this forward to improve the default rate limiters, but
since we would _have_ to meeting somewhere in the middle, at the end of all
this - our organizational requirements would still remain unfulfilled until
we build _all_ of the things that I have spoken about.

Just as Rajan was late to the discussion and he pointed out that they also
needed custom rate limiter a while back, there may he others who are either
unknownst to this mailing list, or are yet to look into rate limiter all
together who may find that whatever we have built/modified/improved is
still lacking.


>
> Let's keep the focus on improving the rate limiting in Pulsar core.
> 

Re: [DISCUSS] PIP-310: Support custom publish rate limiters

2023-11-08 Thread Girish Sharma
Hello Lari,
I've now gone through a bunch of rate limiting algorithms, along with dual
rate, dual bucket algorithm.
Reply inline.

On Tue, Nov 7, 2023 at 11:32 PM Lari Hotari  wrote:

>
> Bucket4J documentation gives some good ideas and it's shows how the
> token bucket algorithm could be varied. For example, the "Refill
> styles" section [1] is useful to read as an inspiration.
> In network routers, there's a concept of "dual token bucket"
> algorithms and by googling you can find both Cisco and Juniper
> documentation referencing this.
> I also asked ChatGPT-4 to explain "dual token bucket" algorithm [2].
>
> 1 - https://bucket4j.com/8.6.0/toc.html#refill-types
> 2 - https://chat.openai.com/share/d4f4f740-f675-4233-964e-2910a7c8ed24
>
>
While dual-rate dual token bucket looks promising, there is still some
challenge with respect to allowing a certain peak burst for/up to a bigger
duration. I am explaining it below:

Assume a 10MBps topic. Bursting support of 1.5x upto 2 minutes, once every
10 minute interval.

The first bucket has the capacity of the consistent, long-term peak that
the topic would observe, so basically -
`limit.capacity(10_000_000).refillGreedy(1_000_000, ofMillis(100))` . Here,
I am not refilling every milli because the topic will never receive uniform
homogenous traffic at a millisecond level. The pattern would always be a
batch of one or more messages in an instant, then nothing for next few
instants, then another batch of one or more messages and so on.
This first bucket is good enough to handle rate limiting in normal cases
without bursting support. This is also better than the current precise rate
limiter in pulsar which allows for hotspotting of produce within a second
as this one spreads the quota over 100ms time windows.

The second bucket would have to have a slower refill rate and also a less
granular refill rate - think of something like
`limit.capacity(5_000_000).refillGreedy(5_000_000, ofMinutes(10))`.
Now the problem with this approach is that it would allow only 1 second
worth of additional 5MBps on the topic every 10 minutes.

Suppose I change the refill rate to
`limit.capacity(5_000_000).refillGreedy(5_000_000, ofSeconds(1))` - then it
does allow additional 5MBps burst every second, but now there is no cap
(the 2 minute cap).

I will have to think this through and do some more math to figure out if
this is possible at all using dual-rate double token method. Inputs
welcomed.

>
> >>
> >> ok. btw. "metrics" doesn't necessarily mean providing the rate limiter
> >> metrics via Prometheus. There might be other ways to provide this
> >> information for components that could react to this.
> >> For example, it could a be system topic where these rate limiters emit
> events.
> >>
> >
> > Are there any other system topics than
> `tenent/namespace/__change_events` . While it's an improvement over
> querying metrics, it would still mean one consumer per namespace and would
> form a cyclic dependency - for example, in case a broker is degrading due
> to mis-use of bursting, it might lead to delays in the consumption of the
> event from the __change_events topic.
>
> The number of events is a fairly low volume so the possible
> degradation wouldn't necessarily impact the event communications for
> this purpose. I'm not sure if there are currently any public event
> topics in Pulsar that are supported in a way that an application could
> directly read from the topic.  However since this is an advanced case,
> I would assume that we don't need to focus on this in the first phase
> of improving rate limiters.
>
>
While the number of events in the system topic would be fairly low per
namespace, the issue is that this system topic lies on the same broker
where the actual topic/partitions exist and those partitions are leading to
degradation of this particular broker.



> I agree. It's better to exclude this from the discussion at the moment
> so that it doesn't cause confusion. Modifying the rate limits up and
> down automatically with some component could be considered out of
> scope in the first phase. However, that might be controversial since
> the externally observable behavior of rate limiting with bursting
> support seems to be behave in a way where the rate limit changes
> automatically. The "auto scaling" aspect of rate limiting, modifying
> the rate limits up and down, might be necessary as part of a rate
> limiter implementation eventually. More of that later.
>
Not to drag on this topic in this discussion, but doesn't this basically
mean that there is no "rate limiting" at all? Basically as good as setting
it `-1` and just let the hardware handle it to its best extent.


> >
> > The desired behavior in all three situations is to have a multiplier
> based bursting capability for a fixed duration. For example, it could be
> that a pulsar topic would be able to support 1.5x of the set quota for a
> burst duration of up to 5 minutes. There also needs to be a cooldown period
> 

Re: [DISCUSS] PIP-310: Support custom publish rate limiters

2023-11-08 Thread Lari Hotari
Hi Girish,

Replies inline.

> > The current state of rate limiting is not acceptable in Pulsar. We
> > need to fix things in the core.
> >
>
> I wouldn't say that it's not acceptable. The precise one works as expected
> as a basic rate limiter. Its only when there are complex requirements, the
> current rate limiters fail.

Just clarifying, I consider the situation with the default rate
limiter not being
optimal. The CPU overhead is significant. You must explicitly
enable the "precise" rate limiters to resolve this issue. It's not
very obvious for any Pulsar user. I don't think that this situation
makes sense. If there's an abstraction of a rate limiter, it should be
efficient and usable in the default configuration of Pulsar.

>  The key here being "complex requirement". I am with Rajan here that
> whatever improvements we do to the core built-in rate limiter, would always
> miss one or the other complex requirement.

I haven't yet seen very complex requirements that relate directly to
the rate limiter. The scope could expand to the area of capacity
management, and I'm pretty sure that it gets there when we go further.
Capacity management is a broader concern than rate limiting. We all
know that capacity management is necessary in multi-tenant systems.
Rate limiting and throttling is one way to handle that. When going to
more complex requirements, it might be useful to go beyond rate limiting also in
the conceptual design.

For example DynamoDB has the concept of capacity units (CU).
DynamoDB's conceptual design for capacity management is well described
in a paper "Amazon DynamoDB: A Scalable, Predictably Performant, and
Fully Managed NoSQL Database Service" and a related presentation [1].
There's also other related blog posts such as "Surprising Scalability
of Multitenancy" [2] and "The Road To Serverless: Multi-tenancy" [3]
which have been inspirational to me. The paper "Kora: A Cloud-Native
Event Streaming Platform For Kafka" [4] is also a very useful read to
learn about serverless capacity management.

Capacity management goes beyond rate limiting since it has a tight
relation to end-to-end flow control, load balancing, service levels and
possible auto-scaling solutions. One of the goals of capacity
management in a multi-tenant
system is to address the dreaded "noisy neighbor" problem in a cost
optimal and efficient way.

1 - https://www.usenix.org/conference/atc22/presentation/elhemali
2 - https://brooker.co.za/blog/2023/03/23/economics.html
3 - https://me.0x.me/dbaas3.html
4 - https://www.vldb.org/pvldb/vol16/p3822-povzner.pdf

>
> I feel like we need both the things here. We can work on improving the
> built in rate limiter which does not try to solve all of my needs like
> blacklisting support, limiting the number of bursts in a window etc. The
> built in one can be improved with respect to code refactoring,
> extensibility, modularization etc. along with implementing a more well
> known rate limiting algorithm, like token bucket.
> Along with this, the newly improved interface can now make way for
> pluggable support.

Yes, I agree. Improving the rate limiter and exposing an interface
aren't exclusive.


> This is assuming that we do not improve the default build in rate limiter
> at all. Think of this like AuthorizationProvider - there is a built in one.
> it's good enough, but each organization would have its own requirements wrt
> how they handle authorization, and thus, most likely, any organization with
> well defined AuthN/AuthZ constructs would be plugging their own providers
> there.

I don't think that this is a valid comparison. The current rate
limiter has an explicit "contract". The user sets the maximum rate in
bytes and/or messages and the rate limiter takes care of enforcing
that limit. It's hard to see why that "contract" would have too many
interpretations of what it means.
Another reason is that I haven't seen any other messaging product
where there would be a need to add support for user provided rate
limiter algorithm. What makes Pulsar a special case that it would be
needed?
For authentication and authorization, it's a completely different
story. The abstractions require that you pick a specific
implementation for your way of doing authentication and authorization.
Many other systems out there do it in a somewhat similar way as Pulsar.

> The key thing here would be to make the public interface as minimal as
> possible while allowing for custom complex rate limiters as well. I believe
> this is easily doable without actually making the internal code more
> complex.

It's doable, but a different question is whether this is necessary in
the end. We'll see over time how we can improve the Pulsar core rate
limiter and whether there's a need to override it.
The current interfaces will change when the Pulsar core rate limiter
is improved. This work won't easily meet in the middle unless we start
by improving the core rate limiter.

What Pulsar does right now might have to 

Re: [DISCUSS] PIP-310: Support custom publish rate limiters

2023-11-08 Thread Lari Hotari
Hi Girish,

replies inline.

On Thu, 9 Nov 2023 at 00:29, Girish Sharma  wrote:
> While dual-rate dual token bucket looks promising, there is still some
> challenge with respect to allowing a certain peak burst for/up to a bigger
> duration. I am explaining it below:

> Assume a 10MBps topic. Bursting support of 1.5x upto 2 minutes, once every
> 10 minute interval.

It's possible to have many ways to model a dual token buckets.
When there are tokens in the bucket, they are consumed as fast as
possible. This is why there is a need for the second token bucket
which is used to rate limit the traffic to the absolute maximum rate.
Technically the second bucket rate limits the average rate for a short
time window.

I'd pick the first bucket for handling the 10MB rate.
The capacity of the first bucket would be 15MB * 120=1800MB. The fill
would happen in special way. I'm not sure if Bucket4J has this at all.
So describing the way of adding tokens to the bucket: the tokens in
the bucket would remain the same when the rate is <10MB. As many
tokens would be added to the bucket as are consumed by the actual
traffic. The left over tokens 10MB - actual rate would go to a
separate filling bucket that gets poured into the actual bucket every
10 minutes.
This first bucket with this separate "filling bucket" would handle the
bursting up to 1800MB.
The second bucket would solely enforce the 1.5x limit of 15MB rate
with a small capacity bucket which enforces the average rate for a
short time window.
There's one nuance here. The bursting support will only allow bursting
if the average rate has been lower than 10MBps for the tokens to use
for the bursting to be usable.
It would be possible that for example 50% of the tokens would be
immediately available and 50% of the tokens are made available in the
"filling bucket" that gets poured into the actual bucket every 10
minutes. Without having some way to earn the burst, I don't think that
there's a reasonable way to make things usable. The 10MB limit
wouldn't have an actual meaning unless that is used to "earn" the
tokens to be used for the burst.

One other detail of topic publishing throttling in Pulsar is that the
actual throttling happens after the limit has been exceeded.
This is due to the fact that Pulsar's network handling uses Netty
where you cannot block. When using the token bucket concepts,
the tokens are always first consumed and after that there's a chance
to pause message publishing.
In code, you can find this at
https://github.com/apache/pulsar/blob/c0eec1e46edeb46c888fa28f27b199ea7e7a1574/pulsar-broker/src/main/java/org/apache/pulsar/broker/service/ServerCnx.java#L1794
and digging down from there.

In the current rate limiters in Pulsar, the implementation is not
optimized to how Pulsar uses rate limiting. There's no need to use a
scheduler for adding "permits" as it's called in the current rate
limiter. The new tokens to add can be calculated based on the elapsed
time. Resuming from the blocking state (auto read disabled -> enabled)
requires a scheduler. The duration of the pause should be calculated
based on the rate and the average message size. Because of the nature
of asynchronous rate limiting in Pulsar topic publish throttling, the
tokens can go to a negative value and it also does. The calculation of
the pause could also take this into count. The rate limiting will be
very accurate and efficient in this way since the scheduler will only
be needed when the token bucket runs out of tokens and there's really
a need to throttle. This is the change that I would do to the current
implementation in an experiment and see how things behave with the
revisited solution. Eliminating the precise rate limiter and having
just a single rate limiter would be part of this.
I think that the code base has multiple usages of the rate limiter.
The dispatching rate limiting might require some variation. IIRC, Rate
limiting is also used for some other reasons in the code base in a
blocking manner. For example, unloading of bundles is rate limited.
Working on the code base will reveal this.

> While the number of events in the system topic would be fairly low per
> namespace, the issue is that this system topic lies on the same broker
> where the actual topic/partitions exist and those partitions are leading to
> degradation of this particular broker.

Yes, that could be possible but perhaps unlikely.

> Agreed, there is a challenge, as it's not as straightforward as I've
> demonstrated in the example above.

Yes, it might require some rounds of experimentation. Although in
general, I think it's a fairly straightforward problem to solve as
long as the requirements could be adapted to some small details that
make sense for bursting in the context of rate limiting. The detail is
that the long time average rate shouldn't go over the configured rate
even with the bursts. That's why the tokens usable for the burst
should be "earned". I'm not sure if it even is necessary to enforce

Re: [DISCUSS] PIP-310: Support custom publish rate limiters

2023-11-09 Thread Girish Sharma
Hello Lari, replies inline

On Thu, Nov 9, 2023 at 6:50 AM Lari Hotari  wrote:

> Hi Girish,
>
> replies inline.
>
> On Thu, 9 Nov 2023 at 00:29, Girish Sharma 
> wrote:
> > While dual-rate dual token bucket looks promising, there is still some
> > challenge with respect to allowing a certain peak burst for/up to a
> bigger
> > duration. I am explaining it below:
>
> > Assume a 10MBps topic. Bursting support of 1.5x upto 2 minutes, once
> every
> > 10 minute interval.
>
> It's possible to have many ways to model a dual token buckets.
> When there are tokens in the bucket, they are consumed as fast as
> possible. This is why there is a need for the second token bucket
> which is used to rate limit the traffic to the absolute maximum rate.
> Technically the second bucket rate limits the average rate for a short
> time window.
>
> I'd pick the first bucket for handling the 10MB rate.
> The capacity of the first bucket would be 15MB * 120=1800MB. The fill
> would happen in special way. I'm not sure if Bucket4J has this at all.
> So describing the way of adding tokens to the bucket: the tokens in
> the bucket would remain the same when the rate is <10MB. As many
>

How is this special behavior (tokens in bucket remaining the same when rate
is <10MB) achieved? I would assume that to even figure out that the rate is
less than 10MB, there is some counter going around?


> tokens would be added to the bucket as are consumed by the actual
> traffic. The left over tokens 10MB - actual rate would go to a
> separate filling bucket that gets poured into the actual bucket every
> 10 minutes.
> This first bucket with this separate "filling bucket" would handle the
> bursting up to 1800MB.
>

But this isn't the requirement? Let's assume that the actual traffic has
been 5MB for a while and this 1800MB capacity bucket is all filled up now..
What's the real use here for that at all?


> The second bucket would solely enforce the 1.5x limit of 15MB rate
> with a small capacity bucket which enforces the average rate for a
> short time window.
> There's one nuance here. The bursting support will only allow bursting
> if the average rate has been lower than 10MBps for the tokens to use
> for the bursting to be usable.
> It would be possible that for example 50% of the tokens would be
> immediately available and 50% of the tokens are made available in the
> "filling bucket" that gets poured into the actual bucket every 10
> minutes. Without having some way to earn the burst, I don't think that
> there's a reasonable way to make things usable. The 10MB limit
>
wouldn't have an actual meaning unless that is used to "earn" the
> tokens to be used for the burst.
>
>
I think this approach of thinking about rate limiter - "earning the right
to burst by letting tokens remain into the bucket, (by doing lower than
10MB for a while)" doesn't not fit well in a messaging use case in real
world, or theoretic.
For a 10MB topic, if the actual produce has been , say, 5MB for a long
while, this shouldn't give the right to that topic to burst to 15MB for as
much as tokens are present.. This is purely due to the fact that this will
then start stressing the network and bookie disks.
Imagine a 100 of such topics going around with similar configuration of
fixed+burst limits and were doing way lower than the fixed rate for the
past couple of hours. Now that they've earned enough tokens, if they all
start bursting, this will bring down the system, which is probably not
capable of supporting simultaneous peaks of all possible topics at all.

Now of course we can utilize a broker level fixed rate limiter to not allow
the overall throughput of the system to go beyond a number, but at that
point - all the earning semantic goes for a toss anyway since the behavior
would be unknown wrt which topics are now going through with bursting and
which are being blocked due to the broker level fixed rate limiting.

As such, letting topics loose would not sit well with any sort of SLA
guarantees to the end user.

Moreover, contrary to the earning tokens logic, in reality a topic _should_
be allowed to burst upto the SOP/SLA as soon as produce starts in the
topic. It shouldn't _have_ to wait for tokens to fill up as it does
below-fixed-rate for a while before it is allowed to burst. This is because
there is no real benefit or reason to not let the topic do such as the
hardware is already present and the topic is already provisioned
(partitions, broker spread) accordingly, assuming the burst.

In an algorithmic/academic/literature setting, token bucket sounds really
promising.. but a platform with SLA to users would not run like that.



> In the current rate limiters in Pulsar, the implementation is not
> optimized to how Pulsar uses rate limiting. There's no need to use a
> scheduler for adding "permits" as it's called in the current rate
> limiter. The new tokens to add can be calculated based on the elapsed
> time. Resuming from the blocking state (auto read disabled 

Re: [DISCUSS] PIP-310: Support custom publish rate limiters

2023-11-10 Thread Lari Hotari
Hi Girish,

replies inline.

> > I'd pick the first bucket for handling the 10MB rate.
> > The capacity of the first bucket would be 15MB * 120=1800MB. The fill
> > would happen in special way. I'm not sure if Bucket4J has this at all.
> > So describing the way of adding tokens to the bucket: the tokens in
> > the bucket would remain the same when the rate is <10MB. As many
> >
>
> How is this special behavior (tokens in bucket remaining the same when rate
> is <10MB) achieved? I would assume that to even figure out that the rate is
> less than 10MB, there is some counter going around?

It's possible to be creative in implementing the token bucket algorithm.
New tokens will be added with the configured rate. When there actual
traffic rate is less than the token bucket token rate,
there will be left over tokens. One way to implement this is to
calculate the amount of new tokens
before using the tokens. The immediately used tokens are subtracted
from the new tokens and the left over tokens are added to the
separate "filling bucket". I explained earlier that this filling
bucket would be poured in the actual bucket every 10 minutes in the
example scenario.

>
>
> > tokens would be added to the bucket as are consumed by the actual
> > traffic. The left over tokens 10MB - actual rate would go to a
> > separate filling bucket that gets poured into the actual bucket every
> > 10 minutes.
> > This first bucket with this separate "filling bucket" would handle the
> > bursting up to 1800MB.
> >
>
> But this isn't the requirement? Let's assume that the actual traffic has
> been 5MB for a while and this 1800MB capacity bucket is all filled up now..
> What's the real use here for that at all?

How would the rate limiter know if 5MB traffic is degraded traffic
which would need to be allowed to burst?

> I think this approach of thinking about rate limiter - "earning the right
> to burst by letting tokens remain into the bucket, (by doing lower than
> 10MB for a while)" doesn't not fit well in a messaging use case in real
> world, or theoretic.

It might feel like it doesn't fit well, but this is how most rate
limiters work. The model works very well in practice.
Without "earning the right to burst", there would have to be some
other way to detect whether there's a need to burst.
The need to burst could be detected by calculating the time that the
message to be sent has been in queues until it's about to be sent.
In other words, from the end-to-end latency.

> For a 10MB topic, if the actual produce has been , say, 5MB for a long
> while, this shouldn't give the right to that topic to burst to 15MB for as
> much as tokens are present.. This is purely due to the fact that this will
> then start stressing the network and bookie disks.

Well, there's always an option to not configure bursting this way or
limiting the maximum rate of bursting, like it is possible in a dual
token bucket implementation.

> Imagine a 100 of such topics going around with similar configuration of
> fixed+burst limits and were doing way lower than the fixed rate for the
> past couple of hours. Now that they've earned enough tokens, if they all
> start bursting, this will bring down the system, which is probably not
> capable of supporting simultaneous peaks of all possible topics at all.

This is the challenge of capacity management and end-to-end flow
control and backpressure.
With proper system wide capacity management and end-to-end back
pressure, the system won't collapse.

As reference, let's take a look at the Confluent Kora paper [1]:
In 5.2.1 Back Pressure and Auto-Tuning:
"Backpressure is achieved via auto-tuning tenant quotas on the broker
such that the combined tenant usage remains below the broker-wide
limit. The tenant quotas are auto-tuned proportionally to their total
quota allocation on the broker. This mechanism ensures fair sharing of
resources among tenants during temporary overload and re-uses the
quota enforcement mechanism for backpressure."
"The broker-wide limits are generally defined by benchmarking brokers
across clouds. The CPU-related limit is unique because there is no
easy way to measure and attribute CPU usage to a tenant. Instead, the
quota is defned as the clock time the broker spends processing
requests and connections, and the safe limit is variable. So to
protect CPU, request backpressure is triggered when request queues
reache a certain threshold."
In 5.2.2 Dynamic Quota Management"
"A straightforward method for distributing tenant-level quotas among
the brokers hosting the tenant is to statically divide the quota
evenly across the brokers. This static approach, which was deployed
initially, worked reasonably well on lower subscribed clusters."...
"Kora addresses this issue by using a dynamic quota mechanism that
adjusts bandwidth distribution based on a tenant’s bandwidth
consumption. This is achieved through the use of a shared quota
service to manage quota distribution, a design similar to that used by
other 

Re: [DISCUSS] PIP-310: Support custom publish rate limiters

2023-11-10 Thread Girish Sharma
Hello Lari, replies inline. It's festive season here so I might be late in
the next reply.


On Fri, Nov 10, 2023 at 4:51 PM Lari Hotari  wrote:

> Hi Girish,
>
> >
> >
> > > tokens would be added to the bucket as are consumed by the actual
> > > traffic. The left over tokens 10MB - actual rate would go to a
> > > separate filling bucket that gets poured into the actual bucket every
> > > 10 minutes.
> > > This first bucket with this separate "filling bucket" would handle the
> > > bursting up to 1800MB.
> > >
> >
> > But this isn't the requirement? Let's assume that the actual traffic has
> > been 5MB for a while and this 1800MB capacity bucket is all filled up
> now..
> > What's the real use here for that at all?
>
> How would the rate limiter know if 5MB traffic is degraded traffic
> which would need to be allowed to burst?
>

That's not what I was implying. I was trying to question the need for
1800MB worth of capacity. I am assuming this is to allow a 2 minute burst
of 15MBps? But isn't this bucket only taking are of the delta beyond 10MBps?
Moreover, once 2 minutes are elapsed, which bucket is ensuring that the
rate is only allowed to go upto 10MBps?


> > I think this approach of thinking about rate limiter - "earning the right
> > to burst by letting tokens remain into the bucket, (by doing lower than
> > 10MB for a while)" doesn't not fit well in a messaging use case in real
> > world, or theoretic.
>
> It might feel like it doesn't fit well, but this is how most rate
> limiters work. The model works very well in practice.
> Without "earning the right to burst", there would have to be some
> other way to detect whether there's a need to burst.
>

Why does the rate limiter need to decide if there is a need to burst? It is
dependent on the incoming message rate. Since the rate limiter has no
knowledge of what is going to happen in future, it cannot assume that the
messages beyond a fixed rate (10MB in our example) can be held/paused until
next second - thus deciding that there is no need to burst right now.

While I understand that earning the right to burst model works well in
practice, another approach to think about this is that the bucket,
initially, starts filled. Moreover, the tokens here are being filled into
the bucket due to the available disk, cpu and network bandwidth.. The
general approach of where the tokens are initially required to be earned
might be helpful to tackle cold starts, but beyond that, a topic doing
5MBps to accumulate enough tokens to burst for a few minutes in the future
doesn't really translate to the physical world, does it? The 1:1
translation here is to an SSD where the topic's data is actually being
written to - So my bursting upto 15MBps doesn't really depend on the fact
that I was only doing 5 MBps in the last few minutes (and thus,
accumulating the remaining 5MBps worth tokens towards the burst) - now does
it? The SSD won't really gain the ability to allow for burst just because
there was low throughput in the last few minutes. Not even from a space POV
either.



> The need to burst could be detected by calculating the time that the
> message to be sent has been in queues until it's about to be sent.
> In other words, from the end-to-end latency.
>

In practice this level of cross component coordination would never result
in a responsive and spontaneous system.. unless each message comes along
with a "I waited in queue for this long" time and on the server side, we
now read the netty channel, parse the message and figure this out before
checking for rate limiting.. At which point, rate limiting isn' treally
doing anything if the broker is reading every message anyway. This may also
require prioritization of messages as some messages from a different
producer to the same partition may have waited longer than others.. This is
out of scope of a rate limiter at this point.

> Imagine a 100 of such topics going around with similar configuration of
> > fixed+burst limits and were doing way lower than the fixed rate for the
> > past couple of hours. Now that they've earned enough tokens, if they all
> > start bursting, this will bring down the system, which is probably not
> > capable of supporting simultaneous peaks of all possible topics at all.
>
> This is the challenge of capacity management and end-to-end flow
> control and backpressure.
>

Rate limiter is a major facilitator and guardian of capacity planning. So,
it does relate here.


> With proper system wide capacity management and end-to-end back
> pressure, the system won't collapse.
>

At no point I am trying to go beyond the purview of a single broker here.
Unless, by system, you meant a single broker itself. For which, I talk
about a broker level rate limiter further below in the example.


> In Pulsar, we have "PIP 82: Tenant and namespace level rate limiting"
> [4] which introduced the "resource group" concept. There is the
> resourcegroup resource in the Admin REST API [5]. There's also a
> resource 

Re: [DISCUSS] PIP-310: Support custom publish rate limiters

2023-11-11 Thread Lari Hotari
Hi Girish,

replies inline.

> Hello Lari, replies inline. It's festive season here so I might be late in
> the next reply.

I'll have limited availability next week so possibly not replying
until the following week. We have the next Pulsar community meeting on
November 23rd, so let's wrap up all of this preparation by then.

> > How would the rate limiter know if 5MB traffic is degraded traffic
> > which would need to be allowed to burst?
> >
>
> That's not what I was implying. I was trying to question the need for
> 1800MB worth of capacity. I am assuming this is to allow a 2 minute burst
> of 15MBps? But isn't this bucket only taking are of the delta beyond 10MBps?
> Moreover, once 2 minutes are elapsed, which bucket is ensuring that the
> rate is only allowed to go upto 10MBps?

I guess there are multiple ways to model things. I was thinking of a
model where the first bucket (and the "filling bucket" to add "earned"
tokens with some interval) handles all logic related to the average
rate limiting of 10MBps, including the bursting capacity which a token
bucket inherently has. As we know, the token bucket doesn't have a
rate limit when there are available tokens in the bucket. That's the
reason why there's the separate independent bucket with a relatively
short token capacity to enforce the maximum rate limit of 15MBps.
There needs to be tokens in both bucket for traffic to flow freely.
It's like an AND and not an OR (in literature there are also examples
of this where tokens are taken from another bucket when the main one
runs out).

> > Without "earning the right to burst", there would have to be some
> > other way to detect whether there's a need to burst.
>
> Why does the rate limiter need to decide if there is a need to burst? It is
> dependent on the incoming message rate. Since the rate limiter has no
> knowledge of what is going to happen in future, it cannot assume that the
> messages beyond a fixed rate (10MB in our example) can be held/paused until
> next second - thus deciding that there is no need to burst right now.

To explain this, I'll first take a quote from your email about the
bursting use cases, sent on Nov 7th:
> Adding to what I wrote above, think of this pattern like the following: the 
> produce rate slowly increases from ~2MBps at around 4 AM to a known peak of 
> about 30MBps by 4 PM and
> stays around that peak until 9 PM after which is again starts decreasing 
> until it reaches ~2MBps around 2 AM.
> Now, due to some external triggers, maybe a scheduled sale event, at 10PM, 
> the quota may spike up to 40MBps for 4-5 minutes and then again go back down 
> to the usual ~20MBps .

Let's observe these cases.
(When we get further, it will be helpful to have a set of well defined
representative concrete use case / scenario descriptions with concrete
examples of expected behavior. Failure / chaos scenarios and
operations (such as software upgrade) scenarios should also be
covered.)

> Why does the rate limiter need to decide if there is a need to burst? It is
> dependent on the incoming message rate. Since the rate limiter has no
This is a very good question. When we look at these examples, it feels
that we'd need to define what the actual rules for the bursting are
and how bursting is decided.
It seems that there are two type of ways: "earning the right to burst"
with the token bucket so that bursting could happen whenever there are
extra tokens in the bucket; or having a way to detect that there's a
need to burst. That relates directly to SLAs. If there's an SLA where
message processing end-to-end latencies must be low, that usually
would be violated if large backlogs pile up. I think you said
something about avoiding getting into this at all. I agree that this
would add complexity, but for the sake of explaining the possible
cases for the need for bursting, I think that it's necessary to cover
the end-to-end aspects.


> While I understand that earning the right to burst model works well in
> practice, another approach to think about this is that the bucket,
> initially, starts filled. Moreover, the tokens here are being filled into
> the bucket due to the available disk, cpu and network bandwidth.. The

I completely agree. If it's fine to rely on the token bucket for
bursting, this simplifies a lot of things. As you are mentioning,
there could be multiple ways to fill/add tokens to the bucket.

> general approach of where the tokens are initially required to be earned
> might be helpful to tackle cold starts, but beyond that, a topic doing
> 5MBps to accumulate enough tokens to burst for a few minutes in the future
> doesn't really translate to the physical world, does it? The 1:1
> translation here is to an SSD where the topic's data is actually being
> written to - So my bursting upto 15MBps doesn't really depend on the fact
> that I was only doing 5 MBps in the last few minutes (and thus,
> accumulating the remaining 5MBps worth tokens towards the burst) - now does
> it? 

Re: [DISCUSS] PIP-310: Support custom publish rate limiters

2023-11-11 Thread Girish Sharma
One final reply from me before the holidays :)


On Sat, Nov 11, 2023 at 4:00 PM Lari Hotari  wrote:

> Hi Girish,
>
> replies inline.
>
> > Hello Lari, replies inline. It's festive season here so I might be late
> in
> > the next reply.
>
> I'll have limited availability next week so possibly not replying
> until the following week. We have the next Pulsar community meeting on
> November 23rd, so let's wrap up all of this preparation by then.
>

Sounds good.


>
> > > How would the rate limiter know if 5MB traffic is degraded traffic
> > > which would need to be allowed to burst?
> > >
> >
> > That's not what I was implying. I was trying to question the need for
> > 1800MB worth of capacity. I am assuming this is to allow a 2 minute burst
> > of 15MBps? But isn't this bucket only taking are of the delta beyond
> 10MBps?
> > Moreover, once 2 minutes are elapsed, which bucket is ensuring that the
> > rate is only allowed to go upto 10MBps?
>
> I guess there are multiple ways to model things. I was thinking of a
> model where the first bucket (and the "filling bucket" to add "earned"
> tokens with some interval) handles all logic related to the average
> rate limiting of 10MBps, including the bursting capacity which a token
> bucket inherently has. As we know, the token bucket doesn't have a
> rate limit when there are available tokens in the bucket. That's the
>

Actually, the capacity is meant to simulate that particular rate limit. if
we have 2 buckets anyways, the one managing the fixed rate limit part
shouldn't generally have a capacity more than the fixed rate, right?



> reason why there's the separate independent bucket with a relatively
> short token capacity to enforce the maximum rate limit of 15MBps.
> There needs to be tokens in both bucket for traffic to flow freely.
> It's like an AND and not an OR (in literature there are also examples
> of this where tokens are taken from another bucket when the main one
> runs out).
>

I think it can be done, especially with that one thing you mentioned about
holding off filling the second bucket for 10 minutes.. but it does become
quite complicated in terms of managing the flow of the tokens.. because
while we only fill the second bucket once every 10 minutes, after the 10th
minute, it needs to be filled continuously for a while (the duration we
want to support the bursting for).. and the capacity of this second bucket
also is governed by and exactly matches the burst value.


> > general approach of where the tokens are initially required to be earned
> > might be helpful to tackle cold starts, but beyond that, a topic doing
> > 5MBps to accumulate enough tokens to burst for a few minutes in the
> future
> > doesn't really translate to the physical world, does it? The 1:1
> > translation here is to an SSD where the topic's data is actually being
> > written to - So my bursting upto 15MBps doesn't really depend on the fact
> > that I was only doing 5 MBps in the last few minutes (and thus,
> > accumulating the remaining 5MBps worth tokens towards the burst) - now
> does
> > it? The SSD won't really gain the ability to allow for burst just because
> > there was low throughput in the last few minutes. Not even from a space
> POV
> > either.
>
> This example of the SSD isn't concrete in terms of Pulsar and
> Bookkeeper. The capacity of a single SSD should and must be
> significantly higher than the maximum write throughput of a single
> topic. Therefore a single SSD shouldn't be a bottleneck. There will
>

Agreed that it is much higher than a single topics' max throughput.. but
the context of my example had multiple topics lying on the same
broker/bookie ensemble bursting together at the same time because they had
been saving up on tokens in the bucket.

always be a need to overprovision resources. You usually don't want to
> go beyond 60% or 70% utilization on disk, cpu or network resources so
> that queues in the system don't start to increase and impacting
> latencies. In Pulsar/Bookkeeper, the storage solution has a very
> effective load balancing, especially for writing. In Bookkeeper each
> ledger (the segment) of a topic selects the "ensemble" and the "write
> quorum", the set of bookies to write to, when the ledger is opened.
> The bookkeeper client could also change the ensemble in the middle of
> a ledger due to some event like a bookie becoming read-only or
>

While it does do that on complete failure of bookie or a bookie disk, or
broker going down, degradations aren't handled this well. So if all topics
in a bookie are bursting due to the fact that they had accumulated tokens,
then all it will lead to is breach of write latency SLA because at one
point, the disks/cpu/network etc will start choking. (even after
considering the 70% utilization i.e. 30% buffer)


> > now read the netty channel, parse the message and figure this out before
> > checking for rate limiting.. At which point, rate limiting isn' treally
> > doing anything if the broker 

Re: [DISCUSS] PIP-310: Support custom publish rate limiters

2023-11-20 Thread Lari Hotari
Hi Girish,

replies inline and after that there are some updates about my
preparation for the community meeting on Thursday. (there's
https://github.com/lhotari/async-tokenbucket with a PoC for a
low-level high performance token bucket implementation)

On Sat, 11 Nov 2023 at 17:25, Girish Sharma  wrote:
> Actually, the capacity is meant to simulate that particular rate limit. if
> we have 2 buckets anyways, the one managing the fixed rate limit part
> shouldn't generally have a capacity more than the fixed rate, right?

There are multiple ways to model and understand a dual token bucket
implementation.
I view the 2 buckets in a dual token bucket implementation as separate
buckets. They are like an AND rule, so if either bucket is empty,
there will be a need to pause to wait for new tokens.
Since we aren't working with code yet, these comments could be out of context.

> I think it can be done, especially with that one thing you mentioned about
> holding off filling the second bucket for 10 minutes.. but it does become
> quite complicated in terms of managing the flow of the tokens.. because
> while we only fill the second bucket once every 10 minutes, after the 10th
> minute, it needs to be filled continuously for a while (the duration we
> want to support the bursting for).. and the capacity of this second bucket
> also is governed by and exactly matches the burst value.

There might not be a need for this complexity of the "filling bucket"
in the first place. It was more of a demonstration that it's possible
to implement the desired behavior of limited bursting by tweaking the
basic token bucket algorithm slightly.
I'd rather avoid this additional complexity.

> Agreed that it is much higher than a single topics' max throughput.. but
> the context of my example had multiple topics lying on the same
> broker/bookie ensemble bursting together at the same time because they had
> been saving up on tokens in the bucket.

Yes, that makes sense.

> always be a need to overprovision resources. You usually don't want to
> > go beyond 60% or 70% utilization on disk, cpu or network resources so
> > that queues in the system don't start to increase and impacting
> > latencies. In Pulsar/Bookkeeper, the storage solution has a very
> > effective load balancing, especially for writing. In Bookkeeper each
> > ledger (the segment) of a topic selects the "ensemble" and the "write
> > quorum", the set of bookies to write to, when the ledger is opened.
> > The bookkeeper client could also change the ensemble in the middle of
> > a ledger due to some event like a bookie becoming read-only or
> >
>
> While it does do that on complete failure of bookie or a bookie disk, or
> broker going down, degradations aren't handled this well. So if all topics
> in a bookie are bursting due to the fact that they had accumulated tokens,
> then all it will lead to is breach of write latency SLA because at one
> point, the disks/cpu/network etc will start choking. (even after
> considering the 70% utilization i.e. 30% buffer)

Yes.

> That's only in the case of the default rate limiter where the tryAcquire
> isn't even implemented.. since the default rate limiter checks for breach
> only at a fixed rate rather than before every produce call. But in case of
> precise rate limiter, the response of `tryAcquire` is respected.

This is one of many reasons why I think it's better to improve the
maintainability of the current solution and remove the unnecessary
options between "precise" and the default one.

> True, and actually, due to the fact that pulsar auto distributes topics
> based on load shedding parameters, we can actually focus on a single
> broker's or a single bookie ensemble and assume that it works as we scale
> it. Of course this means that putting a reasonable limit in terms of
> cpu/network/partition/throughput limits at each broker level and pulsar
> provides ways to do that automatically.

We do have plans to improve the Pulsar so that things would simply
work properly under heavy load. Optimally, things would work without
the need to tune and tweak the system rigorously. These improvements
go beyond rate limiting.

> While I have shared the core requirements over these threads (fixed rate +
> burst multiplier for upto X duration every Y minutes).. We are finalizing
> the details requirements internally to present. As I replied in my previous
> mail, one outcome of detailed internal discussion was the discovery of
> throughput contention.

Sounds good. Sharing your experiences is extremely valuable.

> We do use resource groups for certain namespace level quotas, but even in
> our use case, rate limiter and resource groups are two separate tangents.
> At least for foreseeable future.

At this point they are separate, but there is a need to improve. Jack
Vanlightly's blog post series "The Architecture Of Serverless Data
Systems" [1] explains the competition in this area very well.
Multi-tenant capacity management and SLAs are at 

Re: [DISCUSS] PIP-310: Support custom publish rate limiters

2023-11-22 Thread Lari Hotari
I have written a long blog post that contains the context, the summary
of my view point about PIP-310 and the proposal for proceeding:
https://codingthestreams.com/pulsar/2023/11/22/pulsar-slos-and-rate-limiting.html

Let's discuss this tomorrow in the Pulsar community meeting [1]. Let's
coordinate on Pulsar Slack's #dev channel if the are issues in joining
the meeting.
See you tomorrow!

-Lari

1 - https://github.com/apache/pulsar/wiki/Community-Meetings

On Mon, 20 Nov 2023 at 20:48, Lari Hotari  wrote:
>
> Hi Girish,
>
> replies inline and after that there are some updates about my
> preparation for the community meeting on Thursday. (there's
> https://github.com/lhotari/async-tokenbucket with a PoC for a
> low-level high performance token bucket implementation)
>
> On Sat, 11 Nov 2023 at 17:25, Girish Sharma  wrote:
> > Actually, the capacity is meant to simulate that particular rate limit. if
> > we have 2 buckets anyways, the one managing the fixed rate limit part
> > shouldn't generally have a capacity more than the fixed rate, right?
>
> There are multiple ways to model and understand a dual token bucket
> implementation.
> I view the 2 buckets in a dual token bucket implementation as separate
> buckets. They are like an AND rule, so if either bucket is empty,
> there will be a need to pause to wait for new tokens.
> Since we aren't working with code yet, these comments could be out of context.
>
> > I think it can be done, especially with that one thing you mentioned about
> > holding off filling the second bucket for 10 minutes.. but it does become
> > quite complicated in terms of managing the flow of the tokens.. because
> > while we only fill the second bucket once every 10 minutes, after the 10th
> > minute, it needs to be filled continuously for a while (the duration we
> > want to support the bursting for).. and the capacity of this second bucket
> > also is governed by and exactly matches the burst value.
>
> There might not be a need for this complexity of the "filling bucket"
> in the first place. It was more of a demonstration that it's possible
> to implement the desired behavior of limited bursting by tweaking the
> basic token bucket algorithm slightly.
> I'd rather avoid this additional complexity.
>
> > Agreed that it is much higher than a single topics' max throughput.. but
> > the context of my example had multiple topics lying on the same
> > broker/bookie ensemble bursting together at the same time because they had
> > been saving up on tokens in the bucket.
>
> Yes, that makes sense.
>
> > always be a need to overprovision resources. You usually don't want to
> > > go beyond 60% or 70% utilization on disk, cpu or network resources so
> > > that queues in the system don't start to increase and impacting
> > > latencies. In Pulsar/Bookkeeper, the storage solution has a very
> > > effective load balancing, especially for writing. In Bookkeeper each
> > > ledger (the segment) of a topic selects the "ensemble" and the "write
> > > quorum", the set of bookies to write to, when the ledger is opened.
> > > The bookkeeper client could also change the ensemble in the middle of
> > > a ledger due to some event like a bookie becoming read-only or
> > >
> >
> > While it does do that on complete failure of bookie or a bookie disk, or
> > broker going down, degradations aren't handled this well. So if all topics
> > in a bookie are bursting due to the fact that they had accumulated tokens,
> > then all it will lead to is breach of write latency SLA because at one
> > point, the disks/cpu/network etc will start choking. (even after
> > considering the 70% utilization i.e. 30% buffer)
>
> Yes.
>
> > That's only in the case of the default rate limiter where the tryAcquire
> > isn't even implemented.. since the default rate limiter checks for breach
> > only at a fixed rate rather than before every produce call. But in case of
> > precise rate limiter, the response of `tryAcquire` is respected.
>
> This is one of many reasons why I think it's better to improve the
> maintainability of the current solution and remove the unnecessary
> options between "precise" and the default one.
>
> > True, and actually, due to the fact that pulsar auto distributes topics
> > based on load shedding parameters, we can actually focus on a single
> > broker's or a single bookie ensemble and assume that it works as we scale
> > it. Of course this means that putting a reasonable limit in terms of
> > cpu/network/partition/throughput limits at each broker level and pulsar
> > provides ways to do that automatically.
>
> We do have plans to improve the Pulsar so that things would simply
> work properly under heavy load. Optimally, things would work without
> the need to tune and tweak the system rigorously. These improvements
> go beyond rate limiting.
>
> > While I have shared the core requirements over these threads (fixed rate +
> > burst multiplier for upto X duration every Y minutes).. We are finalizing
> > the 

Re: [DISCUSS] PIP-310: Support custom publish rate limiters

2023-11-23 Thread Girish Sharma
I've captured our requirements in detail in this document -
https://docs.google.com/document/d/1-y5nBaC9QuAUHKUGMVVe4By-SmMZIL4w09U1byJBbMc/edit
Added it to agenda document as well. Will join the meeting and discuss.

Regards

On Wed, Nov 22, 2023 at 10:49 PM Lari Hotari  wrote:

> I have written a long blog post that contains the context, the summary
> of my view point about PIP-310 and the proposal for proceeding:
>
> https://codingthestreams.com/pulsar/2023/11/22/pulsar-slos-and-rate-limiting.html
>
> Let's discuss this tomorrow in the Pulsar community meeting [1]. Let's
> coordinate on Pulsar Slack's #dev channel if the are issues in joining
> the meeting.
> See you tomorrow!
>
> -Lari
>
> 1 - https://github.com/apache/pulsar/wiki/Community-Meetings
>
> On Mon, 20 Nov 2023 at 20:48, Lari Hotari  wrote:
> >
> > Hi Girish,
> >
> > replies inline and after that there are some updates about my
> > preparation for the community meeting on Thursday. (there's
> > https://github.com/lhotari/async-tokenbucket with a PoC for a
> > low-level high performance token bucket implementation)
> >
> > On Sat, 11 Nov 2023 at 17:25, Girish Sharma 
> wrote:
> > > Actually, the capacity is meant to simulate that particular rate
> limit. if
> > > we have 2 buckets anyways, the one managing the fixed rate limit part
> > > shouldn't generally have a capacity more than the fixed rate, right?
> >
> > There are multiple ways to model and understand a dual token bucket
> > implementation.
> > I view the 2 buckets in a dual token bucket implementation as separate
> > buckets. They are like an AND rule, so if either bucket is empty,
> > there will be a need to pause to wait for new tokens.
> > Since we aren't working with code yet, these comments could be out of
> context.
> >
> > > I think it can be done, especially with that one thing you mentioned
> about
> > > holding off filling the second bucket for 10 minutes.. but it does
> become
> > > quite complicated in terms of managing the flow of the tokens.. because
> > > while we only fill the second bucket once every 10 minutes, after the
> 10th
> > > minute, it needs to be filled continuously for a while (the duration we
> > > want to support the bursting for).. and the capacity of this second
> bucket
> > > also is governed by and exactly matches the burst value.
> >
> > There might not be a need for this complexity of the "filling bucket"
> > in the first place. It was more of a demonstration that it's possible
> > to implement the desired behavior of limited bursting by tweaking the
> > basic token bucket algorithm slightly.
> > I'd rather avoid this additional complexity.
> >
> > > Agreed that it is much higher than a single topics' max throughput..
> but
> > > the context of my example had multiple topics lying on the same
> > > broker/bookie ensemble bursting together at the same time because they
> had
> > > been saving up on tokens in the bucket.
> >
> > Yes, that makes sense.
> >
> > > always be a need to overprovision resources. You usually don't want to
> > > > go beyond 60% or 70% utilization on disk, cpu or network resources so
> > > > that queues in the system don't start to increase and impacting
> > > > latencies. In Pulsar/Bookkeeper, the storage solution has a very
> > > > effective load balancing, especially for writing. In Bookkeeper each
> > > > ledger (the segment) of a topic selects the "ensemble" and the "write
> > > > quorum", the set of bookies to write to, when the ledger is opened.
> > > > The bookkeeper client could also change the ensemble in the middle of
> > > > a ledger due to some event like a bookie becoming read-only or
> > > >
> > >
> > > While it does do that on complete failure of bookie or a bookie disk,
> or
> > > broker going down, degradations aren't handled this well. So if all
> topics
> > > in a bookie are bursting due to the fact that they had accumulated
> tokens,
> > > then all it will lead to is breach of write latency SLA because at one
> > > point, the disks/cpu/network etc will start choking. (even after
> > > considering the 70% utilization i.e. 30% buffer)
> >
> > Yes.
> >
> > > That's only in the case of the default rate limiter where the
> tryAcquire
> > > isn't even implemented.. since the default rate limiter checks for
> breach
> > > only at a fixed rate rather than before every produce call. But in
> case of
> > > precise rate limiter, the response of `tryAcquire` is respected.
> >
> > This is one of many reasons why I think it's better to improve the
> > maintainability of the current solution and remove the unnecessary
> > options between "precise" and the default one.
> >
> > > True, and actually, due to the fact that pulsar auto distributes topics
> > > based on load shedding parameters, we can actually focus on a single
> > > broker's or a single bookie ensemble and assume that it works as we
> scale
> > > it. Of course this means that putting a reasonable limit in terms of
> > > cpu/network/partition/throughput 

Re: [DISCUSS] PIP-310: Support custom publish rate limiters

2023-12-15 Thread Girish Sharma
Closing this discussion thread and the PIP. Apart from the discussion
present in this thread, I presented the detailed requirements in a dev meet
on 23rd November and the conclusion was that we will actually go ahead and
implement the requirements in pulsar itself.
There was a pre-requisite of refactoring rate limiter codebase which is
already covered by Lari in PIP-322.

I will be creating a new parent PIP soon about the high level requirements.

Thank you everyone who participated in the thread and the discussion on
23rd dev meeting.

Regards

On Thu, Nov 23, 2023 at 8:26 PM Girish Sharma 
wrote:

> I've captured our requirements in detail in this document -
> https://docs.google.com/document/d/1-y5nBaC9QuAUHKUGMVVe4By-SmMZIL4w09U1byJBbMc/edit
> Added it to agenda document as well. Will join the meeting and discuss.
>
> Regards
>
> On Wed, Nov 22, 2023 at 10:49 PM Lari Hotari  wrote:
>
>> I have written a long blog post that contains the context, the summary
>> of my view point about PIP-310 and the proposal for proceeding:
>>
>> https://codingthestreams.com/pulsar/2023/11/22/pulsar-slos-and-rate-limiting.html
>>
>> Let's discuss this tomorrow in the Pulsar community meeting [1]. Let's
>> coordinate on Pulsar Slack's #dev channel if the are issues in joining
>> the meeting.
>> See you tomorrow!
>>
>> -Lari
>>
>> 1 - https://github.com/apache/pulsar/wiki/Community-Meetings
>>
>> On Mon, 20 Nov 2023 at 20:48, Lari Hotari  wrote:
>> >
>> > Hi Girish,
>> >
>> > replies inline and after that there are some updates about my
>> > preparation for the community meeting on Thursday. (there's
>> > https://github.com/lhotari/async-tokenbucket with a PoC for a
>> > low-level high performance token bucket implementation)
>> >
>> > On Sat, 11 Nov 2023 at 17:25, Girish Sharma 
>> wrote:
>> > > Actually, the capacity is meant to simulate that particular rate
>> limit. if
>> > > we have 2 buckets anyways, the one managing the fixed rate limit part
>> > > shouldn't generally have a capacity more than the fixed rate, right?
>> >
>> > There are multiple ways to model and understand a dual token bucket
>> > implementation.
>> > I view the 2 buckets in a dual token bucket implementation as separate
>> > buckets. They are like an AND rule, so if either bucket is empty,
>> > there will be a need to pause to wait for new tokens.
>> > Since we aren't working with code yet, these comments could be out of
>> context.
>> >
>> > > I think it can be done, especially with that one thing you mentioned
>> about
>> > > holding off filling the second bucket for 10 minutes.. but it does
>> become
>> > > quite complicated in terms of managing the flow of the tokens..
>> because
>> > > while we only fill the second bucket once every 10 minutes, after the
>> 10th
>> > > minute, it needs to be filled continuously for a while (the duration
>> we
>> > > want to support the bursting for).. and the capacity of this second
>> bucket
>> > > also is governed by and exactly matches the burst value.
>> >
>> > There might not be a need for this complexity of the "filling bucket"
>> > in the first place. It was more of a demonstration that it's possible
>> > to implement the desired behavior of limited bursting by tweaking the
>> > basic token bucket algorithm slightly.
>> > I'd rather avoid this additional complexity.
>> >
>> > > Agreed that it is much higher than a single topics' max throughput..
>> but
>> > > the context of my example had multiple topics lying on the same
>> > > broker/bookie ensemble bursting together at the same time because
>> they had
>> > > been saving up on tokens in the bucket.
>> >
>> > Yes, that makes sense.
>> >
>> > > always be a need to overprovision resources. You usually don't want to
>> > > > go beyond 60% or 70% utilization on disk, cpu or network resources
>> so
>> > > > that queues in the system don't start to increase and impacting
>> > > > latencies. In Pulsar/Bookkeeper, the storage solution has a very
>> > > > effective load balancing, especially for writing. In Bookkeeper each
>> > > > ledger (the segment) of a topic selects the "ensemble" and the
>> "write
>> > > > quorum", the set of bookies to write to, when the ledger is opened.
>> > > > The bookkeeper client could also change the ensemble in the middle
>> of
>> > > > a ledger due to some event like a bookie becoming read-only or
>> > > >
>> > >
>> > > While it does do that on complete failure of bookie or a bookie disk,
>> or
>> > > broker going down, degradations aren't handled this well. So if all
>> topics
>> > > in a bookie are bursting due to the fact that they had accumulated
>> tokens,
>> > > then all it will lead to is breach of write latency SLA because at one
>> > > point, the disks/cpu/network etc will start choking. (even after
>> > > considering the 70% utilization i.e. 30% buffer)
>> >
>> > Yes.
>> >
>> > > That's only in the case of the default rate limiter where the
>> tryAcquire
>> > > isn't even implemented.. since the default rate limiter