Re: [DISCUSS] PIP-310: Support custom publish rate limiters

Girish Sharma Sat, 11 Nov 2023 07:25:37 -0800

One final reply from me before the holidays :)


On Sat, Nov 11, 2023 at 4:00 PM Lari Hotari <l...@hotari.net> wrote:

> Hi Girish,
>
> replies inline.
>
> > Hello Lari, replies inline. It's festive season here so I might be late
> in
> > the next reply.
>
> I'll have limited availability next week so possibly not replying
> until the following week. We have the next Pulsar community meeting on
> November 23rd, so let's wrap up all of this preparation by then.
>

Sounds good.


>
> > > How would the rate limiter know if 5MB traffic is degraded traffic
> > > which would need to be allowed to burst?
> > >
> >
> > That's not what I was implying. I was trying to question the need for
> > 1800MB worth of capacity. I am assuming this is to allow a 2 minute burst
> > of 15MBps? But isn't this bucket only taking are of the delta beyond
> 10MBps?
> > Moreover, once 2 minutes are elapsed, which bucket is ensuring that the
> > rate is only allowed to go upto 10MBps?
>
> I guess there are multiple ways to model things. I was thinking of a
> model where the first bucket (and the "filling bucket" to add "earned"
> tokens with some interval) handles all logic related to the average
> rate limiting of 10MBps, including the bursting capacity which a token
> bucket inherently has. As we know, the token bucket doesn't have a
> rate limit when there are available tokens in the bucket. That's the
>

Actually, the capacity is meant to simulate that particular rate limit. if
we have 2 buckets anyways, the one managing the fixed rate limit part
shouldn't generally have a capacity more than the fixed rate, right?



> reason why there's the separate independent bucket with a relatively
> short token capacity to enforce the maximum rate limit of 15MBps.
> There needs to be tokens in both bucket for traffic to flow freely.
> It's like an AND and not an OR (in literature there are also examples
> of this where tokens are taken from another bucket when the main one
> runs out).
>

I think it can be done, especially with that one thing you mentioned about
holding off filling the second bucket for 10 minutes.. but it does become
quite complicated in terms of managing the flow of the tokens.. because
while we only fill the second bucket once every 10 minutes, after the 10th
minute, it needs to be filled continuously for a while (the duration we
want to support the bursting for).. and the capacity of this second bucket
also is governed by and exactly matches the burst value.


> > general approach of where the tokens are initially required to be earned
> > might be helpful to tackle cold starts, but beyond that, a topic doing
> > 5MBps to accumulate enough tokens to burst for a few minutes in the
> future
> > doesn't really translate to the physical world, does it? The 1:1
> > translation here is to an SSD where the topic's data is actually being
> > written to - So my bursting upto 15MBps doesn't really depend on the fact
> > that I was only doing 5 MBps in the last few minutes (and thus,
> > accumulating the remaining 5MBps worth tokens towards the burst) - now
> does
> > it? The SSD won't really gain the ability to allow for burst just because
> > there was low throughput in the last few minutes. Not even from a space
> POV
> > either.
>
> This example of the SSD isn't concrete in terms of Pulsar and
> Bookkeeper. The capacity of a single SSD should and must be
> significantly higher than the maximum write throughput of a single
> topic. Therefore a single SSD shouldn't be a bottleneck. There will
>

Agreed that it is much higher than a single topics' max throughput.. but
the context of my example had multiple topics lying on the same
broker/bookie ensemble bursting together at the same time because they had
been saving up on tokens in the bucket.

always be a need to overprovision resources. You usually don't want to
> go beyond 60% or 70% utilization on disk, cpu or network resources so
> that queues in the system don't start to increase and impacting
> latencies. In Pulsar/Bookkeeper, the storage solution has a very
> effective load balancing, especially for writing. In Bookkeeper each
> ledger (the segment) of a topic selects the "ensemble" and the "write
> quorum", the set of bookies to write to, when the ledger is opened.
> The bookkeeper client could also change the ensemble in the middle of
> a ledger due to some event like a bookie becoming read-only or
>

While it does do that on complete failure of bookie or a bookie disk, or
broker going down, degradations aren't handled this well. So if all topics
in a bookie are bursting due to the fact that they had accumulated tokens,
then all it will lead to is breach of write latency SLA because at one
point, the disks/cpu/network etc will start choking. (even after
considering the 70% utilization i.e. 30% buffer)


> > now read the netty channel, parse the message and figure this out before
> > checking for rate limiting.. At which point, rate limiting isn' treally
> > doing anything if the broker is reading every message anyway. This may
> also
>
> This is how it works currently: the message is read, the rate limiter
> is updated and the message is sent regardless of the result of the
> rate limiter "tryAcquire" call. In asynchronous flows this makes
>

That's only in the case of the default rate limiter where the tryAcquire
isn't even implemented.. since the default rate limiter checks for breach
only at a fixed rate rather than before every produce call. But in case of
precise rate limiter, the response of `tryAcquire` is respected.


>
> > At no point I am trying to go beyond the purview of a single broker here.
> > Unless, by system, you meant a single broker itself. For which, I talk
> > about a broker level rate limiter further below in the example.
>
> My point was that there's a need to take a holistic approach of
> capacity management. The Pulsar system is a distributed system and the
> storage system is separate, unlike storage on Kafka. There isn't an
> equivalent of the Pulsar load balancing in Kafka. In the Kafka world,
> you can calculate the capacity of a single broker. In Pulsar, the
> model is very different.
> If a broker is over loaded, the Pulsar load balancer can quickly shed
> load from that broker. I feel that focusing on a single broker and
>

True, and actually, due to the fact that pulsar auto distributes topics
based on load shedding parameters, we can actually focus on a single
broker's or a single bookie ensemble and assume that it works as we scale
it. Of course this means that putting a reasonable limit in terms of
cpu/network/partition/throughput limits at each broker level and pulsar
provides ways to do that automatically.


> protecting the resources on it won't be helpful in the end. It's not
> to say that things cannot be improved from a single broker
> perspective. One essential role of rate limiters in capacity manage is
> to prevent a single resource of becoming overloaded. The Pulsar load
> balancer / load manager plays a key role in Pulsar.
>
> The other important role of rate limiters is about handling end
> application expectation of the service level. I'm not sure if someone
> remembers the example from the  Google SRE book about a too good
> service level which caused a lot of problems and outages when there
> finally was a service level degradation. The problem was that
> application got coupled to the behavior that there aren't any
> disruptions in the service. IIRC, the solution was to inject failures
> to production periodically to make the service consumers get used to
> service disruptions and design their application to have the required
> resilience. I think it was Netflix that introduced the chaos monkey
> and also ran it in production to ensure that service consumers are
> sufficiently resilient to service disruptions.
>
> In the context of Pulsar, it is possible that messaging applications
> get coupled and depend to a service level that isn't well defined.
> From this perspective, the key role of rate limiters is to ensure that
> the service provider and service consumer have a way to express the
> service level objective (SLO) in the system. If the rate isn't capped,
> there's the risk that service consumers get depended on very high
> rates which might only be achievable in a highly overprovisioned
> system.
>

+1 All platforms should induce chaos from time to time and have SLAs that
make sense rather than trying to be too good to be true.


> I wonder if you would be interested in sharing your requirements or
> context around SLOs and how you see the relationship to rate limiters?
>

While I have shared the core requirements over these threads (fixed rate +
burst multiplier for upto X duration every Y minutes).. We are finalizing
the details requirements internally to present. As I replied in my previous
mail, one outcome of detailed internal discussion was the discovery of
throughput contention.


> > Just as a side note, in practice, resource group level rate limiting is
> > very very imprecise due to the inherent nature that it has to sync broker
> > level data between multiple brokers are a frequency.. thus the data is
> > always stale and the produce goes way beyond the set limit, even when
> > resource groups use the precise rate limiter today.
>
> I think that it's a good starting point for further improvements based
> on feedback. It's currently in early stages.
> Based on what you have shared about your use cases for rate limiting,
> I'd assume that you would go beyond a single broker in the future.
>

We do use resource groups for certain namespace level quotas, but even in
our use case, rate limiter and resource groups are two separate tangents.
At least for foreseeable future.


> It would be interesting to hear more of practical details of what
> makes your use case so special that you need a custom rate limiter and
> it couldn't be covered by a generic rate limiting which is configured
> to your use case and requirements. I guess I just didn't want to keep
> on repeating the same things again and again and instead I've wanted
> to learn more about your use cases and work together with you to solve
> your requirements and attempt to generalize it in a way that we could
> have the implementation directly in Pulsar core.
> As we discussed these 2 attempts aren't conflicting. We might end up
> having a generic rate limiter that suites most of your requirements
> and for some requirements you might need that pluggable rate limiter.
> However, along all these message thread, there hasn't yet been any
> concrete examples that make your requirements so special that they
> couldn't be covered direclty in Pulsar core. Perhaps that's something
> that you are also iterating on and you might be learning new ways to
> handle your use cases and requirements while we make progress and
> small steps on improving the current rate limiter and the
> maintainability of it in preparation for further improvements.
>

Will close this before the 23rd.


> > I did not want to confuse you here by bringing in more than one broker
> into
> > picture. In any case, resource groups are not a solution here. What you
> are
> > suggesting is that we put a check on a namespace or tenant by putting in
> a
> > resource group level rate limiting - all that does is restrict the scope
> of
> > contention of capacity within that tenant or namespace. Moreover,
> resource
> > group level rate limiting is not precise. In another internal discussion
> > about finalizing the requirements from our side, we found that the
> example
> > i provided where multiple topics are bursting at the same time and are
> > contending for the capacity (globally or within a resource group), there
> > will be situation where the contention is now leading to throttling of a
> > topic which is not even trying to burst - since the broker level/resource
> > group level rate limiter will just block any and all produce requests
> after
> > the permits have been exhausted.. the bursting topics could have very
> well
> > exhausted those in the first 800ms of a second while a non bursting topic
> > actually should have had the right to produce in the remaining 200ms of a
> > second.
>
> Thanks for sharing this feedback about resource groups. The problem
>

This is actually not a resource group limitation, but a general broker
level example. Even if resource groups weren't in picture, this issue
remains. The fact remains that since we need to have reasonable broker
level limits (as a result of our NFR testing etc), there will be clash of
topics where some topics trying to burst are taking up broker's capacity of
serving the fixed rate for other topics. This needs special handling even
after dual token approach. I will have detailed requirement with examples
by 23rd.

> That is the one of the reasons for putting a strict check on the amount,
> > duration and frequency of bursting.
>
> Please explain more about this. It would be helpful for me in trying
> to understand your requirements.
> One question about the solution of "strict check on amount, duration
> and frequency of bursting".
> Do you already have a PoC where you have validated that the solution
> actually solves the problem?
>

No, this is still theoretical based on our understanding of rate limiter,
dual tokens, current pulsar code, etc. If topics are allowed to burst
without a real duration based limitation, then the chances of more and more
topics contending for broker's actual capacity is high and thus it
hinders/contents with (a) a new topics trying to burst and make use of the
bursting feature with its SLA and (b) another topics , not bursting, way
within its fixed rate, still being rate limited due to lack of capacity
(which is taken up by bursting topics) at a broker level.



> Could you simply write a PoC by forking apache/pulsar and changing the
> code directly without having it pluggable initially?
>
>
This would be a last resort :) . At most if things don't go as desired,
then we would end up doing this and adding plugging logic in a pulsar fork.
>From a PoC perspective, we may try it out soon.


> > Would love to hear a plan from your point of view and what you envision.
>
> Thanks, I hope I was able to express most of it in these long emails.
> I'm having a break next week and after that I was thinking of
> summarizing this discussion from my viewpoint and then meet in the
> Pulsar community meeting on November 23rd to discuss the summary and
> conclusions and the path forward. Perhaps you could also prepare in a
>

Sounds like a plan!


> similar way where you summarize your viewpoints and we discuss this on
> Nov 23rd in the Pulsar community meeting together with everyone who is
> interested to participate. If we have completed the preparation before
> the meeting, we could possibly already exchange our summaries
> asynchronously before the meeting. Girish, Would this work for you?
>
> Yes, we can exchange it before 23rd. I can come back on my final
requirements and plan by end of next week.

Regards
-- 
Girish Sharma

Re: [DISCUSS] PIP-310: Support custom publish rate limiters

Reply via email to