If we’re debating the overall approach, I think we need to define what we want 
to achieve before we pursue any specific design.

I think rate limiting is simply a proxy for cluster stability. I think 
implicitly we also all want to achieve client fairness. Rate limiting is one 
proposal for achieving the first only - a poor one IMO, driven I think by a 
lack of guarantees currently provided by the rest of the system.

Given this lens, load balancing is IMO a performance improvement - an important 
one, but orthogonal to achieving the stability itself. My personal view is that 
the best next step for achieving both is improved load shedding, building on 
Alex’s earlier work, which I think is broadly what Jon is saying.

The best improvement IMO is ensuring load shedding is fair. That is, we should 
shed load to replicas that already have the most work sent to them, and for 
clients with the most work already queued. That is, we should partition the 
client work queue by client, and as we get back-pressure signals from 
overloaded replicas we should being materialising a view of the client work 
queue attributing targets to this replica. We can then expire work based on the 
worst queue depth of any queue. We should also use this to override the 
behaviour of the dynamic snitch; we should not send work to replicas with local 
queues that cannot flush due to back-pressure.

I think, once this is in place, the internal queue depths Jon discusses would 
be useful signals to the internode inbound message handler to propagate back to 
peers’ outbound message queues (and, transitively, the client message queue 
management).

I think there are also improvements to be made regarding how we select work to 
be expired, such as partitioning by workload type, predicting wether there is 
enough time for the system to process a query (e.g. not submitting work that is 
close to its expiration), as considering the size of the client payload, etc. 
Also, choosing when to notify a client of timeout/overload - it may be 
preferential to discard the work but delay notifying the client to avoid the 
client retrying too quickly.



> On 23 Sep 2024, at 08:39, Alex Petrov <al...@coffeenco.de> wrote:
> 
> > Are those three sufficient to protect against a client that unexpectedly 
> > comes up with 100x a previous provisioned-for workload? Or 100 clients at 
> > 100x concurrently? Given that can be 100x in terms of quantity (helped by 
> > queueing and shedding), but also 100x in terms of computational and disk 
> > implications? We don't have control over what users do, and while in an 
> > ideal world client applications have fairly predictable workloads over 
> > time, in practice things spike around various events for different 
> > applications.
> 
> To fully answer your question I would probably have to flesh out a CEP (which 
> I hope to get to ASAP), and I will elaborate on all of the points on 
> Community over Code this year (and will do my best to put it in writing for 
> the ones who will not attend). But to briefly answer, one of the ideas is 
> exactly the fact that one thing in a queue is not the same as a another thing 
> in a queue. 
> 
> > I see these 2 things as complementary, not as interdependent. Is there 
> > something I'm missing?
> 
> I think if we start working on rate limiting before we implement good load 
> balancing, we are risking of shedding load that could otherwise have been 
> handled by the cluster. I think you even said it yourself with "over time it 
> would raise the ceiling at which rate limiting kicked in".
> 
> Besides, in CASSANDRA-19534 
> <https://issues.apache.org/jira/browse/CASSANDRA-19534> I was attempting to 
> show that we need to find a maximal natural possible throughput that the 
> cluster can handle without tipping, and maintain it. And the easiest way to 
> handle this was, naturally, through the user-set timeouts. We can work in 
> resource limits, but as of now I see only marginal improvement over what we 
> can do just with timeouts.
> 
> One of the risk of a misconfigured rate-limiter is avoidable throttling / 
> shedding. For example, CEP mentions a trigger of a 80% CPU utilization. But 
> while testing 19534, we have seen that we can easily burst into high CPU 
> utilization, let the requests that do not satisfy client timeout boundaries 
> get shedded, and continue operating without any additional rate-limiting. My 
> guess is that the existing rate-limiter sees less usage than we wish it to 
> primarily because it is hard to say where to set the limits to for throwing 
> OverloadedException, and TCP throttling/backoff does not work because we lose 
> the queue arrival timestamp and start triggering client timeouts that are 
> invisible to a server (i.e. client retries while request is still in the 
> queue). While latter problem has a trivial solution (i.e. client-set 
> deadlines), former one probably requires some auto-tuning or guidance. 
> 
> Another example is "culprit Keyspaces" from the CEP. If we introduce fairness 
> into our load-balancing, a single keyspace, or a replica set (partition) will 
> not be able to dominate the cluster workload, causing accross the board 
> timeouts. Which means that by simply giving preference to other requests we 
> have naturally shed an imbalance without introducing any rate-limiting.
> 
> Maybe the problem is in the terminology, but I think we should choose 
> "prioritize read/write coordinator workload" over "block any {read/write} 
> {coordinator/replication} traffic for a table". Let me make an example. Let's 
> say we implement some algorithm for replenishing request allowance for write 
> replication for a table, and this table runs out of tokens. If we make a 
> decision to shed this request before waiting until the last moment it 
> potentially could have gotten processed, we are risking shedding a request 
> that can be served. But if we gave such request a lower priority, we can get 
> to it when we get to it given current resources and queues. If we still can 
> process it, we will, even if it is on a very end of our timeout guarantee, 
> and I think it does not need to be shed.
> 
> I think what I have in mind seems to jibe very well with the points you have 
> brought up:
>   * How can nodes protect themselves from variable and changing user behavior 
> in a way that's minimally disruptive to the user and requires as little 
> configuration as possible for operators
>   * How do we keep the limits of node performance from leaking into the scope 
> of user awareness and responsibility outside simply pushing various 
> exceptions to the client to indicate what's going on 
> 
> To summarize, I think good load balancing and workload prioritization 
> combined with just "give up on a request if we know for a fact we can not 
> process it" feels like a simpler way to solve the two problems you mentioned, 
> and it will help us maximize cluster utilization while not shedding the load 
> that could otherwise have been served, _while_ also reducing latencies across 
> the board.
> 
> 
> On Sat, Sep 21, 2024, at 1:35 PM, Josh McKenzie wrote:
>> Are those three sufficient to protect against a client that unexpectedly 
>> comes up with 100x a previous provisioned-for workload? Or 100 clients at 
>> 100x concurrently? Given that can be 100x in terms of quantity (helped by 
>> queueing and shedding), but also 100x in terms of computational and disk 
>> implications? We don't have control over what users do, and while in an 
>> ideal world client applications have fairly predictable workloads over time, 
>> in practice things spike around various events for different applications.
>> 
>> i.e. 1 thing in a queue could produce orders of magnitude more work than 1 
>> other thing in a queue. Especially as we move into a world with SAI and 
>> Accord.
>> 
>>> we need to solve load balancing (summarized by the above three points) 
>>> before we start working on the rate limiter
>> I see these 2 things as complementary, not as interdependent. Is there 
>> something I'm missing?
>> 
>> At least for me, mentally I frame this as "How can nodes protect themselves 
>> from variable and changing user behavior in a way that's minimally 
>> disruptive to the user and requires as little configuration as possible for 
>> operators?". Basically, how do we keep the limits of node performance from 
>> leaking into the scope of user awareness and responsibility outside simply 
>> pushing various exceptions to the client to indicate what's going on 
>> (OverloadedException to the client, etc).
>> 
>> It seems to me both rate limiting and resource balancing are integral parts 
>> of this, but also parts that could be worked on independently. Were we to 
>> wave a magic wand tomorrow and have robust rate limiting, as we improved 
>> load balancing over time it would raise the ceiling at which rate limiting 
>> kicked in.
>> 
>> So concretely to the thread, I think I agree with Jon:
>>> * use the rate of timeouts to limit the depth of the queues for each of the 
>>> thread pools
>>> * reject requests when the queue is full with an OverloadedException.
>> followed by:
>>> If you want to follow this up with the ability to dynamically resize thread 
>>> pools that could be interesting.
>> 
>> Simple is very much a feature here.
>> 
>> On Sat, Sep 21, 2024, at 5:20 AM, Alex Petrov wrote:
>>> > Personally, I’m a bit skeptical that we will come up with a metric based 
>>> > heuristic that works well in most scenarios and doesn’t require 
>>> > significant knowledge and tuning. I think past implementations of the 
>>> > dynamic snitch are good evidence of that.
>>> 
>>> I am more optimistic on that font. I think we can achieve a lot. However, 
>>> in my opinion, we need to focus on balancing the load rather than rate 
>>> limiting. Rate limiting is going to be important if/when we decide to 
>>> implement workload isolation. Until then, I think we should focus on three 
>>> things:
>>> 
>>>   * Node health (Nodes should produce useful work and should be stable and 
>>> not overloaded)
>>>   * Latency (we always need to find an optimal way to process request and 
>>> minimize overall queueing time)
>>>   * Fairness (avoid workload and utilization imbalances)
>>> 
>>> All three points are achievable with very straightforward approaches that 
>>> will not require much operator involvement.
>>> 
>>> I guess my main point is we need to solve load balancing (summarized by the 
>>> above three points) before we start working on the rate limiter, but 
>>> there's a good chance we may not need one apart from use cases that require 
>>> workload isolation. 
>>> 
>>> 
>>> On Fri, Sep 20, 2024, at 8:14 PM, Jordan West wrote:
>>>> +1 to Benedict’s (and others) comments on plugability and low overhead 
>>>> when disabled. The latter I think needs little justification. The reason I 
>>>> am big on the former is, in my opinion: decisions on approach need to be 
>>>> settled with numbers not anecdotes or past experience (including my own). 
>>>> So I would like to see us compare different approaches (what metrics to 
>>>> use, etc). 
>>>> 
>>>> Personally, I’m a bit skeptical that we will come up with a metric based 
>>>> heuristic that works well in most scenarios and doesn’t require 
>>>> significant knowledge and tuning. I think past implementations of the 
>>>> dynamic snitch are good evidence of that. However, I expressed the same 
>>>> concerns internally for a client level project where we exposed metrics to 
>>>> induce back pressure and early experiments are encouraging / contrary to 
>>>> my expectations. At different layers different approaches can work better 
>>>> or worse. Same with different workloads. I don’t think we should dismiss 
>>>> approaches out right in this thread without hard numbers. 
>>>> 
>>>> In short, I think the testing and evaluation of this CEP is as important 
>>>> as its design and implementation. We will need to test a wide variety of 
>>>> workloads and potentially implementations and that’s where pluggability 
>>>> will be a huge benefit. I would go as far as saying the CEP should focus 
>>>> more on a framework for pluggable implementations that has low to zero 
>>>> cost when disabled than a specific set of metrics to use or specific 
>>>> approach. 
>>>> 
>>>> Jordan 
>>>> 
>>>> On Thu, Sep 19, 2024 at 14:38 Benedict Elliott Smith <bened...@apache.org 
>>>> <mailto:bened...@apache.org>> wrote:
>>>> I just want to flag here that this is a topic I have strong opinions on, 
>>>> but the CEP is not really specific or detailed enough to understand 
>>>> precisely how it will be implemented. So, if a patch is already being 
>>>> produced, most of my feedback is likely to be provided some time after a 
>>>> patch appears, through the normal review process. I want to flag this now 
>>>> to avoid any surprise.
>>>> 
>>>> I will say that upfront that, ideally, this system should be designed to 
>>>> have ~zero overhead when disabled, and with minimal coupling (between its 
>>>> own components and C* itself), so that entirely orthogonal approaches can 
>>>> be integrated in future without polluting the codebase.
>>>> 
>>>> 
>>>>> On 19 Sep 2024, at 19:14, Patrick McFadin <pmcfa...@gmail.com 
>>>>> <mailto:pmcfa...@gmail.com>> wrote:
>>>>> 
>>>>> The work has begun but we don't have a VOTE thread for this CEP. Can one 
>>>>> get started?
>>>>> 
>>>>> On Mon, May 6, 2024 at 9:24 PM Jaydeep Chovatia 
>>>>> <chovatia.jayd...@gmail.com <mailto:chovatia.jayd...@gmail.com>> wrote:
>>>>> Sure, Caleb. I will include the work as part of CASSANDRA-19534 
>>>>> <https://issues.apache.org/jira/browse/CASSANDRA-19534> in the CEP-41.
>>>>> 
>>>>> Jaydeep
>>>>> 
>>>>> On Fri, May 3, 2024 at 7:48 AM Caleb Rackliffe <calebrackli...@gmail.com 
>>>>> <mailto:calebrackli...@gmail.com>> wrote:
>>>>> FYI, there is some ongoing sort-of-related work going on in 
>>>>> CASSANDRA-19534 <https://issues.apache.org/jira/browse/CASSANDRA-19534>
>>>>> 
>>>>> On Wed, Apr 10, 2024 at 6:35 PM Jaydeep Chovatia 
>>>>> <chovatia.jayd...@gmail.com <mailto:chovatia.jayd...@gmail.com>> wrote:
>>>>> Just created an official CEP-41 
>>>>> <https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-41+%28DRAFT%29+Apache+Cassandra+Unified+Rate+Limiter>
>>>>>  incorporating the feedback from this discussion. Feel free to let me 
>>>>> know if I may have missed some important feedback in this thread that is 
>>>>> not captured in the CEP-41.
>>>>> 
>>>>> Jaydeep
>>>>> 
>>>>> On Thu, Feb 22, 2024 at 11:36 AM Jaydeep Chovatia 
>>>>> <chovatia.jayd...@gmail.com <mailto:chovatia.jayd...@gmail.com>> wrote:
>>>>> Thanks, Josh. I will file an official CEP with all the details in a few 
>>>>> days and update this thread with that CEP number.
>>>>> Thanks a lot everyone for providing valuable insights!
>>>>> 
>>>>> Jaydeep
>>>>> 
>>>>> On Thu, Feb 22, 2024 at 9:24 AM Josh McKenzie <jmcken...@apache.org 
>>>>> <mailto:jmcken...@apache.org>> wrote:
>>>>> 
>>>>>> Do folks think we should file an official CEP and take it there?
>>>>> +1 here.
>>>>> 
>>>>> Synthesizing your gdoc, Caleb's work, and the feedback from this thread 
>>>>> into a draft seems like a solid next step.
>>>>> 
>>>>> On Wed, Feb 7, 2024, at 12:31 PM, Jaydeep Chovatia wrote:
>>>>>> I see a lot of great ideas being discussed or proposed in the past to 
>>>>>> cover the most common rate limiter candidate use cases. Do folks think 
>>>>>> we should file an official CEP and take it there?
>>>>>> 
>>>>>> Jaydeep
>>>>>> 
>>>>>> On Fri, Feb 2, 2024 at 8:30 AM Caleb Rackliffe <calebrackli...@gmail.com 
>>>>>> <mailto:calebrackli...@gmail.com>> wrote:
>>>>>> I just remembered the other day that I had done a quick writeup on the 
>>>>>> state of compaction stress-related throttling in the project:
>>>>>> 
>>>>>> https://docs.google.com/document/d/1dfTEcKVidRKC1EWu3SO1kE1iVLMdaJ9uY1WMpS3P_hs/edit?usp=sharing
>>>>>> 
>>>>>> I'm sure most of it is old news to the people on this thread, but I 
>>>>>> figured I'd post it just in case :)
>>>>>> 
>>>>>> On Tue, Jan 30, 2024 at 11:58 AM Josh McKenzie <jmcken...@apache.org 
>>>>>> <mailto:jmcken...@apache.org>> wrote:
>>>>>> 
>>>>>>> 2.) We should make sure the links between the "known" root causes of 
>>>>>>> cascading failures and the mechanisms we introduce to avoid them remain 
>>>>>>> very strong.
>>>>>> Seems to me that our historical strategy was to address individual known 
>>>>>> cases one-by-one rather than looking for a more holistic load-balancing 
>>>>>> and load-shedding solution. While the engineer in me likes the elegance 
>>>>>> of a broad, more-inclusive actual SEDA-like approach, the pragmatist in 
>>>>>> me wonders how far we think we are today from a stable set-point.
>>>>>> 
>>>>>> i.e. are we facing a handful of cases where nodes can still get pushed 
>>>>>> over and then cascade that we can surgically address, or are we facing a 
>>>>>> broader lack of back-pressure that rears its head in different domains 
>>>>>> (client -> coordinator, coordinator -> replica, internode with other 
>>>>>> operations, etc) at surprising times and should be considered more 
>>>>>> holistically?
>>>>>> 
>>>>>> On Tue, Jan 30, 2024, at 12:31 AM, Caleb Rackliffe wrote:
>>>>>>> I almost forgot CASSANDRA-15817, which introduced 
>>>>>>> reject_repair_compaction_threshold, which provides a mechanism to stop 
>>>>>>> repairs while compaction is underwater.
>>>>>>> 
>>>>>>>> On Jan 26, 2024, at 6:22 PM, Caleb Rackliffe <calebrackli...@gmail.com 
>>>>>>>> <mailto:calebrackli...@gmail.com>> wrote:
>>>>>>>> 
>>>>>>>> Hey all,
>>>>>>>> 
>>>>>>>> I'm a bit late to the discussion. I see that we've already discussed 
>>>>>>>> CASSANDRA-15013 
>>>>>>>> <https://issues.apache.org/jira/browse/CASSANDRA-15013> and 
>>>>>>>> CASSANDRA-16663 
>>>>>>>> <https://issues.apache.org/jira/browse/CASSANDRA-16663> at least in 
>>>>>>>> passing. Having written the latter, I'd be the first to admit it's a 
>>>>>>>> crude tool, although it's been useful here and there, and provides a 
>>>>>>>> couple primitives that may be useful for future work. As Scott 
>>>>>>>> mentions, while it is configurable at runtime, it is not adaptive, 
>>>>>>>> although we did make configuration easier in CASSANDRA-17423 
>>>>>>>> <https://issues.apache.org/jira/browse/CASSANDRA-17423>. It also is 
>>>>>>>> global to the node, although we've lightly discussed some ideas around 
>>>>>>>> making it more granular. (For example, keyspace-based limiting, or 
>>>>>>>> limiting "domains" tagged by the client in requests, could be 
>>>>>>>> interesting.) It also does not deal with inter-node traffic, of course.
>>>>>>>> 
>>>>>>>> Something we've not yet mentioned (that does address internode 
>>>>>>>> traffic) is CASSANDRA-17324 
>>>>>>>> <https://issues.apache.org/jira/browse/CASSANDRA-17324>, which I 
>>>>>>>> proposed shortly after working on the native request limiter (and have 
>>>>>>>> just not had much time to return to). The basic idea is this:
>>>>>>>> 
>>>>>>>> When a node is struggling under the weight of a compaction backlog and 
>>>>>>>> becomes a cause of increased read latency for clients, we have two 
>>>>>>>> safety valves:
>>>>>>>> 
>>>>>>>> 1.) Disabling the native protocol server, which stops the node from 
>>>>>>>> coordinating reads and writes.
>>>>>>>> 2.) Jacking up the severity on the node, which tells the dynamic 
>>>>>>>> snitch to avoid the node for reads from other coordinators.
>>>>>>>> 
>>>>>>>> 
>>>>>>>> These are useful, but we don’t appear to have any mechanism that would 
>>>>>>>> allow us to temporarily reject internode hint, batch, and mutation 
>>>>>>>> messages that could further delay resolution of the compaction backlog.
>>>>>>>> 
>>>>>>>> Whether it's done as part of a larger framework or on its own, it 
>>>>>>>> still feels like a good idea.
>>>>>>>> 
>>>>>>>> Thinking in terms of opportunity costs here (i.e. where we spend our 
>>>>>>>> finite engineering time to holistically improve the experience of 
>>>>>>>> operating this database) is healthy, but we probably haven't reached 
>>>>>>>> the point of diminishing returns on nodes being able to protect 
>>>>>>>> themselves from clients and from other nodes. I would just keep in 
>>>>>>>> mind two things:
>>>>>>>> 
>>>>>>>> 1.) The effectiveness of rate-limiting in the system (which includes 
>>>>>>>> the database and all clients) as a whole necessarily decreases as we 
>>>>>>>> move from the application to the lowest-level database internals. 
>>>>>>>> Limiting correctly at the client will save more resources than 
>>>>>>>> limiting at the native protocol server, and limiting correctly at the 
>>>>>>>> native protocol server will save more resources than limiting after 
>>>>>>>> we've dispatched requests to some thread pool for processing.
>>>>>>>> 2.) We should make sure the links between the "known" root causes of 
>>>>>>>> cascading failures and the mechanisms we introduce to avoid them 
>>>>>>>> remain very strong.
>>>>>>>> 
>>>>>>>> In any case, I'd be happy to help out in any way I can as this moves 
>>>>>>>> forward (especially as it relates to our past/current attempts to 
>>>>>>>> address this problem space).

Reply via email to