Re: [DISCUSS] Donating easy-cass-stress to the project

2024-05-03 Thread Alexander DEJANOVSKI
Hi folks,

I'm familiar with the codebase and can help with the maintenance and
evolution.
I already have some additional profiles that I can push there which were
never merged in the main branch of tlp-cluster.

I love this tool (I know I'm biased) and hope it gets the attention it
deserves.

Le mar. 30 avr. 2024, 23:17, Jordan West  a écrit :

> I would likely commit to it as well
>
> Jordan
>
> On Mon, Apr 29, 2024 at 10:55 David Capwell  wrote:
>
>> So: besides Jon, who in the community expects/desires to maintain this
>> going forward?
>>
>>
>> I have been maintaining a fork for years, so don’t mind helping maintain
>> this project.
>>
>> On Apr 28, 2024, at 4:08 AM, Mick Semb Wever  wrote:
>>
>> A separate subproject like dtest and the Java driver would maybe help
>>> address concerns with introducing a gradle build system and Kotlin.
>>>
>>
>>
>> Nit, dtest is a separate repository, not a subproject.  The Java driver
>> is one repository to be in the Drivers subproject.  Esoteric maybe, but ASF
>> terminology we need to get right :-)
>>
>> To your actual point (IIUC), it can be a separate repository and not a
>> separate subproject.  This permits it to be kotlin+gradle, while not having
>> the formal subproject procedures.  It still needs 3 responsible committers
>> from the get-go to show sustainability.  Would easy-cass-stress have
>> releases, or always be a codebase users work directly with ?
>>
>> Can/Should we first demote cassandra-stress by moving it out to a
>> separate repo ?
>>  ( Can its imports work off non-snapshot dependencies ? )
>> It might feel like an extra prerequisite step to introduce, but maybe it
>> helps move the needle forward and make this conversation a bit
>> easier/obvious.
>>
>>
>>


Re: [EXTERNAL] Re: [Discuss] Generic Purpose Rate Limiter in Cassandra

2024-05-03 Thread Caleb Rackliffe
FYI, there is some ongoing sort-of-related work going on in CASSANDRA-19534


On Wed, Apr 10, 2024 at 6:35 PM Jaydeep Chovatia 
wrote:

> Just created an official CEP-41
> 
> incorporating the feedback from this discussion. Feel free to let me know
> if I may have missed some important feedback in this thread that is not
> captured in the CEP-41.
>
> Jaydeep
>
> On Thu, Feb 22, 2024 at 11:36 AM Jaydeep Chovatia <
> chovatia.jayd...@gmail.com> wrote:
>
>> Thanks, Josh. I will file an official CEP with all the details in a few
>> days and update this thread with that CEP number.
>> Thanks a lot everyone for providing valuable insights!
>>
>> Jaydeep
>>
>> On Thu, Feb 22, 2024 at 9:24 AM Josh McKenzie 
>> wrote:
>>
>>> Do folks think we should file an official CEP and take it there?
>>>
>>> +1 here.
>>>
>>> Synthesizing your gdoc, Caleb's work, and the feedback from this thread
>>> into a draft seems like a solid next step.
>>>
>>> On Wed, Feb 7, 2024, at 12:31 PM, Jaydeep Chovatia wrote:
>>>
>>> I see a lot of great ideas being discussed or proposed in the past to
>>> cover the most common rate limiter candidate use cases. Do folks think we
>>> should file an official CEP and take it there?
>>>
>>> Jaydeep
>>>
>>> On Fri, Feb 2, 2024 at 8:30 AM Caleb Rackliffe 
>>> wrote:
>>>
>>> I just remembered the other day that I had done a quick writeup on the
>>> state of compaction stress-related throttling in the project:
>>>
>>>
>>> https://docs.google.com/document/d/1dfTEcKVidRKC1EWu3SO1kE1iVLMdaJ9uY1WMpS3P_hs/edit?usp=sharing
>>>
>>> I'm sure most of it is old news to the people on this thread, but I
>>> figured I'd post it just in case :)
>>>
>>> On Tue, Jan 30, 2024 at 11:58 AM Josh McKenzie 
>>> wrote:
>>>
>>>
>>> 2.) We should make sure the links between the "known" root causes of
>>> cascading failures and the mechanisms we introduce to avoid them remain
>>> very strong.
>>>
>>> Seems to me that our historical strategy was to address individual known
>>> cases one-by-one rather than looking for a more holistic load-balancing and
>>> load-shedding solution. While the engineer in me likes the elegance of a
>>> broad, more-inclusive *actual SEDA-like* approach, the pragmatist in me
>>> wonders how far we think we are today from a stable set-point.
>>>
>>> i.e. are we facing a handful of cases where nodes can still get pushed
>>> over and then cascade that we can surgically address, or are we facing a
>>> broader lack of back-pressure that rears its head in different domains
>>> (client -> coordinator, coordinator -> replica, internode with other
>>> operations, etc) at surprising times and should be considered more
>>> holistically?
>>>
>>> On Tue, Jan 30, 2024, at 12:31 AM, Caleb Rackliffe wrote:
>>>
>>> I almost forgot CASSANDRA-15817, which introduced
>>> reject_repair_compaction_threshold, which provides a mechanism to stop
>>> repairs while compaction is underwater.
>>>
>>> On Jan 26, 2024, at 6:22 PM, Caleb Rackliffe 
>>> wrote:
>>>
>>> 
>>> Hey all,
>>>
>>> I'm a bit late to the discussion. I see that we've already discussed
>>> CASSANDRA-15013 
>>>  and CASSANDRA-16663
>>>  at least in
>>> passing. Having written the latter, I'd be the first to admit it's a crude
>>> tool, although it's been useful here and there, and provides a couple
>>> primitives that may be useful for future work. As Scott mentions, while it
>>> is configurable at runtime, it is not adaptive, although we did
>>> make configuration easier in CASSANDRA-17423
>>> . It also is
>>> global to the node, although we've lightly discussed some ideas around
>>> making it more granular. (For example, keyspace-based limiting, or limiting
>>> "domains" tagged by the client in requests, could be interesting.) It also
>>> does not deal with inter-node traffic, of course.
>>>
>>> Something we've not yet mentioned (that does address internode traffic)
>>> is CASSANDRA-17324
>>> , which I
>>> proposed shortly after working on the native request limiter (and have just
>>> not had much time to return to). The basic idea is this:
>>>
>>> When a node is struggling under the weight of a compaction backlog and
>>> becomes a cause of increased read latency for clients, we have two safety
>>> valves:
>>>
>>>
>>> 1.) Disabling the native protocol server, which stops the node from
>>> coordinating reads and writes.
>>> 2.) Jacking up the severity on the node, which tells the dynamic snitch
>>> to avoid the node for reads from other coordinators.
>>>
>>>
>>> These are useful, but we don’t appear to have any mechanism that would
>>> allow us to temporarily reject internode hint, ba