This isn't as helpful as I would like, but in case it helps: the
description of this problem sounds similar to an incident we had at Datadog
at some point in the past year. I can't remember the details and I can't
find it quickly right now, so it might not be identical. IIRC we observed a
Cassandra cluster using ~100% of its time preparing statements according to
our Java continuous profiling. We weren't sure if the bug was in Cassandra
or in the gocql driver we use, which auto-prepares statements. IIRC we
ended up scaling the cluster and/or turning off the source application and
ramping it back up slowly again, and we weren't able to reproduce the issue
again.

Evan Jones




On Mon, Dec 15, 2025 at 1:33 PM Jaydeep Chovatia <[email protected]>
wrote:

> No problem, Alex. I'm also sorry for not pinging you a couple more times,
> as I assumed this was a corner case only I was seeing. It is now clear that
> a few other folks in the industry have faced it as well.
> Please let Runtian or me know if you need any additional information on
> our end. Thank you!
>
> Jaydeep
>
> On Mon, Dec 15, 2025 at 9:47 AM Alex Petrov <[email protected]> wrote:
>
>> Thank you for explaining. I'll dig through the code to try to remember
>> why we introduced eviction, just to make sure we aren't going to introduce
>> a correctness issue in place of perf/operational issue (which I am not
>> claiming is the case btw, just not fully certain yet).
>>
>> Also Jaydeep sorry for dropping the ball on this: I was under impression
>> this has lost importance, haven't realized it was pending all that time.
>>
>> On Mon, Dec 15, 2025, at 6:41 PM, Runtian Liu wrote:
>>
>> Alex, you're absolutely right that this isn’t a correctness issue—the
>> system will eventually re-prepare the statement. The problem, however,
>> shows up in real production environments under high QPS.
>>
>> When a node is serving a heavy workload, the race condition described in
>> the ticket causes repeated evictions followed by repeated re-prepare
>> attempts. Instead of a single re-prepare, we see a *storm* of re-prepare
>> requests hitting the coordinator. This quickly becomes expensive: it
>> increases CPU usage, adds latency, and in our case escalated into a
>> cluster-wide performance degradation. We actually experienced an outage
>> triggered by this behavior.
>>
>> So while correctness is preserved, the operational impact is severe.
>> Preventing the unnecessary eviction avoids the re-prepare storm entirely,
>> which is why we believe this patch is important for stability in real
>> clusters.
>>
>> On Mon, Dec 15, 2025 at 8:00 AM Paulo Motta <[email protected]> wrote:
>>
>> I wanted to note I recently faced the issue described in this ticket in a
>> real cluster. I'm not familiar with this area to understand if there any
>> negative implications of this patch.
>>
>> So even if it's not a correctness issue per se, but fixes a practical
>> issue faced by users without negative consequences I don't see why this
>> should not be accepted, specially since it has been validated in production.
>>
>> On Mon, 15 Dec 2025 at 07:28 Alex Petrov <[email protected]> wrote:
>>
>>
>> iirc I reviewed it and mentioned this is not a correctness issue since we
>> would simply re-prepare. I can't recall why we needed to evict, but I think
>> this was for correctness reasons.
>>
>> Would you mind to elaborate why simply letting it to get re-prepared is
>> harmful behavior? Or am I missing something and this has larger
>> implications?
>>
>> To be clear, I am not opposed to this patch, just want to understand
>> implications better.
>>
>> On Sun, Dec 14, 2025, at 9:03 PM, Jaydeep Chovatia wrote:
>>
>> Hi
>>
>> I had reported this bug (CASSANDRA-17401
>> <https://issues.apache.org/jira/browse/CASSANDRA-17401>) in 2022 along
>> with the fix (PR#3059 <https://github.com/apache/cassandra/pull/3059>)
>> and a reproducible (PR#3058
>> <https://github.com/apache/cassandra/pull/3058>). I already applied this
>> fix internally, and it has been working fine for many years. Now we can see
>> one of the Cassandra users has been facing the exact same problem. I have
>> told them to go with the private fix for now.
>> Paulo and Alex had reviewed it partially, could you (or someone) please
>> complete the review so I can land to the official repo.
>>
>> Jaydeep
>>
>>
>>
>>

Reply via email to