Thank you for bringing this up. There has also been a re-prepare storm issue (15252) that we have fixed. I think in 17401 re-prepares are transient while in 15252 they'd be permanent as you're describing.
On Mon, Dec 15, 2025, at 9:27 PM, Evan Jones via dev wrote: > This isn't as helpful as I would like, but in case it helps: the description > of this problem sounds similar to an incident we had at Datadog at some point > in the past year. I can't remember the details and I can't find it quickly > right now, so it might not be identical. IIRC we observed a Cassandra cluster > using ~100% of its time preparing statements according to our Java continuous > profiling. We weren't sure if the bug was in Cassandra or in the gocql driver > we use, which auto-prepares statements. IIRC we ended up scaling the cluster > and/or turning off the source application and ramping it back up slowly > again, and we weren't able to reproduce the issue again. > > Evan Jones > > > > > On Mon, Dec 15, 2025 at 1:33 PM Jaydeep Chovatia <[email protected]> > wrote: >> No problem, Alex. I'm also sorry for not pinging you a couple more times, as >> I assumed this was a corner case only I was seeing. It is now clear that a >> few other folks in the industry have faced it as well. >> Please let Runtian or me know if you need any additional information on our >> end. Thank you! >> >> Jaydeep >> >> On Mon, Dec 15, 2025 at 9:47 AM Alex Petrov <[email protected]> wrote: >>> __ >>> Thank you for explaining. I'll dig through the code to try to remember why >>> we introduced eviction, just to make sure we aren't going to introduce a >>> correctness issue in place of perf/operational issue (which I am not >>> claiming is the case btw, just not fully certain yet). >>> >>> Also Jaydeep sorry for dropping the ball on this: I was under impression >>> this has lost importance, haven't realized it was pending all that time. >>> >>> On Mon, Dec 15, 2025, at 6:41 PM, Runtian Liu wrote: >>>> Alex, you're absolutely right that this isn’t a correctness issue—the >>>> system will eventually re-prepare the statement. The problem, however, >>>> shows up in real production environments under high QPS. >>>> >>>> When a node is serving a heavy workload, the race condition described in >>>> the ticket causes repeated evictions followed by repeated re-prepare >>>> attempts. Instead of a single re-prepare, we see a *storm* of re-prepare >>>> requests hitting the coordinator. This quickly becomes expensive: it >>>> increases CPU usage, adds latency, and in our case escalated into a >>>> cluster-wide performance degradation. We actually experienced an outage >>>> triggered by this behavior. >>>> >>>> So while correctness is preserved, the operational impact is severe. >>>> Preventing the unnecessary eviction avoids the re-prepare storm entirely, >>>> which is why we believe this patch is important for stability in real >>>> clusters. >>>> >>>> >>>> On Mon, Dec 15, 2025 at 8:00 AM Paulo Motta <[email protected]> wrote: >>>>> I wanted to note I recently faced the issue described in this ticket in a >>>>> real cluster. I'm not familiar with this area to understand if there any >>>>> negative implications of this patch. >>>>> >>>>> So even if it's not a correctness issue per se, but fixes a practical >>>>> issue faced by users without negative consequences I don't see why this >>>>> should not be accepted, specially since it has been validated in >>>>> production. >>>>> >>>>> On Mon, 15 Dec 2025 at 07:28 Alex Petrov <[email protected]> wrote: >>>>>> __ >>>>>> iirc I reviewed it and mentioned this is not a correctness issue since >>>>>> we would simply re-prepare. I can't recall why we needed to evict, but I >>>>>> think this was for correctness reasons. >>>>>> >>>>>> Would you mind to elaborate why simply letting it to get re-prepared is >>>>>> harmful behavior? Or am I missing something and this has larger >>>>>> implications? >>>>>> >>>>>> To be clear, I am not opposed to this patch, just want to understand >>>>>> implications better. >>>>>> >>>>>> On Sun, Dec 14, 2025, at 9:03 PM, Jaydeep Chovatia wrote: >>>>>>> Hi >>>>>>> >>>>>>> I had reported this bug (CASSANDRA-17401 >>>>>>> <https://issues.apache.org/jira/browse/CASSANDRA-17401>) in 2022 along >>>>>>> with the fix (PR#3059 <https://github.com/apache/cassandra/pull/3059>) >>>>>>> and a reproducible (PR#3058 >>>>>>> <https://github.com/apache/cassandra/pull/3058>). I already applied >>>>>>> this fix internally, and it has been working fine for many years. Now >>>>>>> we can see one of the Cassandra users has been facing the exact same >>>>>>> problem. I have told them to go with the private fix for now. >>>>>>> Paulo and Alex had reviewed it partially, could you (or someone) please >>>>>>> complete the review so I can land to the official repo. >>>>>>> >>>>>>> Jaydeep >>>>>> >>>
