No problem, Alex. I'm also sorry for not pinging you a couple more times, as I assumed this was a corner case only I was seeing. It is now clear that a few other folks in the industry have faced it as well. Please let Runtian or me know if you need any additional information on our end. Thank you!
Jaydeep On Mon, Dec 15, 2025 at 9:47 AM Alex Petrov <[email protected]> wrote: > Thank you for explaining. I'll dig through the code to try to remember why > we introduced eviction, just to make sure we aren't going to introduce a > correctness issue in place of perf/operational issue (which I am not > claiming is the case btw, just not fully certain yet). > > Also Jaydeep sorry for dropping the ball on this: I was under impression > this has lost importance, haven't realized it was pending all that time. > > On Mon, Dec 15, 2025, at 6:41 PM, Runtian Liu wrote: > > Alex, you're absolutely right that this isn’t a correctness issue—the > system will eventually re-prepare the statement. The problem, however, > shows up in real production environments under high QPS. > > When a node is serving a heavy workload, the race condition described in > the ticket causes repeated evictions followed by repeated re-prepare > attempts. Instead of a single re-prepare, we see a *storm* of re-prepare > requests hitting the coordinator. This quickly becomes expensive: it > increases CPU usage, adds latency, and in our case escalated into a > cluster-wide performance degradation. We actually experienced an outage > triggered by this behavior. > > So while correctness is preserved, the operational impact is severe. > Preventing the unnecessary eviction avoids the re-prepare storm entirely, > which is why we believe this patch is important for stability in real > clusters. > > On Mon, Dec 15, 2025 at 8:00 AM Paulo Motta <[email protected]> wrote: > > I wanted to note I recently faced the issue described in this ticket in a > real cluster. I'm not familiar with this area to understand if there any > negative implications of this patch. > > So even if it's not a correctness issue per se, but fixes a practical > issue faced by users without negative consequences I don't see why this > should not be accepted, specially since it has been validated in production. > > On Mon, 15 Dec 2025 at 07:28 Alex Petrov <[email protected]> wrote: > > > iirc I reviewed it and mentioned this is not a correctness issue since we > would simply re-prepare. I can't recall why we needed to evict, but I think > this was for correctness reasons. > > Would you mind to elaborate why simply letting it to get re-prepared is > harmful behavior? Or am I missing something and this has larger > implications? > > To be clear, I am not opposed to this patch, just want to understand > implications better. > > On Sun, Dec 14, 2025, at 9:03 PM, Jaydeep Chovatia wrote: > > Hi > > I had reported this bug (CASSANDRA-17401 > <https://issues.apache.org/jira/browse/CASSANDRA-17401>) in 2022 along > with the fix (PR#3059 <https://github.com/apache/cassandra/pull/3059>) > and a reproducible (PR#3058 > <https://github.com/apache/cassandra/pull/3058>). I already applied this > fix internally, and it has been working fine for many years. Now we can see > one of the Cassandra users has been facing the exact same problem. I have > told them to go with the private fix for now. > Paulo and Alex had reviewed it partially, could you (or someone) please > complete the review so I can land to the official repo. > > Jaydeep > > > >
