Ariel Weisberg created CASSANDRA-19332: ------------------------------------------
Summary: Dropwizard Meter causes timeouts when infrequently used Key: CASSANDRA-19332 URL: https://issues.apache.org/jira/browse/CASSANDRA-19332 Project: Cassandra Issue Type: Bug Components: Observability/Metrics Reporter: Ariel Weisberg Observed instances of timeouts on clusters with long uptime and infrequently used tables and possibly just requests paths such as not using CAS for large fractions of a year. CAS seems to be more severely impacted because it has more metrics in the request path such as latency measurements for prepare, propose, and the read from the underlying table. Tracing showed ~600-800 milliseconds for these operations in between the “appending to memtable” and “sending a response” events. Reads had a delay between finishing the construction of the iterator and sending the read response. Stack traces dumped every 100 milliseconds using {{sjk}} shows that in prepare and propose a lot of time was being spent in {{{}Meter.tickIfNecessary{}}}. {code:java} Thread [2537] RUNNABLE at 2024-01-25T21:14:48.218 - MutationStage-2 com.codahale.metrics.Meter.tickIfNecessary(Meter.java:71) com.codahale.metrics.Meter.mark(Meter.java:55) com.codahale.metrics.Meter.mark(Meter.java:46) com.codahale.metrics.Timer.update(Timer.java:150) com.codahale.metrics.Timer.update(Timer.java:86) org.apache.cassandra.metrics.LatencyMetrics.addNano(LatencyMetrics.java:159) org.apache.cassandra.service.paxos.PaxosState.prepare(PaxosState.java:92) Thread [2539] RUNNABLE at 2024-01-25T21:14:48.520 - MutationStage-4 com.codahale.metrics.Meter.tickIfNecessary(Meter.java:72) com.codahale.metrics.Meter.mark(Meter.java:55) com.codahale.metrics.Meter.mark(Meter.java:46) com.codahale.metrics.Timer.update(Timer.java:150) com.codahale.metrics.Timer.update(Timer.java:86) org.apache.cassandra.metrics.LatencyMetrics.addNano(LatencyMetrics.java:159) org.apache.cassandra.service.paxos.PaxosState.propose(PaxosState.java:127){code} {{tickIfNecessary}} does a linear amount of work proportional to the time since the last time the metric was updated/read/created and this can actually take a measurable amount of time even in a tight loop. On my M2 MBP it was 1.5 milliseconds for a day, ~200 days took ~74 milliseconds. Before it warmed up it was 140 milliseconds. A quick fix is to schedule a task to read all the meters once a day so it isn’t done in the request path and we have a more incremental amount to process at a time. Also observed that {{tickIfNecessary}} is not 100% thread safe in that if it takes longer than 5 seconds to run the loop it can end up with multiple threads attempting to run the loop at once and then they will concurrently run {{EWMA.tick}} which probably ends up in some ticks not being performed. This issue is still present in the latest version of {{Metrics}} if using {{{}EWMA{}}}, but {{SlidingWindowTimeAverages}} looks like it has a bounded amount of work required to tick. Switching would change how our metrics work since the two don't have the same behavior. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org