[jira] [Comment Edited] (CASSANDRA-19332) Dropwizard Meter causes timeouts when infrequently used

Stefan Miklosovic (Jira) Mon, 25 Mar 2024 09:11:22 -0700


    [ 
https://issues.apache.org/jira/browse/CASSANDRA-19332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17830554#comment-17830554
 ]


Stefan Miklosovic edited comment on CASSANDRA-19332 at 3/25/24 4:09 PM:
------------------------------------------------------------------------

Too bad the PR for Dropwizard was not merged and released yet. Do I understand 
it correctly that we do not need to have Dropwizard's PR merged because we 
workaround it but this change will become not necessary anymore if PR is merged 
and we upgrade the library?


was (Author: smiklosovic):
Too bad the PR for Dropwizard was not merge and released yet. Do I understand 
it correctly that we do not need to have Dropwizard's PR merge because we 
workaround it but this change will become not necessary anymore if PR is merged 
and we upgrade the library?

> Dropwizard Meter causes timeouts when infrequently used
> -------------------------------------------------------
>
>                 Key: CASSANDRA-19332
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-19332
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Observability/Metrics
>            Reporter: Ariel Weisberg
>            Assignee: Ariel Weisberg
>            Priority: Normal
>             Fix For: 4.0.x, 4.1.x, 5.0, 5.1
>
>         Attachments: ci_summary_4.1.html, ci_summary_5.0.html, 
> ci_summary_trunk.html, result_details_4.1.tar.gz, result_details_5.0.tar.gz, 
> result_details_trunk.tar.gz
>
>
> Observed instances of timeouts on clusters with long uptime and infrequently 
> used tables and possibly just request paths such as not using CAS for large 
> fractions of a year.
> CAS seems to be more severely impacted because it has more metrics in the 
> request path such as latency measurements for prepare, propose, and the read 
> from the underlying table.
> Tracing showed ~600-800 milliseconds for these operations in between the 
> “appending to memtable” and “sending a response” events. Reads had a delay 
> between finishing the construction of the iterator and sending the read 
> response.
> Stack traces dumped every 100 milliseconds using {{sjk}} shows that in 
> prepare and propose a lot of time was being spent in 
> {{{}Meter.tickIfNecessary{}}}.
> {code:java}
> Thread [2537] RUNNABLE at 2024-01-25T21:14:48.218 - MutationStage-2
> com.codahale.metrics.Meter.tickIfNecessary(Meter.java:71)
> com.codahale.metrics.Meter.mark(Meter.java:55)
> com.codahale.metrics.Meter.mark(Meter.java:46)
> com.codahale.metrics.Timer.update(Timer.java:150)
> com.codahale.metrics.Timer.update(Timer.java:86)
> org.apache.cassandra.metrics.LatencyMetrics.addNano(LatencyMetrics.java:159)
> org.apache.cassandra.service.paxos.PaxosState.prepare(PaxosState.java:92)
> Thread [2539] RUNNABLE at 2024-01-25T21:14:48.520 - MutationStage-4
> com.codahale.metrics.Meter.tickIfNecessary(Meter.java:72)
> com.codahale.metrics.Meter.mark(Meter.java:55)
> com.codahale.metrics.Meter.mark(Meter.java:46)
> com.codahale.metrics.Timer.update(Timer.java:150)
> com.codahale.metrics.Timer.update(Timer.java:86)
> org.apache.cassandra.metrics.LatencyMetrics.addNano(LatencyMetrics.java:159)
> org.apache.cassandra.service.paxos.PaxosState.propose(PaxosState.java:127){code}
> {{tickIfNecessary}} does a linear amount of work proportional to the time 
> since the last time the metric was updated/read/created and this can actually 
> take a measurable amount of time even in a tight loop. On my M2 MBP it was 
> 1.5 milliseconds for a day, ~200 days took ~74 milliseconds. Before it warmed 
> up it was 140 milliseconds.
> A quick fix is to schedule a task to read all the meters once a day so it 
> isn’t done in the request path and we have a more incremental amount to 
> process at a time.
> Also observed that {{tickIfNecessary}} is not 100% thread safe in that if it 
> takes longer than 5 seconds to run the loop it can end up with multiple 
> threads attempting to run the loop at once and then they will concurrently 
> run {{EWMA.tick}} which probably results in some ticks not being performed.
> This issue is still present in the latest version of {{Metrics}} if using 
> {{{}EWMA{}}}, but {{SlidingWindowTimeAverages}} looks like it has a bounded 
> amount of work required to tick.  Switching would change how our metrics work 
> since the two don't have the same behavior.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Comment Edited] (CASSANDRA-19332) Dropwizard Meter causes timeouts when infrequently used

Reply via email to