[ https://issues.apache.org/jira/browse/CASSANDRA-11853?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15297538#comment-15297538 ]
T Jake Luciani edited comment on CASSANDRA-11853 at 5/24/16 2:15 AM: --------------------------------------------------------------------- I'm re-running with multiple settings to see how it changes. Looking at the code my main concern is the UniformRateLimiter. If I understand the code correctly UniformRateLimiter linearly scales the ops/sec from the time the limiter was constructed, so when an operation is ready to run it gets it's expected start time based on the absolute operation number it is. I see two problems with this: * The rate limiter is created at startup and doesn't account for warmup/hotspot etc. So once warmed up the ops are behind. This explains the [initial latency spike|http://cstar.datastax.com/graph?command=one_job&stats=022678d8-2123-11e6-bcd7-0256e416528f&metric=99th_latency&operation=3_read&smoothing=1&show_aggregates=true&xmin=0&xmax=549.67&ymin=0&ymax=318.01] in the run which skew the overall results. The limiter start time should only be set once the actual measured ops are ready to start. * If the rate limit is set too high, such that stress can't keep up with the expected rate, the results will make no sense. The actual start time will be way after the limiters calculated start time. It would be very good if we could add some way of detecting local GC pauses like we do in [GCInspector|https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/service/GCInspector.java] otherwise users have no way of knowing if the latency is due to local pauses or server pauses. General comments/nits on the branch: * The [code style|https://wiki.apache.org/cassandra/CodeStyle] needs to be fixed (break on bracket etc) * HdrHistogram needs to also be added to the build.xml maven/pom dependencies * Comments on top level classes like UniformRateLimiter would be helpful for future readers. was (Author: tjake): I'm re-running with multiple settings to see how it changes. Looking at the code my main questions is the UniformRateLimiter. If I understand the code correctly UniformRateLimiter linearly scales the ops/sec from the time the limiter was constructed, so when an operation is ready to run it gets it's expected start time based on the absolute operation number it is. I see two problems with this: * The rate limiter is created at startup and doesn't account for warmup/hotspot etc. So once warmed up the ops are behind. This explains the [initial latency spike|http://cstar.datastax.com/graph?command=one_job&stats=022678d8-2123-11e6-bcd7-0256e416528f&metric=99th_latency&operation=3_read&smoothing=1&show_aggregates=true&xmin=0&xmax=549.67&ymin=0&ymax=318.01] in the run which skew the overall results. The limiter start time should only be set once the actual measured ops are ready to start. * If the rate limit is set too high, such that stress can't keep up with the expected rate, the results will make no sense. The actual start time will be way after the limiters calculated start time. It would be very good if we could add some way of detecting local GC pauses like we do in [GCInspector|https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/service/GCInspector.java] otherwise users have no way of knowing if the latency is due to local pauses or server pauses. General comments/nits on the branch: * The [code style|https://wiki.apache.org/cassandra/CodeStyle] needs to be fixed (break on bracket etc) * HdrHistogram needs to also be added to the build.xml maven/pom dependencies * Comments on top level classes like UniformRateLimiter would be helpful for future readers. > Improve Cassandra-Stress latency measurement > -------------------------------------------- > > Key: CASSANDRA-11853 > URL: https://issues.apache.org/jira/browse/CASSANDRA-11853 > Project: Cassandra > Issue Type: Improvement > Components: Tools > Reporter: Nitsan Wakart > Assignee: Nitsan Wakart > Fix For: 3.x > > > Currently CS reports latency using a sampling latency container and reporting > service time (as opposed to response time from intended schedule) leading to > coordinated omission. > Fixed here: > https://github.com/nitsanw/cassandra/tree/co-correction -- This message was sent by Atlassian JIRA (v6.3.4#6332)