[ 
https://issues.apache.org/jira/browse/CASSANDRA-11853?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15297538#comment-15297538
 ] 

T Jake Luciani edited comment on CASSANDRA-11853 at 5/24/16 2:15 AM:
---------------------------------------------------------------------

I'm re-running with multiple settings to see how it changes.

Looking at the code my main concern is the UniformRateLimiter.
If I understand the code correctly UniformRateLimiter linearly scales the 
ops/sec from the time the limiter was constructed, so when an operation is 
ready to run it gets it's expected start time based on the absolute operation 
number it is.

I see two problems with this:
   * The rate limiter is created at startup and doesn't account for 
warmup/hotspot etc.  So once warmed up the ops are behind.  This explains the 
[initial latency 
spike|http://cstar.datastax.com/graph?command=one_job&stats=022678d8-2123-11e6-bcd7-0256e416528f&metric=99th_latency&operation=3_read&smoothing=1&show_aggregates=true&xmin=0&xmax=549.67&ymin=0&ymax=318.01]
 in the run which skew the overall results.   The limiter start time should 
only be set once the actual measured ops are ready to start.
   * If the rate limit is set too high, such that stress can't keep up with the 
expected rate, the results will make no sense. The actual start time will be 
way after the limiters calculated start time.
   
It would be very good if we could add some way of detecting local GC pauses 
like we do in 
[GCInspector|https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/service/GCInspector.java]
 otherwise users have no way of knowing if the latency is due to local pauses 
or server pauses.

General comments/nits on the branch:
   * The [code style|https://wiki.apache.org/cassandra/CodeStyle] needs to be 
fixed (break on bracket etc)
   * HdrHistogram needs to also be added to the build.xml maven/pom dependencies
   * Comments on top level classes like UniformRateLimiter would be helpful for 
future readers.
   



was (Author: tjake):
I'm re-running with multiple settings to see how it changes.

Looking at the code my main questions is the UniformRateLimiter.
If I understand the code correctly UniformRateLimiter linearly scales the 
ops/sec from the time the limiter was constructed, so when an operation is 
ready to run it gets it's expected start time based on the absolute operation 
number it is.

I see two problems with this:
   * The rate limiter is created at startup and doesn't account for 
warmup/hotspot etc.  So once warmed up the ops are behind.  This explains the 
[initial latency 
spike|http://cstar.datastax.com/graph?command=one_job&stats=022678d8-2123-11e6-bcd7-0256e416528f&metric=99th_latency&operation=3_read&smoothing=1&show_aggregates=true&xmin=0&xmax=549.67&ymin=0&ymax=318.01]
 in the run which skew the overall results.   The limiter start time should 
only be set once the actual measured ops are ready to start.
   * If the rate limit is set too high, such that stress can't keep up with the 
expected rate, the results will make no sense. The actual start time will be 
way after the limiters calculated start time.
   
It would be very good if we could add some way of detecting local GC pauses 
like we do in 
[GCInspector|https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/service/GCInspector.java]
 otherwise users have no way of knowing if the latency is due to local pauses 
or server pauses.

General comments/nits on the branch:
   * The [code style|https://wiki.apache.org/cassandra/CodeStyle] needs to be 
fixed (break on bracket etc)
   * HdrHistogram needs to also be added to the build.xml maven/pom dependencies
   * Comments on top level classes like UniformRateLimiter would be helpful for 
future readers.
   


> Improve Cassandra-Stress latency measurement
> --------------------------------------------
>
>                 Key: CASSANDRA-11853
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-11853
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Tools
>            Reporter: Nitsan Wakart
>            Assignee: Nitsan Wakart
>             Fix For: 3.x
>
>
> Currently CS reports latency using a sampling latency container and reporting 
> service time (as opposed to response time from intended schedule) leading to 
> coordinated omission.
> Fixed here:
> https://github.com/nitsanw/cassandra/tree/co-correction



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to