[ 
https://issues.apache.org/jira/browse/CASSANDRA-9870?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ryan McGuire reassigned CASSANDRA-9870:
---------------------------------------

    Assignee: Shawn Kumar

> Improve cassandra-stress graphing
> ---------------------------------
>
>                 Key: CASSANDRA-9870
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-9870
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Tools
>            Reporter: Benedict
>            Assignee: Shawn Kumar
>         Attachments: reads.svg
>
>
> CASSANDRA-7918 introduces graph output from a stress run, but these graphs 
> are a little limited. Attached to the ticket is an example of some improved 
> graphs which can serve as the *basis* for some improvements, which I will 
> briefly describe. They should not be taken as the exact end goal, but we 
> should aim for at least their functionality. Preferably with some Javascript 
> advantages thrown in, such as the hiding of datasets/graphs for clarity. Any 
> ideas for improvements are *definitely* encouraged.
> Some overarching design principles:
> * Display _on *one* screen_ all of the information necessary to get a good 
> idea of how two or more branches compare to each other. Ideally we will 
> reintroduce this, painting multiple graphs onto one screen, stretched to fit.
> * Axes must be truncated to only the interesting dimensions, to ensure there 
> is no wasted space.
> * Each graph displaying multiple kinds of data should use colour _and shape_ 
> to help easily distinguish the different datasets.
> * Each graph should be tailored to the data it is representing, and we should 
> have multiple views of each data.
> The data can roughly be partitioned into three kinds:
> * throughput
> * latency
> * gc
> These can each be viewed in different ways:
> * as a continuous plot of:
> ** raw data
> ** scaled/compared to a "base" branch, or other metric
> ** cumulatively
> * as box plots
> ** ideally, these will plot median, outer quartiles, outer deciles and 
> absolute limits of the distribution, so the shape of the data can be best 
> understood
> Each compresses the information differently, losing different information, so 
> that collectively they help to understand the data.
> Some basic rules for presentation that work well:
> * Latency information should be plotted to a logarithmic scale, to avoid high 
> latencies drowning out low ones
> * GC information should be plotted cumulatively, to avoid differing 
> throughputs giving the impression of worse GC. It should also have a line 
> that is rescaled by the amount of work (number of operations) completed
> * Throughput should be plotted as the actual numbers
> To walk the graphs top-left to bottom-right, we have:
> * Spot throughput comparison of branches to the baseline branch, as an 
> improvement ratio (which can of course be negative, but is not in this 
> example)
> * Raw throughput of all branches (no baseline)
> * Raw throughput as a box plot
> * Latency percentiles, compared to baseline. The percentage improvement at 
> any point in time vs baseline is calculated, and then multiplied by the 
> overall median for the entire run. This simply permits the non-baseline 
> branches to scatter their wins/loss around a relatively clustered line for 
> each percentile. It's probably the most "dishonest" graph but comparing 
> something like latency where each data point can have very high variance is 
> difficult, and this gives you an idea of clustering of improvements/losses.
> * Latency percentiles, raw, each with a different shape; lowest percentiles 
> plotted as a solid line as they vary least, with higher percentiles each 
> getting their own subtly different shape to scatter.
> * Latency box plots
> * GC time, plotted cumulatively and also scaled by work done
> * GC Mb, plotted cumulatively and also scaled by work done
> * GC time, raw
> * GC time as a box plot
> These do mostly introduce the concept of a "baseline" branch. It may be that, 
> ideally, this baseline be selected by a dropdown so the javascript can 
> transform the output dynamically. This would permit more interesting 
> comparisons to be made on the fly.
> There are also some complexities, such as deciding which datapoints to 
> compare against baseline when times get out-of-whack (due to GC, etc, causing 
> a lack of output for a period). The version I uploaded does a merge of the 
> times, permitting a small degree of variance, and ignoring those datapoints 
> we cannot pair. One option here might be to change stress' behaviour to 
> always print to a strict schedule, instead of trying to get absolutely 
> accurate apportionment of timings. If this makes things much simpler, it can 
> be done.
> As previously stated, but may be lost in the wall-of-text, these should be 
> taken as a starting point / sign post, rather than a golden rule for the end 
> goal. But ideally they will be the lower bound of what we can deliver.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to