[ https://issues.apache.org/jira/browse/CASSANDRA-9870?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Ryan McGuire reassigned CASSANDRA-9870: --------------------------------------- Assignee: Shawn Kumar > Improve cassandra-stress graphing > --------------------------------- > > Key: CASSANDRA-9870 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9870 > Project: Cassandra > Issue Type: Improvement > Components: Tools > Reporter: Benedict > Assignee: Shawn Kumar > Attachments: reads.svg > > > CASSANDRA-7918 introduces graph output from a stress run, but these graphs > are a little limited. Attached to the ticket is an example of some improved > graphs which can serve as the *basis* for some improvements, which I will > briefly describe. They should not be taken as the exact end goal, but we > should aim for at least their functionality. Preferably with some Javascript > advantages thrown in, such as the hiding of datasets/graphs for clarity. Any > ideas for improvements are *definitely* encouraged. > Some overarching design principles: > * Display _on *one* screen_ all of the information necessary to get a good > idea of how two or more branches compare to each other. Ideally we will > reintroduce this, painting multiple graphs onto one screen, stretched to fit. > * Axes must be truncated to only the interesting dimensions, to ensure there > is no wasted space. > * Each graph displaying multiple kinds of data should use colour _and shape_ > to help easily distinguish the different datasets. > * Each graph should be tailored to the data it is representing, and we should > have multiple views of each data. > The data can roughly be partitioned into three kinds: > * throughput > * latency > * gc > These can each be viewed in different ways: > * as a continuous plot of: > ** raw data > ** scaled/compared to a "base" branch, or other metric > ** cumulatively > * as box plots > ** ideally, these will plot median, outer quartiles, outer deciles and > absolute limits of the distribution, so the shape of the data can be best > understood > Each compresses the information differently, losing different information, so > that collectively they help to understand the data. > Some basic rules for presentation that work well: > * Latency information should be plotted to a logarithmic scale, to avoid high > latencies drowning out low ones > * GC information should be plotted cumulatively, to avoid differing > throughputs giving the impression of worse GC. It should also have a line > that is rescaled by the amount of work (number of operations) completed > * Throughput should be plotted as the actual numbers > To walk the graphs top-left to bottom-right, we have: > * Spot throughput comparison of branches to the baseline branch, as an > improvement ratio (which can of course be negative, but is not in this > example) > * Raw throughput of all branches (no baseline) > * Raw throughput as a box plot > * Latency percentiles, compared to baseline. The percentage improvement at > any point in time vs baseline is calculated, and then multiplied by the > overall median for the entire run. This simply permits the non-baseline > branches to scatter their wins/loss around a relatively clustered line for > each percentile. It's probably the most "dishonest" graph but comparing > something like latency where each data point can have very high variance is > difficult, and this gives you an idea of clustering of improvements/losses. > * Latency percentiles, raw, each with a different shape; lowest percentiles > plotted as a solid line as they vary least, with higher percentiles each > getting their own subtly different shape to scatter. > * Latency box plots > * GC time, plotted cumulatively and also scaled by work done > * GC Mb, plotted cumulatively and also scaled by work done > * GC time, raw > * GC time as a box plot > These do mostly introduce the concept of a "baseline" branch. It may be that, > ideally, this baseline be selected by a dropdown so the javascript can > transform the output dynamically. This would permit more interesting > comparisons to be made on the fly. > There are also some complexities, such as deciding which datapoints to > compare against baseline when times get out-of-whack (due to GC, etc, causing > a lack of output for a period). The version I uploaded does a merge of the > times, permitting a small degree of variance, and ignoring those datapoints > we cannot pair. One option here might be to change stress' behaviour to > always print to a strict schedule, instead of trying to get absolutely > accurate apportionment of timings. If this makes things much simpler, it can > be done. > As previously stated, but may be lost in the wall-of-text, these should be > taken as a starting point / sign post, rather than a golden rule for the end > goal. But ideally they will be the lower bound of what we can deliver. -- This message was sent by Atlassian JIRA (v6.3.4#6332)