[ 
https://issues.apache.org/jira/browse/CASSANDRA-10742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15024539#comment-15024539
 ] 

Jonathan Shook commented on CASSANDRA-10742:
--------------------------------------------

[~krummas],

Some notes on test setup, and some observations from data models we've seen. We 
can try to get some additional details from willing users if this doesn't get 
us close enough.

The baseline test I use is high-ingest, read-most-recent, with some read-cold 
mixed-in. The idea is to simulate the typical access patterns of time-series 
telemetry with roll-up processing, with the occasional historic query or 
reprocessing of old data. I use 90/10/1 ratio for write/recent-read/cold-read 
as a starting point. I usually back off the ingest rate from a saturating load 
in order to find a stable steady-state reference point. This still is much 
higher load per-node than you would often have in a production scenario. It 
does provide for good contrast with trade-offs, like compaction load. Often, 
you will be accumulating data over a longer period of time, so ingest rates 
that approach the reasonable saturating load are closer to stress tests than 
real-world. As such, they are still good tests. If you can run a node at 10x to 
1000x the data rates that you would expect in production, then 1) you can 
complete the test in a reasonable amount of time and 2) you're not too worried 
about the margin of error.

The data model I use is essentially ((datasource, timebucket), parametername, 
timestamp) -> value, although future testing will likely drop the timebucket 
component, relying instead on the time-based layout of sstables as a 
simplification. (Still needs supporting data from tests). parametername is just 
a variable name that is associated with a type of measurement. This is selected 
from a fixed set, as is often the case in the wild. The value can vary in type 
and size according to the type of data logging. I use a range from 1k to 5k, 
depending on the type of test. In the simplest cases, a value is an int or 
float, but it can also be a log line from a stack trace.

The model of writes/read-most-recent/read-cold can cover lots of ground in 
terms of time-series. The ratios can be varied. Also, the number of partitions 
per node in conjunction with the number of parameters should vary. In some 
cases in the wild, time-series partitions are single-series. In other cases, 
they can have hundreds of related series by name (by cluster). In some cases, 
the parameters associated with a data source are distributed by partition to 
support async loading the cluster for responsive reads of significant data. To 
cover this, simply move the parenthesis right by one term above.

If you cover some of the permutations above for op ratios, clustering 
structure, grain of partition, and payload size, then you'll be covering lots 
of the space we see in practice.


> Real world DateTieredCompaction tests
> -------------------------------------
>
>                 Key: CASSANDRA-10742
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-10742
>             Project: Cassandra
>          Issue Type: Test
>            Reporter: Marcus Eriksson
>
> So, to be able to actually evaluate DTCS (or TWCS) we need stress profiles 
> that are similar to something that could be found in real production systems.
> We should then run these profiles for _weeks_, and do regular operational 
> tasks on the cluster - like bootstrap, decom, repair etc.
> [~jjirsa] [~jshook] (or anyone): could you describe any write/read patterns 
> you have seen people use with DTCS in production?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to