[
https://issues.apache.org/jira/browse/CASSANDRA-10742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15024539#comment-15024539
]
Jonathan Shook commented on CASSANDRA-10742:
[~krummas],
Some notes on test setup, and some observations from data models we've seen. We
can try to get some additional details from willing users if this doesn't get
us close enough.
The baseline test I use is high-ingest, read-most-recent, with some read-cold
mixed-in. The idea is to simulate the typical access patterns of time-series
telemetry with roll-up processing, with the occasional historic query or
reprocessing of old data. I use 90/10/1 ratio for write/recent-read/cold-read
as a starting point. I usually back off the ingest rate from a saturating load
in order to find a stable steady-state reference point. This still is much
higher load per-node than you would often have in a production scenario. It
does provide for good contrast with trade-offs, like compaction load. Often,
you will be accumulating data over a longer period of time, so ingest rates
that approach the reasonable saturating load are closer to stress tests than
real-world. As such, they are still good tests. If you can run a node at 10x to
1000x the data rates that you would expect in production, then 1) you can
complete the test in a reasonable amount of time and 2) you're not too worried
about the margin of error.
The data model I use is essentially ((datasource, timebucket), parametername,
timestamp) -> value, although future testing will likely drop the timebucket
component, relying instead on the time-based layout of sstables as a
simplification. (Still needs supporting data from tests). parametername is just
a variable name that is associated with a type of measurement. This is selected
from a fixed set, as is often the case in the wild. The value can vary in type
and size according to the type of data logging. I use a range from 1k to 5k,
depending on the type of test. In the simplest cases, a value is an int or
float, but it can also be a log line from a stack trace.
The model of writes/read-most-recent/read-cold can cover lots of ground in
terms of time-series. The ratios can be varied. Also, the number of partitions
per node in conjunction with the number of parameters should vary. In some
cases in the wild, time-series partitions are single-series. In other cases,
they can have hundreds of related series by name (by cluster). In some cases,
the parameters associated with a data source are distributed by partition to
support async loading the cluster for responsive reads of significant data. To
cover this, simply move the parenthesis right by one term above.
If you cover some of the permutations above for op ratios, clustering
structure, grain of partition, and payload size, then you'll be covering lots
of the space we see in practice.
> Real world DateTieredCompaction tests
> -
>
> Key: CASSANDRA-10742
> URL: https://issues.apache.org/jira/browse/CASSANDRA-10742
> Project: Cassandra
> Issue Type: Test
>Reporter: Marcus Eriksson
>
> So, to be able to actually evaluate DTCS (or TWCS) we need stress profiles
> that are similar to something that could be found in real production systems.
> We should then run these profiles for _weeks_, and do regular operational
> tasks on the cluster - like bootstrap, decom, repair etc.
> [~jjirsa] [~jshook] (or anyone): could you describe any write/read patterns
> you have seen people use with DTCS in production?
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)