[ 
https://issues.apache.org/jira/browse/CASSANDRA-7519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14108226#comment-14108226
 ] 

Benedict commented on CASSANDRA-7519:
-------------------------------------

bq. What is the point of batchcount? The point of a batch is to group the 
inserts into a single statement for the server, so why would you send multiple 
of these sequentially? Even though it's possible I can't think of a realistic 
workload that would use it.

The idea was to support benchmarking many inserts into a very wide row. However 
now that we support the revisit mechanism, this does seem superfluous. There is 
one slight potential reason to include it, which is that currently batches only 
support 5000 statements. Currently stress automatically splits into batches of 
this size, but perhaps we should error out at the start if it's possible to 
generate a batch larger than this, and support this option to permit users to 
split it up. Or, alternatively, we should drop this option and forbid batches 
larger than 5000 in size only if using LOGGED batches. Any of the above seem 
reasonable to me.

bq, I think it would be helpful to output some information on the partition 
sizes and batch sizes for inserts to give people a sense of what their selected 
values will do,

That does sound sensible, yes. I'll ad that.

It seems that it might be worthwhile including an estimate of the size of data 
we've sent in the main stress output as well, as with a lot of randomly (esp. 
expontentially) generated data it could vary dramatically, so the current data 
might not be as useful. As it first appears.

Separately, I think we should make a minor tweak and base the stderr 
calculation on partition count, not operation count.

> Further stress improvements to generate more realistic workloads
> ----------------------------------------------------------------
>
>                 Key: CASSANDRA-7519
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-7519
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Tools
>            Reporter: Benedict
>            Assignee: Benedict
>            Priority: Minor
>              Labels: tools
>             Fix For: 2.1.1
>
>
> We generally believe that the most common workload is for reads to 
> exponentially prefer most recently written data. However as stress currently 
> behaves we have two id generation modes: sequential and random (although 
> random can be distributed). I propose introducing a new mode which is 
> somewhat like sequential, except we essentially 'look back' from the current 
> id by some amount defined by a distribution. I may possibly make the position 
> only increment as it's first written to also, so that this mode can be run 
> from a clean slate with a mixed workload. This should allow is to generate 
> workloads that are more representative.
> At the same time, I will introduce a timestamp value generator for primary 
> key columns that is strictly ascending, i.e. has some random component but is 
> based off of the actual system time (or some shared monotonically increasing 
> state) so that we can again generate a more realistic workload. This may be 
> challenging to tie in with the new procedurally generated partitions, but I'm 
> sure it can be done without too much difficulty.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to