[ 
https://issues.apache.org/jira/browse/CASSANDRA-12490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15564148#comment-15564148
 ] 

Ben Slater commented on CASSANDRA-12490:
----------------------------------------

Yes, you're right resetting the counter to zero on setSeed() does result in the 
same row being generated over and over again (which does make me wonder how 
stress is respecting the distribution for the PK value but didn't investigate 
at this point). However, that is pretty easily fixed by having setSeed() set 
the counter to the supplied seed value. I think once we do this SEQ behaves 
very similarly to the other distributions.

I don't think it's correct that stress generates every value if the number of 
unique values it can generate is <= the number of values it is being asked to 
generate for a partition. This would only respect the distribution in the case 
of uniform distribution, however even then I don't think it's guaranteed to be 
completely uniform (and thus generate all values) from n samples of a 1..n 
distribution (you probably need to do many * n to get very close to uniform) - 
it certainly doesn't seem to behave this way in testing. For say normal 
distribution you'd need several * n to cover all the possible values and have 
close to a normal distribution.

I afraid I don't really understand why you think this is abusing the notion of 
distributions when (a) there was already a sequence distribution type in the 
"legacy" distribution sets (presumably for just this purpose) and (b) to me, 
one way of describing this is a uniform distribution with minimal chance of 
collisions (ie it's just another way for selecting values from a range).

Finally, it's not quite correct to say I'm trying to populate all possible 
values for a column, rather trying to generate as many unique values as 
possible (within the specified ranges) for a given sample size (to minimise 
overwriting).

> Add sequence distribution type to cassandra stress
> --------------------------------------------------
>
>                 Key: CASSANDRA-12490
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-12490
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Tools
>            Reporter: Ben Slater
>            Assignee: Ben Slater
>            Priority: Minor
>             Fix For: 3.10
>
>         Attachments: 12490-trunk.patch, 12490.yaml, cqlstress-seq-example.yaml
>
>
> When using the write command, cassandra stress sequentially generates seeds. 
> This ensures generated values don't overlap (unless the sequence wraps) 
> providing more predictable number of inserted records (and generating a base 
> set of data without wasted writes).
> When using a yaml stress spec there is no sequenced distribution available. 
> It think it would be useful to have this for doing initial load of data for 
> testing 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to