[ https://issues.apache.org/jira/browse/CASSANDRA-12490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15564148#comment-15564148 ]
Ben Slater commented on CASSANDRA-12490: ---------------------------------------- Yes, you're right resetting the counter to zero on setSeed() does result in the same row being generated over and over again (which does make me wonder how stress is respecting the distribution for the PK value but didn't investigate at this point). However, that is pretty easily fixed by having setSeed() set the counter to the supplied seed value. I think once we do this SEQ behaves very similarly to the other distributions. I don't think it's correct that stress generates every value if the number of unique values it can generate is <= the number of values it is being asked to generate for a partition. This would only respect the distribution in the case of uniform distribution, however even then I don't think it's guaranteed to be completely uniform (and thus generate all values) from n samples of a 1..n distribution (you probably need to do many * n to get very close to uniform) - it certainly doesn't seem to behave this way in testing. For say normal distribution you'd need several * n to cover all the possible values and have close to a normal distribution. I afraid I don't really understand why you think this is abusing the notion of distributions when (a) there was already a sequence distribution type in the "legacy" distribution sets (presumably for just this purpose) and (b) to me, one way of describing this is a uniform distribution with minimal chance of collisions (ie it's just another way for selecting values from a range). Finally, it's not quite correct to say I'm trying to populate all possible values for a column, rather trying to generate as many unique values as possible (within the specified ranges) for a given sample size (to minimise overwriting). > Add sequence distribution type to cassandra stress > -------------------------------------------------- > > Key: CASSANDRA-12490 > URL: https://issues.apache.org/jira/browse/CASSANDRA-12490 > Project: Cassandra > Issue Type: Improvement > Components: Tools > Reporter: Ben Slater > Assignee: Ben Slater > Priority: Minor > Fix For: 3.10 > > Attachments: 12490-trunk.patch, 12490.yaml, cqlstress-seq-example.yaml > > > When using the write command, cassandra stress sequentially generates seeds. > This ensures generated values don't overlap (unless the sequence wraps) > providing more predictable number of inserted records (and generating a base > set of data without wasted writes). > When using a yaml stress spec there is no sequenced distribution available. > It think it would be useful to have this for doing initial load of data for > testing -- This message was sent by Atlassian JIRA (v6.3.4#6332)