Hi, 

I have a question about the behavior of the HexStrings value generator in the 
cassandra-stress tool, particularly concerning its population/identity 
distribution.  


Per the discussion in JIRA item CASSANDRA-6146 concerning the stress YAML 
profile, the population field in a columnspec “represents the total unique 
population distribution of that column across rows.”


I interpreted this to mean that if I specify some distribution 'F' for a 
column, then the probability of occurrence for each potential value of that 
column is given by 'F'. 

So, for example, if I provided the following columnspec for a text column: 
  name: fake_column 
           size: fixed(32) 
     population: gaussian(1..100)  
and then generated a large amount of data according to this specification, 
I would expect there to be 100 distinct values for ‘fake_column’, and that a 
histogram of the frequency of occurrence of each value would be roughly 
bell-shaped. 



However, the current implementation of the HexStrings generator deviates from 
this expectation. In the current implementation, each CHARACTER in the string 
is drawn from F, rather than the string as a whole. Therefore, if you plot the 
histogram of frequency of occurrence for each character, you get a bell-shaped 
curve, but the distribution of the occurrences of whole strings (the actual 
columns) is something else. 


My question is, is this the desired behavior for string columns? Was my 
expectation/interpretation incorrect? If so, can anyone give some insight as to 
why strings are designed to behave this way and what the use case is for this 
behavior? 

Thanks, 
-Saleil 


Reply via email to