Main issue is resolved. The test I was using to determine normality was too sensitive to discretization, so it was yielding a negative result even though the data looked pretty normal on visual inspection. The tool only ever uses the Strings generator; HexStrings is unused.
The only (minor) concern is that the Strings generator generates some control characters as part of the generated string. I presume that this behavior is undesired and that the characters should be restricted to ASCII printing characters. Thanks, -Saleil From: bened...@apache.org At: 12/13/18 17:10:17To: Saleil Bhat (BLOOMBERG/ 731 LEX ) , dev@cassandra.apache.org Subject: Re: cassandra-stress HexStrings generator I’m honestly not sure. The code has changed since I last worked on it, which was years ago. I suspect the profile mode has entirely supplanted the prior modes, and that these older modes supported the HexStrings generator. Perhaps somebody else can help answer this question. > On 13 Dec 2018, at 17:37, Saleil Bhat (BLOOMBERG/ 731 LEX) <sbha...@bloomberg.net> wrote: > > Ah ok thanks. This brings up another question: how did the HexStrings generator code path even get called? > > > > When I saw these results, I was using the following test table: > CREATE TABLE testtable ( > partition_key text, > clustering_column text, > value text, > PRIMARY KEY (partition_key, clustering_column) > ) > > > From StressProfile.java, any column of type TEXT should use the Strings generator. > However, my data looks suspiciously like the HexStrings generator was being used instead. > > > First, the generated strings included control characters like SUB (\x1A), BEL (\x07), etc. However, the Strings generator code looks like it forces the characters to be in the printing characters range. > Second, the result I documented previously (that the characters are normally distributed, but the strings are not), matches the implementation of HexStrings. > > > > Do you know why this might be the case? > > Thanks, > -Saleil > > > From: bened...@apache.org At: 12/12/18 18:09:14To: Saleil Bhat (BLOOMBERG/ 731 LEX ) , dev@cassandra.apache.org > Subject: Re: cassandra-stress HexStrings generator > > Yes, I’m pretty sure you understood correctly (I wrote most of this, but it’s > been a long time so I cannot remember much for certain). > > It should be implemented like the Strings generator. It looks like both > HexStrings and HexBytes are incorrect, and have been for a long time. > > >> On 12 Dec 2018, at 22:27, Saleil Bhat (BLOOMBERG/ 731 LEX) > <sbha...@bloomberg.net> wrote: >> >> Hi, >> >> I have a question about the behavior of the HexStrings value generator in the > cassandra-stress tool, particularly concerning its population/identity > distribution. >> >> >> Per the discussion in JIRA item CASSANDRA-6146 concerning the stress YAML > profile, the population field in a columnspec “represents the total unique > population distribution of that column across rows.” >> >> >> I interpreted this to mean that if I specify some distribution 'F' for a > column, then the probability of occurrence for each potential value of that > column is given by 'F'. >> >> So, for example, if I provided the following columnspec for a text column: >> name: fake_column >> size: fixed(32) >> population: gaussian(1..100) >> and then generated a large amount of data according to this specification, >> I would expect there to be 100 distinct values for ‘fake_column’, and that a > histogram of the frequency of occurrence of each value would be roughly > bell-shaped. >> >> >> >> However, the current implementation of the HexStrings generator deviates from > this expectation. In the current implementation, each CHARACTER in the string > is drawn from F, rather than the string as a whole. Therefore, if you plot the > histogram of frequency of occurrence for each character, you get a bell-shaped > curve, but the distribution of the occurrences of whole strings (the actual > columns) is something else. >> >> >> My question is, is this the desired behavior for string columns? Was my > expectation/interpretation incorrect? If so, can anyone give some insight as to > why strings are designed to behave this way and what the use case is for this > behavior? >> >> Thanks, >> -Saleil > >