I’m honestly not sure. The code has changed since I last worked on it, which was years ago. I suspect the profile mode has entirely supplanted the prior modes, and that these older modes supported the HexStrings generator.
Perhaps somebody else can help answer this question. > On 13 Dec 2018, at 17:37, Saleil Bhat (BLOOMBERG/ 731 LEX) > <sbha...@bloomberg.net> wrote: > > Ah ok thanks. This brings up another question: how did the HexStrings > generator code path even get called? > > > > When I saw these results, I was using the following test table: > CREATE TABLE testtable ( > partition_key text, > clustering_column text, > value text, > PRIMARY KEY (partition_key, clustering_column) > ) > > > From StressProfile.java, any column of type TEXT should use the Strings > generator. > However, my data looks suspiciously like the HexStrings generator was being > used instead. > > > First, the generated strings included control characters like SUB (\x1A), BEL > (\x07), etc. However, the Strings generator code looks like it forces the > characters to be in the printing characters range. > Second, the result I documented previously (that the characters are normally > distributed, but the strings are not), matches the implementation of > HexStrings. > > > > Do you know why this might be the case? > > Thanks, > -Saleil > > > From: bened...@apache.org At: 12/12/18 18:09:14To: Saleil Bhat (BLOOMBERG/ > 731 LEX ) , dev@cassandra.apache.org > Subject: Re: cassandra-stress HexStrings generator > > Yes, I’m pretty sure you understood correctly (I wrote most of this, but it’s > been a long time so I cannot remember much for certain). > > It should be implemented like the Strings generator. It looks like both > HexStrings and HexBytes are incorrect, and have been for a long time. > > >> On 12 Dec 2018, at 22:27, Saleil Bhat (BLOOMBERG/ 731 LEX) > <sbha...@bloomberg.net> wrote: >> >> Hi, >> >> I have a question about the behavior of the HexStrings value generator in >> the > cassandra-stress tool, particularly concerning its population/identity > distribution. >> >> >> Per the discussion in JIRA item CASSANDRA-6146 concerning the stress YAML > profile, the population field in a columnspec “represents the total unique > population distribution of that column across rows.” >> >> >> I interpreted this to mean that if I specify some distribution 'F' for a > column, then the probability of occurrence for each potential value of that > column is given by 'F'. >> >> So, for example, if I provided the following columnspec for a text column: >> name: fake_column >> size: fixed(32) >> population: gaussian(1..100) >> and then generated a large amount of data according to this specification, >> I would expect there to be 100 distinct values for ‘fake_column’, and that a > histogram of the frequency of occurrence of each value would be roughly > bell-shaped. >> >> >> >> However, the current implementation of the HexStrings generator deviates >> from > this expectation. In the current implementation, each CHARACTER in the string > is drawn from F, rather than the string as a whole. Therefore, if you plot > the > histogram of frequency of occurrence for each character, you get a > bell-shaped > curve, but the distribution of the occurrences of whole strings (the actual > columns) is something else. >> >> >> My question is, is this the desired behavior for string columns? Was my > expectation/interpretation incorrect? If so, can anyone give some insight as > to > why strings are designed to behave this way and what the use case is for this > behavior? >> >> Thanks, >> -Saleil > > --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org For additional commands, e-mail: dev-h...@cassandra.apache.org