[ 
https://issues.apache.org/jira/browse/CASSANDRA-8597?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14273799#comment-14273799
 ] 

Benedict commented on CASSANDRA-8597:
-------------------------------------

# see CASSANDRA-7980
# This is a known problem with heavily skewed distributions, and is challenging 
to resolve, largely because we don't know how many values we will generate for 
any lower tier when deciding if we will descend from an upper tier (tier being 
clustering column prefix); this would be even worse with 7980. I've given this 
a little thought in the past, but since stress hasn't been considered a major 
priority have left on the back burner to try and resolve. One possibility is, 
instead of generating the number of values on a per tier, we _could_ instead 
generate a total number of values for all tiers, then generate a distribution 
for ratio of adoption for each tier, and each part of the tier. This is pretty 
difficult to conceptualise though, and implement. There are some other 
possibilities but they don't avoid similar problems. For instance, we could 
visit all of the lower tiers with the defined select chance, but since the 
upper tier may be filtered out with higher chance than it deserves, these rows 
will be visited with much lower likelihood. TL;DR: this is a complex ticket of 
its own, and requires a mini-research project to improve.
# i'm not sure what's meant here? it's a deterministic workload if you use the 
-pop seq=1..N, except for thread interleavings and ancillary chances like 
"select". Do you mean a deterministic non-uniform distribution? Deterministic 
select behaviour?
# With 7980, we can simulate a workload very similar to a time-series one, by 
generating giant partitions with a temporal component and visiting their 
contents in ascending order. _Exactly_ simulating one requires some thought as 
to how to best define, model and deliver it though. The TODO in generator.Dates 
helps, but is probably not the best avenue; permitting expressions for ranges 
based on the partition seed might be a better route. I have idly wondered if, 
generally, we shouldn't permit some arbitrary javascript with a couple of 
predefined inputs to generate values, or the value ranges since this would be 
the most elegant and general way of supporting this. Again, not trivial though.

> Stress: make simple things simple
> ---------------------------------
>
>                 Key: CASSANDRA-8597
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-8597
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Tools
>            Reporter: Jonathan Ellis
>            Assignee: T Jake Luciani
>             Fix For: 2.1.3
>
>
> Some of the trouble people have with stress is a documentation problem, but 
> some is functional.
> Comments from [~iamaleksey]:
> # 3 clustering columns, make a million cells in a single partition, should be 
> simple, but it's not. have to tweak 'clustering' on the three columns just 
> right to make stress work at all. w/ some values it'd just gets stuck forever 
> computing batches
> # for others, it generates huge, megabyte-size batches, utterly disrespecting 
> 'select' clause in 'insert'
> #  I want a sequential generator too, to be able to predict deterministic 
> result sets. uniform() only gets you so far
> # impossible to simulate a time series workload
> /cc [~jshook] [~aweisberg] [~benedict]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to