[jira] [Comment Edited] (CASSANDRA-6146) CQL-native stress

Benedict (JIRA) Tue, 01 Jul 2014 09:04:41 -0700

    [ 
https://issues.apache.org/jira/browse/CASSANDRA-6146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14048994#comment-14048994
 ]


Benedict edited comment on CASSANDRA-6146 at 7/1/14 4:03 PM:
-------------------------------------------------------------

bq. It sounds like writing to an entire partition at once is a step backwards 
from the original patch, since you can't test writing incrementally to a wide 
row. all clustering columns are written at once (unless I'm misunderstanding). 
Previously the population distribution of a column was not within a partition 
so you could make it very large.

The problem with the prior approach was that you could not control the size of 
partition you created, nor whether or not you were actually querying any data 
for the non-insert operations. The only control you had was the size of your 
population for each field, so the only way to perform incremental inserts to a 
partition was to constrain your partition key domain to a fraction of the 
domain of the clustering columns. This did not give you much capacity to 
control or reason about how much data was being inserted to a given partition, 
nor how this was distributed, nor, importantly, how many distinct partitions 
were updated for a single batch statement, and it meant that we would likely 
benchmark queries that returned (and even operated over) no data, with no way 
of knowing if this was correct or not. 

The new approach lets us validate the data we get back, be certain we are 
operating over data that should exist (so does real work), and even knows how 
much data it's operating over to report accurate statistics. It also lets us 
control how many cql rows we insert into a single partition in one batch. 
Modifying the current approach to write/generate only a portion of a partition 
at a time is relatively trivial; we can even support an extra "batch" option 
that supports splitting an insert for a single partition into multiple distinct 
batch statements so we can control very specifically how incrementally the data 
is written. I only left it out to put some kind of cap on the number of changes 
introduced in this ticket, but don't mind including it this round.

bq. I'm not sure how I feel about putting the batchsize and batchtype into the 
yaml. Those feel like command line args to me.

The problem with a command line option is it applies to all operations; whilst 
we don't currently support batching for anything other than inserts, it's quite 
likely we'll want to for, e.g., deletes and potentially also for queries with 
IN statements. But I'm not dead set against moving this out onto the command 
line.

bq. I think we should change the term identity to population as it seems 
clearer to me for the columnspec. and in the code identityDistribution to 
populationDistribution

Sure. We should comment that this is a unique seed population, and not the 
actual population, however.

bq. I'm trying to run with one of the yaml files and getting an error:

Whoops. Obviously I broke something in a final tweak somewhere :/




was (Author: benedict):
bq. It sounds like writing to an entire partition at once is a step backwards 
from the original patch, since you can't test writing incrementally to a wide 
row. all clustering columns are written at once (unless I'm misunderstanding). 
Previously the population distribution of a column was not within a partition 
so you could make it very large.

The problem with the prior approach was that you could not control the size of 
partition you created, nor whether or not you were actually querying any data 
for the non-insert operations. The only control you had was the size of your 
population for each field, so the only way to perform incremental inserts to a 
partition was to constrain your partition key domain to a fraction of the 
domain of the clustering columns. This did not give you much capacity to 
control or reason about how much data was being inserted to a given partition, 
nor how this was distributed, nor, importantly, how many distinct partitions 
were updated for a single batch statement, and it meant that we would likely 
benchmark queries that returned (and even operated over) no data, with no way 
of knowing if this was correct or not. 

The new approach lets us validate the data we get back, be certain we are 
operating over data that should exist (so does real work), and even knows how 
much data it's operating over to report accurate statistics. It also lets us 
control how many cql rows we insert into a single partition in one batch. 
Modifying the current approach to write/generate only a portion of a partition 
at a time is relatively trivial; we can even support an extra "batch" option 
that supports splitting an insert for a single partition into multiple distinct 
batch statements, I only left it out to put some kind of cap on the number of 
changes introduced in this ticket, but don't mind including it this round.

bq. I'm not sure how I feel about putting the batchsize and batchtype into the 
yaml. Those feel like command line args to me.

The problem with a command line option is it applies to all operations; whilst 
we don't currently support batching for anything other than inserts, it's quite 
likely we'll want to for, e.g., deletes and potentially also for queries with 
IN statements. But I'm not dead set against moving this out onto the command 
line.

bq. I think we should change the term identity to population as it seems 
clearer to me for the columnspec. and in the code identityDistribution to 
populationDistribution

Sure. We should comment that this is a unique seed population, and not the 
actual population, however.

bq. I'm trying to run with one of the yaml files and getting an error:

Whoops. Obviously I broke something in a final tweak somewhere :/



> CQL-native stress
> -----------------
>
>                 Key: CASSANDRA-6146
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-6146
>             Project: Cassandra
>          Issue Type: New Feature
>          Components: Tools
>            Reporter: Jonathan Ellis
>            Assignee: T Jake Luciani
>             Fix For: 2.1.1
>
>         Attachments: 6146-v2.txt, 6146.txt, 6164-v3.txt
>
>
> The existing CQL "support" in stress is not worth discussing.  We need to 
> start over, and we might as well kill two birds with one stone and move to 
> the native protocol while we're at it.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Comment Edited] (CASSANDRA-6146) CQL-native stress

Reply via email to