[ https://issues.apache.org/jira/browse/CASSANDRA-6146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14048994#comment-14048994 ]
Benedict edited comment on CASSANDRA-6146 at 7/1/14 4:03 PM: ------------------------------------------------------------- bq. It sounds like writing to an entire partition at once is a step backwards from the original patch, since you can't test writing incrementally to a wide row. all clustering columns are written at once (unless I'm misunderstanding). Previously the population distribution of a column was not within a partition so you could make it very large. The problem with the prior approach was that you could not control the size of partition you created, nor whether or not you were actually querying any data for the non-insert operations. The only control you had was the size of your population for each field, so the only way to perform incremental inserts to a partition was to constrain your partition key domain to a fraction of the domain of the clustering columns. This did not give you much capacity to control or reason about how much data was being inserted to a given partition, nor how this was distributed, nor, importantly, how many distinct partitions were updated for a single batch statement, and it meant that we would likely benchmark queries that returned (and even operated over) no data, with no way of knowing if this was correct or not. The new approach lets us validate the data we get back, be certain we are operating over data that should exist (so does real work), and even knows how much data it's operating over to report accurate statistics. It also lets us control how many cql rows we insert into a single partition in one batch. Modifying the current approach to write/generate only a portion of a partition at a time is relatively trivial; we can even support an extra "batch" option that supports splitting an insert for a single partition into multiple distinct batch statements so we can control very specifically how incrementally the data is written. I only left it out to put some kind of cap on the number of changes introduced in this ticket, but don't mind including it this round. bq. I'm not sure how I feel about putting the batchsize and batchtype into the yaml. Those feel like command line args to me. The problem with a command line option is it applies to all operations; whilst we don't currently support batching for anything other than inserts, it's quite likely we'll want to for, e.g., deletes and potentially also for queries with IN statements. But I'm not dead set against moving this out onto the command line. bq. I think we should change the term identity to population as it seems clearer to me for the columnspec. and in the code identityDistribution to populationDistribution Sure. We should comment that this is a unique seed population, and not the actual population, however. bq. I'm trying to run with one of the yaml files and getting an error: Whoops. Obviously I broke something in a final tweak somewhere :/ was (Author: benedict): bq. It sounds like writing to an entire partition at once is a step backwards from the original patch, since you can't test writing incrementally to a wide row. all clustering columns are written at once (unless I'm misunderstanding). Previously the population distribution of a column was not within a partition so you could make it very large. The problem with the prior approach was that you could not control the size of partition you created, nor whether or not you were actually querying any data for the non-insert operations. The only control you had was the size of your population for each field, so the only way to perform incremental inserts to a partition was to constrain your partition key domain to a fraction of the domain of the clustering columns. This did not give you much capacity to control or reason about how much data was being inserted to a given partition, nor how this was distributed, nor, importantly, how many distinct partitions were updated for a single batch statement, and it meant that we would likely benchmark queries that returned (and even operated over) no data, with no way of knowing if this was correct or not. The new approach lets us validate the data we get back, be certain we are operating over data that should exist (so does real work), and even knows how much data it's operating over to report accurate statistics. It also lets us control how many cql rows we insert into a single partition in one batch. Modifying the current approach to write/generate only a portion of a partition at a time is relatively trivial; we can even support an extra "batch" option that supports splitting an insert for a single partition into multiple distinct batch statements, I only left it out to put some kind of cap on the number of changes introduced in this ticket, but don't mind including it this round. bq. I'm not sure how I feel about putting the batchsize and batchtype into the yaml. Those feel like command line args to me. The problem with a command line option is it applies to all operations; whilst we don't currently support batching for anything other than inserts, it's quite likely we'll want to for, e.g., deletes and potentially also for queries with IN statements. But I'm not dead set against moving this out onto the command line. bq. I think we should change the term identity to population as it seems clearer to me for the columnspec. and in the code identityDistribution to populationDistribution Sure. We should comment that this is a unique seed population, and not the actual population, however. bq. I'm trying to run with one of the yaml files and getting an error: Whoops. Obviously I broke something in a final tweak somewhere :/ > CQL-native stress > ----------------- > > Key: CASSANDRA-6146 > URL: https://issues.apache.org/jira/browse/CASSANDRA-6146 > Project: Cassandra > Issue Type: New Feature > Components: Tools > Reporter: Jonathan Ellis > Assignee: T Jake Luciani > Fix For: 2.1.1 > > Attachments: 6146-v2.txt, 6146.txt, 6164-v3.txt > > > The existing CQL "support" in stress is not worth discussing. We need to > start over, and we might as well kill two birds with one stone and move to > the native protocol while we're at it. -- This message was sent by Atlassian JIRA (v6.2#6252)