Currently there is a limitation that each row must fit in memory (with some not insignificant overhead), thus having lots of columns per row can trigger out-of-memory errors. This limitation should be removed in a future release.
Please see: - http://wiki.apache.org/cassandra/CassandraLimitations - https://issues.apache.org/jira/browse/CASSANDRA-16 (notice this is marked as resolved now) Mason On Tue, Jul 13, 2010 at 9:38 AM, Kochheiser,Todd W - TOK-DITT-1 < twkochhei...@bpa.gov> wrote: > I recently ran across a blog posting with a comment from a Cassandra > committer that indicated a performance penalty when having a large number of > columns per row/key. Unfortunately I didn’t bookmark the blog posting and > now I can’t find it. Regardless, since our current plan and design is to > have several thousand columns per row/key, it made me question our design > and if it might cause unintended performance consequences. As a somewhat > concrete example for discussion purposes, which of the following scenarios > would “potential” perform better or worse? > > Assume: > > - Single ColumnFamily > - Three node cluster > - 10 to 1 read/write ratio (10 reads to every write) > > > Scenario A: > > > - 10k rows > - 5k columns/row > - Each column ~ 64kB > - Hot spot for writes and reads would be a single column in each row > (the hot column would change every hour). We would be accessing every row > constantly, but in general accessing just a few columns in each. > - A low volume of reads accessing ~100 columns per row (range queries > would work) > - Access is generally direct (row key / column key) > - Data growth would be horizontal (adding columns) as apposed to > vertically (adding rows) > - This is our current design > > > Scenario B: > > > - 50M rows/keys > - 1 column/key > - Each column ~ 64kB > - Hot spot for writes and reads would be the single column in 10k rows, > but the 10k rows accessed would change every hour. > - Access would generally be direct (row key / column key) > - Data growth would be vertically (adding rows 10k at a time) as > apposed to horizontal (adding columns) > > > Scenario C: > > > - 5k rows/keys > - 10k columns/row > - Each column ~64kB > - Hot spot for writes and reads would be every column in a single row. > Row being access would change every hour > - Access is generally direct (row key / column key) > - Low volume of queries accessing a single column in many rows > - Data growth would be by adding rows, each with 10k column. > > > In all three scenarios the amount of data is the same but the access > pattern in different. From an application coding perspective any of the > approaches are feasible, although the data is easier to think about in > Scenario A (i.e. fewer mental gymnastics and fewer composite keys). In all > of the scenarios there are 10k columns that are constantly accessed (read > and write). > > Some thoughts: Scenario A has the advantage of evenly distributing > reads/writes across all cluster nodes (I think). Scenario B has the > potential advantage of having one column per row (I think) but **not** > necessarily distributing evenly reads/writes across all cluster nodes. I’m > not serious about Scenario C, but it is an option. Scenario C would > probably cause one node in the cluster to take the brunt of all reads/writes > so I think this design would be a bad idea. And, if having lots of columns > is a bad idea then this is even worse than scenario A. > > Regards, > Todd > > > > > >