I'd stick with the RandomPartitioner until you have a really good reason to change :)
I'd also go with your alternative design with some possible tweaks. Consider partitioning the rows by year or some other sensible value. If you will generally be getting the most recent data this can reduce the need for cassandra to read SSTables that contain the row key, but do not contain any required columns. Depending on how the data is collected, consider storing all the data collected for a certain data in a single columns using sometime like JSON. This would allow you to have a single column for each observation. This makes it easier to use a SliceRange to get say all the observations from 01/05/2011 If you often want to read certain keys for a single day (or a few days) consider pivoting the data so the key is the date and the columns are the current row keys. Hope that helps. ----------------- Aaron Morton Freelance Cassandra Developer @aaronmorton http://www.thelastpickle.com On 15 May 2011, at 19:56, Charles Blaxland wrote: > Hi All, > > New to Cassandra, so apologies if I don't fully grok stuff just yet. > > I have data keyed by a key as well as a date. I want to run a query to get > multiple keys across multiple contiguous date ranges simultaneously. I'm > currently storing the date along with the row key like this: > > key1|2011-05-15 { c1 : , c2 :, c3 : ... } > key1|2011-05-16 { c1 : , c2 :, c3 : ... } > key2|2011-05-15 { c1 : , c2 :, c3 : ... } > key2|2011-05-16 { c1 : , c2 :, c3 : ... } > ... > > I generate all the key/date combinations that I'm interested in and use > multiget_slice to retrieve them, pulling in all the columns for each key (I > need all the data, but the number of columns is small: less than 100). The > total number of row keys retrieved will only be 100 or so. > > Now it strikes me I could also store this using composite columns, like this: > > key1 { 2011-05-15|c1 : , 2011-5-16|c1 : , 2011-05-15|c2 :, 2011-05-16|c2 : , > 2011-05-15|c3 : , 2011-05-16|c3 : , ... } > key2 { 2011-05-15|c1 : , 2011-5-16|c1 : , 2011-05-15|c2 :, 2011-05-16|c2 : , > 2011-05-15|c3 : , 2011-05-16|c3 : , ... } > ... > > Then use multislice_get again (but with less keys), and use a slice range to > only retrieve the dates I'm interested in. > > Another alternative I guess would be to use OPP with the first storage > approach and get_range_slices, but as I understand this would not be great > for performance due to keys being clustered together on a single node? > > So my question is, which approach is best? One downside to the latter I guess > is that the number of columns grows without bound (although with 2 billion to > play with this isn't gonna be a problem any time soon). Also multiget_slice > supports only one slice predicate, so I'd guess I'd have to use multiple > queries to get multiple date ranges. > > Anyway, any thoughts/tips appreciated. > > Thanks, > Charles >