I'd stick with the RandomPartitioner until you have a really good reason to 
change :)

I'd also go with your alternative design with some possible tweaks. 

Consider partitioning the rows  by year or some other sensible value. If you 
will generally be getting the most recent data this can reduce the need for 
cassandra to read SSTables that contain the row key, but do not contain any 
required columns. 

Depending on how the data is collected, consider storing all the data collected 
for a certain data in a single columns using sometime like JSON. This would 
allow you to have a single column for each observation. This makes it easier to 
use a SliceRange to get say all the observations from 01/05/2011

If you often want to read certain keys for a single day (or a few days) 
consider pivoting the data so the key is the date and the columns are the 
current row keys. 

Hope that helps. 

-----------------
Aaron Morton
Freelance Cassandra Developer
@aaronmorton
http://www.thelastpickle.com

On 15 May 2011, at 19:56, Charles Blaxland wrote:

> Hi All,
> 
> New to Cassandra, so apologies if I don't fully grok stuff just yet.
> 
> I have data keyed by a key as well as a date. I want to run a query to get 
> multiple keys across multiple contiguous date ranges simultaneously. I'm 
> currently storing the date along with the row key like this:
> 
> key1|2011-05-15 {  c1 : , c2 :,  c3 : ... }
> key1|2011-05-16 {  c1 : , c2 :,  c3 : ... }
> key2|2011-05-15 {  c1 : , c2 :,  c3 : ... }
> key2|2011-05-16 {  c1 : , c2 :,  c3 : ... }
> ...
> 
> I generate all the key/date combinations that I'm interested in and use 
> multiget_slice to retrieve them, pulling in all the columns for each key (I 
> need all the data, but the number of columns is small: less than 100). The 
> total number of row keys retrieved will only be 100 or so.
> 
> Now it strikes me I could also store this using composite columns, like this:
> 
> key1 {  2011-05-15|c1 : , 2011-5-16|c1 : , 2011-05-15|c2 :, 2011-05-16|c2 : , 
> 2011-05-15|c3 : , 2011-05-16|c3 : , ... }
> key2 {  2011-05-15|c1 : , 2011-5-16|c1 : , 2011-05-15|c2 :, 2011-05-16|c2 : , 
> 2011-05-15|c3 : , 2011-05-16|c3 : , ... }
> ...
> 
> Then use multislice_get again (but with less keys), and use a slice range to 
> only retrieve the dates I'm interested in.
> 
> Another alternative I guess would be to use OPP with the first storage 
> approach and get_range_slices, but as I understand this would not be great 
> for performance due to keys being clustered together on a single node?
> 
> So my question is, which approach is best? One downside to the latter I guess 
> is that the number of columns grows without bound (although with 2 billion to 
> play with this isn't gonna be  a problem any time soon). Also multiget_slice 
> supports only one slice predicate, so I'd guess I'd have to use multiple 
> queries to get multiple date ranges.
> 
> Anyway, any thoughts/tips appreciated.
> 
> Thanks,
> Charles
> 

Reply via email to