I know Cassandra is very flexible. a. Because of super_column can not contain large number of columns, you should not use design 1 b. Maybe with each query, you have to separate to each ColumnFamily
On Wed, Apr 21, 2010 at 1:17 PM, Steve Lihn <stevel...@gmail.com> wrote: > Hi, > I am new to Cassandra. I would like to use Cassandra to store financial > data (time series). Have question on the data model design. > > The example here is the daily stock data. This would be a column family > called dailyStockData. The raw key is stock ticker. > Everyday there are attributes like closingPrice, volume, sharesOutstanding, > etc. that need to be stored. There seems to be two ways to model it: > > Design 1: Each attribute is a super column. Therefore each date is a > column. So we have: > > AAPL -> closingPrice -> { '2010-04-13' : 242, '2010-04-14': 245 } > AAPL -> volume -> { '2010-04-13' : 10.9m, '2010-04-14': 14.4m } > etc. > > Design 2: Each date is a super column. Therefore each attribute is a > column. So we have: > > AAPL -> '2010-04-13' -> { closingPrice -> 242, volume -> 10.9m } > AAPL -> '2010-04-14' -> {closingPrice -> 245, volume -> 14.4m } > etc. > > The date column / superColumn will need Order Perserving Partitioner since > we are going to do a lot of range queries. Examples are: > Query 1: Give me the data between date1 and date2 for a set of tickers > (say, the 100 tickers in QQQ). > Query 2: More often than not, the query is: Give me the data for the max > available dates (for each ticker) between date1 and date2 in a set of > tickers. > (Since not every day is traded, and we only want the most recent data, > given a range of dates.) > > My questions are: > a. Is there any technical reason to prefer (or must choose) one rather than > the other between Design 1 and Design 2 ? > b. Are both queries possible (and comparable in speed) for the chosen > design ? > > Thanks, > Steve > > > > > > > -- Best regards, JKnight