> We need to query data by each column, do pagination as below,
> 
> select * from book where isbn   < "XYZ" order by ISBN   descending limit 30;
> select * from book where price  < 992   order by price  descending limit 30;
> select * from book where col_n1 < 789   order by col_n1 descending limit 30;
> select * from book where col_n2 < "MUJ" order by col_n2 descending limit 30;
> ...
> select * from book where col_nm < 978 order by col_nm descending limit 30;
These queries are not easily supported on standard Cassandra. If you need this 
level of query complexity consider Data Stax Enterprise, Solr, or a RDBMS. 

> If we choose Materialized Views approach, we have to update all
> 20 Materialized View column family(s), for each base row update.
> Will the Cassandra write performance acceptable?
Yes, depending on the size of the cluster and the machine spec. 

It's often a good idea to design CF's to match the workloads. If you have some 
data that changes faster than other, consider splitting them into different CFs.

> Should we just normalize the data, create base book table with book_id
> as primary key, and then
> build 20 index column family(s), use wide row column slicing approach,
> with index column data value as column name and book_id as value?
You do not need a different CF for each custom secondary index. Try putting the 
name of the index in the row key. 

> What will you recommend?
Take another look at the queries you *need* to support. Then build a small 
proof of concept to see if Cassandra will work for you. 

Hope that helps. 

-----------------
Aaron Morton
Freelance Developer
@aaronmorton
http://www.thelastpickle.com

On 6/04/2012, at 6:46 AM, Data Craftsman wrote:

> Howdy,
> 
> Can I ask a data model question here?
> 
> We have a book table with 20 columns, 300 million rows, average row
> size is 1500 bytes.
> 
> create table book(
> book_id,
> isbn,
> price,
> author,
> titile,
> ...
> col_n1,
> col_n2,
> ...
> col_nm
> );
> 
> Data usage:
> 
> We need to query data by each column, do pagination as below,
> 
> select * from book where isbn   < "XYZ" order by ISBN   descending limit 30;
> select * from book where price  < 992   order by price  descending limit 30;
> select * from book where col_n1 < 789   order by col_n1 descending limit 30;
> select * from book where col_n2 < "MUJ" order by col_n2 descending limit 30;
> ...
> select * from book where col_nm < 978 order by col_nm descending limit 30;
> 
> Write: 100 million updates a day.
> Read : 16  million queries a day. 200 queries per second, one query
> returns 30 rows.
> 
> ***
> Materialized Views approach
> 
> {"ISBN_01",book_object1},{"ISBN_02",book_object2},...,{"ISBN_N",book_objectN}
> ...
> We will end up with 20 timelines.
> 
> 
> ***
> Index approach - create 2nd Column Family as Index
> 
> 'ISBN_01': 'book_id_a01','book_id_a02',...,'book_id_aN'
> 'ISBN_02': 'book_id_b01','book_id_b02',...,'book_id_bN'
> ...
> 'ISBN_0m': 'book_id_m01','book_id_m02',...,'book_id_mN'
> 
> This way, we will create 20 index Column Family(s).
> 
> ---
> 
> If we choose Materialized Views approach, we have to update all
> 20 Materialized View column family(s), for each base row update.
> Will the Cassandra write performance acceptable?
> 
> Redis recommend building an index for the query on each column, that
> is your 1st strategy - create 2nd index CF:
> http://redis.io/topics/data-types-intro
> (see section [ Pushing IDs instead of the actual data in Redis lists ]
> 
> Should we just normalize the data, create base book table with book_id
> as primary key, and then
> build 20 index column family(s), use wide row column slicing approach,
> with index column data value as column name and book_id as value?
> This way, we only need to update fewer affected column family that
> column value changed, but not all 20 Materialized Views CF(s).
> 
> Another option would be using Redis to store master book data, using
> Cassandra Column Family to maintain 2nd index.
> 
> What will you recommend?
> 
> Charlie (@mujiang) 一个 木匠
> =======
> Data Architect Developer
> http://mujiang.blogspot.com
> 
> 
> p.s.
> 
> Gist from datastax dev blog (
> http://www.datastax.com/dev/blog/advanced-time-series-with-cassandra )
> "
> If the same event is tracked in multiple timelines,
> it’s okay to denormalize and store all of the event data in each of
> those timelines.
> One of the main principles that Cassandra was built on is that disk
> space is very cheap resource;
> minimizing disk seeks at the cost of higher space consumption is a
> good tradeoff.
> Unless the data for each event is ^very large^, I always prefer this
> strategy over the index strategy.
> "
> 
> Will 1500 bytes row size be large or small for Cassandra from your
> understanding?

Reply via email to