I would suggest you watch this video: http://www.cloudera.com/resource/video-hbasecon-2012-real-performance-gains-with-real-time-data/
The jive guys solved a lot of the problems you're talking about and discuss it in that case study. On Wed, Oct 3, 2012 at 6:27 AM, Karthikeyan Muthukumarasamy < [email protected]> wrote: > Hi, > Our usecase is as follows: > We have time series data continuously flowing into the system and has to be > stored in HBase. > Subscriber Mobile Number (a.k.a MSISDN) is the primary identifier based on > which data is stored and later retrieved. > There are two sets of parameters that get stored in every record in HBase, > lets call them group1 and group2. The number of records that would have > group1 parameters would be approx. 6 per day and the same for group2 > parameters is approx. 1 per 3 days (their cardinality is different). > > Typically, the retention policy for group1 parameters is 3 months and for > group2 parameters is 1 year. The read-pattern is as follows: An online > query would ask for records matching an MSISDN for a given date range, and > the system needs to respond with all available data (both from group1 and > group2) satifying the MSISDN and data range filters. > > Question1: > Alternative1: Create a single table with G1 and G2 as two column families. > Alternative2: Create two tables one for each group > Which is the better alternative and what are the pros and cons? > > > Question2: > To achieve max. distribution during write and reasonable complexity during > read, we decided on the following row key design: > <last 3 digits of MSISDN>,<MMDD>,<full MSISDN> > We will manually pre-split regions for the table based on the <last 3 > digits of MSISDN>,<MMDD> part of row key > So there are 1000 (from 3 digits of MSISDN) * 365 (from MMDD) buckets that > would translate to as many regions > In this case, when retention is configured as < 1 year, the design looks > optimal > When retention is configured > 1 year, one region might store data for more > than 1 day (feb 1 of 2012 and also feb 1 of 2013), which means more data is > to be handled by hbase during compactions and read. > An alternative Key design, which does not have the above disadvantage is: > <last 3 digits of MSISDN>,<YYYYMMDD>,<full MSISDN> > this way, in one region, there will be only 1 days data at any point, > regardless of retention > What are other pros & cons of the two key designs? > > Question3: > In our usecase, delete happens only based on retention policy, where one > days full data has to be deleted when rention period is crossed (for eg, if > retention is 30 days, on Apr 1 all the data for Mar 1 is deleted) > What is the most optimal way to implement this retention policy? > Alternative 1: TTL for column famil is configured and we leave it to HBase > to delete data during major compaction, but we are not sure of the cost of > this major compaction happening in all regions at same time > Alternative 2: Through key design logic mentioned before, if we ensure data > for one day goes into one set of regions, can we use HBase APIs like > HFileArchiver to programatically archive and drop regions? > > Thanks & Regards > MK >
