Hi there

There are some really good ideas in this presentation from HBaseCon: 
http://www.cloudera.com/resource/video-hbasecon-2012-real-performance-gains-with-real-time-data/

Regards,
Cristofer

-----Mensagem original-----
De: Alex Baranau [mailto:alex.barano...@gmail.com] 
Enviada em: quinta-feira, 26 de julho de 2012 11:28
Para: user@hbase.apache.org
Assunto: Re: Hbase Data Model to purge old data.

> reason for
> this is bulk delete of one days data within a big table is more 
> expensive
than
> dropping a one day table

Sorry for the obvious question, but have you tried using TTLs instead of 
deleting rows explicitly? This should bring less load on the cluster, though 
you'll still have to run major_compaction, which might be a resource intensive 
process.

> In this per-day-separate-table model, the load balancer will never get
triggered
> as the current days table is always in memory, and daughter regions 
> will continuously get assigned to same region server. This leads to a 
> region
server
> hotspots.

Again, may be an obvious q: have you tried to (or is it possible in your case 
to) pre-split table so that regions are distributed over the cluster from the 
start?

Alex Baranau
------
Sematext :: http://blog.sematext.com/ :: Hadoop - HBase - ElasticSearch - Solr

On Thu, Jul 26, 2012 at 2:34 AM, Padmanaban <padmanaban.math...@gmail.com>wrote:

> We have the following use case:
>
> Store telecom CDR data on a per subscriber basis data is time series 
> based and every record is per-subscriber based comes in round the 
> clock the expected volume of data would be around 300 million 
> records/day.
> this data is to be queried 24/7 by an online system where the filters 
> are subscriber id and date range
>
> Since the volume of data is huge, we have data retention policies to 
> archive old data on a daily basis.
> For example, if retention is set to 90 days, every day a offline 
> process would delete data from Hbase which is older than 90 days and 
> archive it on tape.
>
> The current HBase data model design is as follows:
> Separate table for every day's data with row key as subscriber id: 
> reason for this is bulk delete of one days data within a big table is 
> more expensive than dropping a one day table In this 
> per-day-separate-table model, the load balancer will never get 
> triggered as the current days table is always in memory, and daughter 
> regions will continuously get assigned to same region server. This 
> leads to a region server hotspots.
>
> Please feedback on whether the per-day-separate-table model is the 
> best-practice for this use case considering the data life cycle 
> management requirement.
> If
> yes, how do we solve the side effect of region server hotspot? If no, 
> please advice alternate model
>
> Thanks in advance,
> Padmanaban M
>
>
>


--
Alex Baranau
------
Sematext :: http://blog.sematext.com/ :: Hadoop - HBase - ElasticSearch - Solr

Reply via email to