Re: Routing and region deletes

Per Steffensen Thu, 08 Dec 2011 06:42:57 -0800

Thanks for your reply!

Michel Segel skrev:

Per Seffensen,


I would urge you to step away from the keyboard and rethink your design.

Will do :-) But would actually still like to receive answers for myquestions - just pretend that my ideas are not so stupid and let me knowif it can be done

It sounds like you want to replicate a date partition model similar to what you 
would do if you were attempting this with HBase.

HBase is not a relational database and you have a different way of doing things.

I know

You could put the date/time stamp in the key such that your data is sorted by 
date.

But I guess that would not guarantee that records with timestamps from aspecific day or month all exist in the same set of regions and thatrecords with timestamps from other days or months all exist outsidethose regions, so that I can delete records from that day or month, justby deleting the regions.

However, this would cause hot spots.  Think about how you access the data. It 
sounds like you access the more recent data more frequently than historical 
data.

Not necessarily wrt reading, but certainly I (almost) only write newrecords with timestamps from the current day/month.

  This is a bad idea in HBase.
(note: it may still make sense to do this ... You have to think more about the 
data and consider alternatives.)

I personally would hash the key for even distribution, again depending on the 
data access pattern.  (hashed data means you can't do range queries but again, 
it depends on what you are doing...)

You also have to think about how you purge the data. You don't just drop a 
region.

I know that this is not the "default" way of deleting data, but it ispossible? Believe a region is basically just a folder with a set offiles and deleting those would be a matter of a few ms. So if I canroute all records with timestamps from a certain day or month to adesignated set of regions, deleting all those records will be a matterof deleting #regions-in-that-set folders on disk - very quick. Thealternative is to do 50mio+ single delete operations every day (or 1,5billion operations every month), and that will not even free up spaceimmediately since the records will actually just be marked deleted (in anew file) - space will not be freed before next compaction of theinvolved regions (see e.g. http://outerthought.org/blog/465-ot.html).

 Doing a full table scan once a month to delete may not be a bad thing.

But I dont believe one full table scan will be enough. For that to bepossible, at least I would have to be able to provide HBase with all 1,5billion records to delete in one "delete"-call - thats probably notpossible :-)

 Again it depends on what you are doing...

Just my opinion. Others will have their own... Now I'm stepping away from the 
keyboard to get my morning coffee...

Enjoy. Then I will consider leaving work (its late afternoon in Europe)

:-)


Sent from a remote device. Please excuse any typos...

Mike Segel

On Dec 8, 2011, at 7:13 AM, Per Steffensen <st...@designware.dk> wrote:

Hi

The system we are going to work on will receive 50mio+ new datarecords every 
day. We need to keep a history of 2 years of data (thats 35+ billion 
datarecords in the storage all in all), and that basically means that we also 
need to delete 50mio+ datarecords every day, or e.g. 1,5 billion every month. 
We plan to store the datarecords in HBase.

Is it somehow possible to tell HBase to put (route) all datarecords belonging 
to a specific date or month to a designated set of regions (and route nothing 
else there), so that deleting all data belonging to that day/month i basically 
deleting those regions entirely? And is explicit deletion of entire regions 
possible at all?

The reason I want to do this is that I expect it to be much faster than doing 
explicit deletion record by record of 50mio+ records every day.

Regards, Per Steffensen

Re: Routing and region deletes

Reply via email to