I believe HBase has some kind of TTL (timeout-based expiry) for records and it can clean them up on its own.
On Sat, Sep 10, 2011 at 1:54 AM, Dhodapkar, Chinmay <chinm...@qualcomm.com> wrote: > Hello, > I have a setup where a bunch of clients store 'events' in an Hbase table . > Also, periodically(once a day), I run a mapreduce job that goes over the > table and computes some reports. > > Now my issue is that the next time I don't want mapreduce job to process the > 'events' that it has already processed previously. I know that I can mark > processed event in the hbase table and the mapper can filter them them out > during the next run. But what I would really like/want is that previously > processed events don't even hit the mapper. > > One solution I can think of is to backup the hbase table after running the > job and then clear the table. But this has lot of problems.. > 1) Clients may have inserted events while the job was running. > 2) I could disable and drop the table and then create it again...but then the > clients would complain about this short window of unavailability. > > > What do people using Hbase (live) + mapreduce typically do. ? > > Thanks! > Chinmay > > -- Eugene Kirpichov Principal Engineer, Mirantis Inc. http://www.mirantis.com/ Editor, http://fprog.ru/