But then he can't trigger an event if its a net new row. Methinks that he needs to better define the problem he is trying to solve. Also the number of events. A billion an hour or 300K events a second? (Ok its 277.78K events a second.)
On Feb 14, 2013, at 10:19 PM, Anoop Sam John <anoo...@huawei.com> wrote: > When max versions set as 1 and duplicate key is added, the last added will > win removing the old. This is what you want Rahul? I think from his > explanation he needs the reverse way > > -Anoop- > ________________________________________ > From: Asaf Mesika [asaf.mes...@gmail.com] > Sent: Friday, February 15, 2013 3:56 AM > To: user@hbase.apache.org; Rahul Ravindran > Subject: Re: Using HBase for Deduping > > You can load the events into an Hbase table, which has the event id as the > unique row key. You can define max versions of 1 to the column family thus > letting Hbase get rid of the duplicates for you during major compaction. > > > > On Thursday, February 14, 2013, Rahul Ravindran wrote: > >> Hi, >> We have events which are delivered into our HDFS cluster which may be >> duplicated. Each event has a UUID and we were hoping to leverage HBase to >> dedupe them. We run a MapReduce job which would perform a lookup for each >> UUID on HBase and then emit the event only if the UUID was absent and would >> also insert into the HBase table(This is simplistic, I am missing out >> details to make this more resilient to failures). My concern is that doing >> a Read+Write for every event in MR would be slow (We expect around 1 >> Billion events every hour). Does anyone use Hbase for a similar use case or >> is there a different approach to achieving the same end result. Any >> information, comments would be great. >> >> Thanks, >> ~Rahul.