When max versions set as 1 and duplicate key is added, the last added will win removing the old. This is what you want Rahul? I think from his explanation he needs the reverse way
-Anoop- ________________________________________ From: Asaf Mesika [asaf.mes...@gmail.com] Sent: Friday, February 15, 2013 3:56 AM To: user@hbase.apache.org; Rahul Ravindran Subject: Re: Using HBase for Deduping You can load the events into an Hbase table, which has the event id as the unique row key. You can define max versions of 1 to the column family thus letting Hbase get rid of the duplicates for you during major compaction. On Thursday, February 14, 2013, Rahul Ravindran wrote: > Hi, > We have events which are delivered into our HDFS cluster which may be > duplicated. Each event has a UUID and we were hoping to leverage HBase to > dedupe them. We run a MapReduce job which would perform a lookup for each > UUID on HBase and then emit the event only if the UUID was absent and would > also insert into the HBase table(This is simplistic, I am missing out > details to make this more resilient to failures). My concern is that doing > a Read+Write for every event in MR would be slow (We expect around 1 > Billion events every hour). Does anyone use Hbase for a similar use case or > is there a different approach to achieving the same end result. Any > information, comments would be great. > > Thanks, > ~Rahul.