Re: Using HBase for Deduping

Michael Segel Fri, 15 Feb 2013 04:39:05 -0800

But then he can't trigger an event if its a net new row. 

Methinks that he needs to better define the problem he is trying to solve. 
Also the number of events.  A billion an hour or 300K events a second? (Ok its 
277.78K events a second.)



On Feb 14, 2013, at 10:19 PM, Anoop Sam John <anoo...@huawei.com> wrote:

> When max versions set as 1 and duplicate key is added, the last added will 
> win removing the old.  This is what you want Rahul?  I think from his 
> explanation he needs the reverse way
> 
> -Anoop-
> ________________________________________
> From: Asaf Mesika [asaf.mes...@gmail.com]
> Sent: Friday, February 15, 2013 3:56 AM
> To: user@hbase.apache.org; Rahul Ravindran
> Subject: Re: Using HBase for Deduping
> 
> You can load the events into an Hbase table, which has the event id as the
> unique row key. You can define max versions of 1 to the column family thus
> letting Hbase get rid of the duplicates for you during major compaction.
> 
> 
> 
> On Thursday, February 14, 2013, Rahul Ravindran wrote:
> 
>> Hi,
>>   We have events which are delivered into our HDFS cluster which may be
>> duplicated. Each event has a UUID and we were hoping to leverage HBase to
>> dedupe them. We run a MapReduce job which would perform a lookup for each
>> UUID on HBase and then emit the event only if the UUID was absent and would
>> also insert into the HBase table(This is simplistic, I am missing out
>> details to make this more resilient to failures). My concern is that doing
>> a Read+Write for every event in MR would be slow (We expect around 1
>> Billion events every hour). Does anyone use Hbase for a similar use case or
>> is there a different approach to achieving the same end result. Any
>> information, comments would be great.
>> 
>> Thanks,
>> ~Rahul.

Re: Using HBase for Deduping

Reply via email to