Re: Using HBase for Deduping

Rahul Ravindran Fri, 15 Feb 2013 09:13:06 -0800

I had tried checkAndPut yesterday with a null passed as the value and it had 
thrown an exception when the row did not exist. Perhaps, I was doing something 
wrong. Will try that again, since, yes, I would prefer a checkAndPut().



________________________________
 From: Michael Segel <michael_se...@hotmail.com>
To: user@hbase.apache.org 
Cc: Rahul Ravindran <rahu...@yahoo.com> 
Sent: Friday, February 15, 2013 4:36 AM
Subject: Re: Using HBase for Deduping
 

On Feb 15, 2013, at 3:07 AM, Asaf Mesika <asaf.mes...@gmail.com> wrote:

> Michael, this means read for every write?
> 
Yes and no. 

At the macro level, a read for every write would mean that your client would 
read a record from HBase, and then based on some logic it would either write a 
record, or not. 

So that you have a lot of overhead in the initial get() and then put(). 

At this macro level, with a Check and Put you have less overhead because of a 
single message to HBase.

Intermal to HBase, you would still have to check the value in the row, if it 
exists and then perform an insert or not. 

WIth respect to your billion events an hour... 

dividing by 3600 to get the number of events in a second. You would have less 
than 300,000 events a second. 

What exactly are you doing and how large are those events? 

Since you are processing these events in a batch job, timing doesn't appear to 
be that important and of course there is also async hbase which may improve 
some of the performance. 

YMMV but this is a good example of the checkAndPut()



> On Friday, February 15, 2013, Michael Segel wrote:
> 
>> What constitutes a duplicate?
>> 
>> An over simplification is to do a HTable.checkAndPut() where you do the
>> put if the column doesn't exist.
>> Then if the row is inserted (TRUE) return value, you push the event.
>> 
>> That will do what you want.
>> 
>> At least at first blush.
>> 
>> 
>> 
>> On Feb 14, 2013, at 3:24 PM, Viral Bajaria <viral.baja...@gmail.com>
>> wrote:
>> 
>>> Given the size of the data (> 1B rows) and the frequency of job run (once
>>> per hour), I don't think your most optimal solution is to lookup HBase
>> for
>>> every single event. You will benefit more by loading the HBase table
>>> directly in your MR job.
>>> 
>>> In 1B rows, what's the cardinality ? Is it 100M UUID's ? 99% unique
>> UUID's ?
>>> 
>>> Also once you have done the unique, are you going to use the data again
>> in
>>> some other way i.e. online serving of traffic or some other analysis ? Or
>>> this is just to compute some unique #'s ?
>>> 
>>> It will be more helpful if you describe your final use case of the
>> computed
>>> data too. Given the amount of back and forth, we can take it off list too
>>> and summarize the conversation for the list.
>>> 
>>> On Thu, Feb 14, 2013 at 1:07 PM, Rahul Ravindran <rahu...@yahoo.com>
>> wrote:
>>> 
>>>> We can't rely on the the assumption event dupes will not dupe outside an
>>>> hour boundary. So, your take is that, doing a lookup per event within
>> the
>>>> MR job is going to be bad?
>>>> 
>>>> 
>>>> ________________________________
>>>> From: Viral Bajaria <viral.baja...@gmail.com>
>>>> To: Rahul Ravindran <rahu...@yahoo.com>
>>>> Cc: "user@hbase.apache.org" <user@hbase.apache.org>
>>>> Sent: Thursday, February 14, 2013 12:48 PM
>>>> Subject: Re: Using HBase for Deduping
>>>> 
>>>> You could do with a 2-pronged approach here i.e. some MR and some HBase
>>>> lookups. I don't think this is the best solution either given the # of
>>>> events you will get.
>>>> 
>>>> FWIW, the solution below again relies on the assumption that if a event
>> is
>>>> duped in the same hour it won't have a dupe outside of that hour
>> boundary.
>>>> If it can have then you are better of with running a MR job with the
>>>> current hour + another 3 hours of data or an MR job with the current
>> hour +
>>>> the HBase table as input to the job too (i.e. no HBase lookups, just
>> read
>>>> the HFile directly) ?
>>>> 
>>>> - Run a MR job which de-dupes events for the current hour i.e. only
>> runs on
>>>> 1 hour worth of data.
>>>> - Mark records which you were not able to de-dupe in the current run
>>>> - For the records that you were not able to de-dupe, check against HBase
>>>> whether you saw that event in the past. If you did, you can drop the
>>>> current event or update the event to the new value (based on your
>> business
>>>> logic)
>>>> - Save all the de-duped events (via HBase bulk upload)
>>>> 
>>>> Sorry if I just rambled along, but without knowing the whole problem
>> it's
>>>> very tough to come up with a probable solution. So correct my
>> assumptions
>>>> and we could drill down more.
>>>> 
>>>> Thanks,
>>>> Viral
>>>> 
>>>> On Thu, Feb 14, 2013 at 12:29 PM, Rahul Ravindran <rahu...@yahoo.com>
>>>> wrote:
>>>> 
>>>>> Most will be in the same hour. Some will be across 3-6 hours.
>>>>> 
>>>>> Sent from my phone.Excuse the terseness.
>>>>> 
>>>>> On Feb 14, 2013, at 12:19 PM, Viral Bajaria <viral.baja...@gmail.com>
>>>>> wrote:
>>>>> 
>>>>>> Are all these dupe events expected to be within the same hour or they
>>>>>> can happen over multiple hours ?
>>>>>> 
>>>>>> Viral
>>>>>> From: Rahul Ravindran
>>>>>> Sent: 2/14/2013 11:41 AM
>>>>>> To: user@hbase.apache.org
>>>>>> Subject: Using HBase for Deduping
>>>>>> Hi,
>>>>>>  We have events which are delivered into our HDFS cluster which may
>>>>>> be duplicated. Each event has a UUID and we were hoping to leverage
>>>> Michael Segel  | (m) 312.755.9623
>> 
>> Segel and Associates
>> 
>> 
>>

Re: Using HBase for Deduping

Reply via email to