Checkandput() does not work when the row does not exist, or am I missing something?
Sent from my phone.Excuse the terseness. On Feb 14, 2013, at 5:33 PM, Michael Segel <michael_se...@hotmail.com> wrote: > What constitutes a duplicate? > > An over simplification is to do a HTable.checkAndPut() where you do the put > if the column doesn't exist. > Then if the row is inserted (TRUE) return value, you push the event. > > That will do what you want. > > At least at first blush. > > > > On Feb 14, 2013, at 3:24 PM, Viral Bajaria <viral.baja...@gmail.com> wrote: > >> Given the size of the data (> 1B rows) and the frequency of job run (once >> per hour), I don't think your most optimal solution is to lookup HBase for >> every single event. You will benefit more by loading the HBase table >> directly in your MR job. >> >> In 1B rows, what's the cardinality ? Is it 100M UUID's ? 99% unique UUID's ? >> >> Also once you have done the unique, are you going to use the data again in >> some other way i.e. online serving of traffic or some other analysis ? Or >> this is just to compute some unique #'s ? >> >> It will be more helpful if you describe your final use case of the computed >> data too. Given the amount of back and forth, we can take it off list too >> and summarize the conversation for the list. >> >> On Thu, Feb 14, 2013 at 1:07 PM, Rahul Ravindran <rahu...@yahoo.com> wrote: >> >>> We can't rely on the the assumption event dupes will not dupe outside an >>> hour boundary. So, your take is that, doing a lookup per event within the >>> MR job is going to be bad? >>> >>> >>> ________________________________ >>> From: Viral Bajaria <viral.baja...@gmail.com> >>> To: Rahul Ravindran <rahu...@yahoo.com> >>> Cc: "user@hbase.apache.org" <user@hbase.apache.org> >>> Sent: Thursday, February 14, 2013 12:48 PM >>> Subject: Re: Using HBase for Deduping >>> >>> You could do with a 2-pronged approach here i.e. some MR and some HBase >>> lookups. I don't think this is the best solution either given the # of >>> events you will get. >>> >>> FWIW, the solution below again relies on the assumption that if a event is >>> duped in the same hour it won't have a dupe outside of that hour boundary. >>> If it can have then you are better of with running a MR job with the >>> current hour + another 3 hours of data or an MR job with the current hour + >>> the HBase table as input to the job too (i.e. no HBase lookups, just read >>> the HFile directly) ? >>> >>> - Run a MR job which de-dupes events for the current hour i.e. only runs on >>> 1 hour worth of data. >>> - Mark records which you were not able to de-dupe in the current run >>> - For the records that you were not able to de-dupe, check against HBase >>> whether you saw that event in the past. If you did, you can drop the >>> current event or update the event to the new value (based on your business >>> logic) >>> - Save all the de-duped events (via HBase bulk upload) >>> >>> Sorry if I just rambled along, but without knowing the whole problem it's >>> very tough to come up with a probable solution. So correct my assumptions >>> and we could drill down more. >>> >>> Thanks, >>> Viral >>> >>> On Thu, Feb 14, 2013 at 12:29 PM, Rahul Ravindran <rahu...@yahoo.com> >>> wrote: >>> >>>> Most will be in the same hour. Some will be across 3-6 hours. >>>> >>>> Sent from my phone.Excuse the terseness. >>>> >>>> On Feb 14, 2013, at 12:19 PM, Viral Bajaria <viral.baja...@gmail.com> >>>> wrote: >>>> >>>>> Are all these dupe events expected to be within the same hour or they >>>>> can happen over multiple hours ? >>>>> >>>>> Viral >>>>> From: Rahul Ravindran >>>>> Sent: 2/14/2013 11:41 AM >>>>> To: user@hbase.apache.org >>>>> Subject: Using HBase for Deduping >>>>> Hi, >>>>> We have events which are delivered into our HDFS cluster which may >>>>> be duplicated. Each event has a UUID and we were hoping to leverage >>>>> HBase to dedupe them. We run a MapReduce job which would perform a >>>>> lookup for each UUID on HBase and then emit the event only if the UUID >>>>> was absent and would also insert into the HBase table(This is >>>>> simplistic, I am missing out details to make this more resilient to >>>>> failures). My concern is that doing a Read+Write for every event in MR >>>>> would be slow (We expect around 1 Billion events every hour). Does >>>>> anyone use Hbase for a similar use case or is there a different >>>>> approach to achieving the same end result. Any information, comments >>>>> would be great. >>>>> >>>>> Thanks, >>>>> ~Rahul. > > Michael Segel | (m) 312.755.9623 > > Segel and Associates > >