Well, Maybe its a lack of sleep, but this is what I found... checkAndPut public boolean checkAndPut(byte[] row, byte[] family, byte[] qualifier, byte[] value, Put put) throws IOException Atomically checks if a row/family/qualifier value matches the expected value. If it does, it adds the put. If the passed value is null, the check is for the lack of column (ie: non-existance)
Specified by: checkAndPut in interface HTableInterface Parameters: row - to check family - column family to check qualifier - column qualifier to check value - the expected value put - data to put if check succeeds Returns: true if the new put was executed, false otherwise Throws: IOException - e Maybe I'm reading it wrong? But hey! What do I know? Its Valentine's Day and I'm spending my evening answering questions sitting in my man cave instead of spending it with my wife. Its no wonder I live in the perpetual dog house! :-P On Feb 14, 2013, at 7:35 PM, Rahul Ravindran <rahu...@yahoo.com> wrote: > Checkandput() does not work when the row does not exist, or am I missing > something? > > Sent from my phone.Excuse the terseness. > > On Feb 14, 2013, at 5:33 PM, Michael Segel <michael_se...@hotmail.com> wrote: > >> What constitutes a duplicate? >> >> An over simplification is to do a HTable.checkAndPut() where you do the put >> if the column doesn't exist. >> Then if the row is inserted (TRUE) return value, you push the event. >> >> That will do what you want. >> >> At least at first blush. >> >> >> >> On Feb 14, 2013, at 3:24 PM, Viral Bajaria <viral.baja...@gmail.com> wrote: >> >>> Given the size of the data (> 1B rows) and the frequency of job run (once >>> per hour), I don't think your most optimal solution is to lookup HBase for >>> every single event. You will benefit more by loading the HBase table >>> directly in your MR job. >>> >>> In 1B rows, what's the cardinality ? Is it 100M UUID's ? 99% unique UUID's ? >>> >>> Also once you have done the unique, are you going to use the data again in >>> some other way i.e. online serving of traffic or some other analysis ? Or >>> this is just to compute some unique #'s ? >>> >>> It will be more helpful if you describe your final use case of the computed >>> data too. Given the amount of back and forth, we can take it off list too >>> and summarize the conversation for the list. >>> >>> On Thu, Feb 14, 2013 at 1:07 PM, Rahul Ravindran <rahu...@yahoo.com> wrote: >>> >>>> We can't rely on the the assumption event dupes will not dupe outside an >>>> hour boundary. So, your take is that, doing a lookup per event within the >>>> MR job is going to be bad? >>>> >>>> >>>> ________________________________ >>>> From: Viral Bajaria <viral.baja...@gmail.com> >>>> To: Rahul Ravindran <rahu...@yahoo.com> >>>> Cc: "user@hbase.apache.org" <user@hbase.apache.org> >>>> Sent: Thursday, February 14, 2013 12:48 PM >>>> Subject: Re: Using HBase for Deduping >>>> >>>> You could do with a 2-pronged approach here i.e. some MR and some HBase >>>> lookups. I don't think this is the best solution either given the # of >>>> events you will get. >>>> >>>> FWIW, the solution below again relies on the assumption that if a event is >>>> duped in the same hour it won't have a dupe outside of that hour boundary. >>>> If it can have then you are better of with running a MR job with the >>>> current hour + another 3 hours of data or an MR job with the current hour + >>>> the HBase table as input to the job too (i.e. no HBase lookups, just read >>>> the HFile directly) ? >>>> >>>> - Run a MR job which de-dupes events for the current hour i.e. only runs on >>>> 1 hour worth of data. >>>> - Mark records which you were not able to de-dupe in the current run >>>> - For the records that you were not able to de-dupe, check against HBase >>>> whether you saw that event in the past. If you did, you can drop the >>>> current event or update the event to the new value (based on your business >>>> logic) >>>> - Save all the de-duped events (via HBase bulk upload) >>>> >>>> Sorry if I just rambled along, but without knowing the whole problem it's >>>> very tough to come up with a probable solution. So correct my assumptions >>>> and we could drill down more. >>>> >>>> Thanks, >>>> Viral >>>> >>>> On Thu, Feb 14, 2013 at 12:29 PM, Rahul Ravindran <rahu...@yahoo.com> >>>> wrote: >>>> >>>>> Most will be in the same hour. Some will be across 3-6 hours. >>>>> >>>>> Sent from my phone.Excuse the terseness. >>>>> >>>>> On Feb 14, 2013, at 12:19 PM, Viral Bajaria <viral.baja...@gmail.com> >>>>> wrote: >>>>> >>>>>> Are all these dupe events expected to be within the same hour or they >>>>>> can happen over multiple hours ? >>>>>> >>>>>> Viral >>>>>> From: Rahul Ravindran >>>>>> Sent: 2/14/2013 11:41 AM >>>>>> To: user@hbase.apache.org >>>>>> Subject: Using HBase for Deduping >>>>>> Hi, >>>>>> We have events which are delivered into our HDFS cluster which may >>>>>> be duplicated. Each event has a UUID and we were hoping to leverage >>>>>> HBase to dedupe them. We run a MapReduce job which would perform a >>>>>> lookup for each UUID on HBase and then emit the event only if the UUID >>>>>> was absent and would also insert into the HBase table(This is >>>>>> simplistic, I am missing out details to make this more resilient to >>>>>> failures). My concern is that doing a Read+Write for every event in MR >>>>>> would be slow (We expect around 1 Billion events every hour). Does >>>>>> anyone use Hbase for a similar use case or is there a different >>>>>> approach to achieving the same end result. Any information, comments >>>>>> would be great. >>>>>> >>>>>> Thanks, >>>>>> ~Rahul. >> >> Michael Segel | (m) 312.755.9623 >> >> Segel and Associates >> >> > Michael Segel | (m) 312.755.9623 Segel and Associates