Most will be in the same hour. Some will be across 3-6 hours. Sent from my phone.Excuse the terseness.
On Feb 14, 2013, at 12:19 PM, Viral Bajaria <viral.baja...@gmail.com> wrote: > Are all these dupe events expected to be within the same hour or they > can happen over multiple hours ? > > Viral > From: Rahul Ravindran > Sent: 2/14/2013 11:41 AM > To: user@hbase.apache.org > Subject: Using HBase for Deduping > Hi, > We have events which are delivered into our HDFS cluster which may > be duplicated. Each event has a UUID and we were hoping to leverage > HBase to dedupe them. We run a MapReduce job which would perform a > lookup for each UUID on HBase and then emit the event only if the UUID > was absent and would also insert into the HBase table(This is > simplistic, I am missing out details to make this more resilient to > failures). My concern is that doing a Read+Write for every event in MR > would be slow (We expect around 1 Billion events every hour). Does > anyone use Hbase for a similar use case or is there a different > approach to achieving the same end result. Any information, comments > would be great. > > Thanks, > ~Rahul.