Hi,
   We have events which are delivered into our HDFS cluster which may be 
duplicated. Each event has a UUID and we were hoping to leverage HBase to 
dedupe them. We run a MapReduce job which would perform a lookup for each UUID 
on HBase and then emit the event only if the UUID was absent and would also 
insert into the HBase table(This is simplistic, I am missing out details to 
make this more resilient to failures). My concern is that doing a Read+Write 
for every event in MR would be slow (We expect around 1 Billion events every 
hour). Does anyone use Hbase for a similar use case or is there a different 
approach to achieving the same end result. Any information, comments would be 
great.

Thanks,
~Rahul.

Reply via email to