Deduplication Effort in Hadoop

jonathan.hwang Thu, 14 Jul 2011 09:03:35 -0700

Hi All,
In databases you can be able to define primary keys to ensure no duplicate data 
get loaded into the system.   Let say I have a lot of 1 billion records flowing 
into my system everyday and some of these are repeated data (Same records).   I 
can use 2-3 columns in the record to match and look for duplicates.   What is 
the best strategy of de-duplication?  The duplicated records should only appear 
within the last 2 weeks.    I want a fast way to get the data into the system 
without much delay.  Anyway HBase or Hive can help?


Thanks!
Jonathan

________________________________
This message is for the designated recipient only and may contain privileged, 
proprietary, or otherwise private information. If you have received it in 
error, please notify the sender immediately and delete the original. Any other 
use of the email by you is prohibited.

Deduplication Effort in Hadoop

Reply via email to