Hi All, In databases you can be able to define primary keys to ensure no duplicate data get loaded into the system. Let say I have a lot of 1 billion records flowing into my system everyday and some of these are repeated data (Same records). I can use 2-3 columns in the record to match and look for duplicates. What is the best strategy of de-duplication? The duplicated records should only appear within the last 2 weeks. I want a fast way to get the data into the system without much delay. Anyway HBase or Hive can help?
Thanks! Jonathan ________________________________ This message is for the designated recipient only and may contain privileged, proprietary, or otherwise private information. If you have received it in error, please notify the sender immediately and delete the original. Any other use of the email by you is prohibited.