Vivek, You are correct, distcp will overwrite a file if it has changed or is new. As to running this realtime (ie: as soon as data is deposited on the source cluster, you will have to handle that). Please be aware if you are talking about hive tables, you will also need the hive metastore. We copy our critical data from a Production Cluster to another Production Cluster and to a Test Cluster on a daily basis. Also, the contents of the Hive Metastore database. Be aware if you restore the Hive Metastore database on the destination cluster, any tables created solely on the destination cluster may disappear.
David From: Vivek Singh Raghuwanshi [mailto:[email protected]] Sent: Wednesday, February 10, 2016 1:28 PM To: [email protected] Subject: Re: Hadoop Backup and Archival Cluster Thanks David, I want to replicate the data once it reached on the cluster, and delete from source Cluster after one year. I want Cluster works as Hot Backup and Archival and Cluster A only having latest data. And as per my information distcp copy all the data and over-right. Please correct me if i am wrong. On Wed, Feb 10, 2016 at 12:21 PM, David Whitmore <[email protected]<mailto:[email protected]>> wrote: Yes, you can run a distcp to copy data from one cluster to another, also distcp has an option to tell if it will delete files on the destination if they are NOT on the source. From: Vivek Singh Raghuwanshi [mailto:[email protected]<mailto:[email protected]>] Sent: Wednesday, February 10, 2016 1:16 PM To: [email protected]<mailto:[email protected]> Subject: Hadoop Backup and Archival Cluster Hi Friends, I am planning to setup a Hadoop Cluster (A) with Cluster replication (B). so that once data is reached to Cluster A it will replicated to Cluster D. I am having one question if i delete data from Cluster A on the basis of Time like one month old data is it also removed from Cluster B. if yes how i can avoid this. What i want to achieve. 1. Once data is reached to Cluster A it will automatically replicated to Cluster B. 2. After one year old data from Cluster A remove automatically but not from Cluster B. 3. If any one wants to run query on latest data Cluster A is available but for Older data Cluster B is available. Regards -- ViVek Raghuwanshi Mobile -+91-09595950504<tel:%2B91-09595950504> Skype - vivek_raghuwanshi IRC vivekraghuwanshi http://vivekraghuwanshi.wordpress.com/ http://in.linkedin.com/in/vivekraghuwanshi -- ViVek Raghuwanshi Mobile -+91-09595950504 Skype - vivek_raghuwanshi IRC vivekraghuwanshi http://vivekraghuwanshi.wordpress.com/ http://in.linkedin.com/in/vivekraghuwanshi
