Re: Backing up HDFS?
On Tue, Feb 10, 2009 at 2:22 AM, Allen Wittenauer wrote: > > The key here is to prioritize your data. Impossible to replicate data gets > backed up using whatever means necessary, hard-to-regenerate data, next > priority. Easy to regenerate and ok to nuke data, doesn't get backed up. > I think thats a good advise to start with when creating a backup strategy. E.g. what we do at the moment is to analyze huge volumes of access logs where we import those logs into hdfs, creating aggregates for several metrics and finally storing results in sequence files using block level compression. Its kind of an intermediate format that can be used for further analysis. Those files end up being pretty small and will be exported daily to storage and getting backuped. In case hdfs goes to hell we can restore some raw log data from the servers and only loose historical logs which should not be a big deal. I must also add that I really enjoy the great deal of optimization opportunities that hadoop gives you by directly implementing the serialization strategies. You really get control over every bit and byte that gets recorded. Same with compression. So you can make the best trade offs possible and finally store only data you really need.
Re: Backing up HDFS?
Allen Wittenauer wrote: On 2/9/09 4:41 PM, "Amandeep Khurana" wrote: Why would you want to have another backup beyond HDFS? HDFS itself replicates your data so if the reliability of the system shouldnt be a concern (if at all it is)... I'm reminded of a previous job where a site administrator refused to make tape backups (despite our continual harassment and pointing out that he was in violation of the contract) because he said RAID was "good enough". Then the RAID controller failed. When we couldn't recover data "from the other mirror" he was fired. Not sure how they ever recovered, esp. considering what the data was they lost. Hopefully they had a paper trail. hope that wasnt at SUNW, not given they do their own controllers 1. controller failure is lethal, especially if you don't notice for a while 2. some products -say, databases- didnt like live updates, so a trick evolved of taking off some of the RAID array and putting that to tape. Of course, then there's the problem of what happens there 3. Tape is still very power efficient; good for a bulk off-site store (or a local fire-safe) 4. Over at last.fm, they had an accident rm / on their primary dataset. Fortunately they did apparently have another copy somewhere else. and now that hfds has user ids, you can prevent anyone but the admin team from accidentally deleting everyones data.
Re: Backing up HDFS?
We copy over selected files from HDFS to KFS and use an instance of KFS as backup file system. We use distcp to take backup. Lohit - Original Message From: Allen Wittenauer To: core-user@hadoop.apache.org Sent: Monday, February 9, 2009 5:22:38 PM Subject: Re: Backing up HDFS? On 2/9/09 4:41 PM, "Amandeep Khurana" wrote: > Why would you want to have another backup beyond HDFS? HDFS itself > replicates your data so if the reliability of the system shouldnt be a > concern (if at all it is)... I'm reminded of a previous job where a site administrator refused to make tape backups (despite our continual harassment and pointing out that he was in violation of the contract) because he said RAID was "good enough". Then the RAID controller failed. When we couldn't recover data "from the other mirror" he was fired. Not sure how they ever recovered, esp. considering what the data was they lost. Hopefully they had a paper trail. To answer Nathan's question: > On Mon, Feb 9, 2009 at 4:17 PM, Nathan Marz wrote: > >> How do people back up their data that they keep on HDFS? We have many TB of >> data which we need to get backed up but are unclear on how to do this >> efficiently/reliably. The content of our HDFSes is loaded from elsewhere and is not considered 'the source of authority'. It is the responsibility of the original sources to maintain backups and we then follow their policies for data retention. For user generated content, we provide *limited* (read: quota'ed) NFS space that is backed up regularly. Another strategy we take is multiple grids in multiple locations that get the data loaded simultaneously. The key here is to prioritize your data. Impossible to replicate data gets backed up using whatever means necessary, hard-to-regenerate data, next priority. Easy to regenerate and ok to nuke data, doesn't get backed up.
Re: Backing up HDFS?
Hey, There's also a ticket open to enable global snapshots for a single HDFS instance: https://issues.apache.org/jira/browse/HADOOP-3637. While this doesn't solve the multi-site backup issue, it does provide stronger protection against programmatic deletion of data in a single cluster. Regards, Jeff On Mon, Feb 9, 2009 at 5:22 PM, Allen Wittenauer wrote: > On 2/9/09 4:41 PM, "Amandeep Khurana" wrote: > > Why would you want to have another backup beyond HDFS? HDFS itself > > replicates your data so if the reliability of the system shouldnt be a > > concern (if at all it is)... > > I'm reminded of a previous job where a site administrator refused to make > tape backups (despite our continual harassment and pointing out that he was > in violation of the contract) because he said RAID was "good enough". > > Then the RAID controller failed. When we couldn't recover data "from the > other mirror" he was fired. Not sure how they ever recovered, esp. > considering what the data was they lost. Hopefully they had a paper trail. > > To answer Nathan's question: > > > On Mon, Feb 9, 2009 at 4:17 PM, Nathan Marz wrote: > > > >> How do people back up their data that they keep on HDFS? We have many TB > of > >> data which we need to get backed up but are unclear on how to do this > >> efficiently/reliably. > > The content of our HDFSes is loaded from elsewhere and is not considered > 'the source of authority'. It is the responsibility of the original > sources > to maintain backups and we then follow their policies for data retention. > For user generated content, we provide *limited* (read: quota'ed) NFS space > that is backed up regularly. > > Another strategy we take is multiple grids in multiple locations that get > the data loaded simultaneously. > > The key here is to prioritize your data. Impossible to replicate data gets > backed up using whatever means necessary, hard-to-regenerate data, next > priority. Easy to regenerate and ok to nuke data, doesn't get backed up. > >
Re: Backing up HDFS?
On 2/9/09 4:41 PM, "Amandeep Khurana" wrote: > Why would you want to have another backup beyond HDFS? HDFS itself > replicates your data so if the reliability of the system shouldnt be a > concern (if at all it is)... I'm reminded of a previous job where a site administrator refused to make tape backups (despite our continual harassment and pointing out that he was in violation of the contract) because he said RAID was "good enough". Then the RAID controller failed. When we couldn't recover data "from the other mirror" he was fired. Not sure how they ever recovered, esp. considering what the data was they lost. Hopefully they had a paper trail. To answer Nathan's question: > On Mon, Feb 9, 2009 at 4:17 PM, Nathan Marz wrote: > >> How do people back up their data that they keep on HDFS? We have many TB of >> data which we need to get backed up but are unclear on how to do this >> efficiently/reliably. The content of our HDFSes is loaded from elsewhere and is not considered 'the source of authority'. It is the responsibility of the original sources to maintain backups and we then follow their policies for data retention. For user generated content, we provide *limited* (read: quota'ed) NFS space that is backed up regularly. Another strategy we take is multiple grids in multiple locations that get the data loaded simultaneously. The key here is to prioritize your data. Impossible to replicate data gets backed up using whatever means necessary, hard-to-regenerate data, next priority. Easy to regenerate and ok to nuke data, doesn't get backed up.
Re: Backing up HDFS?
On Feb 9, 2009, at 6:41 PM, Amandeep Khurana wrote: Why would you want to have another backup beyond HDFS? HDFS itself replicates your data so if the reliability of the system shouldnt be a concern (if at all it is)... It should be. HDFS is not an archival system. Multiple replicas does not equate a backup, just like having a RAID1 or RAID5 shouldn't make you feel safe. HDFS is actively developed with lots of new features. Bugs creep in. Things can become inconsistent and mis-replicated. Even though loss due to hardware failure is small, losses due to new bugs are still possible! Brian Amandeep Amandeep Khurana Computer Science Graduate Student University of California, Santa Cruz On Mon, Feb 9, 2009 at 4:17 PM, Nathan Marz wrote: How do people back up their data that they keep on HDFS? We have many TB of data which we need to get backed up but are unclear on how to do this efficiently/reliably.
Re: Backing up HDFS?
Replication only protects against single node failure. If there's a fire and we lose the whole cluster, replication doesn't help. Or if there's human error and someone accidentally deletes data, then it's deleted from all the replicas. We want our backups to protect against all these scenarios. On Feb 9, 2009, at 4:41 PM, Amandeep Khurana wrote: Why would you want to have another backup beyond HDFS? HDFS itself replicates your data so if the reliability of the system shouldnt be a concern (if at all it is)... Amandeep Amandeep Khurana Computer Science Graduate Student University of California, Santa Cruz On Mon, Feb 9, 2009 at 4:17 PM, Nathan Marz wrote: How do people back up their data that they keep on HDFS? We have many TB of data which we need to get backed up but are unclear on how to do this efficiently/reliably.
Re: Backing up HDFS?
Why would you want to have another backup beyond HDFS? HDFS itself replicates your data so if the reliability of the system shouldnt be a concern (if at all it is)... Amandeep Amandeep Khurana Computer Science Graduate Student University of California, Santa Cruz On Mon, Feb 9, 2009 at 4:17 PM, Nathan Marz wrote: > How do people back up their data that they keep on HDFS? We have many TB of > data which we need to get backed up but are unclear on how to do this > efficiently/reliably. >
Backing up HDFS?
How do people back up their data that they keep on HDFS? We have many TB of data which we need to get backed up but are unclear on how to do this efficiently/reliably.