you can also distcp to AWS S3 http://wiki.apache.org/hadoop/AmazonS3 which you can do as frequently as you like, even after the map/reduce job is done just ship it over
On Tue, Jan 3, 2012 at 4:31 PM, Mac Noland <mcdonaldnol...@yahoo.com> wrote: > > > Thanks for the reply Alex. To make sure I understand: > > 1) "park" the data by sending it over to a different cluster on a > schedule (e.g. nightly is what we offer today on most things). > 2) then from this secondary cluster, which is sitting idle after the > distcp, do a copy local to a NFS mount pointed at SAN or NAS. > 3) Then with some type of coordination (so you're not copying local when > the backup happens), have the SAN or NAS device snap the data for backup. > > A simple restore process would be then to allow users read access to the > NFS mounted storage so they can pick and choose what they want to recover > via the SAN or NAS's snapshot feature - or after a "restore" to the local > file system is completed by the support folks if they are using one of our > older systems. > > > Is that about right? > > Mac > > > > ________________________________ > From: alo alt <wget.n...@googlemail.com> > To: "hdfs-user@hadoop.apache.org" <hdfs-user@hadoop.apache.org>; Mac > Noland <mcdonaldnol...@yahoo.com> > Sent: Tuesday, January 3, 2012 3:10 PM > Subject: Re: Hadoop HDFS Backup/Restore Solutions > > > Hi Mac, > > hdfs has at the moment no solution for an complete backup- and restore > process like ITL or ISO9000. An strategy could be to "park" the data from > hdfs do you want to backup on tape with "distcp" to another backup cluster > and snapshot from them with SAN mechanism. Here the DN store has to be > located on the SAN box. > > - Alex > > On Tuesday, January 3, 2012, Mac Noland <mcdonaldnol...@yahoo.com> wrote: > > Good day, > > > > I’m guessing this question been asked a myriad of times, but > > we’re about to get serious with some of our Hadoop implementations so I > wanted > > to re-ask to see if I’m missing anything, or if others happen to know if > this might > > be on a future road map. > > > > For our current storage offerings (e.g. NAS or SAN), we give > > businesses the opportunity to choose 7, 14, or 45 day “backups” for their > > storage. The purpose of the backup isn’t > > so much as they are worried about losing their current data (we’re > RAID’ed > > and have some stuff mirrored to remote > > datacenters), but more so if they were to delete some data today, they > can > > recover from yesterday’s backup. Or the > > day before’s backup, or the day before that, etc. And to be honest, > business units buy a good portion of their backups to make people feel > better and fulfill custom contracts. > > > > > > So far with HDFS we haven’t found too many formalized > > offerings for this specific feature. While I haven’t done a ton of > research, the best solution I’ve found is an > > idea where we’d schedule a job to pull the data locally to a mount that > is > > backed up via our traditional methods. See Michael Segel’s first post > on this site > http://lucene.472066.n3.nabble.com/Backing-up-HDFS-td1019184.html > > > > Though we’d have to work through the details of what this > > would look like for our support folks, it looks like something that could > > potentially fit into our current model. We’d basically need to allocate > the same amount of SAN or NAS disk as we > > have for HDFS, then coordinate a snap on the the SAN or NAS via our > traditional > > methods. Not sure what a restore would > > look like, other than we could give the end users read access to the NAS > or SAN > > mounts so they can pick through what they need to recover and let them > figure > > out how to get it back into HDFS. > > > > For use cases like ours where we’d need multi-day backups to > > fulfill business needs, is this kind of what people are thinking or > doing? Moreover, are there any things in the Hadoop > > HDFS road map for providing, for lack of a better word, an “enterprise” > > backup/restore solution? > > > > Thanks in advance, > > > > Mac Noland – Thomson Reuters > > > > -- > > Alexander Lorenz > http://mapredit.blogspot.com > > P Think of the environment: please don't print this email unless you > really need to. > -- /* Joe Stein http://www.linkedin.com/in/charmalloc Twitter: @allthingshadoop <http://twitter.com/#!/allthingshadoop> */