Re: Pragmatic cluster backup strategies?
Sounds like Trash is useful for those times when you delete a bunch of files by mistake and can get them back quickly, as you say not a backup strategy, but at least a first line of defence. We had a discussion in the office and came up with the following possible solution, this stems from the technique we currently use for fast MySQL backups. So, each of the nodes will have 4 x 3Tb drives on board, what we propose is to use 2 of the drives on each node for the main data, and the other 2 drives for backup, but using LVM we will be able to take a snapshot of all the nodes all at the same time, what LVM snapshots effectively do is checkpoint the main disk and then takes copies of any changed inodes, resulting in a partition that you can mount that is a view of the node at the snapshot time. We would simply run this in cron on all the machines at the same time (machines are synced with ntp) and this would give us a snapshot of the cluster at a point in time. The main question I have here is if the cluster is busy doing something at the point in time we take the snapshot, and then do a subsequent full restore (after shutting down the cluster etc.) what potential problems might we see with the data nodes, as I guess there will be blocks in various random states, but the cluster is essentially restored. Also I guess we need to apply the same technique to the main namenode and jobtracker machines? Anybody every tried anything like this before? Is it even feasible? On Wed, May 30, 2012 at 2:36 PM, Robert Evans ev...@yahoo-inc.com wrote: I am not an expert on the trash so you probably want to verify everything I am about to say. I believe that trash acts oddly when you try to use it to delete a trash directory. Quotas can potentially get off when doing this, but I think it still deletes the directory. Trash is a nice feature, but I wouldn't trust it as a true backup. I just don't think it is mature enough for something like that. There are enough issues with quotas that sadly most of our users almost always add -skipTrash all the time. Where I work we do a combination of several different things depending on the project and their requirements. In some cases where there are government regulations involved we do regular tape backups. In other cases we keep the original data around for some time and can re-import it to HDFS if necessary. In other cases we will copy the data, to multiple Hadoop clusters. This is usually for the case where we want to do Hot/Warm failover between clusters. Now we may be different from most other users because we do run lots of different projects on lots of different clusters. --Bobby Evans On 5/30/12 1:31 AM, Darrell Taylor darrell.tay...@gmail.com wrote: Will hadoop fs -rm -rf move everything to the the /trash directory or will it delete that as well? I was thinking along the lines of what you suggest, keep the original source of the data somewhere and then reprocess it all in the event of a problem. What do other people do? Do you run another cluster? Do you backup specific parts of the cluster? Some form of offsite SAN? On Tue, May 29, 2012 at 6:02 PM, Robert Evans ev...@yahoo-inc.com wrote: Yes you will have redundancy, so no single point of hardware failure can wipe out your data, short of a major catastrophe. But you can still have an errant or malicious hadoop fs -rm -rf shut you down. If you still have the original source of your data somewhere else you may be able to recover, by reprocessing the data, but if this cluster is your single repository for all your data you may have a problem. --Bobby Evans On 5/29/12 11:40 AM, Michael Segel michael_se...@hotmail.com wrote: Hi, That's not a back up strategy. You could still have joe luser take out a key file or directory. What do you do then? On May 29, 2012, at 11:19 AM, Darrell Taylor wrote: Hi, We are about to build a 10 machine cluster with 40Tb of storage, obviously as this gets full actually trying to create an offsite backup becomes a problem unless we build another 10 machine cluster (too expensive right now). Not sure if it will help but we have planned the cabinet into an upper and lower half with separate redundant power, then we plan to put half of the cluster in the top, half in the bottom, effectively 2 racks, so in theory we could lose half the cluster and still have the copies of all the blocks with a replication factor of 3? Apart form the data centre burning down or some other disaster that would render the machines totally unrecoverable, is this approach good enough? I realise this is a very open question and everyone's circumstances are different, but I'm wondering what other peoples experiences/opinions are for backing up cluster data? Thanks Darrell.
Re: Pragmatic cluster backup strategies?
Will hadoop fs -rm -rf move everything to the the /trash directory or will it delete that as well? I was thinking along the lines of what you suggest, keep the original source of the data somewhere and then reprocess it all in the event of a problem. What do other people do? Do you run another cluster? Do you backup specific parts of the cluster? Some form of offsite SAN? On Tue, May 29, 2012 at 6:02 PM, Robert Evans ev...@yahoo-inc.com wrote: Yes you will have redundancy, so no single point of hardware failure can wipe out your data, short of a major catastrophe. But you can still have an errant or malicious hadoop fs -rm -rf shut you down. If you still have the original source of your data somewhere else you may be able to recover, by reprocessing the data, but if this cluster is your single repository for all your data you may have a problem. --Bobby Evans On 5/29/12 11:40 AM, Michael Segel michael_se...@hotmail.com wrote: Hi, That's not a back up strategy. You could still have joe luser take out a key file or directory. What do you do then? On May 29, 2012, at 11:19 AM, Darrell Taylor wrote: Hi, We are about to build a 10 machine cluster with 40Tb of storage, obviously as this gets full actually trying to create an offsite backup becomes a problem unless we build another 10 machine cluster (too expensive right now). Not sure if it will help but we have planned the cabinet into an upper and lower half with separate redundant power, then we plan to put half of the cluster in the top, half in the bottom, effectively 2 racks, so in theory we could lose half the cluster and still have the copies of all the blocks with a replication factor of 3? Apart form the data centre burning down or some other disaster that would render the machines totally unrecoverable, is this approach good enough? I realise this is a very open question and everyone's circumstances are different, but I'm wondering what other peoples experiences/opinions are for backing up cluster data? Thanks Darrell.
Re: Pragmatic cluster backup strategies?
Hi, you could set fs.trash.interval into the number of minutes you want to consider that the rm'd data will lost forever. The data will be moved into .Trash and deleted after the configured time. Second way could be to use mount.fuse to mount the HDFS and backup over that mount your data into a storage tier. That is not the best solution, but a useable way. cheers, Alex -- Alexander Alten-Lorenz http://mapredit.blogspot.com German Hadoop LinkedIn Group: http://goo.gl/N8pCF On May 30, 2012, at 8:31 AM, Darrell Taylor wrote: Will hadoop fs -rm -rf move everything to the the /trash directory or will it delete that as well? I was thinking along the lines of what you suggest, keep the original source of the data somewhere and then reprocess it all in the event of a problem. What do other people do? Do you run another cluster? Do you backup specific parts of the cluster? Some form of offsite SAN? On Tue, May 29, 2012 at 6:02 PM, Robert Evans ev...@yahoo-inc.com wrote: Yes you will have redundancy, so no single point of hardware failure can wipe out your data, short of a major catastrophe. But you can still have an errant or malicious hadoop fs -rm -rf shut you down. If you still have the original source of your data somewhere else you may be able to recover, by reprocessing the data, but if this cluster is your single repository for all your data you may have a problem. --Bobby Evans On 5/29/12 11:40 AM, Michael Segel michael_se...@hotmail.com wrote: Hi, That's not a back up strategy. You could still have joe luser take out a key file or directory. What do you do then? On May 29, 2012, at 11:19 AM, Darrell Taylor wrote: Hi, We are about to build a 10 machine cluster with 40Tb of storage, obviously as this gets full actually trying to create an offsite backup becomes a problem unless we build another 10 machine cluster (too expensive right now). Not sure if it will help but we have planned the cabinet into an upper and lower half with separate redundant power, then we plan to put half of the cluster in the top, half in the bottom, effectively 2 racks, so in theory we could lose half the cluster and still have the copies of all the blocks with a replication factor of 3? Apart form the data centre burning down or some other disaster that would render the machines totally unrecoverable, is this approach good enough? I realise this is a very open question and everyone's circumstances are different, but I'm wondering what other peoples experiences/opinions are for backing up cluster data? Thanks Darrell.
Re: Pragmatic cluster backup strategies?
I am not an expert on the trash so you probably want to verify everything I am about to say. I believe that trash acts oddly when you try to use it to delete a trash directory. Quotas can potentially get off when doing this, but I think it still deletes the directory. Trash is a nice feature, but I wouldn't trust it as a true backup. I just don't think it is mature enough for something like that. There are enough issues with quotas that sadly most of our users almost always add -skipTrash all the time. Where I work we do a combination of several different things depending on the project and their requirements. In some cases where there are government regulations involved we do regular tape backups. In other cases we keep the original data around for some time and can re-import it to HDFS if necessary. In other cases we will copy the data, to multiple Hadoop clusters. This is usually for the case where we want to do Hot/Warm failover between clusters. Now we may be different from most other users because we do run lots of different projects on lots of different clusters. --Bobby Evans On 5/30/12 1:31 AM, Darrell Taylor darrell.tay...@gmail.com wrote: Will hadoop fs -rm -rf move everything to the the /trash directory or will it delete that as well? I was thinking along the lines of what you suggest, keep the original source of the data somewhere and then reprocess it all in the event of a problem. What do other people do? Do you run another cluster? Do you backup specific parts of the cluster? Some form of offsite SAN? On Tue, May 29, 2012 at 6:02 PM, Robert Evans ev...@yahoo-inc.com wrote: Yes you will have redundancy, so no single point of hardware failure can wipe out your data, short of a major catastrophe. But you can still have an errant or malicious hadoop fs -rm -rf shut you down. If you still have the original source of your data somewhere else you may be able to recover, by reprocessing the data, but if this cluster is your single repository for all your data you may have a problem. --Bobby Evans On 5/29/12 11:40 AM, Michael Segel michael_se...@hotmail.com wrote: Hi, That's not a back up strategy. You could still have joe luser take out a key file or directory. What do you do then? On May 29, 2012, at 11:19 AM, Darrell Taylor wrote: Hi, We are about to build a 10 machine cluster with 40Tb of storage, obviously as this gets full actually trying to create an offsite backup becomes a problem unless we build another 10 machine cluster (too expensive right now). Not sure if it will help but we have planned the cabinet into an upper and lower half with separate redundant power, then we plan to put half of the cluster in the top, half in the bottom, effectively 2 racks, so in theory we could lose half the cluster and still have the copies of all the blocks with a replication factor of 3? Apart form the data centre burning down or some other disaster that would render the machines totally unrecoverable, is this approach good enough? I realise this is a very open question and everyone's circumstances are different, but I'm wondering what other peoples experiences/opinions are for backing up cluster data? Thanks Darrell.
Re: Pragmatic cluster backup strategies?
Hi, That's not a back up strategy. You could still have joe luser take out a key file or directory. What do you do then? On May 29, 2012, at 11:19 AM, Darrell Taylor wrote: Hi, We are about to build a 10 machine cluster with 40Tb of storage, obviously as this gets full actually trying to create an offsite backup becomes a problem unless we build another 10 machine cluster (too expensive right now). Not sure if it will help but we have planned the cabinet into an upper and lower half with separate redundant power, then we plan to put half of the cluster in the top, half in the bottom, effectively 2 racks, so in theory we could lose half the cluster and still have the copies of all the blocks with a replication factor of 3? Apart form the data centre burning down or some other disaster that would render the machines totally unrecoverable, is this approach good enough? I realise this is a very open question and everyone's circumstances are different, but I'm wondering what other peoples experiences/opinions are for backing up cluster data? Thanks Darrell.
Re: Pragmatic cluster backup strategies?
Yes you will have redundancy, so no single point of hardware failure can wipe out your data, short of a major catastrophe. But you can still have an errant or malicious hadoop fs -rm -rf shut you down. If you still have the original source of your data somewhere else you may be able to recover, by reprocessing the data, but if this cluster is your single repository for all your data you may have a problem. --Bobby Evans On 5/29/12 11:40 AM, Michael Segel michael_se...@hotmail.com wrote: Hi, That's not a back up strategy. You could still have joe luser take out a key file or directory. What do you do then? On May 29, 2012, at 11:19 AM, Darrell Taylor wrote: Hi, We are about to build a 10 machine cluster with 40Tb of storage, obviously as this gets full actually trying to create an offsite backup becomes a problem unless we build another 10 machine cluster (too expensive right now). Not sure if it will help but we have planned the cabinet into an upper and lower half with separate redundant power, then we plan to put half of the cluster in the top, half in the bottom, effectively 2 racks, so in theory we could lose half the cluster and still have the copies of all the blocks with a replication factor of 3? Apart form the data centre burning down or some other disaster that would render the machines totally unrecoverable, is this approach good enough? I realise this is a very open question and everyone's circumstances are different, but I'm wondering what other peoples experiences/opinions are for backing up cluster data? Thanks Darrell.