Re: Pragmatic cluster backup strategies?

2012-05-31 Thread Darrell Taylor
Sounds like Trash is useful for those times when you delete a bunch of
files by mistake and can get them back quickly, as you say not a backup
strategy, but at least a first line of defence.

We had a discussion in the office and came up with the following possible
solution, this stems from the technique we currently use for fast MySQL
backups.  So, each of the nodes will have 4 x 3Tb drives on board, what we
propose is to use 2 of the drives on each node for the main data, and the
other 2 drives for backup, but using LVM we will be able to take a snapshot
of all the nodes all at the same time, what LVM snapshots effectively do is
checkpoint the main disk and then takes copies of any changed inodes,
resulting in a partition that you can mount that is a view of the node at
the snapshot time.  We would simply run this in cron on all the machines at
the same time (machines are synced with ntp) and this would give us a
snapshot of the cluster at a point in time.  The main question I have here
is if the cluster is busy doing something at the point in time we take the
snapshot, and then do a subsequent full restore (after shutting down the
cluster etc.) what potential problems might we see with the data nodes, as
I guess there will be blocks in various random states, but the cluster is
essentially restored.  Also I guess we need to apply the same technique to
the main namenode and jobtracker machines?

Anybody every tried anything like this before?  Is it even feasible?

On Wed, May 30, 2012 at 2:36 PM, Robert Evans ev...@yahoo-inc.com wrote:

 I am not an expert on the trash so you probably want to verify everything
 I am about to say.  I believe that trash acts oddly when you try to use it
 to delete a trash directory.  Quotas can potentially get off when doing
 this, but I think it still deletes the directory.  Trash is a nice feature,
 but I wouldn't trust it as a true backup.  I just don't think it is mature
 enough for something like that.  There are enough issues with quotas that
 sadly most of our users almost always add -skipTrash all the time.

 Where I work we do a combination of several different things depending on
 the project and their requirements.  In some cases where there are
 government regulations involved we do regular tape backups.  In other cases
 we keep the original data around for some time and can re-import it to HDFS
 if necessary.  In other cases we will copy the data, to multiple Hadoop
 clusters.  This is usually for the case where we want to do Hot/Warm
 failover between clusters.  Now we may be different from most other users
 because we do run lots of different projects on lots of different clusters.

 --Bobby Evans

 On 5/30/12 1:31 AM, Darrell Taylor darrell.tay...@gmail.com wrote:

 Will hadoop fs -rm -rf move everything to the the /trash directory or
 will it delete that as well?

 I was thinking along the lines of what you suggest, keep the original
 source of the data somewhere and then reprocess it all in the event of a
 problem.

 What do other people do?  Do you run another cluster?  Do you backup
 specific parts of the cluster?  Some form of offsite SAN?

 On Tue, May 29, 2012 at 6:02 PM, Robert Evans ev...@yahoo-inc.com wrote:

  Yes you will have redundancy, so no single point of hardware failure can
  wipe out your data, short of a major catastrophe.  But you can still have
  an errant or malicious hadoop fs -rm -rf shut you down.  If you still
  have the original source of your data somewhere else you may be able to
  recover, by reprocessing the data, but if this cluster is your single
  repository for all your data you may have a problem.
 
  --Bobby Evans
 
  On 5/29/12 11:40 AM, Michael Segel michael_se...@hotmail.com wrote:
 
  Hi,
  That's not a back up strategy.
  You could still have joe luser take out a key file or directory. What do
  you do then?
 
  On May 29, 2012, at 11:19 AM, Darrell Taylor wrote:
 
   Hi,
  
   We are about to build a 10 machine cluster with 40Tb of storage,
  obviously
   as this gets full actually trying to create an offsite backup becomes a
   problem unless we build another 10 machine cluster (too expensive right
   now).  Not sure if it will help but we have planned the cabinet into an
   upper and lower half with separate redundant power, then we plan to put
   half of the cluster in the top, half in the bottom, effectively 2
 racks,
  so
   in theory we could lose half the cluster and still have the copies of
 all
   the blocks with a replication factor of 3?  Apart form the data centre
   burning down or some other disaster that would render the machines
  totally
   unrecoverable, is this approach good enough?
  
   I realise this is a very open question and everyone's circumstances are
   different, but I'm wondering what other peoples experiences/opinions
 are
   for backing up cluster data?
  
   Thanks
   Darrell.
 
 
 




Re: Pragmatic cluster backup strategies?

2012-05-30 Thread Darrell Taylor
Will hadoop fs -rm -rf move everything to the the /trash directory or
will it delete that as well?

I was thinking along the lines of what you suggest, keep the original
source of the data somewhere and then reprocess it all in the event of a
problem.

What do other people do?  Do you run another cluster?  Do you backup
specific parts of the cluster?  Some form of offsite SAN?

On Tue, May 29, 2012 at 6:02 PM, Robert Evans ev...@yahoo-inc.com wrote:

 Yes you will have redundancy, so no single point of hardware failure can
 wipe out your data, short of a major catastrophe.  But you can still have
 an errant or malicious hadoop fs -rm -rf shut you down.  If you still
 have the original source of your data somewhere else you may be able to
 recover, by reprocessing the data, but if this cluster is your single
 repository for all your data you may have a problem.

 --Bobby Evans

 On 5/29/12 11:40 AM, Michael Segel michael_se...@hotmail.com wrote:

 Hi,
 That's not a back up strategy.
 You could still have joe luser take out a key file or directory. What do
 you do then?

 On May 29, 2012, at 11:19 AM, Darrell Taylor wrote:

  Hi,
 
  We are about to build a 10 machine cluster with 40Tb of storage,
 obviously
  as this gets full actually trying to create an offsite backup becomes a
  problem unless we build another 10 machine cluster (too expensive right
  now).  Not sure if it will help but we have planned the cabinet into an
  upper and lower half with separate redundant power, then we plan to put
  half of the cluster in the top, half in the bottom, effectively 2 racks,
 so
  in theory we could lose half the cluster and still have the copies of all
  the blocks with a replication factor of 3?  Apart form the data centre
  burning down or some other disaster that would render the machines
 totally
  unrecoverable, is this approach good enough?
 
  I realise this is a very open question and everyone's circumstances are
  different, but I'm wondering what other peoples experiences/opinions are
  for backing up cluster data?
 
  Thanks
  Darrell.





Re: Pragmatic cluster backup strategies?

2012-05-30 Thread alo alt
Hi,

you could set fs.trash.interval into the number of minutes you want to consider 
that the rm'd data will lost forever. The data will be moved into .Trash and 
deleted after the configured time.
Second way could be to use mount.fuse to mount the HDFS and backup over that 
mount your data into a storage tier. That is not the best solution, but a 
useable way. 

cheers,
 Alex 

--
Alexander Alten-Lorenz
http://mapredit.blogspot.com
German Hadoop LinkedIn Group: http://goo.gl/N8pCF

On May 30, 2012, at 8:31 AM, Darrell Taylor wrote:

 Will hadoop fs -rm -rf move everything to the the /trash directory or
 will it delete that as well?
 
 I was thinking along the lines of what you suggest, keep the original
 source of the data somewhere and then reprocess it all in the event of a
 problem.
 
 What do other people do?  Do you run another cluster?  Do you backup
 specific parts of the cluster?  Some form of offsite SAN?
 
 On Tue, May 29, 2012 at 6:02 PM, Robert Evans ev...@yahoo-inc.com wrote:
 
 Yes you will have redundancy, so no single point of hardware failure can
 wipe out your data, short of a major catastrophe.  But you can still have
 an errant or malicious hadoop fs -rm -rf shut you down.  If you still
 have the original source of your data somewhere else you may be able to
 recover, by reprocessing the data, but if this cluster is your single
 repository for all your data you may have a problem.
 
 --Bobby Evans
 
 On 5/29/12 11:40 AM, Michael Segel michael_se...@hotmail.com wrote:
 
 Hi,
 That's not a back up strategy.
 You could still have joe luser take out a key file or directory. What do
 you do then?
 
 On May 29, 2012, at 11:19 AM, Darrell Taylor wrote:
 
 Hi,
 
 We are about to build a 10 machine cluster with 40Tb of storage,
 obviously
 as this gets full actually trying to create an offsite backup becomes a
 problem unless we build another 10 machine cluster (too expensive right
 now).  Not sure if it will help but we have planned the cabinet into an
 upper and lower half with separate redundant power, then we plan to put
 half of the cluster in the top, half in the bottom, effectively 2 racks,
 so
 in theory we could lose half the cluster and still have the copies of all
 the blocks with a replication factor of 3?  Apart form the data centre
 burning down or some other disaster that would render the machines
 totally
 unrecoverable, is this approach good enough?
 
 I realise this is a very open question and everyone's circumstances are
 different, but I'm wondering what other peoples experiences/opinions are
 for backing up cluster data?
 
 Thanks
 Darrell.
 
 
 



Re: Pragmatic cluster backup strategies?

2012-05-30 Thread Robert Evans
I am not an expert on the trash so you probably want to verify everything I am 
about to say.  I believe that trash acts oddly when you try to use it to delete 
a trash directory.  Quotas can potentially get off when doing this, but I think 
it still deletes the directory.  Trash is a nice feature, but I wouldn't trust 
it as a true backup.  I just don't think it is mature enough for something like 
that.  There are enough issues with quotas that sadly most of our users almost 
always add -skipTrash all the time.

Where I work we do a combination of several different things depending on the 
project and their requirements.  In some cases where there are government 
regulations involved we do regular tape backups.  In other cases we keep the 
original data around for some time and can re-import it to HDFS if necessary.  
In other cases we will copy the data, to multiple Hadoop clusters.  This is 
usually for the case where we want to do Hot/Warm failover between clusters.  
Now we may be different from most other users because we do run lots of 
different projects on lots of different clusters.

--Bobby Evans

On 5/30/12 1:31 AM, Darrell Taylor darrell.tay...@gmail.com wrote:

Will hadoop fs -rm -rf move everything to the the /trash directory or
will it delete that as well?

I was thinking along the lines of what you suggest, keep the original
source of the data somewhere and then reprocess it all in the event of a
problem.

What do other people do?  Do you run another cluster?  Do you backup
specific parts of the cluster?  Some form of offsite SAN?

On Tue, May 29, 2012 at 6:02 PM, Robert Evans ev...@yahoo-inc.com wrote:

 Yes you will have redundancy, so no single point of hardware failure can
 wipe out your data, short of a major catastrophe.  But you can still have
 an errant or malicious hadoop fs -rm -rf shut you down.  If you still
 have the original source of your data somewhere else you may be able to
 recover, by reprocessing the data, but if this cluster is your single
 repository for all your data you may have a problem.

 --Bobby Evans

 On 5/29/12 11:40 AM, Michael Segel michael_se...@hotmail.com wrote:

 Hi,
 That's not a back up strategy.
 You could still have joe luser take out a key file or directory. What do
 you do then?

 On May 29, 2012, at 11:19 AM, Darrell Taylor wrote:

  Hi,
 
  We are about to build a 10 machine cluster with 40Tb of storage,
 obviously
  as this gets full actually trying to create an offsite backup becomes a
  problem unless we build another 10 machine cluster (too expensive right
  now).  Not sure if it will help but we have planned the cabinet into an
  upper and lower half with separate redundant power, then we plan to put
  half of the cluster in the top, half in the bottom, effectively 2 racks,
 so
  in theory we could lose half the cluster and still have the copies of all
  the blocks with a replication factor of 3?  Apart form the data centre
  burning down or some other disaster that would render the machines
 totally
  unrecoverable, is this approach good enough?
 
  I realise this is a very open question and everyone's circumstances are
  different, but I'm wondering what other peoples experiences/opinions are
  for backing up cluster data?
 
  Thanks
  Darrell.






Re: Pragmatic cluster backup strategies?

2012-05-29 Thread Michael Segel
Hi,
That's not a back up strategy. 
You could still have joe luser take out a key file or directory. What do you do 
then?

On May 29, 2012, at 11:19 AM, Darrell Taylor wrote:

 Hi,
 
 We are about to build a 10 machine cluster with 40Tb of storage, obviously
 as this gets full actually trying to create an offsite backup becomes a
 problem unless we build another 10 machine cluster (too expensive right
 now).  Not sure if it will help but we have planned the cabinet into an
 upper and lower half with separate redundant power, then we plan to put
 half of the cluster in the top, half in the bottom, effectively 2 racks, so
 in theory we could lose half the cluster and still have the copies of all
 the blocks with a replication factor of 3?  Apart form the data centre
 burning down or some other disaster that would render the machines totally
 unrecoverable, is this approach good enough?
 
 I realise this is a very open question and everyone's circumstances are
 different, but I'm wondering what other peoples experiences/opinions are
 for backing up cluster data?
 
 Thanks
 Darrell.



Re: Pragmatic cluster backup strategies?

2012-05-29 Thread Robert Evans
Yes you will have redundancy, so no single point of hardware failure can wipe 
out your data, short of a major catastrophe.  But you can still have an errant 
or malicious hadoop fs -rm -rf shut you down.  If you still have the original 
source of your data somewhere else you may be able to recover, by reprocessing 
the data, but if this cluster is your single repository for all your data you 
may have a problem.

--Bobby Evans

On 5/29/12 11:40 AM, Michael Segel michael_se...@hotmail.com wrote:

Hi,
That's not a back up strategy.
You could still have joe luser take out a key file or directory. What do you do 
then?

On May 29, 2012, at 11:19 AM, Darrell Taylor wrote:

 Hi,

 We are about to build a 10 machine cluster with 40Tb of storage, obviously
 as this gets full actually trying to create an offsite backup becomes a
 problem unless we build another 10 machine cluster (too expensive right
 now).  Not sure if it will help but we have planned the cabinet into an
 upper and lower half with separate redundant power, then we plan to put
 half of the cluster in the top, half in the bottom, effectively 2 racks, so
 in theory we could lose half the cluster and still have the copies of all
 the blocks with a replication factor of 3?  Apart form the data centre
 burning down or some other disaster that would render the machines totally
 unrecoverable, is this approach good enough?

 I realise this is a very open question and everyone's circumstances are
 different, but I'm wondering what other peoples experiences/opinions are
 for backing up cluster data?

 Thanks
 Darrell.