Re: [BackupPC-users] Long: How BackupPC handles pooling, and how transfer methods affect bandwidth usage

Les Mikesell Sat, 27 Jan 2007 12:54:44 -0800

Timothy J. Massey wrote:

>  > > As a start, how about a utility that simply clones one host to another
>  > > using only the pc/host directory tree, and assumes that none of the
>  > > source files are in the pool, just like it would during a brand-new
>  > > rsync backup?
>  >
>  > That would be better than nothing, but if you have multiple full runs
>  > that you want to keep you'll have to transfer a lot of duplicates that
>  > could probably be avoided.
> 
> Correct.  But it's a proof of concept that can be refined.  I understand 
> that some sort of inode or hash caching is required.  But the first step 
> can be done with the parts we've already got.


Agreed, but it's a lot easier to design in the out-of-band info you'll 
need later than to try to figure out where to put it afterwards.

>  > But what is the advantage over just letting the remote server make its
>  > run directly against the same targets?
> 
> I thought a lot of Holger's points were good.  But for me, it comes down 
> to two points:
> 
> Point 1:  Distributing Load
> ===========================
> I have hosts that take, across a LAN, 12 hours to back up.  The deltas 
> are not necessarily very big:  there's just *lots* of files.  And these 
> are reasonably fast hosts:  >2GHz Xeon processors, 10k and 15k RPM 
> drives, hardware SCSI (and now SAS) RAID controllers, etc.
> 
> I want to store the data in multiple places, both on the local LAN and 
> in at least 2 remote locations.  That would mean 3 backups.  It's 
> probably not going to take 36 hours to do that, but it's going to take a 
> *lot* more than 12...
> 
> Other times, it's not the host's fault, but the Internet connection. 
> Maybe it's a host that's behind a low-end DSL that only offers 768k up 
> (or worse).  It's hard enough to get *one* backup done over that, let 
> alone two.
> 
> So how can I speed this up?

Brute force approach: park a linux box with a big disk on the local LAN 
side.  Do scripted stock rsync backups to this box to make full 
uncompressed copies with each host in its own directory. It's not as 
elegant as a local backuppc but you get quick access to a copy
locally plus offloading any issues you might have in the remote 
transfer.  I actually use this approach in several remote offices, 
taking advantage of an existing box that also provides VPN and some file 
shares.  One up side is that you can add the -C option on the ssh 
command that runs rsync to get compression on the transfer (although
starting over, I'd use openvpn as the VPN and add compression there).

> And once one remote BackupPC server has the data, the 
> rest can get it over the very fast Internet connections that they have 
> between them.  So I only have to get the data across that slow link 
> once, and I can still get it to multiple remote locations.

For this case you might also want to do a stock rsync copy of the 
backups on the remote LAN to an uncompressed copy at the central 
location, then point 2 or more backuppc instances that have faster 
connections at that copy. Paradoxically, stock rsync with the -z option 
can move data more efficiently than just about anything but it requires 
the raw storage at both ends to be uncompressed.

This might be cumbersome if you have a lot of  individual hosts to add 
but it isn't bad if everyone is already saving the files that need 
backup onto one or a few servers at the remote sites.

As I've mentioned before, I raid-mirror to an external drive weekly to 
get an offsite copy.


> On top of this, the BackupPC server has a much easier task to replicate 
> a pool than the host does in the first place.  Pooling has already been 
> taken care of.  We *know* which files are new, and which ones are not.

I don't think you can count on any particular relationship between local 
and remote pools.

> There are only two things the replication need worry about:  1) 
> Transferring the new files and see if they already exist in the new 
> pool, and 2) integrating these new files into the remote server's own pool.

That happens now if you can arrange for the rsync method to see a raw 
uncompressed copy.  I agree that a more elegant method could be written, 
but so far it hasn't.

> Point 2:  Long-term Management of Data

LVM on top of RAID is probably the right approach for being able to 
maintain an archive that needs to grow and have failing drives replaced.

> However, with the ability to migrate hosts from one host to another, I 
> can have tiers of BackupPC servers.  As hosts are retired, I still need 
> to keep their data.  7 years was not chosen for the fun of it:  SOX 
> compliance requires it.  However, I can migrate it *out* of my 
> first-line backup servers onto secondary servers.

Again there is a brute force fix: keep the old servers with the old data 
but add new ones at whatever interval is necessary to keep current data. 
  You'll have to rebuild the pool of any still-existing files, but as a 
tradeoff you get some redundancy.

 > If my backup load
> increases to the point where one server can no longer handle its load, I 
> can divide its load across multiple servers *without* losing its history.

An additional approach here would be to make the web interface aware of 
multiple servers so you don't have to put everything in one box.

> I have a feeling that many people who run BackupPC might be somewhat 
> caviler with historical data.  I *know* I have been:  between 
> permanently archiving content every 3 months, and not keeping more than 
> a few weeks' worth of data on a server, none of these items were a 
> hardship. 

Most of our real data has it's own concept of history built-in.  That 
is, from a current backup of the source code repository you can 
reconstruct anything that has ever been added to it.  The accounting 
system likewise has its own way to report on any period covered in its 
database if you have a current copy.  There is not much reason to worry 
about anything but the current versions of things like that.  SOX 
compliance is something else, though.


> Trying to keep historical information in place and accessable for 
> *years* makes that much harder.  It is *guaranteed* that over the 7 year 
> life of that data, it's going to live on at least 2 different servers, 
> and likely three.  The idea of needing to stay locked into the 
> configuration that was put in place 7 years ago is *not* appealing. 
> Without being able to move hosts from one server to another--where the 
> pool is one monolithic block forever--is worrysome.

There is always the image-copy/grow filesystem method.

> Maybe I'm just abusing BackupPC beyond what it was intended to do. 
> That's fine, too.  But adding the ability to migrate a host from one 
> pool to another does not have to change a *single* thing about BackupPC 
> at all.  It's kind of like archiving hosts:  it's a feature you can use, 
> or ignore.

Don't take anything I've said about the workarounds to imply that I 
don't agree that there should be a better way to move/replicate hosts 
and cascade servers.

-- 
   Les Mikesell
    [EMAIL PROTECTED]

-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys - and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
BackupPC-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/backuppc-users
http://backuppc.sourceforge.net/

Re: [BackupPC-users] Long: How BackupPC handles pooling, and how transfer methods affect bandwidth usage

Reply via email to