On Sun, Jun 26, 2011 at 03:48:20PM -0500, Chris Baker wrote: > I have been wondering about this for a while. Am I better off having > backups run parallel or in series? > > By running in series, I mean one backup runs at a time. When it finishes, > another one starts. > > By running parallel, I mean that several backups run at once. It seems > that when backups have to fight over bandwidth, they all end up running > much more slowly. I have it set up to run four backups at once. > > A server that rarely runs more than one back has achieved throughput as > high as 24.13 MB/sec. However, the server with four backups has a maximum > of only 5.71 MB/sec. Bottom line, the four when added up still don't get > as good a throughput as the single backup. > > What does everyone here think?
It depends on a few things. I use rsync almost exclusively. So YMMV with other backup mechanisms. What is your i/o subsystem? I have a striped array (raid 0) over two raid 6 arrays with 7 drives in each array, so effectively I have 10 spindles. With this I can handle more i/o load than if I only had one drive. I would expect the throughput of multiple servers backing up in parallel to be less the the total i/o bandwidth of the disk since each write will end up moving the heads of the disks to different locations dropping the effective bandwidth of the disk. However, if you have a raid controller with a battery backed cache you may find that you can actually get more bandwidth up to the point where your backups have saturated the cache at which point the backups are waiting for the disk heads to move and write data. Also if you have the memory increasing the readahead helps as well as you can do a sequential read of multiple blocks of a file when the first block is requested so that multiple blocks are cached in memory and an rsync read of the next block doesn't have to wait for another i/o cycle to disk and head movement to fullfil the request. In my case I also have a lot of systems backing up across the wan that are bandwidth limited to 64 K bytes/sec. So even if I only had one backup disk, I could handle more than 1 backup running in parallel easily. Also the backup process itself goes through both read and write cycles. If you are doing an incremental and have the inode info cached in memory you have effectively 0 bandwidth use on the server while the client is furiously scanning its disks looking for new data. This allows you to have another backup writing new data to your backup disk without any contention. So running multiple backups can let you consume i/o bandwidth that would otherwise be wasted while processing is primarily happening on the client (note there is no way to actually control this). If you always do full backups (requiring a full disk read for all files with rsync) and your backups are always starting at exactly the same time and stay synchronized you will have a very different performance curve from a mixed set of incrementals and fulls where they are staggered in their i/o pattern over the span of the backup window. When I first set up backuppc I ran some speed tests varying: amount of input bandwidth (# of clients and a mix of clients using different network bandwidth limits 64KB -> 5MB/s) type of back end (raid 6, raid 1/0, raid 0 (for testing only) and I tried to minimize the total amount of backup time needed per test set of of hosts (I had 40 hosts total). I found that raid 0 across the disks was the fastest (no surprise there) and I was able to handle 20 or so of the higher bandwidth hosts running in parallel. Most likely because the backups started at the same time, but finished at different time when new backups were started. So the mix of running backups vs read/write i/o load would shift as the backups got done. Currently I run 10 backups in parallel and have 132 hosts being backed up nightly and I only have one or two still running in the morning when we have a heavy data churn (> 1/2 TB). My pool reports: 1258 full backups of total size 61013.99GB. So I have a pretty small average full backup size. -- -- rouilj John Rouillard System Administrator Renesys Corporation 603-244-9084 (cell) 603-643-9300 x 111 ------------------------------------------------------------------------------ All of the data generated in your IT infrastructure is seriously valuable. Why? It contains a definitive record of application performance, security threats, fraudulent activity, and more. Splunk takes this data and makes sense of it. IT sense. And common sense. http://p.sf.net/sfu/splunk-d2d-c2 _______________________________________________ BackupPC-users mailing list BackupPC-users@lists.sourceforge.net List: https://lists.sourceforge.net/lists/listinfo/backuppc-users Wiki: http://backuppc.wiki.sourceforge.net Project: http://backuppc.sourceforge.net/