Re: [BackupPC-users] Linux backups with rsync vs tar
Hi, Timothy J Massey wrote on 2011-09-02 10:43:37 -0400 [Re: [BackupPC-users] Linux backups with rsync vs tar]: > charlesboyo wrote on 08/31/2011 05:53:43 > AM: > [...] > > Thus I have reason to suspect the rsync overhead as being guilty. for the record, I've just (finally!) switched from tar to rsync for a data server, and this significantly increased run times of backups. Incremental backups are taking about three times as long as they used to (if memory serves correctly). So, yes, rsync does have a significant overhead. That is not surprising. I should add that in my case the server is backing up local file systems (to an iSCSI disk). The effect should be less significant if client and server are not the same machine (though it is a quad-core and the disk sets are independent). > > Note that I have disabled hard links, ??? What is that supposed to mean? You removed the rsync "-H" option? When you're talking about the pool, "disabling hard links" sounds rather troubling ;-). > > implemented checksum caching, > > increased the block size to 512 KB and enable --whole-file to no avail. I don't think File::RsyncP supports changing block size (and probably not --whole-file either). > > 1. since over 90% of the files change every day and "incremental" > > backups involve transferring the whole file to the BackupPC server, > > won't it make better sense to just run a full backup everyday? There is probably not much difference either way. > [...] > You may find that trading CPU performance for network performance may not > be a good trade in your case. That is true, but there are other reasons for using rsync rather than tar. Exactness of backups. tar incrementals don't reflect deleted files, for instance. Though, in *this* specific case that may not make much difference (presuming I'm correct in assuming your mbox files are never deleted or renamed, extracted from tar-/zip-files, etc., or at least that missing such a change until the next full backup is unproblematic). > The number one question I have is: is this really a problem? If you have > a backup window that allows this, I would not worry about it. If you do > *not*, then rsync might not be for you. That's exactly the point. In my case, it is *not* a problem, so I prefer more accurate backups, even if the fulls take all of the night. Thank you for reminding me to shift the full run to the weekend, which I'll do right now :-). > 2) Les' point about the format of the files (one monolithic file for each > mailbox vs. one file per e-mail) is dead on. That allows 99% of the files > to remain untouched once they're backed up *once*. That will *vastly* > reduce the backup times. ... and pool storage requirements, and rsync will handle small files much better. Sadly enough, there is still enough braindead software around that doesn't support maildir format, even in the Unix world. Open Source probably means that I should start hacking the $#@%volution sources ... > > 2. from Pavel's questions, he observed that BackupPC is unable to > > recover from interrupted tar transfer. Such interruptions simply > > cannot happen in my case. Should I switch to tar? Your situation is completely different from his - you're on a local network, aren't you? You don't need the bandwidth reductions you gain from rsync - tar should work fine for you. *But* you should consider whether (incremental) tar backups will be sufficiently accurate. Since you are transferring almost everything anyway, you could even run only full tar backups. For the sake of completeness, I should mention that full backups always rebuild the entire tree (in BackupPC storage). In the general case, this can raise storage requirements (for directory entries and duplicates due to exceeding HardLinkMax), but in your case I wouldn't expect much difference. However, with tar, backup exactness would benefit. > > And in the > > unlikely event that the transferred does get interrupted, what > > mechanisms do I need to implement to resume/recover from the failure? > > To repeat another response: restart the backup... To extend on that: BackupPC does that automatically at the next wakeup, so you don't really need to do anything (except from having a reasonable WakeupSchedule). > > 3. What is the recommended process for switching from rsync to tar - Change $Conf{XferMethod} to 'tar' :-). Add/rename the other settings as needed. > > since the format/attributes are reportedly incompatible? They're only slightly different. The only "problem" is that when switching *from tar to rsync* (which you're not doing), rsync will re-transfer everything, because it appears to have changed from to . The o
Re: [BackupPC-users] Linux backups with rsync vs tar
"Jeffrey J. Kosowsky" wrote on 09/02/2011 02:37:31 PM: > Timothy J Massey wrote at about 10:43:37 -0400 on Friday, September 2, 2011: > > > Your old backups should be 100% fine. They will remain in the pool just > > fine, etc. I do not believe that files transferred by rsync will pool > > with files transferred by tar (due to the attribute issue you mention); > > however, for you that's a moot point: 90% of your files don't pool, > > anyway. > > Why do you think they won't pool? You ignored the part where I said to take what I wrote with a grain of salt. I was extrapolating from what the questioner said. The point was that it DOES NOT MATTER in this case. Having said that, thank you for the details, and for correcting the record. > > This is not a *bad* thing. Every single one of my backup serversis based > > on BackupPC, and all but maybe 2 shares are backed up using rsync. (The > > only exceptions I can think of are where I'm backing up data on aNAS, and > > I can't or won't run rsyncd on the NAS so I have to use SMB). Whether > > it's an advantage or disadvantage, that's the setup I use. I vastly > > prefer consistency over performance. But I can live with 8 hour backup > > windows. > > Why not run rsyncd on a NAS? It works fine and is reasonably fast even > on low end arm-based devices with minimal memory (e.g., 64MB). Because I do not control the firmware on the NAS devices, rooting the device and adding random software does not appeal to me, there is (for my purposes) zero downside to using SMB, and if I wanted to deal with a random collection of vendor-supplied and custom-added code, I probably would have not selected a NAS in the first place. Timothy J. Massey Out of the Box Solutions, Inc. Creative IT Solutions Made Simple! http://www.OutOfTheBoxSolutions.com tmas...@obscorp.com 22108 Harper Ave. St. Clair Shores, MI 48080 Office: (800)750-4OBS (4627) Cell: (586)945-8796 -- Special Offer -- Download ArcSight Logger for FREE! Finally, a world-class log management solution at an even better price-free! And you'll get a free "Love Thy Logs" t-shirt when you download Logger. Secure your free ArcSight Logger TODAY! http://p.sf.net/sfu/arcsisghtdev2dev___ BackupPC-users mailing list BackupPC-users@lists.sourceforge.net List:https://lists.sourceforge.net/lists/listinfo/backuppc-users Wiki:http://backuppc.wiki.sourceforge.net Project: http://backuppc.sourceforge.net/
Re: [BackupPC-users] Linux backups with rsync vs tar
Timothy J Massey wrote at about 10:43:37 -0400 on Friday, September 2, 2011: > Your old backups should be 100% fine. They will remain in the pool just > fine, etc. I do not believe that files transferred by rsync will pool > with files transferred by tar (due to the attribute issue you mention); > however, for you that's a moot point: 90% of your files don't pool, > anyway. Why do you think they won't pool? Pooling is based on file *content*. "Attributes" are stored in separate 'attrib' files. Even so I'm not sure why the basic file attributes would be different between rsync and tar -- but even if they do it would only mean that the attrib files wouldn't pool with old attrib files and that's typically a small proportion of the pool by volume. The only issue I can imagine is with rsync checksums. I'm not sure what happens with such files when you move from rsync to tar. I would hope that it would still pool them properly either by ignoring or deleting the checksums at the end of the file. Again, the actual file contents (which don't include the checksums obviously) are the same between rsync and tar. > > This is not a *bad* thing. Every single one of my backup servers is based > on BackupPC, and all but maybe 2 shares are backed up using rsync. (The > only exceptions I can think of are where I'm backing up data on a NAS, and > I can't or won't run rsyncd on the NAS so I have to use SMB). Whether > it's an advantage or disadvantage, that's the setup I use. I vastly > prefer consistency over performance. But I can live with 8 hour backup > windows. Why not run rsyncd on a NAS? It works fine and is reasonably fast even on low end arm-based devices with minimal memory (e.g., 64MB). -- Special Offer -- Download ArcSight Logger for FREE! Finally, a world-class log management solution at an even better price-free! And you'll get a free "Love Thy Logs" t-shirt when you download Logger. Secure your free ArcSight Logger TODAY! http://p.sf.net/sfu/arcsisghtdev2dev ___ BackupPC-users mailing list BackupPC-users@lists.sourceforge.net List:https://lists.sourceforge.net/lists/listinfo/backuppc-users Wiki:http://backuppc.wiki.sourceforge.net Project: http://backuppc.sourceforge.net/
Re: [BackupPC-users] Linux backups with rsync vs tar
charlesboyo wrote on 08/31/2011 05:53:43 AM: > I'm using BackupPC to take daily backups of a maildir totaling 250 > GB with average file sizes of 500 MB (text mailboxes, these files > change everyday). > Currently, my setup take full backups once a week and incremental > backups every day between the full backups. The servers are directly > connected with a cross-cable, allowing 100 Mbps. I have a very similar setup with several servers. They are often connected using 100Mb/s just because the clients haven't upgraded to Gb switches. Also, they back up IBM Lotus Domino servers. In Domino, each mail user has their own mail database which is typically Gigabytes big (except with this thing called DAMO, but even then they're still hundreds of MB big). This is pretty comparable to your environment, though my *total* size is not usually 250GB of just mail data... I have file servers that are bigger, but not mail servers. (I have some servers that back up Microsoft Exchange servers. This is even worse: one monolithic file for the *ENTIRE* mailstore. U G L Y... And incrementals *ARE* fulls! :) ) > However, these backups take about 8 hours to complete, averaging 8 > Mbps and the BackupPC server is CPU-bound through-out the entire > process. Fulls or incrementals or both? If truly 90% of your files are changing daily, I'm going to assume both. There will be *very* little difference between a full backup and an incremental. > Thus I have reason to suspect the rsync overhead as being guilty. > Note that I have disabled hard links, implemented checksum caching, > increased the block size to 512 KB and enable --whole-file to no avail. I have done zero tuning of the rsync command: I use 100% stock BackupPC command line for it. > 1. since over 90% of the files change every day and "incremental" > backups involve transferring the whole file to the BackupPC server, > won't it make better sense to just run a full backup everyday? Incremental backups end up with a whole new file, but when using rsync it does not do it by transferring the whole file. The rsync protocol works on sending just the changed parts of the file. HOWEVER, the whole file is read on *BOTH* ends of the connection, so it doesn't save you a *BIT* of disk I/O: it only saves you NETWORK I/O. Seeing as you have only 100Mb/s between them, that will improve performance, but not tremendously dramatically, and like you have found it exacts a CPU hit in order to do this. You may find that trading CPU performance for network performance may not be a good trade in your case. Having said that, I run BackupPC on about the slowest systems you can actually buy new today: VIA EPIA EN 1500 system boards with 512MB RAM. Terrible performance, but meet my BackupPC needs just *fine*. Hard numbers on the nearest Domino server to me: 60GB total backed up for full, 18GB for incremental (this is a DAOS server). Fulls take about 150 minutes, incrementals take about 40. 1/4 the data, 1/4 the time. And that's on the miserable hardware I described. Scaling that up to your sizes, that would take about 600 minutes, or 10 hours. So, the 8 hours that you're seeing sounds reasonable. The number one question I have is: is this really a problem? If you have a backup window that allows this, I would not worry about it. If you do *not*, then rsync might not be for you. To address a couple of things said in other replies: 1) Avoiding building a file list is pointless. It takes my servers just a couple of minutes. It may certainly use RAM, but that is only an issue if you have millions of files. And in that case, simply add more RAM. I'm a glutton for punishment running with 512MB of RAM (and actually, I use 2GB in new servers now: I just like to twist Les' tail! :) ). 2) Les' point about the format of the files (one monolithic file for each mailbox vs. one file per e-mail) is dead on. That allows 99% of the files to remain untouched once they're backed up *once*. That will *vastly* reduce the backup times. (That DAOS thing does a similar thing for Domino by breaking out attachments into individual files, and hashing and pooling them in a manner very similarly to a BackupPC pool, BTW. Before DAOS, my fulls and incrementals were indistinguishable, now they're 4:1 size-wise. Plus a 50% reduction in total disk usage. But I digress.) However, be aware that now you substitute the "my backups are taking a long time and don't pool" problem with a "now I have to manage several *MILLION* files!" problem. fsck can become a major issue in that case--with 250GB of e-mail, even ls can be a major issue! Both have advantages and disadvantages. Just be aware that it's not a clear win either way. And you might not have a choice, making the argument moot. Now, for tar. Take my information with a grain of salt: I have *never* run tar with BackupPC... > 2. from Pavel's questions, he observed that B
Re: [BackupPC-users] Linux backups with rsync vs tar
On Wed, Aug 31, 2011 at 4:53 AM, charlesboyo wrote: > > I'm using BackupPC to take daily backups of a maildir totaling 250 GB with > average file sizes of 500 MB (text mailboxes, these files change everyday). 'Maildir' usually refers to a format where each message is in its own file. However, this sounds like a directory of mailbox format files where the file consists of many messages appended together and modified for every change. Maildir format is much more 'backup friendly' because older messages don't change that often. > However, these backups take about 8 hours to complete, averaging 8 Mbps and > the BackupPC server is CPU-bound through-out the entire process. Thus I have > reason to suspect the rsync overhead as being guilty. > Note that I have disabled hard links, implemented checksum caching, increased > the block size to 512 KB and enable --whole-file to no avail. Rsync isn't that great with large files that change. Normally it will copy parts of the file from the previous version to merge with the changes being sent, resulting in a lot of extra disk traffic (and linux normally reports disk wait time as cpu time). The --whole-file option should change that behavior but I'm not sure how it is implemented in backuppc's version of rsync. > With this background, I will appreciate answers to the following questions: > > 1. since over 90% of the files change every day and "incremental" backups > involve transferring the whole file to the BackupPC server, won't it make > better sense to just run a full backup everyday? Changing to maildir format storage might change that. 'Full' backups with rsync tend to be slow because it is still going to read the whole directory on the target and it will do a full read on all of the files that are still in common with the previous run to do a checksum verification. > 2. from Pavel's questions, he observed that BackupPC is unable to recover > from interrupted tar transfer. Such interruptions simply cannot happen in my > case. Should I switch to tar? And in the unlikely event that the transferred > does get interrupted, what mechanisms do I need to implement to > resume/recover from the failure? Yes, tar will be faster for mailbox format were the files are all going to be changed. Recovery is only an issue where bandwidth limits make the time a problem. If you have a problem your recovery is to just do another run. > 3. What is the recommended process for switching from rsync to tar - since > the format/attributes are reportedly incompatible? I would like to preserve > existing compressed backups as much as possible. Not sure about that. With mailbox format you aren't going to have much pooling anyway. -- Les Mikesell lesmikes...@gmail.com -- Special Offer -- Download ArcSight Logger for FREE! Finally, a world-class log management solution at an even better price-free! And you'll get a free "Love Thy Logs" t-shirt when you download Logger. Secure your free ArcSight Logger TODAY! http://p.sf.net/sfu/arcsisghtdev2dev ___ BackupPC-users mailing list BackupPC-users@lists.sourceforge.net List:https://lists.sourceforge.net/lists/listinfo/backuppc-users Wiki:http://backuppc.wiki.sourceforge.net Project: http://backuppc.sourceforge.net/
Re: [BackupPC-users] Linux backups with rsync vs tar
tar is faster since it doesn't spend hours building a file list should there be thousands or millions of files involved. -- Special Offer -- Download ArcSight Logger for FREE! Finally, a world-class log management solution at an even better price-free! And you'll get a free "Love Thy Logs" t-shirt when you download Logger. Secure your free ArcSight Logger TODAY! http://p.sf.net/sfu/arcsisghtdev2dev ___ BackupPC-users mailing list BackupPC-users@lists.sourceforge.net List:https://lists.sourceforge.net/lists/listinfo/backuppc-users Wiki:http://backuppc.wiki.sourceforge.net Project: http://backuppc.sourceforge.net/
Re: [BackupPC-users] Linux backups with rsync vs tar
On 08/31 02:53 , charlesboyo wrote: > 1. since over 90% of the files change every day and "incremental" backups > involve transferring the whole file to the BackupPC server, won't it make > better sense to just run a full backup everyday? > 2. from Pavel's questions, he observed that BackupPC is unable to recover > from interrupted tar transfer. Such interruptions simply cannot happen in my > case. Should I switch to tar? And in the unlikely event that the transferred > does get interrupted, what mechanisms do I need to implement to > resume/recover from the failure? Tar can often give several times greater performance than rsync, if most of the files are being transferred. In your case, I would suggest experimenting with it and seeing how much of a difference it makes. As for recovering from tar errors; I would think it just involves re-running the backup. As for how to make the change; just edit your per-host config file (/etc/backupc/.pl). I don't know how extensive the compatibility problems are; some experimentation may be in order. -- Carl Soderstrom Systems Administrator Real-Time Enterprises www.real-time.com -- Special Offer -- Download ArcSight Logger for FREE! Finally, a world-class log management solution at an even better price-free! And you'll get a free "Love Thy Logs" t-shirt when you download Logger. Secure your free ArcSight Logger TODAY! http://p.sf.net/sfu/arcsisghtdev2dev ___ BackupPC-users mailing list BackupPC-users@lists.sourceforge.net List:https://lists.sourceforge.net/lists/listinfo/backuppc-users Wiki:http://backuppc.wiki.sourceforge.net Project: http://backuppc.sourceforge.net/