Re: [BackupPC-users] Linux backups with rsync vs tar

2011-09-03 Thread Holger Parplies
Hi,

Timothy J Massey wrote on 2011-09-02 10:43:37 -0400 [Re: [BackupPC-users] Linux 
backups with rsync vs tar]:
> charlesboyo  wrote on 08/31/2011 05:53:43 
> AM:
> [...]
> > Thus I have reason to suspect the rsync overhead as being guilty.

for the record, I've just (finally!) switched from tar to rsync for a data
server, and this significantly increased run times of backups. Incremental
backups are taking about three times as long as they used to (if memory
serves correctly). So, yes, rsync does have a significant overhead. That is
not surprising.

I should add that in my case the server is backing up local file systems (to
an iSCSI disk). The effect should be less significant if client and server are
not the same machine (though it is a quad-core and the disk sets are
independent).

> > Note that I have disabled hard links,

???
What is that supposed to mean? You removed the rsync "-H" option?
When you're talking about the pool, "disabling hard links" sounds rather
troubling ;-).

> > implemented checksum caching, 
> > increased the block size to 512 KB and enable --whole-file to no avail.

I don't think File::RsyncP supports changing block size (and probably not
--whole-file either).

> > 1. since over 90% of the files change every day and "incremental" 
> > backups involve transferring the whole file to the BackupPC server, 
> > won't it make better sense to just run a full backup everyday?

There is probably not much difference either way.

> [...]
> You may find that trading CPU performance for network performance may not 
> be a good trade in your case.

That is true, but there are other reasons for using rsync rather than tar.
Exactness of backups. tar incrementals don't reflect deleted files, for
instance. Though, in *this* specific case that may not make much difference
(presuming I'm correct in assuming your mbox files are never deleted or
renamed, extracted from tar-/zip-files, etc., or at least that missing such
a change until the next full backup is unproblematic).

> The number one question I have is:  is this really a problem?  If you have 
> a backup window that allows this, I would not worry about it.  If you do 
> *not*, then rsync might not be for you.

That's exactly the point. In my case, it is *not* a problem, so I prefer more
accurate backups, even if the fulls take all of the night. Thank you for
reminding me to shift the full run to the weekend, which I'll do right now :-).

> 2) Les' point about the format of the files (one monolithic file for each 
> mailbox vs. one file per e-mail) is dead on.  That allows 99% of the files 
> to remain untouched once they're backed up *once*.  That will *vastly* 
> reduce the backup times.

... and pool storage requirements, and rsync will handle small files much
better. Sadly enough, there is still enough braindead software around that
doesn't support maildir format, even in the Unix world. Open Source probably
means that I should start hacking the $#@%volution sources ...

> > 2. from Pavel's questions, he observed that BackupPC is unable to 
> > recover from interrupted tar transfer. Such interruptions simply 
> > cannot happen in my case. Should I switch to tar?

Your situation is completely different from his - you're on a local network,
aren't you? You don't need the bandwidth reductions you gain from rsync - tar
should work fine for you.
*But* you should consider whether (incremental) tar backups will be
sufficiently accurate. Since you are transferring almost everything anyway,
you could even run only full tar backups.

For the sake of completeness, I should mention that full backups always
rebuild the entire tree (in BackupPC storage). In the general case, this
can raise storage requirements (for directory entries and duplicates due to
exceeding HardLinkMax), but in your case I wouldn't expect much difference.
However, with tar, backup exactness would benefit.

> > And in the 
> > unlikely event that the transferred does get interrupted, what 
> > mechanisms do I need to implement to resume/recover from the failure?
> 
> To repeat another response:  restart the backup...

To extend on that: BackupPC does that automatically at the next wakeup, so you
don't really need to do anything (except from having a reasonable
WakeupSchedule).

> > 3. What is the recommended process for switching from rsync to tar -

Change $Conf{XferMethod} to 'tar' :-). Add/rename the other settings as
needed.

> > since the format/attributes are reportedly incompatible?

They're only slightly different. The only "problem" is that when switching
*from tar to rsync* (which you're not doing), rsync will re-transfer
everything, because it appears to have changed from  to
. The o

Re: [BackupPC-users] Linux backups with rsync vs tar

2011-09-02 Thread Timothy J Massey
"Jeffrey J. Kosowsky"  wrote on 09/02/2011 02:37:31 
PM:

> Timothy J Massey wrote at about 10:43:37 -0400 on Friday, September 2, 
2011:
> 
>  > Your old backups should be 100% fine.  They will remain in the pool 
just 
>  > fine, etc.  I do not believe that files transferred by rsync will 
pool 
>  > with files transferred by tar (due to the attribute issue you 
mention); 
>  > however, for you that's a moot point:  90% of your files don't pool, 
>  > anyway.
> 
> Why do you think they won't pool?

You ignored the part where I said to take what I wrote with a grain of 
salt.  I was extrapolating from what the questioner said.  The point was 
that it DOES NOT MATTER in this case.

Having said that, thank you for the details, and for correcting the 
record.

>  > This is not a *bad* thing.  Every single one of my backup serversis 
based 
>  > on BackupPC, and all but maybe 2 shares are backed up using rsync. 
(The 
>  > only exceptions I can think of are where I'm backing up data on aNAS, 
and 
>  > I can't or won't run rsyncd on the NAS so I have to use SMB). Whether 

>  > it's an advantage or disadvantage, that's the setup I use.  I vastly 
>  > prefer consistency over performance.  But I can live with 8 hour 
backup 
>  > windows.
> 
> Why not run rsyncd on a NAS? It works fine and is reasonably fast even
> on low end arm-based devices with minimal memory (e.g., 64MB).

Because I do not control the firmware on the NAS devices, rooting the 
device and adding random software does not appeal to me, there is (for my 
purposes) zero downside to using SMB, and if I wanted to deal with a 
random collection of vendor-supplied and custom-added code, I probably 
would have not selected a NAS in the first place.

Timothy J. Massey

 
Out of the Box Solutions, Inc. 
Creative IT Solutions Made Simple!
http://www.OutOfTheBoxSolutions.com
tmas...@obscorp.com 
 
22108 Harper Ave.
St. Clair Shores, MI 48080
Office: (800)750-4OBS (4627)
Cell: (586)945-8796 
--
Special Offer -- Download ArcSight Logger for FREE!
Finally, a world-class log management solution at an even better 
price-free! And you'll get a free "Love Thy Logs" t-shirt when you
download Logger. Secure your free ArcSight Logger TODAY!
http://p.sf.net/sfu/arcsisghtdev2dev___
BackupPC-users mailing list
BackupPC-users@lists.sourceforge.net
List:https://lists.sourceforge.net/lists/listinfo/backuppc-users
Wiki:http://backuppc.wiki.sourceforge.net
Project: http://backuppc.sourceforge.net/


Re: [BackupPC-users] Linux backups with rsync vs tar

2011-09-02 Thread Jeffrey J. Kosowsky
Timothy J Massey wrote at about 10:43:37 -0400 on Friday, September 2, 2011:
 
 > Your old backups should be 100% fine.  They will remain in the pool just 
 > fine, etc.  I do not believe that files transferred by rsync will pool 
 > with files transferred by tar (due to the attribute issue you mention); 
 > however, for you that's a moot point:  90% of your files don't pool, 
 > anyway.

Why do you think they won't pool?
Pooling is based on file *content*. "Attributes" are stored in
separate 'attrib' files. Even so I'm not sure why the basic file
attributes would be different between rsync and tar -- but even if
they do it would only mean that the attrib files wouldn't pool with
old attrib files and that's typically a small proportion of the pool
by volume.

The only issue I can imagine is with rsync checksums. I'm not sure
what happens with such files when you move from rsync to tar. I would
hope that it would still pool them properly either by ignoring or
deleting the checksums at the end of the file. Again, the actual file
contents (which don't include the checksums obviously) are the same
between rsync and tar.
 > 
 > This is not a *bad* thing.  Every single one of my backup servers is based 
 > on BackupPC, and all but maybe 2 shares are backed up using rsync.  (The 
 > only exceptions I can think of are where I'm backing up data on a NAS, and 
 > I can't or won't run rsyncd on the NAS so I have to use SMB).  Whether 
 > it's an advantage or disadvantage, that's the setup I use.  I vastly 
 > prefer consistency over performance.  But I can live with 8 hour backup 
 > windows.

Why not run rsyncd on a NAS? It works fine and is reasonably fast even
on low end arm-based devices with minimal memory (e.g., 64MB).

--
Special Offer -- Download ArcSight Logger for FREE!
Finally, a world-class log management solution at an even better 
price-free! And you'll get a free "Love Thy Logs" t-shirt when you
download Logger. Secure your free ArcSight Logger TODAY!
http://p.sf.net/sfu/arcsisghtdev2dev
___
BackupPC-users mailing list
BackupPC-users@lists.sourceforge.net
List:https://lists.sourceforge.net/lists/listinfo/backuppc-users
Wiki:http://backuppc.wiki.sourceforge.net
Project: http://backuppc.sourceforge.net/


Re: [BackupPC-users] Linux backups with rsync vs tar

2011-09-02 Thread Timothy J Massey
charlesboyo  wrote on 08/31/2011 
05:53:43 AM:

> I'm using BackupPC to take daily backups of a maildir totaling 250 
> GB with average file sizes of 500 MB (text mailboxes, these files 
> change everyday).
> Currently, my setup take full backups once a week and incremental 
> backups every day between the full backups. The servers are directly
> connected with a cross-cable, allowing 100 Mbps.

I have a very similar setup with several servers.  They are often 
connected using 100Mb/s just because the clients haven't upgraded to Gb 
switches.  Also, they back up IBM Lotus Domino servers.  In Domino, each 
mail user has their own mail database which is typically Gigabytes big 
(except with this thing called DAMO, but even then they're still hundreds 
of MB big).  This is pretty comparable to your environment, though my 
*total* size is not usually 250GB of just mail data...  I have file 
servers that are bigger, but not mail servers.

(I have some servers that back up Microsoft Exchange servers.  This is 
even worse:  one monolithic file for the *ENTIRE* mailstore.  U G L Y... 
And incrementals *ARE* fulls!  :) )

> However, these backups take about 8 hours to complete, averaging 8 
> Mbps and the BackupPC server is CPU-bound through-out the entire 
> process.

Fulls or incrementals or both?  If truly 90% of your files are changing 
daily, I'm going to assume both.  There will be *very* little difference 
between a full backup and an incremental.

> Thus I have reason to suspect the rsync overhead as being guilty.
> Note that I have disabled hard links, implemented checksum caching, 
> increased the block size to 512 KB and enable --whole-file to no avail.

I have done zero tuning of the rsync command:  I use 100% stock BackupPC 
command line for it.

> 1. since over 90% of the files change every day and "incremental" 
> backups involve transferring the whole file to the BackupPC server, 
> won't it make better sense to just run a full backup everyday?

Incremental backups end up with a whole new file, but when using rsync it 
does not do it by transferring the whole file.  The rsync protocol works 
on sending just the changed parts of the file.  HOWEVER, the whole file is 
read on *BOTH* ends of the connection, so it doesn't save you a *BIT* of 
disk I/O:  it only saves you NETWORK I/O.  Seeing as you have only 100Mb/s 
between them, that will improve performance, but not tremendously 
dramatically, and like you have found it exacts a CPU hit in order to do 
this.

You may find that trading CPU performance for network performance may not 
be a good trade in your case.  Having said that, I run BackupPC on about 
the slowest systems you can actually buy new today:  VIA EPIA EN 1500 
system boards with 512MB RAM.  Terrible performance, but meet my BackupPC 
needs just *fine*.

Hard numbers on the nearest Domino server to me:  60GB total backed up for 
full, 18GB for incremental (this is a DAOS server).  Fulls take about 150 
minutes, incrementals take about 40.  1/4 the data, 1/4 the time.  And 
that's on the miserable hardware I described.

Scaling that up to your sizes, that would take about 600 minutes, or 10 
hours.  So, the 8 hours that you're seeing sounds reasonable.

The number one question I have is:  is this really a problem?  If you have 
a backup window that allows this, I would not worry about it.  If you do 
*not*, then rsync might not be for you.

To address a couple of things said in other replies:

1) Avoiding building a file list is pointless.  It takes my servers just a 
couple of minutes.  It may certainly use RAM, but that is only an issue if 
you have millions of files.  And in that case, simply add more RAM.  I'm a 
glutton for punishment running with 512MB of RAM (and actually, I use 2GB 
in new servers now:  I just like to twist Les' tail!  :) ).

2) Les' point about the format of the files (one monolithic file for each 
mailbox vs. one file per e-mail) is dead on.  That allows 99% of the files 
to remain untouched once they're backed up *once*.  That will *vastly* 
reduce the backup times.  (That DAOS thing does a similar thing for Domino 
by breaking out attachments into individual files, and hashing and pooling 
them in a manner very similarly to a BackupPC pool, BTW.  Before DAOS, my 
fulls and incrementals were indistinguishable, now they're 4:1 size-wise. 
Plus a 50% reduction in total disk usage.  But I digress.)

However, be aware that now you substitute the "my backups are taking a 
long time and don't pool" problem with a "now I have to manage several 
*MILLION* files!" problem.  fsck can become a major issue in that 
case--with 250GB of e-mail, even ls can be a major issue!  Both have 
advantages and disadvantages.  Just be aware that it's not a clear win 
either way.

And you might not have a choice, making the argument moot.


Now, for tar.  Take my information with a grain of salt:  I have *never* 
run tar with BackupPC...

> 2. from Pavel's questions, he observed that B

Re: [BackupPC-users] Linux backups with rsync vs tar

2011-08-31 Thread Les Mikesell
On Wed, Aug 31, 2011 at 4:53 AM, charlesboyo
 wrote:
>
> I'm using BackupPC to take daily backups of a maildir totaling 250 GB with 
> average file sizes of 500 MB (text mailboxes, these files change everyday).

'Maildir' usually refers to a format where each message is in its own
file.  However, this sounds like a directory of mailbox format files
where the file consists of many messages appended together and
modified for every change.   Maildir format is much more 'backup
friendly' because older messages don't change that often.

> However, these backups take about 8 hours to complete, averaging 8 Mbps and 
> the BackupPC server is CPU-bound through-out the entire process. Thus I have 
> reason to suspect the rsync overhead as being guilty.
> Note that I have disabled hard links, implemented checksum caching, increased 
> the block size to 512 KB and enable --whole-file to no avail.

Rsync isn't that great with large files that change.  Normally it will
copy parts of the file from the previous version to merge with the
changes being sent, resulting in a lot of extra disk traffic (and
linux normally reports disk wait time as cpu time).   The --whole-file
option should change that behavior but I'm not sure how it is
implemented in backuppc's version of rsync.

> With this background, I will appreciate answers to the following questions:
>
> 1. since over 90% of the files change every day and "incremental" backups 
> involve transferring the whole file to the BackupPC server, won't it make 
> better sense to just run a full backup everyday?

Changing to maildir format storage might change that.   'Full' backups
with rsync tend to be slow because it is still going to read the whole
directory on the target and it will do a full read on all of the files
that are still in common with the previous run to do a checksum
verification.

> 2. from Pavel's questions, he observed that BackupPC is unable to recover 
> from interrupted tar transfer. Such interruptions simply cannot happen in my 
> case. Should I switch to tar? And in the unlikely event that the transferred 
> does get interrupted, what mechanisms do I need to implement to 
> resume/recover from the failure?

Yes, tar will be faster for mailbox format were the files are all
going to be changed.  Recovery is only an issue where bandwidth limits
make the time a problem.  If you have a problem your recovery is to
just do another run.

> 3. What is the recommended process for switching from rsync to tar - since 
> the format/attributes are reportedly incompatible? I would like to preserve 
> existing compressed backups as much as possible.

Not sure about that.  With mailbox format you aren't going to have
much pooling anyway.

-- 
  Les Mikesell
lesmikes...@gmail.com

--
Special Offer -- Download ArcSight Logger for FREE!
Finally, a world-class log management solution at an even better 
price-free! And you'll get a free "Love Thy Logs" t-shirt when you
download Logger. Secure your free ArcSight Logger TODAY!
http://p.sf.net/sfu/arcsisghtdev2dev
___
BackupPC-users mailing list
BackupPC-users@lists.sourceforge.net
List:https://lists.sourceforge.net/lists/listinfo/backuppc-users
Wiki:http://backuppc.wiki.sourceforge.net
Project: http://backuppc.sourceforge.net/


Re: [BackupPC-users] Linux backups with rsync vs tar

2011-08-31 Thread Sabuj Pattanayek
tar is faster since it doesn't spend hours building a file list should
there be thousands or millions of files involved.

--
Special Offer -- Download ArcSight Logger for FREE!
Finally, a world-class log management solution at an even better 
price-free! And you'll get a free "Love Thy Logs" t-shirt when you
download Logger. Secure your free ArcSight Logger TODAY!
http://p.sf.net/sfu/arcsisghtdev2dev
___
BackupPC-users mailing list
BackupPC-users@lists.sourceforge.net
List:https://lists.sourceforge.net/lists/listinfo/backuppc-users
Wiki:http://backuppc.wiki.sourceforge.net
Project: http://backuppc.sourceforge.net/


Re: [BackupPC-users] Linux backups with rsync vs tar

2011-08-31 Thread Carl Wilhelm Soderstrom
On 08/31 02:53 , charlesboyo wrote:
> 1. since over 90% of the files change every day and "incremental" backups 
> involve transferring the whole file to the BackupPC server, won't it make 
> better sense to just run a full backup everyday?
> 2. from Pavel's questions, he observed that BackupPC is unable to recover 
> from interrupted tar transfer. Such interruptions simply cannot happen in my 
> case. Should I switch to tar? And in the unlikely event that the transferred 
> does get interrupted, what mechanisms do I need to implement to 
> resume/recover from the failure?


Tar can often give several times greater performance than rsync, if most of
the files are being transferred. In your case, I would suggest experimenting
with it and seeing how much of a difference it makes.

As for recovering from tar errors; I would think it just involves re-running
the backup.

As for how to make the change; just edit your per-host config file
(/etc/backupc/.pl). I don't know how extensive the compatibility
problems are; some experimentation may be in order.

-- 
Carl Soderstrom
Systems Administrator
Real-Time Enterprises
www.real-time.com

--
Special Offer -- Download ArcSight Logger for FREE!
Finally, a world-class log management solution at an even better 
price-free! And you'll get a free "Love Thy Logs" t-shirt when you
download Logger. Secure your free ArcSight Logger TODAY!
http://p.sf.net/sfu/arcsisghtdev2dev
___
BackupPC-users mailing list
BackupPC-users@lists.sourceforge.net
List:https://lists.sourceforge.net/lists/listinfo/backuppc-users
Wiki:http://backuppc.wiki.sourceforge.net
Project: http://backuppc.sourceforge.net/