El 13/03/2011, a las 02:10, "Jeffrey J. Kosowsky" <backu...@kosowsky.org> 

> Cesar Kawar wrote at about 23:07:53 +0100 on Friday, March 11, 2011:
>> El 11/03/2011, a las 21:13, Jeffrey J. Kosowsky escribió:
>>> Cesar Kawar wrote at about 18:27:34 +0100 on Friday, March 11, 2011:
>>>> El 11/03/2011, a las 14:59, Jeffrey J. Kosowsky escribió:
>>>>> Cesar Kawar wrote at about 10:08:10 +0100 on Friday, March 11, 2011:
>> A 100Mbps nic on a P4 is not going to be the problem here. In the case 
>> presented, the second filesystem was plugged to a USB interface. 
>> I've always used SATA hard drives, and never had I/O constrains.
>> Someone did the test on his own... please, check the link: 
>> http://lwn.net/Articles/400489/
> Very interesting article -- though there they are talking about
> slowdown (and cpu consumption) that has nothing to do with hardlinks
> and seems to be caused by the block checksums that check to make sure
> the copy is accurate -- however, after the author optimized some stuff
> the penalty of rsync (without hardlinks) over straight cp was about
> 50% and he was able to get copy speeds of 85MB/sec.
> However, in the BackupPC case we are talking about effective copy
> speeds that are several orders of magnitude slower so the source of
> the problem must be something with the hard link tracking.
> Honestly, I am a bit confused because your ability to rsync a 1TB
> BackupPC archive in 2 hours seems to be at odds with the experience of
> just about everyone else that talks about rsyncs taking days or
> crashing on pools of just a few hundred gigabytes. And everybody else
> has talked about memory issues. Indeed, if a 1TB archive of 1 year of
> BackupPC data could be rsynced in 2 hours, I can almost guarantee that
> we never would have had hundreds of threads looking for better ways to
> backup a BackupPC archive. I would really love to understand why your
> experience seems to be so different from others.

As i said before, it did't work for us eithdr with versions prior to 3.0.2. 
With 2.6.9 we had all the problems that have been described so many times. Core 
dumps, high memory utilization, etc. Basically, it didn't work.

I've been following this list for a long time, even though i did never wrote a 
mail to it until now, but i've read all the problems people were having to 
replicate the pool.

I even remember someone said that by installing backuppc on nexenta or open 
solaris he'd been able to use built in zfs block level rsync like funtions to 
sync the pool to another zfs based network machine. 

But all that was before rsync 3.0.2 which actually worked for us.

We even had 2 drives to cycle them every morning.

The most demanded resource on the machine during the sync process was CPU. I 
don't have access to that machine and data anymore, as I said, but I'm sure 
that most of you have the  knowledge to do it with your installation and a 
recent rsync version. 

Could you please try it and confirm that? It will depend greatly on the machine 
you are using, ours was a 4 core Xeon with 4 gb of ram and 2 SATA HDs on a 
software RAID-1. When rsyncing the pool, BackupPC was stopped obviously and no 
other unneeded daemons or programs where running.

>>>> The amount of memory needed is much less important than the cpu
>>>> needed. Again, from rsync FAQ page:
>>>> "Rsync needs about 100 bytes to store all the relevant information
>>>> for one file, so (for example) a run with 800,000 files would
>>>> consume about 80M of memory. -H and --delete increase the memory
>>>> usage further."
>>> You need to re-read that CRITICAL last sentence. Rsyncing without hard
>>> links scales very nicely and indeed uses little memory and minimal
>>> cpu. Rsyncing with pool hard links uses *tons* of memory. Been there,
>>> done that!
>> At most, you'll need another 100 bytes per hard link. if you have
>> even if you have 10 hardlinks per file (that actually means 10
>> versions of the same file) it would be 8,000,000 files to process,
>> which makes about 800 Mb of memory. Still not an issue (at least,
>> it hasn't been a issue for me).
> I admit I don't understand how to reconcile your experience with
> others and with how rsync works. I mean people with several gigs of
> memory have run into problems with archives of a couple hundred gigs.
> On the other hand, 800,000 pool files with 10 links per file is a
> pretty small pool and represents also a pretty small set of
> backups. After just backing up a few home machines for a few weeks I
> had closer to a million pool files and 16 million pc files.
> When you do incrementals once a day and store backups going back
> several months (even with some type of exponential paring), you can
> quite quickly have dozens of copies of each file per machine and if
> each machine has O(500,000) files then you are quickly talking about
> 10's of millions of total pc file counts with even a small number of
> machines. Even with 100 bytes per file, you quickly get to a point
> where even 8GB is not enough of memory.
>>> I'm surprised you could even rsync 1Tb of massively linked files in 2
>>> hours. Unless you have just a small number of large files.
>> No 800,000, We had at that company over 2 billion files on our
>> fileserver. Most of them were .doc, .xls and the like.
> Again your experience seems so different from everybody else. 
> Was this a BackupPC archive? Because if so, even using the math of 100
> bytes per file, copying 2 billion files would require 200GB of RAM to
> create the hard link list and I can't imagine any normal server has
> that much RAM. 
> And if you really rsynced a 1TB archive containing 2 billion files in
> 2 hours then your experience is truly 100% different from other
> people. I mean that represents a raw speed of 138MB/sec which is
> orders of magnitude faster than other people's experience.
> On the other hand, if you are just talking about rsyncing 2 billion
> non-linked files then your experience is believable since without hard
> links there are no memory issues and I can believe that the cpu might
> be the rate limiting step given the need to calculate rolling
> checksums, particularly if the data hasn't changed much so most of the
> time is spent checking checksums rather than transferring data.
>> BackupPC was running on a 4 Cores Xeon Dell PowerEdge 2900 II, with
>> 2 500Gb SATA hard drive on software RAID-1 and 4 Gb of RAM.
>> And when replicating the pools, the CPU was almost 100% used.
> Are you saying that you rsynced a BackupPC archive of 2 billion
> files in 2 hours with only 4GB of RAM??? And before you said 1TB, now
> it looks like your disk is only 500GB?
> Again your experience seems to be 100% at odd with and orders of
> magnitude better than anybody else.
>>>> rsync is a really cpu expensive process. You can always use caching
>>>> for md5 chesums process, but, I wouldn't recommend that on an
>>>> off-site replicated backup. Caching introduces a small probability
>>>> of loosing data, and that technique is already used when doing a
>>>> normal BackupPC backup with rsync transfer, so, if you then resync
>>>> that data to another drive, disk of filesystem of any kind, your
>>>> probability of loosing data is a power of the original one.
>>> First, the cpu consumption (for BackupPC archives) is *not* in the
>>> md5sums but is in the hard linking (you can verify this by doing an
>>> rsync on the pool alone or rsyncing TopDir without the -H
>>> flag). Moreover, the cpu requirements for the rolling md5sum checksums
>>> are actually much less for BackupPC archives than for normal files
>>> since you actually rarely need to do the "rolling" part which is the
>>> actual cpu-intensive part. This is because you only do rolling when
>>> files change and pool files only change in the relatively rare event
>>> of chain renumbering plus in the case of the rsync method with checksum
>>> caching in the one-time-only event when digests are added (but this
>>> only affects the first and last blocks).
>> I did not talk about what backuppc does. I was just saying that
>> replicating a BackupPC pool to another filesystem is very a CPU
>> intensive task.
> I wasn't talking about what BackupPC does either. I was saying that by
> nature of the structure of BackupPC archives, rsync should require a
> lot less cpu power to copy them then if they were files that changed a
> lot. Much of the cpu consumption of at least non-hard link rsync is
> due to aligning rolling checksums but if the files don't change then
> there is no need for that cpu power and if the timestamps, perms, size
> etc. don't change then even regular checksums aren't done.
>>> So, to the extent that you are cpu-limited, the problem is not with
>>> md5sums but with hard links which requires both memory to store the
>>> hard link list (which is limiting on many machines) plus some cpu
>>> intensity to search the list -- specifically rsync requires that for
>>> each hard linked file (which for BackupPC is *every* file), you need
>>> to do a binary search of the hard link list (which in BackupPC is
>>> every file). Also, I imagine that rsync was not optimized for the
>>> extreme edge case represented by BackupPC archives where (just about)
>>> *every* non-zero length file is hard linked. 
>>> The bottom line is that checksum caching is unlikely to have any
>>> significant effect.
>> So, if checksum caching does not impact, or has a small impact in 
>> performance, what's the reson to use it? 
>> If you are right i will never use chechsum caching again
> Are you talking about checksum caching for BackupPC or checksum
> caching for rsync itself (which I think requires a patched version of
> rsync and is non-standard)?
> If you are talking about checksum caching within BackupPC itself, it
> definitely can have significant benefits for full backups where the
> actual files would otherwise have to be compared which would require
> decompressing cpool files (slow) and then calculating block checksums
> to compare the new file against the pool. So, checksum caching should
> be quite beneficial since then you avoid decompressing or even reading
> the full pool file but instead just need to read the appended
> digest. I have not however seen any specific benchmarks. It would seem
> that this benefit applies really only to files that are unchanged
> because if the file changes then you will have to decompress it anyway
> to find & align the changes and calculate the deltas (plus the
> blocksize might very well be different if the filesize changes).
> If you are talking about using checksum caching when manually rsyncing
> a BacupPC archive, then I don't think it will have much of an effect
> since the rate limiting step is tracking and resolving hard links (at
> least in the experience of seemingly everybody else).
>>> Second, regarding your concern of compounding checksum errors, a power
>>> of a small error is still small.  However, that is not even really the
>>> case here since the only thing one would need to worry about here is
>>> the false negative of having matching checksums but corrupted file
>>> data. But this error is not directly compounded by the BackupPC
>>> checksuming since it is an error in the data itself. (Note the other
>>> potential false negative of md5sum collisions in the block data is
>>> vanishingly small particularly given both block checksums and file
>>> checksums). False positives only at worse cause an extra rsync copy
>>> operation. 
>>> More generally, if you are truly worried about the compounding of
>>> small errors then by extension you should never be backing up
>>> archive backups. I mean any backup has some probability of error (due
>>> to disk errors, ram errors, etc.) so a backup of a backup then has a
>>> power of that original error.
>>>> Not recomended I think.  I prefer to expend a little more money on
>>>> the machine once and not have surprises later on when the big boss
>>>> ask you to recover his files....
>>> If you worry about compounding of errors in backups then probably
>>> better to have two parallel BackupPC servers rather than backing up a
>>> backup -- since all errors compound and as above, I think a faulty
>>> checksum is not your most likely error source.
>> Well, would be nice to be able to double every single hardware
>> resource in every company but most of the times you have a budget
>> and a boss...  Of course the safest option is to have 2 backup
>> systems. It is even safer to have to totally different approaches
>> to backup your data. BackupPC is great but there could be a bug
>> that destroyed your pool while you were sleeping at home, and next
>> morning, your boss needs to recover a file and... voilá, there's no
>> backup!
> All I was saying is that the compounded errors of checksums is no
> different from the compounded errors inherent in taking backups of
> backups. So, as long as the probability of error introduced by
> checksums is the same order of magnitude (or less) than other errors,
> there is no reason to single out checksums as being an issue.
>> You'd better have 2 backup solutions, if you can... but in the
>> meanwhile, it's better to backup your backup than nothing, don't
>> you think so? And if I can do it with less probabilities of
>> failure, I will, and if that just imply buying an extra $50 hard
>> drive.
> I'm not sure how big your setup is but on my SOHO setup, I use a $50
> external hard drive and a $100 plugcomputer as my second backup. Now
> that is way low end. But I'm sure for a few hundred total dollars you
> could get a second parallel backup going using a bare bones computer
> and a cheap drive. Frankly, you could use just about any old obsolete
> pentium class PC. It doesn't need to be powerful since other than
> compressing and computing some mdsums, it doesn't really require much
> cpu power. And even if it is slower than your primary, you can run
> your secondary backups just once or twice a week if needed. So you
> could probably set that up for free with just scavenged hardware.
>>>> I don't have graphs, but the amount of memory available to any
>>>> recent computer is more than enough for rsync. Disk I/O is somewhat
>>>> important, and disk bandwidth is a constraint, but, cpu speed is
>>>> the more important thing in my tests.
>>> Interesting, based on my experience and the experience of most reports
>>> on this mailing list, memory is the main problem encountered. But
>>> perhaps if you have enough memory, the repeated binary search of the
>>> hard link list is the issue. Maybe rsync could be written better for
>>> this case to presort the file list by inode number or something like that.
>> Ok, hard links sort and search is important, very important and
>> takes a lot of cpu time, I will not argue this. In earlier versions
>> of rsync, before 3.0.0, rsync even broke in the process.  But, in
>> my experience, it didn't happened again after we moved to 3.0.2
>> Maybe we were luky, I don't know, but the truth is that over the
>> last year (2010), I used this technique to maintain 2 offsite
>> copies of the backup.  I'm not working for that company anymore
>> since January, now I have my own company, so I don't have access to
>> those systems to make any metrics on them, I wish I could.
>> As a bottom line, I don't care if it because of checksums or
>> because of hardlinks, but rsync is a really CPU intensive task.
> My only interest is in figuring out how you have been successful with
> just a couple of GB of memory in situations that others have
> floundered with. I would love it if we could all rsync BackupPC
> archives of 2 billion files in 2 hours. That would be awesome!
