Hi, Jeffrey J. Kosowsky wrote on 2009-06-11 00:25:37 -0400 [Re: [BackupPC-users] backup the backuppc pool with bacula]: > Holger Parplies wrote at about 04:22:03 +0200 on Thursday, June 11, 2009: > > Les Mikesell wrote on 2009-06-10 15:45:22 -0500 [Re: [BackupPC-users] > backup the backuppc pool with bacula]: > > [...] > > the file list [...] can and has been [optimized] in 3.0 (probably meaning > > protocol version 30, i.e. rsync 3.x on both sides). > > Holger, I may be wrong here, but I think that you get the more > efficient memory usage as long as both client & server are version >=3.0 > even if protocol version is set to < 30 (which is true for BackupPC > where it defaults back to version 28).
firstly, it's *not* true. BackupPC (as client side rsync) is not version >= 3.0. It's not even really rsync at all, and I doubt File::RsyncP is more memory efficient than rsync, even if the core code is in C and copied from rsync. Secondly, I'm *guessing* that for an incremental file list you'd need a protocol modification. I understand it that instead of one big file list comparison done before transfer, 3.0 does partial file list comparisons during transfer (otherwise it would need to traverse the file tree at least twice, which is something you'd normally avoid). That would clearly require a protocol change, wouldn't it? Actually, I would think that rsync < 3.0 *does* need to traverse the file tree twice, so the change might even have been made because of the wish to speed up the transfer rather than to decrease the file list size (it does both, of course, as well as better utilize network bandwidth by starting the transfer earlier and allowing more parallelism between network I/O and disk I/O - presuming my assumptions are correct). > But I'm not an expert and my understanding is that the protocols themselves > are not well documented other than looking through the source code. Neither am I. I admit that I haven't even looked for documentation (or at the source code). It just seems logical to implement it that way. I can't rule out that the optimization could be possible with the older protocol versions, but then, why wouldn't rsync have always operated that way? > > > > and how the rest of the community deals with getting pools of > > > > 100+GB offsite in less than a week of transfer time. > > > > > > 100 Gigs might be feasible - it depends more on the file sizes and how > > > many directory entries you have, though. And you might have to make the > > > first copy on-site so subsequently you only have to transfer the changes. > > > > Does anyone actually have experience with rsyncing an existing pool to an > > existing copy (as in: verification of obtaining a correct result)? I'm > kind of > > sceptical that pool chain renumbering will be handled correctly. At least, > it > > seems extremely complicated to get right. > > Why wouldn't rsync -H handle this correctly? I'm not saying it doesn't. I'm saying it's complicated. I'm asking whether anyone has actually verified that it does. I'm asking because it's an extremely rare corner case that the developers may not have had in mind and thus may not have tested. The massive usage of hardlinks in a BackupPC pool clearly is something they did not anticipate (or, at least, feel the need to implement a solution for). There might be problems that appear only in conjunction with massive counts of inodes with nlinks > 1. In another thread, an issue was described that *could* have been caused by this *not* working as expected (maybe crashing rather than doing something wrong, not sure). It's unclear at the moment, and I'd like to be able to rule it out on the basis of something more than "it should work, so it probably does". I'm also saying that pool backups are important enough to verify the contents by looking closely at the corner cases we are aware of. > And the renumbering will change the timestamps which should alert rsync to > all the changes even without the --checksum flag. This part I'm not sure on. Is it actually *guaranteed* that a rename(2) must be implemented in terms of unlink(2) and link(2) (but atomically), i.e. that it must modify the inode change time? The inode is not really changed, except for the side effect of (atomically) decrementing and re-incrementing the link count. By virtue of the operation being atomical, the link count is *guaranteed* not to change, so I, were I to implement a file system, would feel free to optimize the inode change away (or simply not implement it in terms of unlink() and link()), unless it is documented somewhere that updating the inode change time is mandatory (though it really is *not* an inode change, so I don't see why it should be). Does rsync even act on the inode change time? File modification time will be unchanged, obviously. rsync's focus is on the file contents and optionally keeping the attributes in sync (as far as it can). ctime is an indication that attributes have been changed (which may mask a content change), but attributes are compared "in full" anyway (if requested), aren't they? Either way, if rsync is aware of the change, it will work (rsync should simply need to delete the target and re-link according to its inode map, just as if the link had not been there in the first place). If not, rsync would need to keep and check a mapping {source inode number -> dest inode number} (for all files with nlinks > 1) to find out if all links still reference the same inode. That is a closer examination than is done for single link files without --checksum, and a rather expensive one. I'm not saying this doesn't happen. I didn't check the source code. It would make sense to make '-H' add this check. > Or are you saying it would be difficult to do this manually with a > special purpose algorithm that tries to just track changes to the pool > and pc files? I haven't given that topic much thought. The advantage in a special purpose algorithm is that we can make assumptions about the data we are dealing with. We shouldn't do this unnecessarily, but if it has notable advantages, then why not? "Difficult" isn't really a point. The question is whether it can be done efficiently. > More generally, I think we really need to find a guinea pig to spend > some time testing the methods that you and I have discussed of > creating a sorted inode database of the pool. Yes, and we need to think about how to *verify* such a copy. A verification tool would also answer my question above. The algorithm for creating the initial copy is not complex, so testing some sample cases might be sufficient. I expect incremental updates to make the situation far more difficult. It could be difficult to even imagine which cases could go wrong, so it would be nice to have a tool that fully verifies that content and hardlink relationships in a pool copy match the original. > Then it would be > instructive to compare execution times vs the straight rsync -H method > and vs. the tar method. For small pools, I imagine rsync -H would be > faster, but at some point the database would presumably be > faster. Presumably the tar method would be slowest of all. The devil > of course is in the details. I agree. But the important point is scalability rather than speed. We need something that will continue to work regardless of pool size. You can still use rsync on small pools and switch at an appropriate time (i.e. before a failing rsync update breaks your copy, even if the "database version", as you call it, is still somewhat slower). > Either way, this issue seem to be becoming a true FAQ for this list -- Always has been ;-). > so we should probably agree on some definitive answer (or set of > answers) so that we can put this one to rest. Definitely. Somehow I still see people giving different answers and restarting the discussion all over again ;-). > My personal belief is that while disk images or ZFS may be the "ideal" > answer, there still is a need for an alternative even if slower method > for reliably backing up (and ideally incrementally synching) just > $topdir for those who don't/can't back up the whole partition or who > can't run ZFS. My understanding is that the simple answer of "rsync -H" > seems to not be reliable enough on large pools at least for some > people. In addition, there are cases where the "copy" is to be stored on something that doesn't support hardlinks. As long as the "copy" doesn't have to be functional (but rather allow re-creating a functional pool), that is no problem. It is not difficult to accomodate for this case - at least for the initial copy - if we have it in mind from the start. We just need to split up the copy operation into a "send" and a "receive" part (like 'tar -c' and 'tar -x') which can be plugged together for a straight copy or generate an easily storable intermediate result. Incrementals might be harder, but we should at least look into it. Furthermore, I'd like to keep pool merging in mind. If we had a way to copy a pool into a pre-existing *different* pool, that would be great. And it really doesn't seem hard either, if we use PoolWrite() instead of File::Copy (well, there might be some details to figure out, and it might be easier to make use of the already known BackupPC hash and simply handle collisions like PoolWrite() would). It may completely conflict with incremental updates though. Or incrementals might be new pc/ file trees (based on timestamps) that are merged into a pre-existing pool copy? Hmm ... there's potential there. Generate a list of pool files and *some* pc/ directories, based on timestamp, instead of attempting to handle the whole structure. That would miss changes of existing backups (like deleting individual files), but BackupPC doesn't really endorse changes of existing backups, does it? ;-) Regards, Holger P.S.: I won't find any more time until at least Sunday, so please excuse me for not responding until then. ------------------------------------------------------------------------------ Crystal Reports - New Free Runtime and 30 Day Trial Check out the new simplified licensing option that enables unlimited royalty-free distribution of the report engine for externally facing server and web deployment. http://p.sf.net/sfu/businessobjects _______________________________________________ BackupPC-users mailing list BackupPC-users@lists.sourceforge.net List: https://lists.sourceforge.net/lists/listinfo/backuppc-users Wiki: http://backuppc.wiki.sourceforge.net Project: http://backuppc.sourceforge.net/