Pieter Wuille wrote at about 02:52:14 +0200 on Thursday, June 4, 2009: > On Wed, Jun 03, 2009 at 07:36:22PM -0400, Jeffrey J. Kosowsky wrote: > > Holger Parplies wrote at about 23:45:35 +0200 on Wednesday, June 3, 2009: > > > Hi, > > > > > > Peter Walter wrote on 2009-06-03 16:15:37 -0400 [Re: [BackupPC-users] > > Backing up a BackupPC server]: > > > > [...] > > > > My understanding is that, if it were not for the > > > > hardlinks, that rsync transfers to another server would be more > > > > feasible; > > > > > > right. > > > > > > > that processing the hardlinks requires significant cpu > > > > resources, memory resources, and that access times are very slow, > > > > > > Memory: yes. CPU: I don't think so. Access times very slow? Well, the > > inodes > > > referenced from one directory are probably scattered all over the > > place, so > > > traversing the file tree (e.g. "find $TopDir -ls") is probably slower > > than > > > in "normal" directories. Or do you mean swapping slows down memory > > accesses > > > by several orders of magnitude? > > > > > > > compared to processing ordinary files. Is my understanding correct? > > If > > > > so, then what I would think of doing is (a) shutting down backuppc > > (b) > > > > creating a "dump" file containing the hardlink metadata (c) backing > > up > > > > the pooled files and the dump file using rsync (d) restarting > > backuppc. > > > > I really don't need a live, working copy of the backuppc file system > > - > > > > just a way to recreate it from a backup if necessary, using an > > "undump" > > > > program that recreated the hardlinks from the dump file. Is this > > > > approach feasible? > > > > > > Yes. I'm just not certain how you would test it. You can undoubtedly > > > restore your pool to a new location, but apart from browsing a few > > random > > > files, how would you verify it? Maybe create a new "dump" and compare > > the > > > two ... > > > > > > Have you got the resources to try this? I believe I've got most of the > > code > > > we'd need. I'd just need to take it apart ... > > > > > > > Holger, one thing I don't understand is that if you create a dump > > table associating inodes with pool file hashes, aren't we back in the > > same situation as using rsync -H? i.e., for large pool sizes, the > > table ends up using up all memory and bleeding into swap which means > > that lookups start taking forever causing the system to > > thrash. Specifically, I would assume that rsync -H basically is > > constructing a similar table when it deals with hard links, though > > perhaps there are some savings in this case since we know something > > about the structure of the BackupPC file data -- i.e., we know that > > all the hard links have as one of their links a link to a pool file. > > > [...] > > This would allow the entire above algorithm to be done in O(mlogm) > > time with the only memory intensive steps being those required to sort > > the pool and pc tables. However, since sorting is a well studied > > problem, we should be able to use memory efficient algorithms for > > that. > > You didn't use the knowledge that the files in the pool have names that > correspond (apart from a few hashchains) to the partial md5sums of the > data in them, like BackupPC_tarPCcopy does. I've never used/tested this > tool, but if i understand it correctly, it builds a tar file that contains > symbolic hardlinks to the pool directory, instead of the actual data. > This combined with with a verbatim copy of the pool directory itself, should > suffice to copy the entire topdir in O(m+n) time and O(1) memory (since a > lookup of what pool file a certain hardlinked file in a pc/ dir points to, > can be done in O(1) time and space (except for a sporadic hash chain)). > In practice however, doing the copy on the blocklevel will be significantly > faster still, because no continuous seeking is required.
Yeah but that would require computing the partial md5sums which is not a cheap operation since you need to read in and decompress the first 1MB of each non-zero length file in the pc hierarchy and compute the md5sum on 2 128KB blocks (then if there is a chain you need to compare inode numbers to find the right correspondence). Though, I guess for a large enough pool & pc hierarchy computing this in order(m+n) time would be shorter than sorting in o(mlogm + nlogn) time. It would be interesting to know where the crossover is. > > I would be curious to know how how in the real world the time (and > > memory usage) compares to copy over a large (say multi Terabyte) > > BackupPC topdir varies for the following methods: > > > > 1. cp -ad > > 2. rsync -H > > 3. Copy using a single table of pool inode numbers > > 4. Copy using a sorted table of pool inode numbers and pc hierarchy > > inode numbers > Add: > 5. copy the pooldir and use tarPCcopy for the rest > 6. copy the blockdevice > > -- > Pieter ------------------------------------------------------------------------------ OpenSolaris 2009.06 is a cutting edge operating system for enterprises looking to deploy the next generation of Solaris that includes the latest innovations from Sun and the OpenSource community. Download a copy and enjoy capabilities such as Networking, Storage and Virtualization. Go to: http://p.sf.net/sfu/opensolaris-get _______________________________________________ BackupPC-users mailing list BackupPC-users@lists.sourceforge.net List: https://lists.sourceforge.net/lists/listinfo/backuppc-users Wiki: http://backuppc.wiki.sourceforge.net Project: http://backuppc.sourceforge.net/