I've heard that there's a new rsyncd client in the works, which might alleviate 
the frequent and severe problems encountered with rsync and high memory 
consumption on clients with many files. The "solution" now seems to be to split 
the 
backup into numerous rsync sets, with each one consuming less memory.

I've got another suggestion for dealing with the problem, which would work with
the stock rsync client (and potentially be applicable to tar and samba backups
as well). This suggestion is to dynamically build "include" and "exclude" lists 
to pass to rsync, and to do multiple (serial) calls to rsync to backup all 
files.

In detail:

        Before doing the backup, build a data structure representing all the 
        directories in the filesystem, and the number of files per directory. 
In 
        building the data structure, any directories specified in the 
configuration 
        file "exclude" list would be skipped. Then, apply an algorithm to the 
data, 
        taking into account:
                the amount of memory on the server
                the amount of free memory on the server
                the number of directories
                the number of files
        to try to roughly split up all the directories into sets, sized based 
on the 
        amount of memory in the server. The algorithm should be weighted to 
group 
        directories as high "up" in the tree as possible. For example, it's 
better to 
        backup all of "/var" than to combine "/var/spool/cron" and 
"/usr/local/src" in 
        one set and the remainder of "/var" in another backup set, 
        even if doing all of "/var" has slightly more files (and more memory 
usage) 
        than the alternative.

        In addition, the algorithm could be weighted to give a small preference 
to 
        combining directory trees from different physical devices, in order to 
improve 
        performance by reducing I/O wait times.

        The algorithm for determining the ideal number of files per backup 
(and, by 
        implication, which directories will be grouped together), doesn't need 
to be 
        very sophisticated.  There's no need to turn this into a true 
        "knapsack problem" and attempt to reach optimal backups sets, as long 
as 
        there's real improvement on backing up all files in a single rsync 
        session. I think that putting fairly simple logging into BackupPC, 
        to record available memory before a backup begins, the number of files 
        backed-up, and the time the backup takes (possibly scaling for the 
speed of the 
        network interface) would generate enough data (across the diverse 
        configurations of clients where BackupPC is installed) to get some very 
good 
        empirical constants for the algorithm. 

The time savings by doing smaller backups, which also cause less of an impact 
on both the backup client and server, should be far greater than the time 
required to get the filesystem data and build the set of individual rsync 
commands (excludes and includes). 

Since BackupPC already does and excellent job of "filling" individual backups 
into a single "virtual full" backup that can easily be browsed for restoring, 
it shouldn't matter to the users that the backup sets are dynamically 
generated, as long as the user-specified includes and excludes are obeyed.

This scheme offers several advantages:

        It's dynamic, and will automatically adjust for changes in filesystem 
        layouts or the number of files, and for the amount of physical memory
        and even adjusting for the load (free memory) on the backup client and
        server.

        It's maintenance-free on the part of users. There's no need to create 
        multiple rsync "targets", making sure that they are identically named 
        on the client and server, and trying to balance backup sizes per target.

        It would work with existing implementations of rsync, so any issues 
        with the future rsync daemon that's supposed to be embedded in BackupPC
        would be avoided. Similarly, the dynamic backup set partitioning could
        also be applied to the embedded rsync daemon, when it's ready.

I'm very interested in hearing your feedback about this proposal.
         
Mark


----
Mark Bergman
[EMAIL PROTECTED]
Seeking a Unix or Linux sysadmin position local to Philadelphia or via 
telecommuting

http://wwwkeys.pgp.net:11371/pks/lookup?op=get&search=bergman%40merctech.com




-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems?  Stop!  Download the new AJAX search engine that makes
searching your log files as easy as surfing the  web.  DOWNLOAD SPLUNK!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=103432&bid=230486&dat=121642
_______________________________________________
BackupPC-users mailing list
BackupPC-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/backuppc-users
http://backuppc.sourceforge.net/

Reply via email to