I'm considering using rsync in our data center but I'm worried about whether it will scale to the numbers and sizes we deal with. We would be moving up to a terabyte in a typical sync, consisting of about a million files. Our data mover machines are RedHat Linux Advanced Server 2.1 and all the sources and destinations are NFS mounts. The data is stored on big NFS file servers. The destination will typically be empty and rsync will have to copy everything. However, the copy operation takes many hours and often gets interrupted by an outage. In that case, the operator should be able to restart the process and it resumes where it left off. The current, less than desirable, method uses tar. In the event of an outage, everything needs to be copied again. I'm hoping rsync could avoid this and pick up where it left off. There are really two scaling problems here: 1) Number and size of files - What are the theoretical limits in rsycn? What are the demonstrated maxima? 2) Performance - The current tar-based method breaks the mount points down into (a few dozen) subdirectories and runs multiple tar processes. This does a much better job of keeping the GigE pipes full than a single process and allows the load to be spread over the 4 CPUs in the Linux box. Is there a better way to do this with rsync or would we do the same thing, generate one rsync call for each subdirectory? A major drawback of the subdirectory approach is that tuning to find the optimum number of copy processes is almost impossible. Is anyone looking at multithreading rsync to copy many files at once and get more CPU utilization from a multi-CPU machine? We're moving about 10 terabytes a week (and rising) so whatever we use has to keep those GigE pipes full.
Thanks, Bret -- To unsubscribe or change options: http://lists.samba.org/mailman/listinfo/rsync Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html