[Bug 3099] Please parallelize filesystem scan
https://bugzilla.samba.org/show_bug.cgi?id=3099 --- Comment #8 from Chip Schweiss c...@innovates.com --- I would argue that optionally all directory scanning should be made parallel. Modern file systems perform best when request queues are kept full. The current mode of rsync scanning directories does nothing to take advantage of this. I currently use scripts to split a couple dozen or so rsync jobs in to literally 100's of jobs. This reduces execution time from what would be days to a couple hours every night. There are lots of scripts like this appearing on the net because the current state of rsync is inadequate. This ticket could reasonably combined with 5124. -- You are receiving this mail because: You are the QA Contact for the bug. -- Please use reply-all for most replies to avoid omitting the mailing list. To unsubscribe or change options: https://lists.samba.org/mailman/listinfo/rsync Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html
Re: [Bug 3099] Please parallelize filesystem scan
I dont understand - scanning metadata is sped up by thrashing the head all over the disk instead of mostly-sequentially scanning through? How does that work out? /kc On Fri, Jul 17, 2015 at 02:37:21PM +, samba-b...@samba.org said: https://bugzilla.samba.org/show_bug.cgi?id=3099 --- Comment #8 from Chip Schweiss c...@innovates.com --- I would argue that optionally all directory scanning should be made parallel. Modern file systems perform best when request queues are kept full. The current mode of rsync scanning directories does nothing to take advantage of this. I currently use scripts to split a couple dozen or so rsync jobs in to literally 100's of jobs. This reduces execution time from what would be days to a couple hours every night. There are lots of scripts like this appearing on the net because the current state of rsync is inadequate. This ticket could reasonably combined with 5124. -- You are receiving this mail because: You are the QA Contact for the bug. -- Please use reply-all for most replies to avoid omitting the mailing list. To unsubscribe or change options: https://lists.samba.org/mailman/listinfo/rsync Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html -- Ken Chase - k...@heavycomputing.ca skype:kenchase23 Toronto Canada Heavy Computing - Clued bandwidth, colocation and managed linux VPS @151 Front St. W. -- Please use reply-all for most replies to avoid omitting the mailing list. To unsubscribe or change options: https://lists.samba.org/mailman/listinfo/rsync Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html
[Bug 3099] Please parallelize filesystem scan
https://bugzilla.samba.org/show_bug.cgi?id=3099 --- Comment #7 from Rainer rai...@voigt-home.net --- Hi, I'm experiencing the very same problem: I'm trying to sync a set of VMWare disk files (about 2.5TB) with not too many changes, and direct copying is still faster than the checksumming by a quite large margin because of the sequential checksumming on source and target just doubles the time needed. I think the point is that the GigE link between the PC and the NAS achieves about 80MB/s, and the HDD read rate is not much higher (approx. 130MB/s). When doing the checksumming on source and target in parallel we could ideally (if nothing changed) reach the read rate of the HDDs as 'transfer' bandwidth, because this is the speed at which we can verify that the data is the same on source and target. The sequential approach like it is now reduces the initial check to half the HDD read rate, so transfering unchanged files will only yield about 65MB/s in my case, which is slower than simple copying. Is this patch you proposed some years ago something I can apply to and try on a current rsync version? If not, could you update it to the 3.1.x version so I can benchmark the parallel checksumming in my situation? Best Regards Rainer -- You are receiving this mail because: You are the QA Contact for the bug. -- Please use reply-all for most replies to avoid omitting the mailing list. To unsubscribe or change options: https://lists.samba.org/mailman/listinfo/rsync Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html
Re: [Bug 3099] Please parallelize filesystem scan
Sounds to me like maintaining the metadata cache is important - and tuning the filesystem to do so would be more beneficial than caching writes, especially with a backup target where a write already written will likely never be read again (and isnt a big deal if it is since so few files are changed compared to the total # of inodes to scan). Your report of the minutes for the re-sync shows the unthrashed cache is highly valuable. So all we need to do is tune the backup target (and even the operational servers themselves) to maintain more metadata. I dont know how much ram is used per inode, but I'd throw in another 4-8gb just for metadata caching per box, or even more, if it meant scanning was sped up. (Really, actually, one only needs it in the backup target - if you can run all the backups in parallel, and there's N servers to backup, they can all run at 1/N speed, as long as scanning metadata on the backup target is fast enough to keep up with it all -- my total data written is only 20-30GB for example, which at reasonable speed (20-30MB/s even, which is slow) is only 15 minutes total writing. Even 200-300GB changed would be 150 minutes at that rate, and the rate could easily be 4x faster. So, tuning caches to prefer metadata seems to be key. How? As we've discussed before, letting the filesystem at it throws away precious metadata cache, and so tracking your own changes (since the backup system will never be used for anything else, right? :) would be beneficial. Of course the danger is using the backup system for anything else and changing any of the target info - inconsistencies would crop up and make the backup worthless very quickly. /kc On Fri, Jul 17, 2015 at 03:18:02PM +, Schweiss, Chip said: Modern file systems have many internal queues, and service many clients simultaneously. They arrange their work to maximize throughput in both read and write operations.This is the norm on any enterprise file system, be it Hitachi, Oracle, Dell, HP, Isilon, etc. You will get significantly higher throughput if you hit it with multiple threads. These systems have elaborate predictive read ahead caches and perform best when multiple threads hit them. Using the test case of a single server with a simple file system such as ext3/4, or xfs, no gains will be seen in multithreading rsync. Use an enterprise file system with 100's of TBs and the more threads you use the faster you will go. Metadata and data on these systems ends up across 100's of disks. Single threads end up severely bound by latency. This is why multi-threading should be optional. It doesn't help everyone. For example, one of my rsync jobs moving from a ZFS system in St. Louis, Missouri to a Hitachi HNAS in Minneapolis, Minnesota has over 100 million files. Each day 50 to 100 thousand files get added or updated. A single rsync job would take weeks to parse this job and send the changes. I split it into 120 jobs and it typically completes in 2 hours when no humans are using the systems. A re-sync immediately afterwards, again with 120 jobs, scans both ends in minutes. -Chip -Original Message- From: rsync [mailto:rsync-boun...@lists.samba.org] On Behalf Of Ken Chase Sent: Friday, July 17, 2015 9:51 AM To: samba-b...@samba.org Cc: rsync...@samba.org Subject: Re: [Bug 3099] Please parallelize filesystem scan I dont understand - scanning metadata is sped up by thrashing the head all over the disk instead of mostly-sequentially scanning through? How does that work out? /kc On Fri, Jul 17, 2015 at 02:37:21PM +, samba-b...@samba.org said: https://bugzilla.samba.org/show_bug.cgi?id=3099 --- Comment #8 from Chip Schweiss c...@innovates.com --- I would argue that optionally all directory scanning should be made parallel. Modern file systems perform best when request queues are kept full. The current mode of rsync scanning directories does nothing to take advantage of this. I currently use scripts to split a couple dozen or so rsync jobs in to literally 100's of jobs. This reduces execution time from what would be days to a couple hours every night. There are lots of scripts like this appearing on the net because the current state of rsync is inadequate. This ticket could reasonably combined with 5124. -- You are receiving this mail because: You are the QA Contact for the bug. -- Please use reply-all for most replies to avoid omitting the mailing list. To unsubscribe or change options: https://lists.samba.org/mailman/listinfo/rsync Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html -- Ken Chase - k...@heavycomputing.ca skype:kenchase23 Toronto Canada Heavy Computing - Clued bandwidth, colocation and managed linux VPS @151 Front St. W. -- Please use reply-all for most replies to avoid omitting
Re: clone a disk
Hi TG, You can keep an up-to-date copy of the files/folders/pipes/etc. in the 100GB space using rsync, but not a true clone of the partition. To get a true clone of the boot partition, you'd need to boot from a rescue CD, mount the other machine's 100GB space and dd the boot partition device to a file on the 100GB space. You'd also probably want to get the Master Boot Record by grabbing the first 2K of the raw boot device into a separate file... i.e. something like: dd bs=512 if=/dev/sda1 of=/mnt/u/backup/jessie_sda1 dd bs=512 count=4 if=/dev/sda of=/mnt/u/backup/jessie_mbr_sda The device names might be different - the mount folder may be different - etc... But the idea works. - With a new, blank drive, you could recreate the partitions using fdisk, reverse the dd commands, boot and then work on getting the second partition back up and running. -- Larry Irwin On 07/17/2015 01:40 PM, Thierry Granier wrote: Hello i have a machine A with 2 disks 1 et 2 running Debian Jessie on 1 is the system and the boot and the swap on 2 different partitions like /home /opt ETC. i have a machine B with 1 disk running kali-linux and *100G free* Can i clone the disk 1 of machine A on the 100G free on machine B with rsync? If it is possible, how to do that? Many thanks TG -- Please use reply-all for most replies to avoid omitting the mailing list. To unsubscribe or change options: https://lists.samba.org/mailman/listinfo/rsync Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html
Re: clone a disk
Thierry Granier th.gran...@free.fr wrote: i have a machine A with 2 disks 1 et 2 running Debian Jessie on 1 is the system and the boot and the swap on 2 different partitions like /home /opt ETC. i have a machine B with 1 disk running kali-linux and 100G free Can i clone the disk 1 of machine A on the 100G free on machine B with rsync? If it is possible, how to do that? Yes, it's easy to do, I do that for the primary backup on all my systems. Lets say you are doing it from machine a, and backing up to directory /backup_a on b. Logged in as root then you could do it with : rsync -avH --delete --exclude-from=/etc/rsync_excludes / root@b:/backup_a/ -a means archive and sets several parameters, v simply makes things verbose, H means correctly handle hard linked files. --delete means delete files from the destination that have been removed from the source, and --exclude-from specifies a file containing a list of exclusions to omit. You need to exclude a bunch of stuff, things like /dev/*, /proc/*, /sys/*, and so on. You can also exclude things you don't want to copy such as log files. However, this is interactive and also needs permission to log in as root on the destination (which I block for security). Far better, for regular backups, to use rsync as a service on the destination which only needs a few more steps. Also note that trailing /s on source and destination are significant. root@b:/backup_a/ will produce different results to root@b:/backup_a ! -- Please use reply-all for most replies to avoid omitting the mailing list. To unsubscribe or change options: https://lists.samba.org/mailman/listinfo/rsync Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html
clone a disk
Hello i have a machine A with 2 disks 1 et 2 running Debian Jessie on 1 is the system and the boot and the swap on 2 different partitions like /home /opt ETC. i have a machine B with 1 disk running kali-linux and *100G free* Can i clone the disk 1 of machine A on the 100G free on machine B with rsync? If it is possible, how to do that? Many thanks TG -- Please use reply-all for most replies to avoid omitting the mailing list. To unsubscribe or change options: https://lists.samba.org/mailman/listinfo/rsync Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html
Re: [Bug 3099] Please parallelize filesystem scan
Ken, this just happens to be a special case where your configuration has a huge number of spindles. If you have multiple threads reading the same spindle you'll just be thrashing the heads back forth. If there is one thread reading at the front of the disk and another thread reading at the end of the disk, it will be *slower* that if you have just one thread reading first the front of the disk and then the end of the disk. Two threads will just have the head whipping back and forth. one of my rsync jobs moving from a ZFS system ... has over 100 million files Spreads over how many spindles? The problem is, the optimum way to access the disks depends on how the data lies on the disks. And that's something that a mere program cannot know. Only the filesystem can know that information. Whether it's ext4, md, brtfs, zfs, or whatever -- a program like rsync cannot possibly know how best to access the disk(s) and with how many simultaneous threads. -- Please use reply-all for most replies to avoid omitting the mailing list. To unsubscribe or change options: https://lists.samba.org/mailman/listinfo/rsync Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html
Re: clone a disk
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 This is good info for backing up the MBR (which includes the partition table). However, if you are going to image a partition use either ddrescue (does the same thing but has a status screen, can resume, and works around read errors) or partimage (has a status screen and it understands (most) filesystems so it can leave the empty space sparse). On 07/17/2015 02:48 PM, Larry Irwin (gmail) wrote: Hi TG, You can keep an up-to-date copy of the files/folders/pipes/etc. in the 100GB space using rsync, but not a true clone of the partition. To get a true clone of the boot partition, you'd need to boot from a rescue CD, mount the other machine's 100GB space and dd the boot partition device to a file on the 100GB space. You'd also probably want to get the Master Boot Record by grabbing the first 2K of the raw boot device into a separate file... i.e. something like: dd bs=512 if=/dev/sda1 of=/mnt/u/backup/jessie_sda1 dd bs=512 count=4 if=/dev/sda of=/mnt/u/backup/jessie_mbr_sda The device names might be different - the mount folder may be different - etc... But the idea works. - With a new, blank drive, you could recreate the partitions using fdisk, reverse the dd commands, boot and then work on getting the second partition back up and running. -- Larry Irwin On 07/17/2015 01:40 PM, Thierry Granier wrote: Hello i have a machine A with 2 disks 1 et 2 running Debian Jessie on 1 is the system and the boot and the swap on 2 different partitions like /home /opt ETC. i have a machine B with 1 disk running kali-linux and *100G free* Can i clone the disk 1 of machine A on the 100G free on machine B with rsync? If it is possible, how to do that? Many thanks TG - -- ~*-,._.,-*~'`^`'~*-,._.,-*~'`^`'~*-,._.,-*~'`^`'~*-,._.,-*~'`^`'~*-,._., - -*~ Kevin Korb Phone:(407) 252-6853 Systems Administrator Internet: FutureQuest, Inc. ke...@futurequest.net (work) Orlando, Floridak...@sanitarium.net (personal) Web page: http://www.sanitarium.net/ PGP public key available on web site. ~*-,._.,-*~'`^`'~*-,._.,-*~'`^`'~*-,._.,-*~'`^`'~*-,._.,-*~'`^`'~*-,._., - -*~ -BEGIN PGP SIGNATURE- Version: GnuPG v2 iEYEARECAAYFAlWpV+kACgkQVKC1jlbQAQc47gCeNqRbq5PGVmvC61Qby2saHo9z Q3wAn2ZSoBM080XyQ8j7DXJn7TBeEL6A =EhNc -END PGP SIGNATURE- -- Please use reply-all for most replies to avoid omitting the mailing list. To unsubscribe or change options: https://lists.samba.org/mailman/listinfo/rsync Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html
Re: clone a disk
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 I would add --numeric-ids and --itemize-changes. Up to you if you need --xattrs or --acls. Also, I prefer to do backups by filesystem so I would add - --one-file-system and run one rsync per filesystem. This means you don't have to exclude things like /proc and /dev and any random thing that isn't normally connected but sometimes is but it also means you have to list all the filesystems that you do want to backup. On 07/17/2015 02:21 PM, Simon Hobson wrote: Thierry Granier th.gran...@free.fr wrote: i have a machine A with 2 disks 1 et 2 running Debian Jessie on 1 is the system and the boot and the swap on 2 different partitions like /home /opt ETC. i have a machine B with 1 disk running kali-linux and 100G free Can i clone the disk 1 of machine A on the 100G free on machine B with rsync? If it is possible, how to do that? Yes, it's easy to do, I do that for the primary backup on all my systems. Lets say you are doing it from machine a, and backing up to directory /backup_a on b. Logged in as root then you could do it with : rsync -avH --delete --exclude-from=/etc/rsync_excludes / root@b:/backup_a/ -a means archive and sets several parameters, v simply makes things verbose, H means correctly handle hard linked files. --delete means delete files from the destination that have been removed from the source, and --exclude-from specifies a file containing a list of exclusions to omit. You need to exclude a bunch of stuff, things like /dev/*, /proc/*, /sys/*, and so on. You can also exclude things you don't want to copy such as log files. However, this is interactive and also needs permission to log in as root on the destination (which I block for security). Far better, for regular backups, to use rsync as a service on the destination which only needs a few more steps. Also note that trailing /s on source and destination are significant. root@b:/backup_a/ will produce different results to root@b:/backup_a ! - -- ~*-,._.,-*~'`^`'~*-,._.,-*~'`^`'~*-,._.,-*~'`^`'~*-,._.,-*~'`^`'~*-,._., - -*~ Kevin Korb Phone:(407) 252-6853 Systems Administrator Internet: FutureQuest, Inc. ke...@futurequest.net (work) Orlando, Floridak...@sanitarium.net (personal) Web page: http://www.sanitarium.net/ PGP public key available on web site. ~*-,._.,-*~'`^`'~*-,._.,-*~'`^`'~*-,._.,-*~'`^`'~*-,._.,-*~'`^`'~*-,._., - -*~ -BEGIN PGP SIGNATURE- Version: GnuPG v2 iEYEARECAAYFAlWpWJ8ACgkQVKC1jlbQAQfL5ACfT0vOkim+7HE53/pqfsSzaA1U KN8AoOKGhNGI2xzZrco9Li9jv9Y/6cFi =+mSP -END PGP SIGNATURE- -- Please use reply-all for most replies to avoid omitting the mailing list. To unsubscribe or change options: https://lists.samba.org/mailman/listinfo/rsync Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html