Re: Nice little performance improvement
On Sat, 2009-10-17 at 12:13 -0700, Mike Connell wrote: > > Interesting. If you're not using incremental recursion (the default in > > rsync >= 3.0.0), I can see that the "du" would help by forcing the > > destination I/O to overlap the file-list building in time. But with > > incremental recursion, the "du" shouldn't be necessary because rsync > > actually overlaps the checking of destination files with the file-list > > building on the source. > > Ignoring incremental recursion for a moment. Don't ignore it, it makes a difference. > It seems to me that anything > that can warm up the file cache before it is needed would be beneficial? I didn't reason it out carefully enough; let's try again... Warming up the destination file cache decreases the amount of time the generator spends blocked on I/O. So the answer is yes, provided that the generator is the bottleneck. If incremental recursion is not used, that's almost certainly the case during the main phase of the rsync run, since the generator is checking all the destination files but the sender is only processing the small number of source files that need a transfer. But with incremental recursion, the sender and generator are checking files in parallel, so the sender may be the bottleneck depending on the relative speeds or disk configurations of the machines. (I take it that your rsync run is local. For remote runs, the network could be the bottleneck.) -- Matt -- Please use reply-all for most replies to avoid omitting the mailing list. To unsubscribe or change options: https://lists.samba.org/mailman/listinfo/rsync Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html
Re: Nice little performance improvement
Mike Connell wrote: > > Hi, > > >Interesting. If you're not using incremental recursion (the default in > >rsync >= 3.0.0), I can see that the "du" would help by forcing the > >destination I/O to overlap the file-list building in time. But with > >incremental recursion, the "du" shouldn't be necessary because rsync > >actually overlaps the checking of destination files with the file-list > >building on the source. > > > Ignoring incremental recursion for a moment. It seems to me that anything > that can warm up the file cache before it is needed would be beneficial? No, not if the file cache isn't large enough for the number of files. E.g. if you have 20 million files and only 256MB RAM, it's likely a bad idea. Personally I use a program that I wrote about 11 years ago, called treescan, which pulls in the inodes to cache about twice as fast as du by using inode number sorting. -- Jamie -- Please use reply-all for most replies to avoid omitting the mailing list. To unsubscribe or change options: https://lists.samba.org/mailman/listinfo/rsync Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html
Re: Nice little performance improvement
No, not if the file cache isn't large enough for the number of files. E.g. if you have 20 million files and only 256MB RAM, it's likely a bad idea. Splitting down to the subsub (2-levels down) directory level allows a single subsub rsync to fit for me. Warming the cache is beneficial here, I didn't say it was in every situation. -- Please use reply-all for most replies to avoid omitting the mailing list. To unsubscribe or change options: https://lists.samba.org/mailman/listinfo/rsync Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html
Re: Nice little performance improvement
> Hi, > >> In order to expeditiously move these new files offsite, we use a >> modified >> version of pyinotify to log all added/altered files across the entire >> filesystem(s) and then every five minutes feed the list to rsync with >> the >> --files-from option. This works very effectively and quickly. > > Interesting... > > How do you tell rsync to delete files that were deleted from the source, > or is that not part of your use case? For us, that is not a necessary part of our use-case. It would certainly however be possible to capture the delete events and remove the files with some other helper script, rather than use rsync directly (rsync doesn't give any advantage in that scenario except to be able to re-use the existing network transport mechanism). regards, Darryl Dixon Winterhouse Consulting Ltd http://www.winterhouseconsulting.com -- Please use reply-all for most replies to avoid omitting the mailing list. To unsubscribe or change options: https://lists.samba.org/mailman/listinfo/rsync Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html
Re: Nice little performance improvement
Hi, Interesting. If you're not using incremental recursion (the default in rsync >= 3.0.0), I can see that the "du" would help by forcing the destination I/O to overlap the file-list building in time. But with incremental recursion, the "du" shouldn't be necessary because rsync actually overlaps the checking of destination files with the file-list building on the source. Ignoring incremental recursion for a moment. It seems to me that anything that can warm up the file cache before it is needed would be beneficial? -- Please use reply-all for most replies to avoid omitting the mailing list. To unsubscribe or change options: https://lists.samba.org/mailman/listinfo/rsync Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html
Re: Nice little performance improvement
Hi, In order to expeditiously move these new files offsite, we use a modified version of pyinotify to log all added/altered files across the entire filesystem(s) and then every five minutes feed the list to rsync with the --files-from option. This works very effectively and quickly. Interesting... How do you tell rsync to delete files that were deleted from the source, or is that not part of your use case? Thanks, Mike -- Please use reply-all for most replies to avoid omitting the mailing list. To unsubscribe or change options: https://lists.samba.org/mailman/listinfo/rsync Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html
Re: Nice little performance improvement
On Thu, 2009-10-15 at 19:07 -0700, Mike Connell wrote: > Today I tried the following: > > For all subsub directories > a) Fork a "du -s subsubdirectory" on the destination > subsubdirectory > b) Run rsync on the subsubdirectory > c) repeat untill done > > Seems to have improved the time it takes by about 25-30%. It looks > like the du can > run ahead of the rsync...so that while rsync is building its file > list, the du is warming up > the file cache on the destination. Then when rsync looks to see what > it needs to do > on the destination, it can do this more efficiently. Interesting. If you're not using incremental recursion (the default in rsync >= 3.0.0), I can see that the "du" would help by forcing the destination I/O to overlap the file-list building in time. But with incremental recursion, the "du" shouldn't be necessary because rsync actually overlaps the checking of destination files with the file-list building on the source. -- Matt -- Please use reply-all for most replies to avoid omitting the mailing list. To unsubscribe or change options: https://lists.samba.org/mailman/listinfo/rsync Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html
Re: Nice little performance improvement
> Hi, > > In my situation I'm using rsync to backup a server with (currently) about > 570,000 files. > These are all little files and maybe .1% of them change or new ones are > added in > any 15 minute period. > Hi Mike, We have three filesystems that between them have approx 22 million files, and around 10-20,000 new or changed files every business day. In order to expeditiously move these new files offsite, we use a modified version of pyinotify to log all added/altered files across the entire filesystem(s) and then every five minutes feed the list to rsync with the --files-from option. This works very effectively and quickly. regards, Darryl Dixon Winterhouse Consulting Ltd http://www.winterhouseconsulting.com -- Please use reply-all for most replies to avoid omitting the mailing list. To unsubscribe or change options: https://lists.samba.org/mailman/listinfo/rsync Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html
Nice little performance improvement
Hi, In my situation I'm using rsync to backup a server with (currently) about 570,000 files. These are all little files and maybe .1% of them change or new ones are added in any 15 minute period. I've split the main tree up so rsync can run on sub sub directories of the main tree. It does each of these sub sub directories sequentially. I would have liked to run some of these in parallel, but that seems to increase i/o on the main server too much. Today I tried the following: For all subsub directories a) Fork a "du -s subsubdirectory" on the destination subsubdirectory b) Run rsync on the subsubdirectory c) repeat untill done Seems to have improved the time it takes by about 25-30%. It looks like the du can run ahead of the rsync...so that while rsync is building its file list, the du is warming up the file cache on the destination. Then when rsync looks to see what it needs to do on the destination, it can do this more efficiently. Looks like a keeper so far. Any other suggestions? (was thinking of a previous suggestion of setting /proc/sys/vm/vfs_cache_pressure to a low value). Thanks, Mike-- Please use reply-all for most replies to avoid omitting the mailing list. To unsubscribe or change options: https://lists.samba.org/mailman/listinfo/rsync Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html