Re: High memory usage - any way around it other than splitting jobs?
Andy Smith via rsync wrote: > I have a virtual machine with 2G of memory. On this VM there is a > directory tree with 33.3 million files in it. When attempting to > rsync (rsync -PSHav --delete /source /dest) this tree from one > directory to another on the same host, rsync uses all the memory and > is killed by oom-killer. > > This host is Debian oldstable so has > > $ rsync --version > rsync version 3.1.2 protocol version 31 Since this is all taking place on a single VM (thus there is no network involved), it's possible that rsync is not the best tool for the job. Had you considered something like: $ ( cd /source && find . -depth -print0 | cpio -p -0l /dest ) && rm -rf /source One advantage of "cpio -p -l" is that it avoids copying any of the files -- it just makes a new directory tree containing hardlinks to the original files. (I am guessing that, since this is a VM, it is likely to have been set up with a single, large filesystem on a single (virtual) drive rather than the older approach of creating multiple partitions -- so that /source and /dest are in the same filesystem.) For that matter, what about: # rm -rf /dest # mv /source /dest i.e. just rename the source tree? -- Please use reply-all for most replies to avoid omitting the mailing list. To unsubscribe or change options: https://lists.samba.org/mailman/listinfo/rsync Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html
Re: High memory usage - any way around it other than splitting jobs?
Kevin Korb via rsync wrote: > Unfortunately the hard links are the problem. In order to keep > them straight rsync has to remember the details of every file it > finds with a link count >1 making it grow and grow. I _hope_ it is only remembering the source and destination inode numbers, and pruning that list when possible[*], as opposed to storing all of "the details" for the duration of the operation. [*] If the source file has (say) 3 hardlinks, it can be deleted from the list as soon as the 3rd hardlink has been copied. -- Please use reply-all for most replies to avoid omitting the mailing list. To unsubscribe or change options: https://lists.samba.org/mailman/listinfo/rsync Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html
Re: High memory usage - any way around it other than splitting jobs?
Unfortunately the hard links are the problem. In order to keep them straight rsync has to remember the details of every file it finds with a link count >1 making it grow and grow. Of course without -H rsync will end up duplicating them. On 6/25/20 10:30 AM, Andy Smith via rsync wrote: > Hi, > > I have a virtual machine with 2G of memory. On this VM there is a > directory tree with 33.3 million files in it. When attempting to > rsync (rsync -PSHav --delete /source /dest) this tree from one > directory to another on the same host, rsync uses all the memory and > is killed by oom-killer. > > This host is Debian oldstable so has > > $ rsync --version > rsync version 3.1.2 protocol version 31 > > The normal operation of this VM does not require more than 2G of > memory, but I doubled it to 4G anyway. Unfortunately rsync still > uses all the memory and is killed. > > Most advice I can find on decreasing rsync memory usage advises to > split the job up into batches. By issuing one rsync for each > directory within /source I was able to make this work. > > The interesting thing is though, the split of file numbers between > sub-directories is very uneven with the majority of them (31.5 > million of the 33.3 million) being in just one of the sub-directory > trees. I am kind of surprised that rsync has such a problem going > just that little bit further with the last 2 million. Is there any > scope for improvement with the incremental recursion code? > > If I upgraded the version of rsync could I expect this to work any > better? > > I could also give the host a massive swap file. It currently has > just 1G of swap, which all gets used in the failure case. I could > add more but I fear that the job will go so slow it will not > complete in a reasonable time. > > I don't know if the -H option is causing extra memory usage here; > unfortunately it is necessary as there are hardlinks in there. > > Some years old advice says to disable incremental recursion with > --no-i-r. As incremental recursion was added to reduce memory usage > this seems counter-intuitive to me, but this advice is all over the > Internet… > > These are all things I will investigate before settling for the > "split into multiple jobs" approach; just wondered if anyone has any > shortcuts for me. > > Thanks, > Andy > -- ~*-,._.,-*~'`^`'~*-,._.,-*~'`^`'~*-,._.,-*~'`^`'~*-,._.,-*~'`^`'~*-,._., Kevin Korb Phone:(407) 252-6853 Systems Administrator Internet: FutureQuest, Inc. ke...@futurequest.net (work) Orlando, Floridak...@sanitarium.net (personal) Web page: https://sanitarium.net/ PGP public key available on web site. ~*-,._.,-*~'`^`'~*-,._.,-*~'`^`'~*-,._.,-*~'`^`'~*-,._.,-*~'`^`'~*-,._., signature.asc Description: OpenPGP digital signature -- Please use reply-all for most replies to avoid omitting the mailing list. To unsubscribe or change options: https://lists.samba.org/mailman/listinfo/rsync Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html
High memory usage - any way around it other than splitting jobs?
Hi, I have a virtual machine with 2G of memory. On this VM there is a directory tree with 33.3 million files in it. When attempting to rsync (rsync -PSHav --delete /source /dest) this tree from one directory to another on the same host, rsync uses all the memory and is killed by oom-killer. This host is Debian oldstable so has $ rsync --version rsync version 3.1.2 protocol version 31 The normal operation of this VM does not require more than 2G of memory, but I doubled it to 4G anyway. Unfortunately rsync still uses all the memory and is killed. Most advice I can find on decreasing rsync memory usage advises to split the job up into batches. By issuing one rsync for each directory within /source I was able to make this work. The interesting thing is though, the split of file numbers between sub-directories is very uneven with the majority of them (31.5 million of the 33.3 million) being in just one of the sub-directory trees. I am kind of surprised that rsync has such a problem going just that little bit further with the last 2 million. Is there any scope for improvement with the incremental recursion code? If I upgraded the version of rsync could I expect this to work any better? I could also give the host a massive swap file. It currently has just 1G of swap, which all gets used in the failure case. I could add more but I fear that the job will go so slow it will not complete in a reasonable time. I don't know if the -H option is causing extra memory usage here; unfortunately it is necessary as there are hardlinks in there. Some years old advice says to disable incremental recursion with --no-i-r. As incremental recursion was added to reduce memory usage this seems counter-intuitive to me, but this advice is all over the Internet… These are all things I will investigate before settling for the "split into multiple jobs" approach; just wondered if anyone has any shortcuts for me. Thanks, Andy -- Please use reply-all for most replies to avoid omitting the mailing list. To unsubscribe or change options: https://lists.samba.org/mailman/listinfo/rsync Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html