Re: High memory usage - any way around it other than splitting jobs?

2020-06-25 Thread Perry Hutchison via rsync
Andy Smith via rsync  wrote:

> I have a virtual machine with 2G of memory. On this VM there is a
> directory tree with 33.3 million files in it. When attempting to
> rsync (rsync -PSHav --delete /source /dest) this tree from one
> directory to another on the same host, rsync uses all the memory and
> is killed by oom-killer.
>
> This host is Debian oldstable so has
>
> $ rsync --version
> rsync  version 3.1.2  protocol version 31

Since this is all taking place on a single VM (thus there is no
network involved), it's possible that rsync is not the best tool
for the job.  Had you considered something like:

$ ( cd /source && find . -depth -print0 | cpio -p -0l /dest ) && rm -rf /source

One advantage of "cpio -p -l" is that it avoids copying any of the
files -- it just makes a new directory tree containing hardlinks to
the original files.

(I am guessing that, since this is a VM, it is likely to have been
set up with a single, large filesystem on a single (virtual) drive
rather than the older approach of creating multiple partitions --
so that /source and /dest are in the same filesystem.)

For that matter, what about:

# rm -rf /dest
# mv /source /dest

i.e. just rename the source tree?

-- 
Please use reply-all for most replies to avoid omitting the mailing list.
To unsubscribe or change options: https://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html


Re: High memory usage - any way around it other than splitting jobs?

2020-06-25 Thread Perry Hutchison via rsync
Kevin Korb via rsync  wrote:

> Unfortunately the hard links are the problem.  In order to keep
> them straight rsync has to remember the details of every file it
> finds with a link count >1 making it grow and grow.

I _hope_ it is only remembering the source and destination inode
numbers, and pruning that list when possible[*], as opposed to
storing all of "the details" for the duration of the operation.

[*] If the source file has (say) 3 hardlinks, it can be deleted
from the list as soon as the 3rd hardlink has been copied.

-- 
Please use reply-all for most replies to avoid omitting the mailing list.
To unsubscribe or change options: https://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html


Re: High memory usage - any way around it other than splitting jobs?

2020-06-25 Thread Kevin Korb via rsync
Unfortunately the hard links are the problem.  In order to keep them
straight rsync has to remember the details of every file it finds with a
link count >1 making it grow and grow.  Of course without -H rsync will
end up duplicating them.

On 6/25/20 10:30 AM, Andy Smith via rsync wrote:
> Hi,
> 
> I have a virtual machine with 2G of memory. On this VM there is a
> directory tree with 33.3 million files in it. When attempting to
> rsync (rsync -PSHav --delete /source /dest) this tree from one
> directory to another on the same host, rsync uses all the memory and
> is killed by oom-killer.
> 
> This host is Debian oldstable so has
> 
> $ rsync --version
> rsync  version 3.1.2  protocol version 31
> 
> The normal operation of this VM does not require more than 2G of
> memory, but I doubled it to 4G anyway. Unfortunately rsync still
> uses all the memory and is killed.
> 
> Most advice I can find on decreasing rsync memory usage advises to
> split the job up into batches. By issuing one rsync for each
> directory within /source I was able to make this work.
> 
> The interesting thing is though, the split of file numbers between
> sub-directories is very uneven with the majority of them (31.5
> million of the 33.3 million) being in just one of the sub-directory
> trees. I am kind of surprised that rsync has such a problem going
> just that little bit further with the last 2 million. Is there any
> scope for improvement with the incremental recursion code?
> 
> If I upgraded the version of rsync could I expect this to work any
> better?
> 
> I could also give the host a massive swap file. It currently has
> just 1G of swap, which all gets used in the failure case. I could
> add more but I fear that the job will go so slow it will not
> complete in a reasonable time.
> 
> I don't know if the -H option is causing extra memory usage here;
> unfortunately it is necessary as there are hardlinks in there.
> 
> Some years old advice says to disable incremental recursion with
> --no-i-r. As incremental recursion was added to reduce memory usage
> this seems counter-intuitive to me, but this advice is all over the
> Internet…
> 
> These are all things I will investigate before settling for the
> "split into multiple jobs" approach; just wondered if anyone has any
> shortcuts for me.
> 
> Thanks,
> Andy
> 

-- 
~*-,._.,-*~'`^`'~*-,._.,-*~'`^`'~*-,._.,-*~'`^`'~*-,._.,-*~'`^`'~*-,._.,
Kevin Korb  Phone:(407) 252-6853
Systems Administrator   Internet:
FutureQuest, Inc.   ke...@futurequest.net  (work)
Orlando, Floridak...@sanitarium.net (personal)
Web page:   https://sanitarium.net/
PGP public key available on web site.
~*-,._.,-*~'`^`'~*-,._.,-*~'`^`'~*-,._.,-*~'`^`'~*-,._.,-*~'`^`'~*-,._.,



signature.asc
Description: OpenPGP digital signature
-- 
Please use reply-all for most replies to avoid omitting the mailing list.
To unsubscribe or change options: https://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html


High memory usage - any way around it other than splitting jobs?

2020-06-25 Thread Andy Smith via rsync
Hi,

I have a virtual machine with 2G of memory. On this VM there is a
directory tree with 33.3 million files in it. When attempting to
rsync (rsync -PSHav --delete /source /dest) this tree from one
directory to another on the same host, rsync uses all the memory and
is killed by oom-killer.

This host is Debian oldstable so has

$ rsync --version
rsync  version 3.1.2  protocol version 31

The normal operation of this VM does not require more than 2G of
memory, but I doubled it to 4G anyway. Unfortunately rsync still
uses all the memory and is killed.

Most advice I can find on decreasing rsync memory usage advises to
split the job up into batches. By issuing one rsync for each
directory within /source I was able to make this work.

The interesting thing is though, the split of file numbers between
sub-directories is very uneven with the majority of them (31.5
million of the 33.3 million) being in just one of the sub-directory
trees. I am kind of surprised that rsync has such a problem going
just that little bit further with the last 2 million. Is there any
scope for improvement with the incremental recursion code?

If I upgraded the version of rsync could I expect this to work any
better?

I could also give the host a massive swap file. It currently has
just 1G of swap, which all gets used in the failure case. I could
add more but I fear that the job will go so slow it will not
complete in a reasonable time.

I don't know if the -H option is causing extra memory usage here;
unfortunately it is necessary as there are hardlinks in there.

Some years old advice says to disable incremental recursion with
--no-i-r. As incremental recursion was added to reduce memory usage
this seems counter-intuitive to me, but this advice is all over the
Internet…

These are all things I will investigate before settling for the
"split into multiple jobs" approach; just wondered if anyone has any
shortcuts for me.

Thanks,
Andy

-- 
Please use reply-all for most replies to avoid omitting the mailing list.
To unsubscribe or change options: https://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html