Bug#953421: dash: Resident Set Size growth is unbound (memory leak) on an infinite shell loop
Hi Harald, Thanks for comprehensive account of the job flow - all worked as expected now. Interestingly, I originally assumed it was a bug due to observed discrepancy with bash... On 29/03/2020 23:07, Jilles Tjoelker wrote: > I agree that the change is incorrect, but I do not agree that this kind > of script must leak memory. Per POSIX.1-2008 XCU 2.9.3.1 Asynchronous > Lists, an implementation has additional ways to forget about jobs than > just an appropriate completion of the wait utility: if another > asynchronous job is started when $! was not referenced or if the number > of known process IDs would exceed {CHILD_MAX} (which tends to be rather > big, though). I can see in Bash man pages: CHILD_MAX Set the number of exited child status values for the shell to remember. Bash will not allow this value to be decreased below a POSIX-mandated minimum, and there is a maximum value (currently 8192) that this may not exceed. The minimum value is system-dependent. Running above scripts under bash shell manifests growing RSS just initially, and with sleep .1 that growth will end after some 800 seconds as one could deduce from the man pages. That is, RSS growth is bound. I understand dash philosophy could be focussed on the most minimalistic solution possible. Such solution would inherently be less forgiving in terms of mistakes. On the other hand, the mistake in question can be difficult to detect and leads all the way to OOM, negatively affecting user experience. It is up to you and community where that balance should rest. And if you decide the change would be worthy of doing - I will be more than happy to contribute. Cheers, Vitaly
Bug#953421: dash: Resident Set Size growth is unbound (memory leak) on an infinite shell loop
Hi Vitaly, On 31/03/2020 20:07, Vitaly Zuevsky wrote: I must have confused two concepts: waited process in OS -vs- waited job inside shell interpreter. I am trying to see how it work in practice: # true & false & # [2] + Done(1)false [1] + Done true # wait 2 # echo $? 127 As we preserve job exit codes, I would expect wait command to read them and free associated jobtab slots. In above example I expect to see 1 in place of 127. What do I miss here? There are two problems here. The first is that waiting for jobs involves the % symbol. wait 2 always means "wait for process 2", never "wait for job 2". To wait for job 2, the syntax is "wait %2". The second problem is that as soon as you see that "[2] + Done(1)", the job has been removed the job table already, so "wait %2" no longer works after that. The blank line (except for the #) indicates that you pressed Return at that point. In interactive shells, jobs are checked for completion and removed from the job table automatically when prompting for input. Since it was practically impossible to press Return before the "true" and "false" commands had finished, when the next prompt would have been shown, the jobs were cleaned up. If you had typed "wait %2" before pressing Return, it would have worked and the subsequent "echo $?" would have printed "1". Many thanks for your help. Best, Vitaly
Bug#953421: dash: Resident Set Size growth is unbound (memory leak) on an infinite shell loop
Hi Harald, > set -- $(seq 1 100) > for i > do > : & > sleep .1 > done > for i > do > wait %$i > done > >This is a valid script and works fine in dash. Your change breaks this by not >keeping the jobs around long enough, and I hope this test script shows that >there is no way to keep the jobs around long enough but by allocating ever >more memory. I must have confused two concepts: waited process in OS -vs- waited job inside shell interpreter. I am trying to see how it work in practice: # true & false & # [2] + Done(1)false [1] + Done true # wait 2 # echo $? 127 As we preserve job exit codes, I would expect wait command to read them and free associated jobtab slots. In above example I expect to see 1 in place of 127. What do I miss here? Many thanks for your help. Best, Vitaly
Bug#953421: dash: Resident Set Size growth is unbound (memory leak) on an infinite shell loop
On 29/03/2020 23:07, Jilles Tjoelker wrote: On Sun, Mar 29, 2020 at 08:06:31PM +0100, Harald van Dijk wrote: On 29/03/2020 18:54, Vitaly Zuevsky wrote: I have now fixed this bug locally. The leak is in jobtab array (jobs.c). I concluded that the most logical approach would be eliminating inconsistency between makejob() and dowait() functions. My fix in a forked repo: https://salsa.debian.org/psvz-guest/dash/-/commit/5e3ea90cb3355d1308c482661a471883d36af5e7 This change is incorrect. The reason dash keeps on allocating memory is because dash needs to keep on allocating memory. Consider this script: set -- $(seq 1 100) for i do : & sleep .1 done for i do wait %$i done This is a valid script and works fine in dash. Your change breaks this by not keeping the jobs around long enough, and I hope this test script shows that there is no way to keep the jobs around long enough but by allocating ever more memory. I agree that the change is incorrect, but I do not agree that this kind of script must leak memory. Per POSIX.1-2008 XCU 2.9.3.1 Asynchronous Lists, an implementation has additional ways to forget about jobs than just an appropriate completion of the wait utility: if another asynchronous job is started when $! was not referenced or if the number of known process IDs would exceed {CHILD_MAX} (which tends to be rather big, though). That says the process ID need not be remembered, which makes sense: process IDs are a very limited resource on some systems. This reasoning does not apply to job IDs and I do not see any corresponding wording for those. Am I overlooking something? Regardless, it does not leak memory. It continually uses more memory, but it is not a leak if the memory is still in use and can/will be freed when it is no longer needed. Memory used for jobs is released when the job is removed from the job table. POSIX does not seem to expect using % in scripts like this; it seems highly fragile to me anyway (although $! has problems with process ID reuse). In what way does it not seem to expect this? POSIX says "The job control job ID type of pid is only available on systems supporting the User Portability Utilities option" but does not appear to have any other relevant restrictions. As for fragile, sure, but if a script is correct, it should be processed correctly by a shell. FreeBSD sh implements forgetting when $! was not referenced (and the job has terminated), but not the {CHILD_MAX} limit. This avoids the increasing memory usage in the example script. See above for the note on process ID vs job ID. Your change makes it impossible to keep track of the background process's status, but if you do not care about that anyway, you can avoid the increasing memory use without modifying dash by launching a background process without including it in the current shell's job table, by launching it from a subshell: while true do (true &) sleep .1 done Certainly a good idea. Another option may be to include regular invocations of the wait utility without parameters, although this is not suitable for all scripts. A third option is invoking the 'jobs' command, possibly redirecting the output to /dev/null. When the 'jobs' command reports that any background job is done, that job is removed from the list of jobs the shell has to remembers. Cheers, Harald van Dijk
Bug#953421: dash: Resident Set Size growth is unbound (memory leak) on an infinite shell loop
On Sun, Mar 29, 2020 at 08:06:31PM +0100, Harald van Dijk wrote: > On 29/03/2020 18:54, Vitaly Zuevsky wrote: > > I have now fixed this bug locally. > > The leak is in jobtab array (jobs.c). I concluded that the most > > logical approach would be eliminating inconsistency between > > makejob() and dowait() functions. My fix in a forked repo: > > https://salsa.debian.org/psvz-guest/dash/-/commit/5e3ea90cb3355d1308c482661a471883d36af5e7 > This change is incorrect. The reason dash keeps on allocating memory is > because dash needs to keep on allocating memory. Consider this script: > set -- $(seq 1 100) > for i > do > : & > sleep .1 > done > for i > do > wait %$i > done > This is a valid script and works fine in dash. Your change breaks this by > not keeping the jobs around long enough, and I hope this test script shows > that there is no way to keep the jobs around long enough but by allocating > ever more memory. I agree that the change is incorrect, but I do not agree that this kind of script must leak memory. Per POSIX.1-2008 XCU 2.9.3.1 Asynchronous Lists, an implementation has additional ways to forget about jobs than just an appropriate completion of the wait utility: if another asynchronous job is started when $! was not referenced or if the number of known process IDs would exceed {CHILD_MAX} (which tends to be rather big, though). POSIX does not seem to expect using % in scripts like this; it seems highly fragile to me anyway (although $! has problems with process ID reuse). FreeBSD sh implements forgetting when $! was not referenced (and the job has terminated), but not the {CHILD_MAX} limit. This avoids the increasing memory usage in the example script. > Your change makes it impossible to keep track of the background process's > status, but if you do not care about that anyway, you can avoid the > increasing memory use without modifying dash by launching a background > process without including it in the current shell's job table, by launching > it from a subshell: > while true > do > (true &) > sleep .1 > done Certainly a good idea. Another option may be to include regular invocations of the wait utility without parameters, although this is not suitable for all scripts. -- Jilles Tjoelker
Bug#953421: dash: Resident Set Size growth is unbound (memory leak) on an infinite shell loop
Hi Vitaly, On 29/03/2020 18:54, Vitaly Zuevsky wrote: I have now fixed this bug locally. The leak is in jobtab array (jobs.c). I concluded that the most logical approach would be eliminating inconsistency between makejob() and dowait() functions. My fix in a forked repo: https://salsa.debian.org/psvz-guest/dash/-/commit/5e3ea90cb3355d1308c482661a471883d36af5e7 This change is incorrect. The reason dash keeps on allocating memory is because dash needs to keep on allocating memory. Consider this script: set -- $(seq 1 100) for i do : & sleep .1 done for i do wait %$i done This is a valid script and works fine in dash. Your change breaks this by not keeping the jobs around long enough, and I hope this test script shows that there is no way to keep the jobs around long enough but by allocating ever more memory. Your change makes it impossible to keep track of the background process's status, but if you do not care about that anyway, you can avoid the increasing memory use without modifying dash by launching a background process without including it in the current shell's job table, by launching it from a subshell: while true do (true &) sleep .1 done Cheers, Harald van Dijk
Bug#953421: dash: Resident Set Size growth is unbound (memory leak) on an infinite shell loop
I have now fixed this bug locally. The leak is in jobtab array (jobs.c). I concluded that the most logical approach would be eliminating inconsistency between makejob() and dowait() functions. My fix in a forked repo: https://salsa.debian.org/psvz-guest/dash/-/commit/5e3ea90cb3355d1308c482661a471883d36af5e7 I suspect that imbalance between allocating and freeing jobtab slots was introduced at evolving SIGCHLD handling, because I can see two waitforjob() calls, wherein the first call could wait for both done jobs and gotsigchld is set. At the second call, gotsigchld is cleared and waitforjob() does nothing. So 1 of 2 jobtab slots goes astray. Simplified bug repro steps: Script --- while true do true & sleep .1 done Execution - # dash Script & You can now observe resident set size (RSS) is bumped every several seconds: # ps aux |grep 'RSS\|dash' What do I do to have it peer-reviewed and, potentially, merged into mainline? Cheers, Vitaly --fixDetails: Function makejob() calls freejob() when appropriate. Specifically, it detects a free slot in jobtab array by this condition: (jp->state != JOBDONE || !jp->waited) The problem is that dowait() never sets waited flag together with JOBDONE state. Consequently, jobtab may leak: repro steps are provided in a bug report #953421. One solution could be not relying on waited flag at all, i.e. converting the makejob()'s condition into (jp->state != JOBDONE). As an alternative, I am setting waited flag in dowait().
Bug#953421: dash: Resident Set Size growth is unbound (memory leak) on an infinite shell loop
Hi Andrej I cloned and compiled dash (0.5.10.2-6) from https://salsa.debian.org/debian/dash I did further testing and found my initial repro steps were incorrect. Offending script showing RSS growth is: -- #!/bin/sh while true do true || echo & sleep .1 done -- RSS will not grow if precedence is addressed like so: true || (echo &) I interpret it as starting/terminating cycle of a background job is leaky. Br Vitaly -Original Message- From: Andrej Shadura [mailto:and...@shadura.me] Sent: Sunday, March 15, 2020 6:47 PM To: Vitaly Zuevsky; 953...@bugs.debian.org Cc: Debian Bug Tracking System Subject: Re: Bug#953421: dash: Resident Set Size growth is unbound (memory leak) on an infinite shell loop Thanks for your report; this most likely is a bug in the upstream package. While I doubt it’s any of our patches that introduced this undesired behaviour, it would be great if you could verify this still happens with the latest upstream Git version of dash, and if it also does, if you followed up with the upstream at d...@vger.kernel.org. If that’s too much work for you right now (it’s okay if it is), I can do that for you, but, unfortunately, I cannot provide a timeline at the moment. -- Cheers, Andrej
Bug#953421: dash: Resident Set Size growth is unbound (memory leak) on an infinite shell loop
Hi, On Mon, 9 Mar 2020 at 17:09, Vitaly Zuevsky wrote: > [VZ]I use a shell script to supervise processes in a docker/kubernetes > container. I noticed steady growth > in the cgroup's CPU utilization from 15 to 35 millicores within 17 > days in absence of any external > stimuli (dry run). If container got restarted, the pattern > reappeared from the base level of 15 millicores. > [VZ]I reproduced the issue in local docker, sampled cgroup's threads > with perf, and ran perf-diff on reports. > It emerged that the CPU cost grew at dash's freejob() and kernel's > copy_page() at page_fault(). > The findings were consistent with dash growing a data structure, > possibly the one indexed by nprocs at > freejob(), on every iteration of the loop of the shell script. That > data structure would have to be > cloned at fork() happening along the loop, hence kernel page_faults. Thanks for your report; this most likely is a bug in the upstream package. While I doubt it’s any of our patches that introduced this undesired behaviour, it would be great if you could verify this still happens with the latest upstream Git version of dash, and if it also does, if you followed up with the upstream at d...@vger.kernel.org. If that’s too much work for you right now (it’s okay if it is), I can do that for you, but, unfortunately, I cannot provide a timeline at the moment. -- Cheers, Andrej
Bug#953421: dash: Resident Set Size growth is unbound (memory leak) on an infinite shell loop
Package: dash Version: 0.5.8-2.1ubuntu2 Severity: important Dear Maintainer, *** Reporter, please consider answering these questions, where appropriate *** * What led up to the situation? [VZ]I use a shell script to supervise processes in a docker/kubernetes container. I noticed steady growth in the cgroup's CPU utilization from 15 to 35 millicores within 17 days in absence of any external stimuli (dry run). If container got restarted, the pattern reappeared from the base level of 15 millicores. * What exactly did you do (or not do) that was effective (or ineffective)? [VZ]I reproduced the issue in local docker, sampled cgroup's threads with perf, and ran perf-diff on reports. It emerged that the CPU cost grew at dash's freejob() and kernel's copy_page() at page_fault(). The findings were consistent with dash growing a data structure, possibly the one indexed by nprocs at freejob(), on every iteration of the loop of the shell script. That data structure would have to be cloned at fork() happening along the loop, hence kernel page_faults. To verify my suspicions I reduced sleep inteval in the loop and monitored Resident Set Size. Reproduction steps without containers. The shell script: - #!/bin/sh cd /tmp while true do [ "`pgrep rsyslogd`" ] || service rsyslog start [ "`pgrep cron`" ] || service cron start [ "`pgrep sshd`" ] || service ssh start env >/root/env sleep .5 done - Save it as a file called woof, chmod it executable and run as # nohup ./woof & Monitoring with ps unveils unbound growth, e.g. (note the first figure): 1628 wait S? 0:00 /bin/sh ./woof 3456 wait S? 1:02 /bin/sh ./woof 5188 wait S? 2:59 /bin/sh ./woof 6584 wait S? 5:19 /bin/sh ./woof 7076 wait S? 6:17 /bin/sh ./woof 9300 wait S? 23:54 /bin/sh ./woof * What was the outcome of this action? [VZ]I had to resort to #!/bin/bash shebang as it rendered expected behaviour: 2684 wait S? 0:00 /bin/bash ./woof 3896 wait S? 1:32 /bin/bash ./woof 3896 wait S? 3:10 /bin/bash ./woof 3896 wait S? 3:45 /bin/bash ./woof 3896 pipe_w S? 15:20 /bin/bash ./woof RSS remains constant at 3896 * What outcome did you expect instead? [VZ]I expect RSS of dash remains bound in the case of an infinite shell loop the same way as RSS of bash does. *** End of the template - remove these template lines *** -- System Information: Debian Release: stretch/sid APT prefers xenial-updates APT policy: (500, 'xenial-updates'), (500, 'xenial-security'), (500, 'xenial'), (100, 'xenial-backports') Architecture: amd64 (x86_64) Kernel: Linux 4.4.0-1074-aws (SMP w/1 CPU core) Locale: LANG=en_US.UTF-8, LC_CTYPE=en_US.UTF-8 (charmap=ANSI_X3.4-1968) (ignored: LC_ALL set to C) Shell: /bin/sh linked to /bin/dash Init: systemd (via /run/systemd/system) Versions of packages dash depends on: ii debianutils 4.7 ii dpkg 1.18.4ubuntu1.4 ii libc62.23-0ubuntu10 dash recommends no packages. dash suggests no packages. -- debconf-show failed