On Sun, Mar 15, 2020 at 03:36:39PM +0700, Robert Elz wrote:
> The NetBSD shell - and I suspect many others, perhaps all others, waits
> for any terminated children (reaps them from the kernel) more or less as
> soon as they exit - then remembers the info in the internal jobs table
> for later reporting status via "wait $pid" or "jobs"  (or just an
> interactive prompt) at the appropriate time.

> This has the advantage that the kernel's process table has zombie
> processes removed quickly, and isn't cluttered with trash lying around
> because some script is running lots of background processes without waiting
> for any of them - the only cost is (or seems to be) some memory in the
> shell's jobs table (which the standard allows us to bound, if we desire).

> However, I have been pondering a somewhat weird case (or more
> correctly, possibility, as I have never actually seen it happen)

> Consider

>       bg-process-1 & PID1=$!
>       long-running-monster-fg-process
>       bg-process-2 & PID2=$!

> "long-running-monster-fg-orocess" is something like a complete system
> build, including lots of add-on utilities (imagine, gnome and all that
> goes with it, and kde, and all associated with that ...) - it doesn't
> really matter, except that there are lots of processes being run.
> It is irrelevant whether that is lots of childrem from the current
> shell, or whether that is a script (or "make" or something) that simply
> takes a long time to complete.

> In this case, and with the shell strategy above, it is possible that
> PID1 and PID2 contain the same value.

> In that case, if both background processes have exited, and
> the script then does

>       wait $PID1

> what are we supposed to do?   How are we to distinguish that from

>       wait $PID2

> ?

> Does anyone know of a shell that correctly handles this now?

About six years ago I committed something to FreeBSD sh that fixed the
storage of the exit status of the second process:
https://svnweb.freebsd.org/base?view=revision&revision=263453

The commit message also included a (slow) test:
  exit 7 & p1=$!; until exit 8 & p2=$!; [ "$p1" = "$p2" ]; do wait "$p2";
  done; sleep 0.1; wait %1; echo $?

I guess I did not fix the problem with $! not being a unique identifier
because it seemed very hard for a very improbable situation.

> The only solutions I can see are to:

> Only ever use waitpid() with an explicit pid for the particular
> process of which we actually want the exit status, and leave all
> other completed processes as zombies until they are wanted (any of
> the other newer wait*() sys calls with similar functionality would
> do as well of course).

> This would mean that while a pipeline is running, we would be unable
> to report status of earlier completed elements of the pipe when the
> final (rightmost) process is still yet to complete, which would be annoying
> (but not actually fatal to anything).

> It would also mean that there would be no way to retain the

>       wait -p PID -n $PID1 $PID2 ...

> command option that the NetBSD shell has, which waits for any one of the
> specified jobs to finish (any that was already finished, in which case
> there is no actual wait and a random one of the completed jobs is selected)
> or the next of them which happens to finish, if none were already done.
> (The "-p PID" option names a variable in which the ID of the job that
> finished is placed - the same as the arg string if there is one, with no
> pid args, the pid of the job (what $! was when the job started), the exit
> status of the wait command is the status of that job).   That relies on
> being able to wait for any child to exit.

> Or:

> We always use wait*() with the WNOWAIT flag when waiting for any random
> child to complete, and then wait() again (wthout WNOWAIT, but with the
> explicit pid) when we want to clean up the jobs table entry for that job.

> The problem with this (aside from WNOWAIT in the standard only applying
> to waitid() - in practice I suspect that all of the wait*() sys calls
> that take a flags arg implement the same set of flags - certainly NetBSD
> does) is that I see no way to prevent that child process being returned
> again and again every time we do an anonymous wait*() system call.  That
> is, I see no way to wait for something not previously ever waited upon,
> which is what we would need here - the kernel would need a bunch more
> mechanism, and a new WXXXXX flag would be required.    NetBSD has
> WNOZOMBIE ("Ignore zombies") which only waits for some running process
> to change status - but that's no use, we want to get status from processes
> that have already exited (ie: zombies) if there are any - just only once.

I agree that WNOWAIT is not helpful here because the same child process
may be returned over and over.

What is necessary here is a notification mechanism for process
termination that is not a wait*() function. Such a mechanism does not
seem to exist in POSIX but exists on various operating systems:

* fully queued SIGCHLD with siginfo (works on FreeBSD but not Linux)
* kqueue with EVFILT_PROC (various BSD systems)
* proc connector (Linux)
* whatever pwait(1) uses (Solaris and related systems)

By the way, most of these mechanisms also allow waiting for an unrelated
process to terminate.

> Of course, both of these "solutions" mean keeping zombies in the kernel
> process table - that's the point, as that prevents the kernel from
> re-using the process ID.

Yes, although this keeping is only necessary for the most recent
background process or if $! has been referenced for it.

> Or:

> Every time the shell forks, before running any of the subshell code,
> it could check whether the PID it was assigned is a PID that is still "live"
> in the jobs table, and if so, it simply exits without doing anything.
> Simultaneously the parent is doing the same check using the new child's
> PID.   Since the two are simply forks() of the parent, the data structures
> they see are identical - both child and parent will answer that check the
> same way.   When the check reports "still in use" the child simply exits
> (as mentioned). the parent simply does a waitpid(PID, ...) to clean up
> that child (without ever having entered it into any data structs) and
> then forks again, and the whole process repeats.

> This is the solution I see with most promise, but relies upon the kernel
> not simply assigning the same pid over and over again (even if there happens
> to only be one available unused pid to assign).   To deal with this the
> parent shell would need something like a counter of attempts, and if we
> fail to get a new pid after a few attempts, give up, and signal a fork error.

Reuse can be prevented by delaying the waitpid() on the unwanted
duplicates until a process with a unique PID has been created or a limit
has been reached. In the general case this information can be stored as
a flag in the previous job structure, so it does not allocate unbounded
memory in userspace.

> This looks kind of cumbersome and ugly to me - even though I don't
> currently see any other plausible solution to this, that meets our goals.

> I'd love to hear from anyone who has (or can even imagine, regardless of
> whether it is currently implemented anywhere) a better solution for this
> issue.   Or if for some reason I am not understanding this isn't even a
> potential (certainly it is extremely unlikely) problem, then why.

> ps: note that we don't currently have a problem with the kernel assigning
> the pid of a previously exited process, which is still alive in the
> jobs table, the shell can cope with that - the issue only arises when that
> pid is communicated to the script, and then used by the script.   A similar
> problem would be if the script attempted

>       kill $PID1

> after         bg-process-1  has finished (without the script realising that)
> which then ends up signalling $PID2 (the same thing) which is still
> running.   Of course, a similar problem can happen here, without PID2
> being involved - with the script simply signalling some unintended process.
> The only way of avoiding that would be to keep the zombies until the
> script has been made aware that the process is completed, after which it
> is simply a script bug if it tries to kill a process it knows is already
> complete.

A possible fix would be to add a magic variable that returns the most
recent background job's identifier in %<number> form, somewhat like $!
in that referencing it causes the job to be remembered. Scripts would
need to use the new variable instead of $!.

-- 
Jilles Tjoelker

Reply via email to