Re: wait -n misses signaled subprocess

Chet Ramey Thu, 08 Feb 2024 13:05:27 -0800

On 1/31/24 2:35 PM, Robert Elz wrote:

   | Not quite. `new' in this sense is the opposite of `anything in the past'
   | as Dale described it -- already notified and removed from the jobs list.


I guess the part about bash that I am not understanding here is how the
"already notified" works.   To me there are just two ways for that, either
the user has done a "wait" which has collected that pid already (either
without -n, and no pid args, or with pid args and one of those is the pid
in question) or with -n and the pid in question was the one whose status
was returned, or the user/script did the jobs command (or jobs -l) and the
job in question was shown as completed.

Is there some other way?


Notification after a job terminates due to a signal in a non-interactive
shell -- that runs the equivalent of `jobs'. As it turns out, this was the
problem with Steven Pelley's original report. I fixed one issue, but that
kind of notification will leave jobs marked as notified and eligible to
be removed from the jobs list.


   | Half the problem here is that bash aggressively marks dead jobs as being
   | notified in non-interactive shells without job control enabled, and moves
   | them out of the jobs table.

That might be more than half the problem, it might be the entire problem.


It seems to be in this case. It's a good thing it's limited to processes
that terminate due to signals; a bad thing that processes terminating due
to signals was the entire subject of the original report.


   | but if you
   | do, or if you use wait -n with pid/job arguments (which you've presumably
   | saved yourself) you're going to need slightly different semantics than we
   | have now to answer that reliably. And that will probably need a new option.

That's a pity, particularly since the current semantics don't seem to

be useful in general.


Shoehorning pid/job arguments into the previous behavior, which only dealt
with running jobs, resulted in the current semantics. I should probably
have made `wait -n' with pid arguments look at terminated and notified
processes, but I didn't change the `running job' semantics. Hindsight.

 Since the sole issue provoking that seems to be
the wait over and over policy,


It's not a policy, per se, it's behavior that has historically worked that
way.

rather than "wait once, and remove completely"


POSIX semantics.

perhaps rather than a new, but different, -n like option, a better idea would
be a "only once" option (ie: if the option (-r (remove) or -c (cleanup) or -o
(once only)) is set, then when the wait with that option returns status or,
or waits until termination without returning status (in the not -n case, with
no pid args, or many pid args) then the processes are completely deleted from

everywhere in the shell.


Or you could use posix mode with the recent change, already in devel, since
POSIX requires this behavior (but see below).

 Using that option would make a changed -n safe
to use in loops.   If you do that, also add an option (maybe the upper case
version of whatever is selected for that one, or just some other letter) to
mean "don't wait" (kind of like wait(2) WNOWAIT) - which in default bash would
just be a no-op (except in posix mode, apparently - whereas the -[cor] option
would be a no-op in posix mode).


You're not the only one to suggest some new option(s). Only one really
matters for this discussion.


If you were to do that, other shells could add the same (except in probably
all of them, -[cor] would always be the default, and the other one would be
the one which changes behaviour).


That's always hit or miss.

   | > The one change that should be made is
   | > to allow wait -n to collect processes/jobs that have already terminated.
   |
   | Yes, that's one of the things we're talking about. I don't have any problem
   | with it, but should it take a new option to change those semantics?

Good, though I think some more thought should go into that.   In another
thread you said (paraphrasing) correctly, that scripts should not be
relying upon bugs, and the current wait -n behaviour is a bug - that it
might have been intentionally coded that way doesn't make it any less so.


Trust me, there are people on the other side of that question.

It isn't as if it was ever documented to work the way it does, or everyone
would have known about it already.


You mean the behavior of `wait -n' with pid arguments, I presume. The
problem with your statement is that people do know about it. The question,
as above, is whether or not to avoid changing the behavior because they do.

There are two things that we could change:

1. wait -n needs to get access to the list of terminated pids (the ones
   that satisfy POSIX's "CHILD_MAX processes known in the current shell
   environment"), like wait without -n does. This can happen via a wait
   option, a shell option, or a change in behavior controlled by the
   compatibility level.

2. Some option to implement the posix-mode semantics of removing a pid
   from this list of "known processes" that has finer granularity than
   `set -o posix'. This can happen in the same way(s).

message was unclear about what "more like wait without -n" meant.


#1 above.

   | Yeah, but we're talking about bash here. It doesn't really matter what
   | the Bourne shell did; there are likely plenty of scripts that assume
   | the historical bash behavior.

Really?   Why?   What's the point of collecting the status twice?


Who can say? But is it reasonable for wait to return a status for a pid
that terminated due to a signal and displayed a status message? Or, since
`jobs' lets you see the status but not capture or do anything with it,
is it reasonable to allow wait to collect the status of those, too?

So you have this second list, which you need anyway to keep track of the
last CHILD_MAX exit statuses. Back in 2005 I didn't want to use that
much storage in the jobs list to save all these exited process statuses,
and I didn't want to spend time traversing a huge jobs list to add a new
one. Let's just say it was a less capable device world. Hell, the script
in the original bug report that resulted in this took 1-2 hours to run.
The information is there if you need it, but saving it doesn't slow normal
operation down.

And bash only lazily removes pids from that second list (hash table,
really), when you exceed CHILD_MAX (or the RLIMIT_NPROC limit, or the max
upper bound), so you can wait for them more than once. I'm sure there are
scripts that take advantage of it for some reason I can't think of.

Maybe a better discussion, and potential change, would be to whatever
other that the use of the wait, or jobs, commands can result in a job
moving out of the jobs list.   If there were nothing other than those,
(and jobs list overflow or similar) then we'd be fine, and it seems to
me now, no change to the -n operation would be needed.


See above.


   | That hasn't actually been true with bash running in default mode for a
   | very long time now. Bash has allowed multiple waits for the same pid for
   | many years, whether or not you or I think it's a good idea or the correct
   | semantics. Even if it was an accident of the implementation, and maybe you
   | could say it was, we are stuck with it.

Which is why I suggested an option (just above) to turn that misfeature off.
Even better perhaps might be a bash shopt.


See #2 above.

Chet

--
``The lyf so short, the craft so long to lerne.'' - Chaucer
                 ``Ars longa, vita brevis'' - Hippocrates
Chet Ramey, UTech, CWRU    [email protected]    http://tiswww.cwru.edu/~chet/

OpenPGP_signature.asc
Description: OpenPGP digital signature

Re: wait -n misses signaled subprocess

Reply via email to