On 7/15/25 1:26 PM, Zachary Santer wrote:
On Tue, Jul 15, 2025 at 11:39 AM Chet Ramey <chet.ra...@case.edu> wrote:

On 7/15/25 11:35 AM, Zachary Santer wrote:
On Tue, Jul 15, 2025 at 10:29 AM Chet Ramey <chet.ra...@case.edu> wrote:

With respect to not consulting the list of saved pids if given a list
of pid arguments, solving this problem.

I'd like to think that's just a stopgap, but yeah, I can play around with that.

Why do you think it would be a stopgap?

We had that super-long discussion, and then you just throw away a
pretty big part of those changes immediately after there's any kind of
issue?

No.

Maybe I was being optimistic.

OK, if you haven't tested or read the patch, let's see if I can
describe the issue precisely enough and describe the solution well
enough so you won't have to.

I defined the problem like this:

"The short story is that `wait -n' now returns the status of any process
that's completed and hasn't been waited for yet, just like `wait'. It's
not restricted to processes that terminates after it's invoked. It does
this even when it's given a list of pid or job arguments, which is
where the problem here arises."

This means that in posix mode, even when it's given a list of pids to
operate on, bash-5.3 can return the pid of a process from the terminated
background pids list (the `bgpids list') that is not one of the pids
supplied as an argument.

It will do this if it doesn't find one of the pids it's looking for in the
bgpids list, the list of terminated procsubs (before they get moved to the
bgpids list), or a terminated job. This is what it would do if it were not
supplied any pid arguments, and, as several people noted, is a real
problem.

In this case, `wait' is returning the status of the process substitution,
which is saved on the bgpids list, even though its pid is not one of the
ones it should look for.

The fix is to pick and return a terminated pid from the bgpids list only if
we are *not* restricting action to a list of pids.

If we have such a list, we check as above, then wait for a process to
terminate if all the processes of interest are still running, then do the
checks again, repeating until we find a terminated process we're interested in.

That's why the patch is so short -- it adds a missing test.

The workaround I suggested works for the example script because the extra
`wait' discards the contents of the bgpids list, leaving nothing to return
even if the code checks for it. It's something you can put in a script
immediately, if it's appropriate, while you wait for me to release a patch
the distros will accept.


It doesn't seem to me like consulting the list of saved pids or not
would even be particularly relevant to the issue here.

This is a bad assumption. `wait' consults the bgpids list to see if it
has one of the pids its interested in. The issue here is that it's
returning one that's not one of those pids.

'wait -n' is
mistakenly returning the termination status of a procsub, which
wouldn't show up in that list to begin with, if I remember correctly.

You don't.

There is a list of active process substitutions, and some time after a procsub terminates, its pid gets saved on the bgpids list. Process
substitutions set $!, which means you can potentially save that pid and
wait for it later, and saving its pid to the bgpids list is how you make
that happen. In this case, the procsub terminated when it finished
writing the jobs output and was reaped when mapfile finished and the shell
went back to read the next command. At that point, it got moved to the
bgpids list.

This instead of waiting for and returning the termination status of
the single regular background process whose pid was given to it
explicitly, even though 'wait -n' was called before that background
process had terminated, and thus it also wouldn't show up in that
list.

Pretty much, as detailed above.


If we *are* looking at a race condition thing, your change could have
affected timing.

We're not. The patch was already there to verify that when you replied.


Not sure if I can realistically dive into and understand the source
code myself right away.

OK. If it's going to be a black box -- not everyone wants to get into the
code -- you can apply the patch, rebuild, and observe the changed behavior.
It's easier if you make all of the asynchronous processes have distinct
exit statuses and inspect $? after `wait' returns so you can tell which
one it grabbed.

--
``The lyf so short, the craft so long to lerne.'' - Chaucer
                 ``Ars longa, vita brevis'' - Hippocrates
Chet Ramey, UTech, CWRU    c...@case.edu    http://tiswww.cwru.edu/~chet/

Attachment: OpenPGP_signature.asc
Description: OpenPGP digital signature

Reply via email to