On 7/15/25 1:26 PM, Zachary Santer wrote:
On Tue, Jul 15, 2025 at 11:39 AM Chet Ramey <chet.ra...@case.edu> wrote:On 7/15/25 11:35 AM, Zachary Santer wrote:On Tue, Jul 15, 2025 at 10:29 AM Chet Ramey <chet.ra...@case.edu> wrote:With respect to not consulting the list of saved pids if given a list of pid arguments, solving this problem.I'd like to think that's just a stopgap, but yeah, I can play around with that.Why do you think it would be a stopgap?We had that super-long discussion, and then you just throw away a pretty big part of those changes immediately after there's any kind ofissue?
No.
Maybe I was being optimistic.
OK, if you haven't tested or read the patch, let's see if I can describe the issue precisely enough and describe the solution well enough so you won't have to. I defined the problem like this: "The short story is that `wait -n' now returns the status of any process that's completed and hasn't been waited for yet, just like `wait'. It's not restricted to processes that terminates after it's invoked. It does this even when it's given a list of pid or job arguments, which is where the problem here arises." This means that in posix mode, even when it's given a list of pids to operate on, bash-5.3 can return the pid of a process from the terminated background pids list (the `bgpids list') that is not one of the pids supplied as an argument. It will do this if it doesn't find one of the pids it's looking for in the bgpids list, the list of terminated procsubs (before they get moved to the bgpids list), or a terminated job. This is what it would do if it were not supplied any pid arguments, and, as several people noted, is a real problem. In this case, `wait' is returning the status of the process substitution, which is saved on the bgpids list, even though its pid is not one of the ones it should look for. The fix is to pick and return a terminated pid from the bgpids list only if we are *not* restricting action to a list of pids. If we have such a list, we check as above, then wait for a process to terminate if all the processes of interest are still running, then do the checks again, repeating until we find a terminated process we're interested in. That's why the patch is so short -- it adds a missing test. The workaround I suggested works for the example script because the extra `wait' discards the contents of the bgpids list, leaving nothing to return even if the code checks for it. It's something you can put in a script immediately, if it's appropriate, while you wait for me to release a patch the distros will accept.
It doesn't seem to me like consulting the list of saved pids or not would even be particularly relevant to the issue here.
This is a bad assumption. `wait' consults the bgpids list to see if it has one of the pids its interested in. The issue here is that it's returning one that's not one of those pids.
'wait -n' is mistakenly returning the termination status of a procsub, which wouldn't show up in that list to begin with, if I remember correctly.
You don't.There is a list of active process substitutions, and some time after a procsub terminates, its pid gets saved on the bgpids list. Process
substitutions set $!, which means you can potentially save that pid and wait for it later, and saving its pid to the bgpids list is how you make that happen. In this case, the procsub terminated when it finished writing the jobs output and was reaped when mapfile finished and the shell went back to read the next command. At that point, it got moved to the bgpids list.
This instead of waiting for and returning the termination status of the single regular background process whose pid was given to it explicitly, even though 'wait -n' was called before that background process had terminated, and thus it also wouldn't show up in that list.
Pretty much, as detailed above.
If we *are* looking at a race condition thing, your change could have affected timing.
We're not. The patch was already there to verify that when you replied.
Not sure if I can realistically dive into and understand the source code myself right away.
OK. If it's going to be a black box -- not everyone wants to get into the code -- you can apply the patch, rebuild, and observe the changed behavior. It's easier if you make all of the asynchronous processes have distinct exit statuses and inspect $? after `wait' returns so you can tell which one it grabbed. -- ``The lyf so short, the craft so long to lerne.'' - Chaucer ``Ars longa, vita brevis'' - Hippocrates Chet Ramey, UTech, CWRU c...@case.edu http://tiswww.cwru.edu/~chet/
OpenPGP_signature.asc
Description: OpenPGP digital signature