Hi Chet, thanks for taking the time to review this :D
[My apologies again upfront for another lengthy (comprehensive?) email.]
On Wed, 3 Apr 2024, Chet Ramey wrote:
On 4/2/24 12:22 PM, Carl Edquist wrote:
the forked coproc has to close its fds to/from _all_ other existing
coprocs (as there can be several).
And there is the issue. Without multi-coproc support, the shell only
keeps track of one coproc at a time, so there's only one set of pipe
file descriptors to close.
Right, exactly. The example with the default build (showing the essential
case that causes deadlock) was to highlight that your multi-coproc support
code apparently does indeed correctly track and close all these fds, and
thus prevents the deadlock issue.
On Wed, 3 Apr 2024, Chet Ramey wrote:
It's straightforward: the coproc process terminates, the shell reaps it,
marks it as dead, notifies the user that the process exited, and reaps
it before printing the next prompt. I don't observe any different
behavior between the default and when compiled for multiple coprocs.
It depends on when the process terminates as to whether you get a prompt
back and need to run an additional command before reaping the coproc
(macOS, RHEL), which gives you the opportunity to run the `read'
command:
Ah, my mistake then - thanks for explaining. I must have been thrown off
by the timing, running it with and without an intervening interactive
prompt before the read command. When run interactively, an extra 'Enter'
(or not) before the read command changes the behavior.
So in that case, this issue (that the shell closes its read-end of the
pipe from a reaped coproc, potentially before being able to read the final
output) was already there and is not specific to the multi-coproc code.
But in any case, it seems like this is a race then? That is, whether the
child process terminates before or after the prompt in question.
$ coproc WC { wc; }
[1] 48057
$ exec {WC[1]}>&-
$ read -u ${WC[0]} X
[1]+ Done coproc WC { wc; }
bash: DEBUG warning: cpl_reap: deleting 48057
$ echo $X
0 0 0
(I put in a trace statement to show exactly when the coproc gets reaped and
deallocated.)
Thanks! (for taking the time to play with this)
Though apparently it's still a race here. If you diagram the shell and
coproc (child) processes, I think you'll see that your DEBUG statement can
also happen _before_ the read command, which would then fail. You can
contrive this by adding a small sleep (eg, 0.1s) at the end of
execute_builtin_or_function (in execute_cmd.c), just before it returns.
Eg:
diff --git a/execute_cmd.c b/execute_cmd.c
index ed1063e..c72f322 100644
--- a/execute_cmd.c
+++ b/execute_cmd.c
@@ -5535,6 +5535,7 @@ execute_builtin_or_function (words, builtin, var,
redirects,
discard_unwind_frame ("saved_fifos");
#endif
+ usleep(100000);
return (result);
}
If I do this, I consistently see "read: X: invalid file descriptor
specification" running the above 4-line "coproc WC" example in a script,
demonstrating that there is no guarantee that the read command will start
before the WC coproc is reaped and {WC[0]} is closed, even though it's the
next statement after 'exec {WC[1]}>&-'.
But (as I'll try to show) you can trip up on this race even without
slowing down bash itself artificially.
I can't reproduce your results with non-interactive shells, either, with
job control enabled or disabled.
That's fair; let's try it with a script:
$ cat cope.sh
#!/bin/bash
coproc WC { wc; }
jobs
exec {WC[1]}>&-
[[ $1 ]] && sleep "$1"
jobs
read -u ${WC[0]} X
echo $X
Run without sleep, the wc output is seen:
$ ./cope.sh
[1]+ Running coproc WC { wc; } &
[1]+ Running coproc WC { wc; } &
0 0 0
Run with a brief sleep after closing the write end, and it breaks:
$ ./cope.sh .1
[1]+ Running coproc WC { wc; } &
[1]+ Done coproc WC { wc; }
./cope.sh: line 8: read: X: invalid file descriptor specification
And, if I run with "0" for a sleep time, it intermittently behaves like
either of the above. Racy!
This is a bug. The shell should not automatically close its read pipe
to a coprocess that has terminated -- it should stay open to read the
final output, and the user should be responsible for closing the read
end explicitly.
How long should the shell defer deallocating the coproc after the
process terminates?
I only offer my opinion here, but it strikes me that it definitely should
_not_ be based on an amount of _time_. That's inherently racy.
In the above example, there is only a single line to read; but the general
case there may be many 'records' sitting in the pipe waiting to be
processed, and processing each record may take an arbitrary amount of
time. (Consider a coproc containing a sort command, for example, that
produces all its output lines at once after it sees EOF, and then
terminates.)
Zack illustrated basically the same point with his example:
exec {fd}< <( some command )
while IFS='' read -r line <&"${fd}"; do
# do stuff
done
{fd}<&-
A process-substitution open to the shell like this is effectively a
one-ended coproc (though not in the jobs list), and it behaves reliably
here because the user can count on {fd} to remain open even after the
child process terminates.
So, the user can determine when the coproc fds are no longer needed,
whether that's when EOF is hit trying to read from the coproc, or whatever
other condition.
Personally I like the idea of 'closing' a coproc explicitly, but if it's a
bother to add options to the coproc keyword, then I would say just let the
user be responsible for closing the fds. Once the coproc has terminated
_and_ the coproc's fds are closed, then the coproc can be deallocated.
Apparently there is already some detection in there for when the coproc
fds get closed, as the {NAME[@]} fd array members get set to -1
automatically when when you do, eg, 'exec {NAME[0]}<&-'. So perhaps this
won't be a radical change.
Alternatively (or, additionally), you could interpret 'unset NAME' for a
coproc to mean "deallocate the coproc." That is, close the {NAME[@]} fds,
unset the NAME variable, and remove any coproc bookkeeping for NAME.
(Though if the coproc child process hasn't terminated on its own yet,
still it shouldn't be killed, and perhaps it should remain in the jobs
list as a background process until it's done.)
...
[And if you _really_ don't want to defer deallocating a coproc after it
terminates, I suppose you can go ahead and deallocate it in terms of
removing it from the jobs list and dropping any bookkeeping for it - as
long as you leave the fds and fd variables intact for the user.
It's a little dicey, but in theory it should not lead to deadlock, even if
copies of these (now untracked) reaped-coproc pipe fds end up in other
coprocs. Why not? Because (1) this coproc is dead and thus won't be
trying to read anymore from its (now closed) end, and (2) reads from the
shell's end will not block since there are no copies of the coproc's write
end open anymore.
Still, it'd be cleaner to defer deallocation, to avoid these stray (albeit
harmless) copies of fds making their way into new coprocs.]
What should it do to make sure that the variables don't hang around with
invalid file descriptors?
First, just to be clear, the fds to/from the coproc pipes are not invalid
when the coproc terminates (you can still read from them); they are only
invalid after they are closed.
The surprising bit is when they become invalid unexpectedly (from the
point of view of the user) because the shell closes them automatically,
at the somewhat arbitrary timing when the coproc is reaped.
Second, why is it a problem if the variables keep their (invalid) fds
after closing them, if the user is the one that closed them anyway?
Isn't this how it works with the auto-assigned fd redirections?
Eg:
$ exec {d}<.
$ echo $d
10
$ exec {d}<&-
$ echo $d
10
But, as noted, bash apparently already ensures that the variables don't
hang around with invalid file descriptors, as once you close them the
corresponding variable gets updated to "-1".
Or should the user be responsible for unsetting the array variable too?
(That's never been a requirement, obviously.)
On the one hand, bash is already messing with the coproc array variables
(setting the values to -1 when the user closes the fds), so it's not
really a stretch in my mind for bash to unset the whole variable when the
coproc is deallocated. On the other hand, as mentioned above, bash leaves
automatically allocated fd variables intact after the user explicitly
closes them.
So I guess either way seems reasonable.
If the user has explicitly closed both fd ends for a coproc, it should not
be a surprise to the user either way - whether the variable gets unset
automatically, or whether it remains with (-1 -1).
Since you are already unsetting the variable when the coproc is
deallocated though, I'd say it's fine to keep doing that -- just don't
deallocate the coproc before the user has closed both fds.
It also invites trouble if the shell variable that holds the fds gets
removed unexpectedly when the coprocess terminates. (Suddenly the
variable expands to an empty string.) It seems to me that the proper
time to clear the coproc variable (if at all) is after the user has
explicitly closed both of the fds.
That requires adding more plumbing than I want to,
Your project your call :D
especially since the user can always save the file descriptors of
interest into another variable if they want to use them after the coproc
terminates.
*Except* that it's inherently a race condition whether the original
variables will still be intact to save them.
Even if you attempt to save them immediately:
coproc X { exit; }
X_BACKUP=( ${X[@]} )
it's not guaranteed that X_BACKUP=(...) will run before coproc X has been
deallocated, and the X variable cleared.
No doubt this hasn't escaped you, but in any case you can see it for
yourself if you introduce a small delay in execute_coproc, in the parent,
just after the call to make_child:
diff --git a/execute_cmd.c b/execute_cmd.c
index ed1063e..5949e3e 100644
--- a/execute_cmd.c
+++ b/execute_cmd.c
@@ -2440,6 +2440,8 @@ execute_coproc (command, pipe_in, pipe_out,
fds_to_close)
exit (estat);
}
+ else
+ usleep(100000);
close (rpipe[1]);
close (wpipe[0]);
When I try this, X_BACKUP is consistently empty. Though if you add, say,
a call to "sleep 0.2" to coproc X, then X_BACKUP consistently gets a copy
of X's fds in time.
I am sorry if this sounds contrived, but I hope it demonstrates that
closing fds and and unsetting the variable for a coproc automatically when
it terminates is fundamentally flawed, because it depends on the arbitrary
race timing between the two processes.
*Or* else add an option to the coproc keyword to explicitly close the
coproc - which will close both fds and clear the variable.
Not going to add any more options to reserved words; that does more
violence to the grammar than I want.
Not sure how you'd feel about using 'unset' on the coproc variable
instead. (Though as discussed, I think the coproc terminated + fds
manually closed condition is also sufficient.)
.............
Anyway, as far as I'm concerned there's nothing urgent about all this, but
(along with the multi-coproc support that you implemented), avoiding the
current automatic deallocation behavior would seem to go a long way toward
making coproc a correct and generally useful feature.
Thanks for your time!
Carl
PS Zack you're welcome :)