Hi Chet, thanks for taking the time to review this  :D

[My apologies again upfront for another lengthy (comprehensive?) email.]


On Wed, 3 Apr 2024, Chet Ramey wrote:

On 4/2/24 12:22 PM, Carl Edquist wrote:

 the forked coproc has to close its fds to/from _all_ other existing
 coprocs (as there can be several).

And there is the issue. Without multi-coproc support, the shell only keeps track of one coproc at a time, so there's only one set of pipe file descriptors to close.

Right, exactly. The example with the default build (showing the essential case that causes deadlock) was to highlight that your multi-coproc support code apparently does indeed correctly track and close all these fds, and thus prevents the deadlock issue.


On Wed, 3 Apr 2024, Chet Ramey wrote:

It's straightforward: the coproc process terminates, the shell reaps it, marks it as dead, notifies the user that the process exited, and reaps it before printing the next prompt. I don't observe any different behavior between the default and when compiled for multiple coprocs.

It depends on when the process terminates as to whether you get a prompt back and need to run an additional command before reaping the coproc (macOS, RHEL), which gives you the opportunity to run the `read' command:

Ah, my mistake then - thanks for explaining. I must have been thrown off by the timing, running it with and without an intervening interactive prompt before the read command. When run interactively, an extra 'Enter' (or not) before the read command changes the behavior.

So in that case, this issue (that the shell closes its read-end of the pipe from a reaped coproc, potentially before being able to read the final output) was already there and is not specific to the multi-coproc code.

But in any case, it seems like this is a race then? That is, whether the child process terminates before or after the prompt in question.

$ coproc WC { wc; }
[1] 48057
$ exec {WC[1]}>&-
$ read -u ${WC[0]} X
[1]+  Done                    coproc WC { wc; }
bash: DEBUG warning: cpl_reap: deleting 48057
$ echo $X
0 0 0

(I put in a trace statement to show exactly when the coproc gets reaped and
deallocated.)

Thanks! (for taking the time to play with this)

Though apparently it's still a race here. If you diagram the shell and coproc (child) processes, I think you'll see that your DEBUG statement can also happen _before_ the read command, which would then fail. You can contrive this by adding a small sleep (eg, 0.1s) at the end of execute_builtin_or_function (in execute_cmd.c), just before it returns.

Eg:

        diff --git a/execute_cmd.c b/execute_cmd.c
        index ed1063e..c72f322 100644
        --- a/execute_cmd.c
        +++ b/execute_cmd.c
        @@ -5535,6 +5535,7 @@ execute_builtin_or_function (words, builtin, var, 
redirects,
           discard_unwind_frame ("saved_fifos");
         #endif

        +  usleep(100000);
           return (result);
         }


If I do this, I consistently see "read: X: invalid file descriptor specification" running the above 4-line "coproc WC" example in a script, demonstrating that there is no guarantee that the read command will start before the WC coproc is reaped and {WC[0]} is closed, even though it's the next statement after 'exec {WC[1]}>&-'.

But (as I'll try to show) you can trip up on this race even without slowing down bash itself artificially.


I can't reproduce your results with non-interactive shells, either, with job control enabled or disabled.

That's fair; let's try it with a script:

        $ cat cope.sh
        #!/bin/bash

        coproc WC { wc; }
        jobs
        exec {WC[1]}>&-
        [[ $1 ]] && sleep "$1"
        jobs
        read -u ${WC[0]} X
        echo $X


Run without sleep, the wc output is seen:

        $ ./cope.sh
        [1]+  Running                 coproc WC { wc; } &
        [1]+  Running                 coproc WC { wc; } &
        0 0 0


Run with a brief sleep after closing the write end, and it breaks:

        $ ./cope.sh .1
        [1]+  Running                 coproc WC { wc; } &
        [1]+  Done                    coproc WC { wc; }
        ./cope.sh: line 8: read: X: invalid file descriptor specification


And, if I run with "0" for a sleep time, it intermittently behaves like either of the above. Racy!


This is a bug. The shell should not automatically close its read pipe to a coprocess that has terminated -- it should stay open to read the final output, and the user should be responsible for closing the read end explicitly.

How long should the shell defer deallocating the coproc after the process terminates?

I only offer my opinion here, but it strikes me that it definitely should _not_ be based on an amount of _time_. That's inherently racy.

In the above example, there is only a single line to read; but the general case there may be many 'records' sitting in the pipe waiting to be processed, and processing each record may take an arbitrary amount of time. (Consider a coproc containing a sort command, for example, that produces all its output lines at once after it sees EOF, and then terminates.)


Zack illustrated basically the same point with his example:

        exec {fd}< <( some command )
        while IFS='' read -r line <&"${fd}"; do
          # do stuff
        done
        {fd}<&-

A process-substitution open to the shell like this is effectively a one-ended coproc (though not in the jobs list), and it behaves reliably here because the user can count on {fd} to remain open even after the child process terminates.


So, the user can determine when the coproc fds are no longer needed, whether that's when EOF is hit trying to read from the coproc, or whatever other condition.


Personally I like the idea of 'closing' a coproc explicitly, but if it's a bother to add options to the coproc keyword, then I would say just let the user be responsible for closing the fds. Once the coproc has terminated _and_ the coproc's fds are closed, then the coproc can be deallocated.

Apparently there is already some detection in there for when the coproc fds get closed, as the {NAME[@]} fd array members get set to -1 automatically when when you do, eg, 'exec {NAME[0]}<&-'. So perhaps this won't be a radical change.

Alternatively (or, additionally), you could interpret 'unset NAME' for a coproc to mean "deallocate the coproc." That is, close the {NAME[@]} fds, unset the NAME variable, and remove any coproc bookkeeping for NAME.

(Though if the coproc child process hasn't terminated on its own yet, still it shouldn't be killed, and perhaps it should remain in the jobs list as a background process until it's done.)

...

[And if you _really_ don't want to defer deallocating a coproc after it terminates, I suppose you can go ahead and deallocate it in terms of removing it from the jobs list and dropping any bookkeeping for it - as long as you leave the fds and fd variables intact for the user.

It's a little dicey, but in theory it should not lead to deadlock, even if copies of these (now untracked) reaped-coproc pipe fds end up in other coprocs. Why not? Because (1) this coproc is dead and thus won't be trying to read anymore from its (now closed) end, and (2) reads from the shell's end will not block since there are no copies of the coproc's write end open anymore.

Still, it'd be cleaner to defer deallocation, to avoid these stray (albeit harmless) copies of fds making their way into new coprocs.]


What should it do to make sure that the variables don't hang around with invalid file descriptors?

First, just to be clear, the fds to/from the coproc pipes are not invalid when the coproc terminates (you can still read from them); they are only invalid after they are closed.

The surprising bit is when they become invalid unexpectedly (from the point of view of the user) because the shell closes them automatically, at the somewhat arbitrary timing when the coproc is reaped.


Second, why is it a problem if the variables keep their (invalid) fds after closing them, if the user is the one that closed them anyway?

Isn't this how it works with the auto-assigned fd redirections?

Eg:

        $ exec {d}<.
        $ echo $d
        10
        $ exec {d}<&-
        $ echo $d
        10


But, as noted, bash apparently already ensures that the variables don't hang around with invalid file descriptors, as once you close them the corresponding variable gets updated to "-1".


Or should the user be responsible for unsetting the array variable too? (That's never been a requirement, obviously.)

On the one hand, bash is already messing with the coproc array variables (setting the values to -1 when the user closes the fds), so it's not really a stretch in my mind for bash to unset the whole variable when the coproc is deallocated. On the other hand, as mentioned above, bash leaves automatically allocated fd variables intact after the user explicitly closes them.

So I guess either way seems reasonable.

If the user has explicitly closed both fd ends for a coproc, it should not be a surprise to the user either way - whether the variable gets unset automatically, or whether it remains with (-1 -1).

Since you are already unsetting the variable when the coproc is deallocated though, I'd say it's fine to keep doing that -- just don't deallocate the coproc before the user has closed both fds.


It also invites trouble if the shell variable that holds the fds gets removed unexpectedly when the coprocess terminates. (Suddenly the variable expands to an empty string.) It seems to me that the proper time to clear the coproc variable (if at all) is after the user has explicitly closed both of the fds.

That requires adding more plumbing than I want to,

Your project your call :D

especially since the user can always save the file descriptors of interest into another variable if they want to use them after the coproc terminates.

*Except* that it's inherently a race condition whether the original variables will still be intact to save them.

Even if you attempt to save them immediately:

        coproc X { exit; }
        X_BACKUP=( ${X[@]} )

it's not guaranteed that X_BACKUP=(...) will run before coproc X has been deallocated, and the X variable cleared.

No doubt this hasn't escaped you, but in any case you can see it for yourself if you introduce a small delay in execute_coproc, in the parent, just after the call to make_child:


        diff --git a/execute_cmd.c b/execute_cmd.c
        index ed1063e..5949e3e 100644
        --- a/execute_cmd.c
        +++ b/execute_cmd.c
        @@ -2440,6 +2440,8 @@ execute_coproc (command, pipe_in, pipe_out,
        fds_to_close)

               exit (estat);
             }
        +  else
        +    usleep(100000);

           close (rpipe[1]);
           close (wpipe[0]);


When I try this, X_BACKUP is consistently empty. Though if you add, say, a call to "sleep 0.2" to coproc X, then X_BACKUP consistently gets a copy of X's fds in time.

I am sorry if this sounds contrived, but I hope it demonstrates that closing fds and and unsetting the variable for a coproc automatically when it terminates is fundamentally flawed, because it depends on the arbitrary race timing between the two processes.



*Or* else add an option to the coproc keyword to explicitly close the coproc - which will close both fds and clear the variable.

Not going to add any more options to reserved words; that does more violence to the grammar than I want.

Not sure how you'd feel about using 'unset' on the coproc variable instead. (Though as discussed, I think the coproc terminated + fds manually closed condition is also sufficient.)


.............


Anyway, as far as I'm concerned there's nothing urgent about all this, but (along with the multi-coproc support that you implemented), avoiding the current automatic deallocation behavior would seem to go a long way toward making coproc a correct and generally useful feature.


Thanks for your time!

Carl


PS Zack you're welcome :)

          • ... Zachary Santer
          • ... Zachary Santer
          • ... Carl Edquist
          • ... Andreas Schwab
          • ... Zachary Santer
          • ... Carl Edquist
          • ... Chet Ramey
          • ... Martin D Kealey
          • ... Chet Ramey
          • ... Carl Edquist
  • Re: Examples o... Carl Edquist
    • Re: Examp... Martin D Kealey
      • Re: E... Chet Ramey
        • R... Zachary Santer
          • ... Chet Ramey
        • R... Carl Edquist
          • ... Chet Ramey
          • ... Zachary Santer
          • ... Chet Ramey
          • ... Carl Edquist via Bug reports for the GNU Bourne Again SHell
          • ... Chet Ramey

Reply via email to