Re: "wait" loses signals

2020-02-25 Thread Daniel Colascione
> Date:Mon, 24 Feb 2020 06:44:12 -0800
> From:"Daniel Colascione" 
> Message-ID:  
>
>   | That executing traps except in case you lose one rare race is
> painfully
>   | obvious.
>
> Maybe you misunderstand the issue, no traps are lost, if they were
> that would indeed be a bug, the trap will always be executed in the
> cases in question, the only issue is when that happens.

They're not executed before the wait as is supposed to happen though, so
we can hang when we shouldn't.

>   | This opposition to doing more than the bare minimum that the standard
>   | requires makes this task all the much harder.
>
> I am not at all opposed to doing more than the standard requires, the
> shell I maintain does more (not nearly as many addons as bash, but
> considerably more than bash - and in some areas we're ahead, we already
> have a wait command where it is possible to wait for any one of a set
> of processes (or jobs) and be told which one completed, for example).
>
> I'm also not opposed to doing less when the standard is nonsense, which
> it is in a couple of places.
>
> But "I want x" or "I think it should be y" aren't good enough reasons to
> change something, and making the shell useful for (very primitive) IPC
> isn't a good reason for making updates.

Yes, it is, because people find this style of IPC useful today, and it's
worthwhile to make this use reliable.

>   | Making people go elsewhere *on purpose* by refusing to fix bugs is not
>   | good software engineering.
>
> Of course.   I don't see a bug.

You can interpret any random bit of brokenness as a feature. Whether the
behavior is a "bug" or not is irrelevant: bash _should_ be handling these
traps as early as possible, because that simplifies the programming model
without hurting anything else.

>   | We're talking about fixing an existing shell feature, not adding a new
> one.
>
> OK, here's an alternative, I want the shell to be able to do arithmetic on
> arbitrarily large (and small) numbers.   All that is needed to fix it is
> to link in the bignum library and use it (and extend the parser a little
> to
> handle real numbers).

This situation is more like bash supporting arbitrary-precision addition
and giving the wrong answer when the number is prime. "Oh, we never
promised support for _prime_ sums. It's not a bug. It's just a thing the
shell doesn't do."


>   | This moralistic outlook is not helpful. It doesn't *matter* whether a
>   | program is right or wrong or making unjustified assumptions or not.
>
> That is unbelievable.   That is all that matters.   If the program is
> wrong, the program needs to be fixed, not the world altered so that the
> program suddely works.

You want to increase the number of correct programs in the world.
Sometimes the fix is to declare incorrect programs broken and have people
fix them. Other times, in situations like this one, it's better to just
change the infrastructure so that the program is correct.

>
>   | Punishing programs does not make the world does not make the world
> better.
>
> It does.   The bad ones fail, and are replaced by better ones.

Computer security was even more of a horrible nightmare than it is today
back when people had this attitude. "Why should we use stack hardening? If
a program writes beyond the end of an array, that's a bug in the program."
Nice sentiment. Doesn't work.




Re: "wait" loses signals

2020-02-25 Thread Robert Elz
Date:Mon, 24 Feb 2020 06:44:12 -0800
From:"Daniel Colascione" 
Message-ID:  

  | That executing traps except in case you lose one rare race is painfully
  | obvious.

Maybe you misunderstand the issue, no traps are lost, if they were
that would indeed be a bug, the trap will always be executed in the
cases in question, the only issue is when that happens.

  | I refuse to let the standard cap the quality of a shell's implementation.

So you should.   No-one is suggesting that there is any reason that
any shell cannot do this better, if the authors feel the cost trade
off is worth the benefit.

  | Missing signals [...]

Since this appears to be based upon a misunderstanding, I will ignore that.

  | A standard is a bare minimum.

That's close enough to correct.

  | This opposition to doing more than the bare minimum that the standard
  | requires makes this task all the much harder.

I am not at all opposed to doing more than the standard requires, the
shell I maintain does more (not nearly as many addons as bash, but
considerably more than bash - and in some areas we're ahead, we already
have a wait command where it is possible to wait for any one of a set
of processes (or jobs) and be told which one completed, for example).

I'm also not opposed to doing less when the standard is nonsense, which
it is in a couple of places.

But "I want x" or "I think it should be y" aren't good enough reasons to
change something, and making the shell useful for (very primitive) IPC
isn't a good reason for making updates.

  | Making people go elsewhere *on purpose* by refusing to fix bugs is not
  | good software engineering.

Of course.   I don't see a bug.

  | We're talking about fixing an existing shell feature, not adding a new one.

OK, here's an alternative, I want the shell to be able to do arithmetic on
arbitrarily large (and small) numbers.   All that is needed to fix it is
to link in the bignum library and use it (and extend the parser a little to
handle real numbers).   Can I call it a bug that bash only does arithmetic
on integers, and has a limit on their size (64 bits I believe), and demand
that Chet fix it?Know that I am perfectly aware that the standard doesn't
require what I want, but remember that is the bare minimum, we can do better
(bash already does, 32 bits is all that is required, as I remember).

  | This moralistic outlook is not helpful. It doesn't *matter* whether a
  | program is right or wrong or making unjustified assumptions or not.

That is unbelievable.   That is all that matters.   If the program is
wrong, the program needs to be fixed, not the world altered so that the
program suddely works.

  | Punishing programs does not make the world does not make the world better.

It does.   The bad ones fail, and are replaced by better ones.

kre




Re: "wait" loses signals

2020-02-24 Thread Harald van Dijk

On 24/02/2020 08:59, Robert Elz wrote:

har...@gigawatt.nl said:
   | In the same way, I think that except when overridden by 2.11, the "when"
   | in "Otherwise, the argument action shall be read and executed by the
   | shell when one of the corresponding conditions arises." should be
   | interpreted as "as soon as".

The only way to do that literally would be to run the trap from the signal
handler, as that is "as soon as" the condition arises.   But I think we all
know that is simply not possible.   So let's read that as "as soon as
possible after" instead.


Sure.


   That's getting more reasonable, but someone needs
to decide just what is possible - will running the trap handler mess up the
shell's internal state while a new command is parsed and executed?

Eg: what if we had
VAR=$(grep  -c some_string file*.c)
and a (trapped) signal arrives while grep is running (more correctly, while
the process running the command substitution, which runs grep, is running).
We know we cannot interrupt the wait for that foreground process to run the
trap handler, so we don't - but do we execute the trap handler before we
assign the answer to VAR ?


Although 2.11 that you referred to states "When a signal for which a 
trap has been set is received while the shell is waiting for the 
completion of a utility executing a foreground command", that is not 
what any shell implements. Instead, what shells implement is more like 
"while the shell is waiting for the completion of a foreground command". 
Consider for instance (sleep 5): the sleep command run in a subshell. 
The parent shell is not waiting for the completion of a utility 
executing a foreground command, the parent shell is waiting for the 
completion of the subshell, which is not a utility. Nevertheless, shells 
do not run any trap action until after the subshell has completed.


This is just sloppy wording in the standard. It is probably written this 
way so that it is clear that given { foo; bar; }, if a signal is 
received while foo is running, any trap action runs before bar. The 
whole compound command shouldn't be considered the foreground command, 
only foo should be.


In your example, I would expect the whole of VAR=$(...) to be considered 
the foreground command that the shell is waiting for, and that is what 
almost all shells do. A notable exception is zsh.



This kind of thing is why shells in general only normally even look to
see if there is a trap handler waiting to run after completing executing
commands, not in the middle of one.

The relevance of this is that if a signal arrives while the wait command
is executing (or as Chet suggested, while doing whatever housekeeping is
needed to prepare to run it, like looking to see what command comes next)
but before the relevant wait*() system call is running, the trap won't
be run until after the wait command completes.

That's the way shells have always worked, and the way the standard (for that
very reason) says can be relied upon by scripts - which is much of its
purpose, to tell script writers what they can expect will work, and what
will not necessarily work.


You say "have always worked", but I'd like to point out that this whole 
thing started because I was looking at code that Herbert Xu had changed 
in dash to avoid this race back in 2009. That's over 10 years ago now. 
The behaviour of dash before that, and several shells now, can not, or 
at least not now, be said to be how shells have always worked.


Cheers,
Harald van Dijk



Re: "wait" loses signals

2020-02-24 Thread Denys Vlasenko

On 2/24/20 5:18 PM, Chet Ramey wrote:

The first case is trickier: there's always going to be a window between
the time the shell checks for pending traps and the time the wait builtin
starts to run. You can't really close it unless you're willing to run the
trap out of the signal handler, which everyone agrees is a bad idea, but
you can squeeze it down to practially nothing.


dash uses something along these lines:

sigfillset();
sigprocmask(SIG_SETMASK, , );
while (!pending_sig)
sigsuspend();
sigprocmask(SIG_SETMASK, , NULL);
if (pending_sig)
handle_signals(pending_sig);
pid = waitpid(... WNOHANG);

It sleeps in sigsuspend(), not in waitpid(). This way we wait for both
signals *and* children (by virtue of getting SIGCHLD for them).




Re: "wait" loses signals

2020-02-24 Thread Chet Ramey
On 2/24/20 7:58 AM, Daniel Colascione wrote:

> No, it's not that much trouble to fix the bug. The techniques for fixing
> this kind of signal race are well-known. In particular, instead of
> waitpid, you use a self-pipe and signal the pipe in the signal handler,
> and you have a signal handler for SIGCHLD. 

You've just substituted a real IPC mechanism (pipes) for the one people
are trying to make signals into.

-- 
``The lyf so short, the craft so long to lerne.'' - Chaucer
 ``Ars longa, vita brevis'' - Hippocrates
Chet Ramey, UTech, CWRUc...@case.eduhttp://tiswww.cwru.edu/~chet/



Re: "wait" loses signals

2020-02-24 Thread Chet Ramey
On 2/24/20 3:59 AM, Robert Elz wrote:

> The relevance of this is that if a signal arrives while the wait command
> is executing (or as Chet suggested, while doing whatever housekeeping is
> needed to prepare to run it, like looking to see what command comes next)
> but before the relevant wait*() system call is running, the trap won't
> be run until after the wait command completes.

There are two separate cases here: if the signal arrives before the wait
command has begun executing (during `housekeeping') or if it arrives
after the wait command has begun running but before it calls whatever
system call it uses to wait for children.

The second case is relatively easy to solve; Jilles wrote a message
detailing the alternatives. Bash uses the longjmp-out-of-the-trap-signal-
handler mechanism. The trap handler only has to know that the wait builtin
is running and that there's a valid saved environment to longjmp to.

The first case is trickier: there's always going to be a window between
the time the shell checks for pending traps and the time the wait builtin
starts to run. You can't really close it unless you're willing to run the
trap out of the signal handler, which everyone agrees is a bad idea, but
you can squeeze it down to practially nothing.

I think I've got a way to close that and make signals that arrive in that
first case act as if they arrived `while the shell is waiting by means of
the wait utility'. It's not much code and not disruptive.

With that, bash runs the original test script (100,000 iterations) on RHEL7
and macOS without a `stray' sleep. It's in the git devel branch.

I'm going to defer the question of whether or not that's the `right' thing
to do -- people have been trying to make signals into an IPC mechanism
since Berkeley introduced `reliable signals'.

Can we all take a breath now?

-- 
``The lyf so short, the craft so long to lerne.'' - Chaucer
 ``Ars longa, vita brevis'' - Hippocrates
Chet Ramey, UTech, CWRUc...@case.eduhttp://tiswww.cwru.edu/~chet/



Re: "wait" loses signals

2020-02-24 Thread Daniel Colascione
> Date:Mon, 24 Feb 2020 04:58:31 -0800
> From:"Daniel Colascione" 
> Message-ID:  <07d1441d41280e6f9535048d6485.squir...@dancol.org>
>
>   | That is a poor excuse for not fixing bugs.
>
> Only if they are bugs.

That executing traps except in case you lose one rare race is painfully
obvious.

>
>   | Maybe you can torture the standards into confessing that this
>   | behavior isn't a bug.
>
> No torture required.  Once again, the standard documents the way users
> can expect shells to behave.

I refuse to let the standard cap the quality of a shell's implementation.
Missing signals this way is pure negative. It doesn't add to any
capability or help any user. It can only make computing unreliable and
hurt real users trying to automate things with shell.

> That is what a standard is - a common set
> of agreed operations

A standard is a bare minimum.

> attempt to get shells to change this way of
> working, and if you can get a suitable set to agree, and implement
> something
> new, that more meets your needs, then perhaps that might one day become
> the
> standard,

This opposition to doing more than the bare minimum that the standard
requires makes this task all the much harder.

>   | This behavior nevertheless surprises people
>
> Lots of things surprise people.

Sometimes people deserve to be surprised. This isn't one of those times.

>   | and nevertheless precludes various things
>   | people want to do with a shell.
>
> That was my point, that you just labelled a poor excuse.   Not everything
> is suitable for implementation in sh.   Sometimes you simply have to go
> elsewhere.

Making people go elsewhere *on purpose* by refusing to fix bugs is not
good software engineering.

> Wanting to do it in shell doesn't make it reasonable or
> possible.

It is reasonable and possible. All that's needed is to make an existing
operation that's almost perfectly reliable in fact perfectly reliable, and
as I've mentioned, it's not that hard.

> I want the shell to feed my dog, where is the dogfood option?

We're talking about fixing an existing shell feature, not adding a new one.

>   | Don't you think it's better that programs
>   | work reliably than that they don't?
>
> Yes, when they are written correctly.

By fixing this bug, we make a class of programs correct automatically.

>
>   | Of course something working intuitively 99.9% of the time and
>   | hanging 0.1% of the time is a bug.
>
> Nonsense.   An alternative explanation is that your intuition is wrong,
> and that it often works that way is just by chance.

We're talking about a documented feature that users expect to work a
certain way and that almost always *does* work that way and that diverges
from this behavior only under rare circumstances. Not the same as spacebar
heating.

> The program is
> broken because it is making unjustified assumptions about how things are
> specified to work.

This moralistic outlook is not helpful. It doesn't *matter* whether a
program is right or wrong or making unjustified assumptions or not.
Punishing programs does not make the world does not make the world better.
When a piece of infrastructure can transform these programs from incorrect
to correct at next to zero cost, it behooves that infrastructure component
to do that.

> This is the kind of common error that people who
> program (in any language) by guesswork often make "I saw Fred did this,
> and I tried it, and it worked for me like I thought it would, so it
> must do this similar thing like I think it will too".   Rubbish.

Ever hear of the "pit of success"? It's the idea that software gets better
when you make the intuitive thing happen to be the correct thing. Why
should we require a degree of cleverness greater than what a domain
requires? Why *not* make it so that, to the greatest extent possible,
shouldn't we let "I saw Fred do this" lead people to good patterns? Like I
said before, making things difficult on purpose doesn't actually achieve
anything.

[1] https://docs.microsoft.com/en-us/archive/blogs/brada/the-pit-of-success

>
>   | I've never understood the philosophy of people who want to leave
>   | bugs unfixed.
>
> Nor have I, except sometimes perhaps when it comes to costs.   But the
> issue here is whether this is a bug.  Your belief that it is does not make
> it so.

Your belief that this behavior is acceptable doesn't make it so --- except
under a pointlessly literal interpretation of the standards.

>   | No, it's not that much trouble to fix the bug.
>
> It isn't, if it needs fixing - but any fix for this will slow the shell
> (for what that matters, but some people care).  Further there are simpler
> cheaper techniques than the one described.

The fix for this issue will not meaningfully affect the speed of the
shell. Instead of waiting on waitpid directly, we wait on a pipe. Plenty
of programs do this already. Micro-optimizing for system call count will
hardly slow the shell: other factors matter a lot 

Re: "wait" loses signals

2020-02-24 Thread Robert Elz
Date:Mon, 24 Feb 2020 04:58:31 -0800
From:"Daniel Colascione" 
Message-ID:  <07d1441d41280e6f9535048d6485.squir...@dancol.org>

  | That is a poor excuse for not fixing bugs.

Only if they are bugs.

  | Maybe you can torture the standards into confessing that this
  | behavior isn't a bug.

No torture required.  Once again, the standard documents the way users
can expect shells to behave.  That is what a standard is - a common set
of agreed operations (or whatever is apporpriate for the object being
standardised).   It does not (or should not) ever invent new stuff and
require it.   Shells have always worked this way, so that is how the
standard is written - that is what users can expect to happen (that is why
it is called a "standard" after all).

Once again, you are free to attempt to get shells to change this way of
working, and if you can get a suitable set to agree, and implement something
new, that more meets your needs, then perhaps that might one day become the
standard, and later appear in the standards document.   New and/or changed
features to happen, expecially when they don't break backwards compatibility,
which this wouldn't.

  | This behavior nevertheless surprises people

Lots of things surprise people.

  | and nevertheless precludes various things
  | people want to do with a shell.

That was my point, that you just labelled a poor excuse.   Not everything
is suitable for implementation in sh.   Sometimes you simply have to go
elsewhere.  Wanting to do it in shell doesn't make it reasonable or possible.

I want the shell to feed my dog, where is the dogfood option?

  | Don't you think it's better that programs
  | work reliably than that they don't?

Yes, when they are written correctly.

  | Of course something working intuitively 99.9% of the time and
  | hanging 0.1% of the time is a bug.

Nonsense.   An alternative explanation is that your intuition is wrong,
and that it often works that way is just by chance.   The program is
broken because it is making unjustified assumptions about how things are
specified to work.   This is the kind of common error that people who
program (in any language) by guesswork often make "I saw Fred did this,
and I tried it, and it worked for me like I thought it would, so it
must do this similar thing like I think it will too".   Rubbish.

  | I've never understood the philosophy of people who want to leave
  | bugs unfixed.

Nor have I, except sometimes perhaps when it comes to costs.   But the
issue here is whether this is a bug.  Your belief that it is does not make
it so.

  | No, it's not that much trouble to fix the bug.

It isn't, if it needs fixing - but any fix for this will slow the shell
(for what that matters, but some people care).  Further there are simpler
cheaper techniques than the one described.

  | If we had a pwaitpid (like pselect) we could use that too.

Yes, if.   If that existed a fix would be almost cost free.  If.
I suspect that before you can get bash (note: I am no authority and have
no voice in these decisions, I work on a different shell) to make use
of something like that it would need to be implemented in quite a lot
of systems, including the commercial ones, which tend to be very conservative
about adding new features for fun.

kre




Re: "wait" loses signals

2020-02-24 Thread Daniel Colascione


> There are lots of programming languages around, they each have their
> particular
> niche - the reason their inventors created them in the first place.  Use
> an
> appropriate one, rather than attempting to shoehorn some feature that is
> needed
> into a language that was never intended for it - just because you happen
> to
> be a big fan of that language.   Spread your wings, learn a new one

That is a poor excuse for not fixing bugs. Maybe you can torture the
standards into confessing that this behavior isn't a bug. This behavior
nevertheless surprises people and nevertheless precludes various things
people want to do with a shell. Don't you think it's better that programs
work reliably than that they don't? Of course something working
intuitively 99.9% of the time and hanging 0.1% of the time is a bug. It's
not appropriate to treat that 0.1% hang as some kind of cosmic punishment
for using shell in a manner you find inappropriate: remember when we
believed in mechanism, not policy? Nor is the presence of the bug in other
shells adequate justification for leaving this one in a bad state. I've
never understood the philosophy of people who want to leave bugs unfixed.

No, it's not that much trouble to fix the bug. The techniques for fixing
this kind of signal race are well-known. In particular, instead of
waitpid, you use a self-pipe and signal the pipe in the signal handler,
and you have a signal handler for SIGCHLD. If we had a pwaitpid (like
pselect) we could use that too. If I could get Chet to look at my patches,
I'd fix it myself.




Re: "wait" loses signals

2020-02-24 Thread Robert Elz
Date:Mon, 24 Feb 2020 11:50:55 +0100
From:Denys Vlasenko 
Message-ID:  <47762f41-e393-30cd-50ed-43c6bdd29...@redhat.com>

  | This is racy. Even if you try to code is as tightly as possible:

Absolutely, I agree.   The question is more whether it really matters.

  | Standard does not say that. It says "when the shell is waiting for an
  | asynchronous command to complete", it does not say "when the shell is
  | waiting in a waitpid() syscall".

That's because the standard has no notion of "system calls", just functions,
but the shell is not actually waiting (it is doing something else) until
the system call causes it to pause if the desired (or any) child is not
ready for reaping.

  | Yes, you are right, you can argue that shell is minimally fulfilling
  | standard's requirement if it does something like my code example.

It doesn't even need to do that.   As I said, the standard's primary purpose
is to advise script writers what they can depend upon the shell providing.
And a race free wrt traps wait utility is not one of those things.  That's
because what scripts can rely upon is based upon what shells implement (or
implemented at the time - with some more recent additions for some more
modern functionality that has been widely adopted).

Even now, as was demonstrated, most shells have this "issue" - hence the
standard simply cannot tell users that they can rely on something else.
Any attempt to read it otherwise than that is simply wrong, and obviously
so (though sometimes it is possible to argue that the wording used does
not express the intent obviously enough - or accoasionally - at all, but
when that happens, all you will ever get as the best possible result is
corrected wording that says what it intended to say in the first place).

The standard also serves to advise shell authors what they need to do to
provide a shell which should run all conformant shell applications, but it
would be grossly unfair (and improper) to require of new shells something
that old ones didn't do.  But that side of it is less relevant to this
discussion, except that it doesn't tell shell authors to make sure there
are no race conditions wrt traps in the wait utility (it would do that in
quite different language than this, but that would be the point, if it were
there).

  | I am arguing that it can be made better:

That part is arguable

  | it can be coded so that signal has no time window to arrive before
  | waitpid() but have its trap delayed to after "wait" builtin ends
  | (which might be "never", mind you).

It can be so coded, but when done (correctly, and assuming a trapped signal
has arrived) it won't be never, the signal will interrupt the sys call that
actually pauses (which will most likely not be wait*() in this case, but that's
irrelevant) and the wait would correctly exit.  A few shells have done that.

The question is whether it is worth going to that extra effort - or in other
words, is it really better.

As best I can tell, it only really matters to shell scripts attempting to
use signals/traps as an IPC mechanism, and that I simply don't believe they
should be doing - programs that need that kind of functionality should be
written in a language that provides more suitable mechanisms (and usually
not only for simple one bit message passing that a signal offers).

There are lots of programming languages around, they each have their particular
niche - the reason their inventors created them in the first place.  Use an
appropriate one, rather than attempting to shoehorn some feature that is needed
into a language that was never intended for it - just because you happen to
be a big fan of that language.   Spread your wings, learn a new one - the hard
part about any programming isn't the programming language, it is getting the
desired concepts and structures straight - do that and any competent programmer
can make a working program in any suitable language (ie: not expecting anyone
to write an operating system in COBOL) fairly quickly.   They'll make it
better after they get used to the idioms of the language, but providing
the method needed to solve the problem is known first (that's usually the
hard part, for anything non trivial) the actual coding into a working, if
not necessarily ideal, form is simple.

kre




Re: "wait" loses signals

2020-02-24 Thread Denys Vlasenko

On 2/24/20 9:59 AM, Robert Elz wrote:

And that is, when the wait/waitpid/wait3/wait4/waitid/wait6 (whatever the
shell  uses) system call returns EINTR, the wait utility exited with a
status indicating it was interrupted by that signal (status > 128 means
128+SIGno) and runs the trap.


This is racy. Even if you try to code is as tightly as possible:

   if (got_sigs) { handle signals }
   got_sigs = 0;
   pid = waitpid(...);  /* without WNOHANG */
   if (pid < 0 && errno == EINTR) { handle signals }

since signals can be delivered not only while waitpid() syscall
is in kernel, but also when we are only about to enter the kernel
- and in this case, the shell's sighandler will set the flag variable,
but then we enter the kernel *and sleep*.


Because that is what shells actually did - the alternative being to
simply restart the wait on EINTR like many other system calls that are
interrupted by signals are conventionally restarted.

Like it or not, that's what shells did, what most still do, and what
the standard says must be done.


Standard does not say that. It says "when the shell is waiting for an
asynchronous command to complete", it does not say "when the shell is
waiting in a waitpid() syscall".

Yes, you are right, you can argue that shell is minimally fulfilling
standard's requirement if it does something like my code example.

I am arguing that it can be made better: it can be coded so that
signal has no time window to arrive before waitpid() but have its
trap delayed to after "wait" builtin ends (which might be "never", mind you).




Re: "wait" loses signals

2020-02-24 Thread Robert Elz
Date:Fri, 21 Feb 2020 10:07:25 -0500
From:Chet Ramey 
Message-ID:  

  | That's just not reasonable. You're saying signals that are received before
  | the wait builtin begins executing (say, while the command is being parsed,
  | or the shell is doing some other bookkeeping task) should be considered
  | to have arrived while the wait builtin is executing. I'm pretty sure that's
  | not consistent with the letter or the spirit of the standard.

It quite clearly isn't consistent, what the standard says is:

 When the shell is waiting, by means of the wait utility, for
 asynchronous commands to complete, the reception of a signal for
 which a trap has been set shall cause the wait utility to return
 immediately with an exit status >128, immediately after which the
 trap associated with that signal shall be taken.

Note: "when the shell us waiting for an asynchronous command to complete"
(when that happens as a result of the user/script executing the wait utility)
then ...

What Denys is failing to realise, is that the standard describes what shells
do (or more accurately perhaps, did, in the late 1980's or early 1990's)
not what someone might want them to do.

And that is, when the wait/waitpid/wait3/wait4/waitid/wait6 (whatever the
shell  uses) system call returns EINTR, the wait utility exited with a
status indicating it was interrupted by that signal (status > 128 means 
128+SIGno) and runs the trap.

Because that is what shells actually did - the alternative being to
simply restart the wait on EINTR like many other system calls that are
interrupted by signals are conventionally restarted.

Like it or not, that's what shells did, what most still do, and what
the standard says must be done.

Apart from that, and not interrupting a wait for a foreground process,
the standard says very little about when traps should be run, and sorry
Harald, but your "as soon as" from ...

har...@gigawatt.nl said:
  | In the same way, I think that except when overridden by 2.11, the "when"
  | in "Otherwise, the argument action shall be read and executed by the
  | shell when one of the corresponding conditions arises." should be
  | interpreted as "as soon as". 

The only way to do that literally would be to run the trap from the signal
handler, as that is "as soon as" the condition arises.   But I think we all
know that is simply not possible.   So let's read that as "as soon as
possible after" instead.   That's getting more reasonable, but someone needs
to decide just what is possible - will running the trap handler mess up the
shell's internal state while a new command is parsed and executed?

Eg: what if we had
VAR=$(grep  -c some_string file*.c)
and a (trapped) signal arrives while grep is running (more correctly, while
the process running the command substitution, which runs grep, is running).
We know we cannot interrupt the wait for that foreground process to run the
trap handler, so we don't - but do we execute the trap handler before we
assign the answer to VAR ?

This kind of thing is why shells in general only normally even look to
see if there is a trap handler waiting to run after completing executing
commands, not in the middle of one.

The relevance of this is that if a signal arrives while the wait command
is executing (or as Chet suggested, while doing whatever housekeeping is
needed to prepare to run it, like looking to see what command comes next)
but before the relevant wait*() system call is running, the trap won't
be run until after the wait command completes.

That's the way shells have always worked, and the way the standard (for that
very reason) says can be relied upon by scripts - which is much of its
purpose, to tell script writers what they can expect will work, and what
will not necessarily work.

Now the standard doesn't preclude a shell from looking for pending traps
as frequently as it wants to, every second line of C code in the shell could
be
if (traps_pending) run_trap_handler();

But most shell authors (I believe) wouldn't consider that reasonable.

The standard also doesn't preclude a shell from taking extra measures to
push the arrival of a signal in the wait utility down to occur in the wait
system call (or whatever replaces it).   Old shells didn't do that, as there
simply was no mechanism for that, and using SIGCHLD was always problematic
because of its quite different implementation of different (now ancient)
systems, hence we have what we have.   The standard is not a legislature,
and does not change the rules just because what is there doesn't look
reasonable, or you don't like it.

If you want things changed, convince the major shell maintainers that this
race condition is something they should make their shell go slower to
fix (because that's really all it takes on modern systems) and wait for
them to comply.   When most major shells (perhaps all major shells, and
some of the others) have implemented what you 

Re: "wait" loses signals

2020-02-21 Thread Denys Vlasenko




On 2/21/20 4:07 PM, Chet Ramey wrote:

On 2/21/20 9:44 AM, Denys Vlasenko wrote:


Yes, and here we are "after command", specifically after "{...} &" command.
Since we got a trapped signal, we must run its trap.


Did you look at the scenario in my message?


What scenario?


The scenario in the message you replied to.


As I said, there are just two possibilities:
signal is received before the point when shell checks for received
signals after "{...} &" command;
or signal is received after that point, and thus signal is
considered to be received "inside wait builtin".


That's just not reasonable.


Yes it is.


You're saying signals that are received before
the wait builtin begins executing (say, while the command is being parsed,
or the shell is doing some other bookkeeping task) should be considered
to have arrived while the wait builtin is executing.


OF COURSE! How else do you think this can possibly be seen?


I'm pretty sure that's
not consistent with the letter or the spirit of the standard.


IOW, you think that between "command 1 finished executing"
and "command 2 starts executing" there can be sort of signal black hole
time period, where signals can be simply ignored.

Now *this* is just not reasonable, since this would make traps
unreliable.




Re: "wait" loses signals

2020-02-21 Thread Chet Ramey
On 2/21/20 9:44 AM, Denys Vlasenko wrote:

>>> Yes, and here we are "after command", specifically after "{...} &" command.
>>> Since we got a trapped signal, we must run its trap.
>>
>> Did you look at the scenario in my message?
> 
> What scenario?

The scenario in the message you replied to.

> As I said, there are just two possibilities:
> signal is received before the point when shell checks for received
> signals after "{...} &" command;
> or signal is received after that point, and thus signal is
> considered to be received "inside wait builtin".

That's just not reasonable. You're saying signals that are received before
the wait builtin begins executing (say, while the command is being parsed,
or the shell is doing some other bookkeeping task) should be considered
to have arrived while the wait builtin is executing. I'm pretty sure that's
not consistent with the letter or the spirit of the standard.

-- 
``The lyf so short, the craft so long to lerne.'' - Chaucer
 ``Ars longa, vita brevis'' - Hippocrates
Chet Ramey, UTech, CWRUc...@case.eduhttp://tiswww.cwru.edu/~chet/



Re: "wait" loses signals

2020-02-21 Thread Denys Vlasenko

On 2/20/20 4:27 PM, Chet Ramey wrote:

On 2/20/20 3:02 AM, Denys Vlasenko wrote:

On 2/19/20 9:30 PM, Chet Ramey wrote:

On 2/19/20 5:29 AM, Denys Vlasenko wrote:

A bug report from Harald van Dijk:

test2.sh:
trap 'kill $!; exit' TERM
{ kill $$; exec sleep 9; } &
wait $!

The above script ought exit quickly, and not leave a stray
"sleep" child:
(1) if "kill $$" signal is delivered before "wait",
then TERM trap will kill the child, and exit.


This strikes me as a shaky assumption, dependent on when the shell receives
the SIGTERM and when it runs traps.


The undisputable fact is that after shell forks a child
to run the "{...} &" subshell, it will receive the SIGTERM signal.

And since it has a trap for it, it should be run.


(There's nothing in POSIX that says
when pending traps are processed. Bash runs them after commands.)


Yes, and here we are "after command", specifically after "{...} &" command.
Since we got a trapped signal, we must run its trap.


Did you look at the scenario in my message?


What scenario?

As I said, there are just two possibilities:
signal is received before the point when shell checks for received
signals after "{...} &" command;
or signal is received after that point, and thus signal is
considered to be received "inside wait builtin".

In both cases, trap should be run.


Keep in mind that you can't run the trap out of the signal handler.


Yes, running anything remotely complex out of signal handlers is
a bad idea: signals can arrive somewhere in the middle of stdio, or memory 
allocation,
or something similarly critical. Reentering one of those can deadlock.

Properly-written programs are careful to record
signal reception in a flag variable, or a pipe, etc,
then return from signal handler, and act on it later, not in a signal handler.




Re: "wait" loses signals

2020-02-20 Thread Harald van Dijk

On 20/02/2020 15:55, Robert Elz wrote:

 Date:Thu, 20 Feb 2020 09:16:05 +
 From:Harald van Dijk 
 Message-ID:  

   | In that case, I think we can interpret the "when" in the description
   | of the trap command literally except when 2.11 overrides it.

I think it should be interpreted just like its normal English usage,
as in:

when I win the lottery I am going to buy a Ferrari
or
I am going to buy a Ferrari when I win the lottery

(which both say the same thing).


These are both ambiguous statements. The meaning of both depends on 
context and emphasis, and because context and emphasis are missing in a 
standalone written sentence, we are left to infer it. The word order may 
lead to a different inference for the two sentences.



It doesn't mean that the instant the lottery winnings arrive (tomorrow
please!) I will be at the luxury imported car dealers, rather it states
a pre-cpndition which will trigger an event which is to follow, sometime,
thereafter.


I can see at least three different meanings.

A: Jake bought a Porsche when he won the lottery.
   When I win the lottery, I am going to buy a Ferrari. [if/after]

A: What are you going to do when you win the lottery?
B: When I win the lottery, I am going to buy a Ferrari. [as soon as]

A: How come you have five Ferraris in your garage?
B: When I win the lottery, I am going to buy a Ferrari. [whenever;
   said by someone who has already won the lottery five times]


Thus
When one of the correspomding conditions arrises (standards
speak for "when a signal has been delivered") the argument
action shall be read and executed...

is "sometime after a signal has been delvered, run the trap action".


Based on how the word is used elsewhere in the standard, I think the "as 
soon as" meaning is more likely here. Two random examples elsewhere from 
the standard:



File Read, Write, and Creation



When a file that does not exist is created, [...]



1. The user ID of the file shall be set to the effective user ID of the calling 
process.


It would be absurd to claim that the user ID might be initially set to 
some completely unrelated user ID, and then changed to the effective 
user ID of the calling process some time later.



 2.5.1 Positional Parameters

[...] Positional parameters are initially assigned when the shell is invoked 
(see sh), [...]


It would be equally absurd to claim that this allows
  sh -c 'echo $1' - hello
to print a blank line because the initial assignment of the positional 
parameters may happen after the first expansion of $1.


In the same way, I think that except when overridden by 2.11, the "when" 
in "Otherwise, the argument action shall be read and executed by the 
shell when one of the corresponding conditions arises." should be 
interpreted as "as soon as".


Cheers,
Harald van Dijk



Re: "wait" loses signals

2020-02-20 Thread Robert Elz
Date:Thu, 20 Feb 2020 09:16:05 +
From:Harald van Dijk 
Message-ID:  

  | In that case, I think we can interpret the "when" in the description
  | of the trap command literally except when 2.11 overrides it.

I think it should be interpreted just like its normal English usage,
as in:

when I win the lottery I am going to buy a Ferrari
or
I am going to buy a Ferrari when I win the lottery

(which both say the same thing).

It doesn't mean that the instant the lottery winnings arrive (tomorrow
please!) I will be at the luxury imported car dealers, rather it states
a pre-cpndition which will trigger an event which is to follow, sometime,
thereafter.

Thus
When one of the correspomding conditions arrises (standards
speak for "when a signal has been delivered") the argument
action shall be read and executed...

is "sometime after a signal has been delvered, run the trap action".

kre




Re: "wait" loses signals

2020-02-20 Thread Chet Ramey
On 2/20/20 3:02 AM, Denys Vlasenko wrote:
> On 2/19/20 9:30 PM, Chet Ramey wrote:
>> On 2/19/20 5:29 AM, Denys Vlasenko wrote:
>>> A bug report from Harald van Dijk:
>>>
>>> test2.sh:
>>> trap 'kill $!; exit' TERM
>>> { kill $$; exec sleep 9; } &
>>> wait $!
>>>
>>> The above script ought exit quickly, and not leave a stray
>>> "sleep" child:
>>> (1) if "kill $$" signal is delivered before "wait",
>>> then TERM trap will kill the child, and exit.
>>
>> This strikes me as a shaky assumption, dependent on when the shell receives
>> the SIGTERM and when it runs traps.
> 
> The undisputable fact is that after shell forks a child
> to run the "{...} &" subshell, it will receive the SIGTERM signal.
> 
> And since it has a trap for it, it should be run.
> 
>> (There's nothing in POSIX that says
>> when pending traps are processed. Bash runs them after commands.)
> 
> Yes, and here we are "after command", specifically after "{...} &" command.
> Since we got a trapped signal, we must run its trap.

Did you look at the scenario in my message?

Keep in mind that you can't run the trap out of the signal handler.

-- 
``The lyf so short, the craft so long to lerne.'' - Chaucer
 ``Ars longa, vita brevis'' - Hippocrates
Chet Ramey, UTech, CWRUc...@case.eduhttp://tiswww.cwru.edu/~chet/



Re: "wait" loses signals

2020-02-20 Thread Harald van Dijk

On 20/02/2020 01:55, Robert Elz wrote:

 Date:Wed, 19 Feb 2020 23:53:56 +
 From:Harald van Dijk 
 Message-ID:  <9b9d435b-3d2f-99bd-eb3d-4a676ce89...@gigawatt.nl>


   | POSIX says in the description of the trap command "Otherwise, the
   | argument action shall be read and executed by the shell when one of the
   | corresponding conditions arises." Because it says "when", not "after",
   | if interpreted literally, it does not even allow waiting until the
   | current command finishes executing.

You need to look at XCU 2.11 not just the description of the trap command
itself.


Ah, thanks, that makes an exception for when the shell is waiting for a 
command to complete. It's the same as what bash documents. In that case, 
I think we can interpret the "when" in the description of the trap 
command literally except when 2.11 overrides it.


Cheers,
Harald van Dijk



Re: "wait" loses signals

2020-02-20 Thread Denys Vlasenko

On 2/19/20 9:30 PM, Chet Ramey wrote:

On 2/19/20 5:29 AM, Denys Vlasenko wrote:

A bug report from Harald van Dijk:

test2.sh:
trap 'kill $!; exit' TERM
{ kill $$; exec sleep 9; } &
wait $!

The above script ought exit quickly, and not leave a stray
"sleep" child:
(1) if "kill $$" signal is delivered before "wait",
then TERM trap will kill the child, and exit.


This strikes me as a shaky assumption, dependent on when the shell receives
the SIGTERM and when it runs traps.


The undisputable fact is that after shell forks a child
to run the "{...} &" subshell, it will receive the SIGTERM signal.

And since it has a trap for it, it should be run.


(There's nothing in POSIX that says
when pending traps are processed. Bash runs them after commands.)


Yes, and here we are "after command", specifically after "{...} &" command.
Since we got a trapped signal, we must run its trap.




Re: "wait" loses signals

2020-02-19 Thread Robert Elz
Date:Wed, 19 Feb 2020 23:53:56 +
From:Harald van Dijk 
Message-ID:  <9b9d435b-3d2f-99bd-eb3d-4a676ce89...@gigawatt.nl>


  | POSIX says in the description of the trap command "Otherwise, the 
  | argument action shall be read and executed by the shell when one of the 
  | corresponding conditions arises." Because it says "when", not "after", 
  | if interpreted literally, it does not even allow waiting until the 
  | current command finishes executing.

You need to look at XCU 2.11 not just the description of the trap command
itself.

kre




Re: "wait" loses signals

2020-02-19 Thread Harald van Dijk

On 19/02/2020 20:30, Chet Ramey wrote:

On 2/19/20 5:29 AM, Denys Vlasenko wrote:

A bug report from Harald van Dijk:

test2.sh:
trap 'kill $!; exit' TERM
{ kill $$; exec sleep 9; } &
wait $!

The above script ought exit quickly, and not leave a stray
"sleep" child:
(1) if "kill $$" signal is delivered before "wait",
then TERM trap will kill the child, and exit.


This strikes me as a shaky assumption, dependent on when the shell receives
the SIGTERM and when it runs traps. (There's nothing in POSIX that says
when pending traps are processed. Bash runs them after commands.)


The bash documentation says traps will not be executed until the command 
completes if it receives a signal while waiting for the command to 
complete, but it does not say the same for when it receives a signal 
before waiting for a command to complete. This may be an oversight in 
the documentation.


POSIX says in the description of the trap command "Otherwise, the 
argument action shall be read and executed by the shell when one of the 
corresponding conditions arises." Because it says "when", not "after", 
if interpreted literally, it does not even allow waiting until the 
current command finishes executing. I realise that that is definitely 
not the way it is meant to be interpreted, but I am not sure what is. I 
consider the assumption that the test script is supposed to work a 
reasonable one, but it is possible that this is considered strictly a 
QoI issue.


But to be clear, regardless of what POSIX requires, I was less concerned 
with prodding other shell authors into changing their shells and more 
with seeing what I can do in my shell. I want to have a shell that is 
capable of handling scripts like this, but it is fine with me if other 
shells do not share that as a goal.


Thanks for looking into this despite your scepticism on the validity of 
the test. Your description of what happens in bash when this ends up 
sleeping probably applies to all shells that behave the same way.


Cheers,
Harald van Dijk



Re: "wait" loses signals

2020-02-19 Thread Chet Ramey
On 2/19/20 5:29 AM, Denys Vlasenko wrote:
> A bug report from Harald van Dijk:
> 
> test2.sh:
> trap 'kill $!; exit' TERM
> { kill $$; exec sleep 9; } &
> wait $!
> 
> The above script ought exit quickly, and not leave a stray
> "sleep" child:
> (1) if "kill $$" signal is delivered before "wait",
> then TERM trap will kill the child, and exit.

This strikes me as a shaky assumption, dependent on when the shell receives
the SIGTERM and when it runs traps. (There's nothing in POSIX that says
when pending traps are processed. Bash runs them after commands.)

> (2) if "kill $$" signal is delivered to "wait",
> it must be interrupted by the signal,
> then TERM trap will kill the child, and exit.

This is well-defined by POSIX.

> 
> The helper to loop the above:
> 
> test1.sh:
> i=1
> while test "$i" -lt 10; do
>  echo "$i"
>  "$@" test2.sh
>  i=$((i + 1))
> done
> 
> To run: sh test1.sh 
> 
> bash 4.4.23 fails pretty quickly:
> 
> $ sh test1.sh bash
> 1
> ...
> 581
> _ 

It seems inherently racy. I ran this with a lightly-instrumented bash
and discovered that signals that arrived when `wait' was running were
always processed correctly and killed the process. There were a few
times when the signal arrived while `wait' was not running, and some
of these cases did not interrupt wait or cause trap execution.

Consider this scenario.

1. Bash forks and starts the background process
2. The parent fork returns
3. The parent bash checks for traps, and finds none
4. SIGTERM arrives, the trap signal handler sets a `pending trap' flag
   for SIGTERM
5. The parent shell runs the `wait' builtin.
6. `wait' is not interrupted by a signal, runs to completion, and the
   trap runs

The window for this is extremely small. I just ran the scripts on RHEL7
and had to go through the loop script multiple times before I saw the
9-second sleep. I saw it more often on Mac OS X, so the scheduler
probably plays a role.

Chet
-- 
``The lyf so short, the craft so long to lerne.'' - Chaucer
 ``Ars longa, vita brevis'' - Hippocrates
Chet Ramey, UTech, CWRUc...@case.eduhttp://tiswww.cwru.edu/~chet/



"wait" loses signals

2020-02-19 Thread Denys Vlasenko

A bug report from Harald van Dijk:

test2.sh:
trap 'kill $!; exit' TERM
{ kill $$; exec sleep 9; } &
wait $!

The above script ought exit quickly, and not leave a stray
"sleep" child:
(1) if "kill $$" signal is delivered before "wait",
then TERM trap will kill the child, and exit.
(2) if "kill $$" signal is delivered to "wait",
it must be interrupted by the signal,
then TERM trap will kill the child, and exit.

The helper to loop the above:

test1.sh:
i=1
while test "$i" -lt 10; do
 echo "$i"
 "$@" test2.sh
 i=$((i + 1))
done

To run: sh test1.sh 

bash 4.4.23 fails pretty quickly:

$ sh test1.sh bash
1
...
581
_ 

Under strace, it seems that "wait" enters wait4() syscall
and waits for the child. (The fact that the pause is
9 seconds is another hint).