Simon Marlow wrote:

> > The fine points of Unix signal semantics have always been somewhat
> > mysterious to me.  However, after digging around in man pages
> > for a while,
> > I have a theory as to what's going "wrong"...
>
> Yes, your diagnosis looks very plausible.
>
> The right way, I believe, to handle this in your signal handler is to
> call getAnyProcessStatus repeatedly until it doesn't return any more
> children (not forgetting to use the non-blocking version, ie. the first
> arg should be False).  Does that help?

I had thought of having the signal handler reap as many terminated child
processes as possible, but had been concerned about a possible race
condition.  After you suggested that approach, I thought some more and
decided that no race problem should exist.  So I've implemented multiple
reaping and it does help.  I no longer have any tests hang as before.
(Note that I still do see the occasional "EVACUATED object entered!"
error.)  However, the implementation turned out to be surprisingly complex.

The first issue I confronted is that the get*ProcessStatus routines return
an error rather than "nothing" if there is no candidate child process.
(The GHC routines simply reflect the system call semantics.)  This required
me to maintain a count of child processes so I could avoid trying to reap
nonexistent children.  Fortunately, adding the counting to my monad was not
difficult.

But having the signal handler avoid subsequent reaping was insufficient.  I
apparently had instances of the signal handler for which there were no
children to reap.  What I figure must be happening is something like the
following.  A sigCHLD signal comes in.  A signal handler instance is
created.  But before it is run, sigCHLD is unblocked.  A second sigCHLD
signal comes in.  A second signal handler instance is created.  One of the
signal handler instances runs and reaps both terminated children.  The
second signal handler instance runs and finds nothing to reap.

I bullet-proofed my logic, so that the signal handler conditions even its
first getAnyProcessStatus call on a nonzero child count.  (By the way, I
had to be careful to lock access to the child count appropriately.)  The
result now seems to work properly.

Two questions:

1. Though I'm now immune to seeing "too many" sigCHLD signals, I still rely
on seeing "enough" of them.  Can you think of any way that a signal could
go unnoticed in my scheme described above?

2. Is my supposition true, that sigCHLD is unblocked *before* invoking the
signal handler?  If so, I think this subtle but important semantic
difference between GHC RTS and POSIX signal handling should be documented.
Are there other differences?

Dean

_______________________________________________
Glasgow-haskell-bugs mailing list
[EMAIL PROTECTED]
http://www.haskell.org/mailman/listinfo/glasgow-haskell-bugs

Reply via email to