On 04/17/2014 09:07 AM, Martin Buchholz wrote:
Many possible solutions eventually fail because whatever we do cannot take ownership of any global resource. Calling waitid on all child processes, even with NOWAIT and NOHANG changes global state (what if another subprocess library in the same process is trying to do the same thing?)

waitid(P_ALL, ..., NOWAIT | NOHANG) does not reap the child. It can be repeated multiple times. It can be used as a precursor to real waitid/waitpid which reaps a child, but only if it is "ours". The problem with this approach is what to do in the following scenario: the precursor waitid(P_ALL, ..., NOWAIT | NOHANG) returns a child that is not "ours" so we don't reap it. The "owner" of that child (JNI-library) does not do prompt reaping of their children. We loop, repeatedly getting the same child as a result, not seeing any other children that have exited in the meanwhile...

Regards, Peter



On Wed, Apr 16, 2014 at 3:34 PM, David M. Lloyd <david.ll...@redhat.com <mailto:david.ll...@redhat.com>> wrote:

    On 04/16/2014 02:15 PM, Martin Buchholz wrote:

        On Mon, Apr 14, 2014 at 1:57 PM, Peter Levart
        <peter.lev...@gmail.com <mailto:peter.lev...@gmail.com>
        <mailto:peter.lev...@gmail.com
        <mailto:peter.lev...@gmail.com>>> wrote:


            There's already such a race in current implementation of
            Process.terminate(). It admittedly only concerns a small
        window
            between process exiting and the reaper thread managing to
        signal
            this state to the other threads wishing to terminate it at
        the same
            time, so it could happen that a KILL/TERM signal is sent to an
            already deceased PID which was re-used, but it doesn't
        happen in
            practice since PIDs are not re-used very soon typically.

            But I agree, waiting between listing children and sending them
            signals increases the chance of hitting a reused PID.


        We do rely on the OS not reusing a PID _immediately_.  We used
        to have
        bugs in this area where Process.destroy would send a signal to
        a pid
        that may have deceased arbitrarily long ago.


    It seems to me that the key to avoiding this is to ensure that
    waitpid() is not called until we know the PID is ready to be
    cleaned.  As long as waitpid() has not yet been called, we can be
    certain that the process still exists and is ours.  So the real
    question is, how can we know a process is dead without actually
    calling wait() (thereby making that knowledge useless)?

    The aforementioned /proc trick seems like one good way to do so
    without, say, spawning a plethora of threads (though at one
    additional FD per thread, it is not free either).  Unforunately
    /proc is not ubiquitous, and even where it does exist, it's not
    standardized (thus its behavior probably cannot be relied upon
    absolutely).

    A simple solution may be to use a synchronized set of child PIDs,
    and set a SIGCHLD handler or waiter which, when triggered, locks
    the set and performs a series of waitid() operations with WNOHANG,
    processing all the process status updates.  The signalling APIs
    would be required to synchronize on the set to determine if the
    process in question is owned by the parent process.  Previously
    unknown processes can be "adopted" into this area by acquiring the
    synchronization and calling "waitpid()"+WNOHANG on the PID in
    question, and using the result to determine whether the PID should
    be added to the set (or whether we just reaped it - or whether it
    doesn't belong to us at all).

    As long as the process API is restricted to managing direct
    children, this should work and be safe across all POSIX-ish
    environments.  Note the potential downside that all children will
    be automatically reaped, which is possibly somewhat hostile to
    naïve JNI libraries or embedders. Selectively enabling the /proc
    trick can mitigate this downside on platforms which support it
    however.

-- - DML



Reply via email to