On 8/4/16 8:28 PM, David Holmes wrote:
Hi Volker,

Thanks for looking at this.

On 5/08/2016 1:48 AM, Volker Simonis wrote:
Hi David,

thanks for doing this change on all platforms.
The fix looks good. Maybe you can just extend the following comment with
something like:

// Note that the SR_lock plays no role in this suspend/resume protocol.
 //  It is only used in SR_handler as a thread termination indicator if
NULL.

Darn this code is confusing - too many "SR"'s :( I have added

//  Note that the SR_lock plays no role in this suspend/resume protocol,
// but is checked for NULL in SR_handler as a thread termination indicator.

Updated webrev:

http://cr.openjdk.java.net/~dholmes/8159461/webrev.v2/

src/share/vm/runtime/thread.cpp
    L380:   _SR_lock = NULL;
        I was expecting the _SR_lock to be freed and NULL'ed earlier
        based on the discussion in the bug report. Since the crashing
        assert() happens in a race between the JavaThread destructor
        the NULL'ing of the _SR_lock field, I was expecting the _SR_lock
        field to be dealt with as early as possible in the Thread
        destructor (or even earlier; see my last comment).

 src/os/linux/vm/os_linux.cpp
L4010: // mask is changed as part of thread termination. Check the current thread
        grammar?: "Check the current" -> "Check that the current"

    L4015:   if (thread->SR_lock() == NULL)
    L4016:     return;
        style nit: multi-line if-statements require '{' and '}'
        Please add the braces or make this a single line if-statement.
        I would prefer the braces. :-)

        Isn't there still a window between the completion of the
        JavaThread destructor and where the Thread destructor sets
        _SR_lock = NULL?

    L4020:   OSThread* osthread = thread->osthread();
        Not your bug. This code assumes that osthread != NULL.
        Maybe it needs to be more robust.

src/os/aix/vm/os_aix.cpp
    L2731:   if (thread->SR_lock() == NULL)
    L2732:     return;
        Same style nit.

        Same race.

    L2736:   OSThread* osthread = thread->osthread();
        Same robustness comment.

src/os/bsd/vm/os_bsd.cpp
    L2759:   if (thread->SR_lock() == NULL)
    L2760:     return;
        Same style nit.

        Same race.

    L2764:   OSThread* osthread = thread->osthread();
        Same robustness comment.

It has been a very long time since I've dealt with races in the
suspend/resume code so I'm probably very rusty with this code.
If the _SR_lock is only used by the JavaThread suspend/resume
protocol, then we could consider free'ing and NULL'ing the field
in the JavaThread destructor (as the last piece of work).

That should eliminate the race that was being observed by the
SR_handler() in this bug. It will open a very small race where
is_Java_thread() can return true, the _SR_lock field is !NULL,
but the _SR_lock has been deleted.

Dan



This also reminded me to follow up on why the Solaris SR_handler is different and I found it is not actually installed as a direct signal handler, but is called from the real signal handler if dealing with a JavaThread or the VMThread. Consequently the Solaris version of the SR_handler can not encounter this specific bug and so I have reverted the changes to os_solaris.cpp

Thanks,
David


Regards,
Volker

On Wed, Aug 3, 2016 at 3:13 AM, David Holmes <david.hol...@oracle.com
<mailto:david.hol...@oracle.com>> wrote:

    webrev: http://cr.openjdk.java.net/~dholmes/8159461/webrev/
    <http://cr.openjdk.java.net/~dholmes/8159461/webrev/>

    bug: https://bugs.openjdk.java.net/browse/JDK-8159461
    <https://bugs.openjdk.java.net/browse/JDK-8159461>

    The suspend/resume signal (SR_signum) is never sent to a thread once
    it has started to terminate. On one platform (SuSE 12) we have seen
    what appears to be a "stuck" signal, which is only delivered when
    the terminating thread restores its original signal mask (as if
    pthread_sigmask makes the system realize there is a pending signal -
    we already check the signal was not blocked). At this point in the
    thread termination we have freed the osthread, so the the SR_handler
    would access deallocated memory. In debug builds we first hit an
    assertion that the current thread is a JavaThread or the VMThread -
    that assertion fails, even though it is a JavaThread, because we
    have already executed the ~JavaThread destructor and inside the
    ~Thread destructor we are a plain Thread not a JavaThread.

    The fix was to make a small adjustment to the thread termination
    process so that we delete the SR_lock before calling
    os::free_thread(). In the SR_handler() we can then use a NULL check
    of SR_lock() to indicate the thread has terminated and we return.

    While only seen on Linux I took the opportunity to apply the fix on
    all platforms and also cleaned up the code where we were using
    Thread::current() unsafely in a signal-handling context.

    Testing: regular tier 1 (JPRT)
             Kitchensink (in progress)

    As we can't readily reproduce the problem I tested this by having a
    terminating thread raise SR_signum directly from within the ~Thread
    destructor.

    Thanks,
    David



Reply via email to