stressExitCode hits assert: Must be VMThread or JavaThread

Daniel D. Daugherty Thu, 11 Aug 2016 18:25:33 -0700

David,

Sorry I forgot to respond before I left for Santa Fe, NM...
More below...



On 8/8/16 5:57 PM, David Holmes wrote:

Hi Dan,

Thanks for the review.

On 9/08/2016 2:07 AM, Daniel D. Daugherty wrote:

On 8/4/16 8:28 PM, David Holmes wrote:

Hi Volker,

Thanks for looking at this.

On 5/08/2016 1:48 AM, Volker Simonis wrote:
Hi David,

thanks for doing this change on all platforms.
The fix looks good. Maybe you can just extend the following commentwith
something like:

 //  Note that the SR_lock plays no role in this suspend/resume
protocol.
// It is only used in SR_handler as a thread terminationindicator if
NULL.
Darn this code is confusing - too many "SR"'s :( I have added
// Note that the SR_lock plays no role in this suspend/resumeprotocol,
//  but is checked for NULL in SR_handler as a thread termination
indicator.

Updated webrev:

http://cr.openjdk.java.net/~dholmes/8159461/webrev.v2/


src/share/vm/runtime/thread.cpp
    L380:   _SR_lock = NULL;
        I was expecting the _SR_lock to be freed and NULL'ed earlier
        based on the discussion in the bug report. Since the crashing
        assert() happens in a race between the JavaThread destructor
        the NULL'ing of the _SR_lock field, I was expecting the _SR_lock
        field to be dealt with as early as possible in the Thread
        destructor (or even earlier; see my last comment).


I will respond after that comment.

 src/os/linux/vm/os_linux.cpp
    L4010:   // mask is changed as part of thread termination. Check the
current thread
        grammar?: "Check the current" -> "Check that the current"


Will change.

    L4015:   if (thread->SR_lock() == NULL)
    L4016:     return;
        style nit: multi-line if-statements require '{' and '}'
        Please add the braces or make this a single line if-statement.
        I would prefer the braces. :-)


Will fix.

        Isn't there still a window between the completion of the
        JavaThread destructor and where the Thread destructor sets
        _SR_lock = NULL?


See below.

    L4020:   OSThread* osthread = thread->osthread();
        Not your bug. This code assumes that osthread != NULL.
        Maybe it needs to be more robust.

Depends what kind of impossibilities we want to guard against. :)There should be no possible way a signal can be sent to a thread thatdoesn't even have a osThread as it means we never successfullystarted/attached the thread.


That's a really good point. I'm good with what's there
for osthread.

src/os/aix/vm/os_aix.cpp
    L2731:   if (thread->SR_lock() == NULL)
    L2732:     return;
        Same style nit.

        Same race.

    L2736:   OSThread* osthread = thread->osthread();
        Same robustness comment.

src/os/bsd/vm/os_bsd.cpp
    L2759:   if (thread->SR_lock() == NULL)
    L2760:     return;
        Same style nit.

        Same race.

    L2764:   OSThread* osthread = thread->osthread();
        Same robustness comment.

It has been a very long time since I've dealt with races in the
suspend/resume code so I'm probably very rusty with this code.
If the _SR_lock is only used by the JavaThread suspend/resume
protocol, then we could consider free'ing and NULL'ing the field
in the JavaThread destructor (as the last piece of work).

That should eliminate the race that was being observed by the
SR_handler() in this bug. It will open a very small race where
is_Java_thread() can return true, the _SR_lock field is !NULL,
but the _SR_lock has been deleted.

Given that it should have been impossible to get into the SR_handlerin the first place from this code I was trying to minimize thedisruption to the existing logic. Moving the delete/NULLing to justbefore the call to os::free_thread() fixes the crashes that had beenobserved. I was not trying to make the entire destruction sequencesafe wrt. the SR_handler.


I suspect it is the combination of 1) NULLing the _SR_lock as a sentinel and
2) doing that before the more expensive os::free_thread() call that results
in the change in behavior.

My major concern with deleting the SR_lock much earlier is thepotential race condition that I have previously outlined in:
https://bugs.openjdk.java.net/browse/JDK-8152849
where there is no protection against a target thread terminating. Thesooner it terminates and deletes the SR_lock the more likely we mayattempt to lock a deleted lock!


Ah yes... thanks for the reminder. We have seen a few of those in the
past where we're racing to grab the _SR_lock and Elvis is trying to
leave the building...

I'm good with just the minor changes you agreed to make above. I don't
think I need to see a new webrev for the above edits.

Thumbs up!

Dan


Thanks,
David

Dan


This also reminded me to follow up on why the Solaris SR_handler is
different and I found it is not actually installed as a direct signal
handler, but is called from the real signal handler if dealing with a
JavaThread or the VMThread. Consequently the Solaris version of the
SR_handler can not encounter this specific bug and so I have reverted
the changes to os_solaris.cpp

Thanks,
David

Regards,
Volker

On Wed, Aug 3, 2016 at 3:13 AM, David Holmes <david.hol...@oracle.com
<mailto:david.hol...@oracle.com>> wrote:

    webrev: http://cr.openjdk.java.net/~dholmes/8159461/webrev/
<http://cr.openjdk.java.net/~dholmes/8159461/webrev/>

    bug: https://bugs.openjdk.java.net/browse/JDK-8159461
    <https://bugs.openjdk.java.net/browse/JDK-8159461>

The suspend/resume signal (SR_signum) is never sent to a threadonceit has started to terminate. On one platform (SuSE 12) we haveseen

    what appears to be a "stuck" signal, which is only delivered when
    the terminating thread restores its original signal mask (as if

pthread_sigmask makes the system realize there is a pendingsignal -

    we already check the signal was not blocked). At this point in the

thread termination we have freed the osthread, so the theSR_handler

    would access deallocated memory. In debug builds we first hit an

assertion that the current thread is a JavaThread or theVMThread -

    that assertion fails, even though it is a JavaThread, because we
    have already executed the ~JavaThread destructor and inside the
    ~Thread destructor we are a plain Thread not a JavaThread.

    The fix was to make a small adjustment to the thread termination
    process so that we delete the SR_lock before calling

os::free_thread(). In the SR_handler() we can then use a NULLcheck

    of SR_lock() to indicate the thread has terminated and we return.

While only seen on Linux I took the opportunity to apply thefix on

    all platforms and also cleaned up the code where we were using
    Thread::current() unsafely in a signal-handling context.

    Testing: regular tier 1 (JPRT)
             Kitchensink (in progress)

As we can't readily reproduce the problem I tested this byhaving aterminating thread raise SR_signum directly from within the~Thread

    destructor.

    Thanks,
    David

Re: (S) RFR: 8159461: bigapps/Kitchensink/stressExitCode hits assert: Must be VMThread or JavaThread

Reply via email to