David,
Sorry I forgot to respond before I left for Santa Fe, NM...
More below...
On 8/8/16 5:57 PM, David Holmes wrote:
Hi Dan,
Thanks for the review.
On 9/08/2016 2:07 AM, Daniel D. Daugherty wrote:
On 8/4/16 8:28 PM, David Holmes wrote:
Hi Volker,
Thanks for looking at this.
On 5/08/2016 1:48 AM, Volker Simonis wrote:
Hi David,
thanks for doing this change on all platforms.
The fix looks good. Maybe you can just extend the following comment
with
something like:
// Note that the SR_lock plays no role in this suspend/resume
protocol.
// It is only used in SR_handler as a thread termination
indicator if
NULL.
Darn this code is confusing - too many "SR"'s :( I have added
// Note that the SR_lock plays no role in this suspend/resume
protocol,
// but is checked for NULL in SR_handler as a thread termination
indicator.
Updated webrev:
http://cr.openjdk.java.net/~dholmes/8159461/webrev.v2/
src/share/vm/runtime/thread.cpp
L380: _SR_lock = NULL;
I was expecting the _SR_lock to be freed and NULL'ed earlier
based on the discussion in the bug report. Since the crashing
assert() happens in a race between the JavaThread destructor
the NULL'ing of the _SR_lock field, I was expecting the _SR_lock
field to be dealt with as early as possible in the Thread
destructor (or even earlier; see my last comment).
I will respond after that comment.
src/os/linux/vm/os_linux.cpp
L4010: // mask is changed as part of thread termination. Check the
current thread
grammar?: "Check the current" -> "Check that the current"
Will change.
L4015: if (thread->SR_lock() == NULL)
L4016: return;
style nit: multi-line if-statements require '{' and '}'
Please add the braces or make this a single line if-statement.
I would prefer the braces. :-)
Will fix.
Isn't there still a window between the completion of the
JavaThread destructor and where the Thread destructor sets
_SR_lock = NULL?
See below.
L4020: OSThread* osthread = thread->osthread();
Not your bug. This code assumes that osthread != NULL.
Maybe it needs to be more robust.
Depends what kind of impossibilities we want to guard against. :)
There should be no possible way a signal can be sent to a thread that
doesn't even have a osThread as it means we never successfully
started/attached the thread.
That's a really good point. I'm good with what's there
for osthread.
src/os/aix/vm/os_aix.cpp
L2731: if (thread->SR_lock() == NULL)
L2732: return;
Same style nit.
Same race.
L2736: OSThread* osthread = thread->osthread();
Same robustness comment.
src/os/bsd/vm/os_bsd.cpp
L2759: if (thread->SR_lock() == NULL)
L2760: return;
Same style nit.
Same race.
L2764: OSThread* osthread = thread->osthread();
Same robustness comment.
It has been a very long time since I've dealt with races in the
suspend/resume code so I'm probably very rusty with this code.
If the _SR_lock is only used by the JavaThread suspend/resume
protocol, then we could consider free'ing and NULL'ing the field
in the JavaThread destructor (as the last piece of work).
That should eliminate the race that was being observed by the
SR_handler() in this bug. It will open a very small race where
is_Java_thread() can return true, the _SR_lock field is !NULL,
but the _SR_lock has been deleted.
Given that it should have been impossible to get into the SR_handler
in the first place from this code I was trying to minimize the
disruption to the existing logic. Moving the delete/NULLing to just
before the call to os::free_thread() fixes the crashes that had been
observed. I was not trying to make the entire destruction sequence
safe wrt. the SR_handler.
I suspect it is the combination of 1) NULLing the _SR_lock as a sentinel and
2) doing that before the more expensive os::free_thread() call that results
in the change in behavior.
My major concern with deleting the SR_lock much earlier is the
potential race condition that I have previously outlined in:
https://bugs.openjdk.java.net/browse/JDK-8152849
where there is no protection against a target thread terminating. The
sooner it terminates and deletes the SR_lock the more likely we may
attempt to lock a deleted lock!
Ah yes... thanks for the reminder. We have seen a few of those in the
past where we're racing to grab the _SR_lock and Elvis is trying to
leave the building...
I'm good with just the minor changes you agreed to make above. I don't
think I need to see a new webrev for the above edits.
Thumbs up!
Dan
Thanks,
David
Dan
This also reminded me to follow up on why the Solaris SR_handler is
different and I found it is not actually installed as a direct signal
handler, but is called from the real signal handler if dealing with a
JavaThread or the VMThread. Consequently the Solaris version of the
SR_handler can not encounter this specific bug and so I have reverted
the changes to os_solaris.cpp
Thanks,
David
Regards,
Volker
On Wed, Aug 3, 2016 at 3:13 AM, David Holmes <david.hol...@oracle.com
<mailto:david.hol...@oracle.com>> wrote:
webrev: http://cr.openjdk.java.net/~dholmes/8159461/webrev/
<http://cr.openjdk.java.net/~dholmes/8159461/webrev/>
bug: https://bugs.openjdk.java.net/browse/JDK-8159461
<https://bugs.openjdk.java.net/browse/JDK-8159461>
The suspend/resume signal (SR_signum) is never sent to a thread
once
it has started to terminate. On one platform (SuSE 12) we have
seen
what appears to be a "stuck" signal, which is only delivered when
the terminating thread restores its original signal mask (as if
pthread_sigmask makes the system realize there is a pending
signal -
we already check the signal was not blocked). At this point in the
thread termination we have freed the osthread, so the the
SR_handler
would access deallocated memory. In debug builds we first hit an
assertion that the current thread is a JavaThread or the
VMThread -
that assertion fails, even though it is a JavaThread, because we
have already executed the ~JavaThread destructor and inside the
~Thread destructor we are a plain Thread not a JavaThread.
The fix was to make a small adjustment to the thread termination
process so that we delete the SR_lock before calling
os::free_thread(). In the SR_handler() we can then use a NULL
check
of SR_lock() to indicate the thread has terminated and we return.
While only seen on Linux I took the opportunity to apply the
fix on
all platforms and also cleaned up the code where we were using
Thread::current() unsafely in a signal-handling context.
Testing: regular tier 1 (JPRT)
Kitchensink (in progress)
As we can't readily reproduce the problem I tested this by
having a
terminating thread raise SR_signum directly from within the
~Thread
destructor.
Thanks,
David