Re: RFR (XS) fix for a safepoint deadlock (8047720)

Daniel D. Daugherty Tue, 01 Jul 2014 06:17:45 -0700

On 6/30/14 8:29 PM, David Holmes wrote:

Hi Dan,


I disagree on a few points but lets wait to see what Markus' work shows.


Thanks.

Though I would be surprised if it detects the indirect safepointblocking caused by submitting a synchronous, safepoint-needing, VM op.


I did not mean to imply that Markus' work would have helped find
this bug. However, in fixing this bug I didn't want to add something
that Markus would later want to re-implement in a different way.

Dan


David

On 1/07/2014 3:08 AM, Daniel D. Daugherty wrote:

David,

Thanks for the review. Obviously I can't list you as a reviewer
since I already pushed the changeset...


On 6/29/14 11:54 PM, David Holmes wrote:

Correction ...

On 30/06/2014 3:33 PM, David Holmes wrote:

Hi Dan,

I see this has already gone in but I think it is worth lookingcloser at

this.

On 28/06/2014 2:18 AM, Daniel D. Daugherty wrote:

Greetings,

I have a fix ready for the following bug:

     8047720 Xprof hangs on Solaris
     https://bugs.openjdk.java.net/browse/JDK-8047720

Here is the webrev URL:

http://cr.openjdk.java.net/~dcubed/8047720-webrev/0-jdk9-hs-rt/

This deadlock occurred between the following threads:

     Main thread   - Trying to stop the WatcherThread as part of
                     shutting down the VM; this thread is blocked
                     on the PeriodicTask_lock which keeps it from
                     reaching a safepoint.
     WatcherThread - Requested a VM_ForceSafepoint to complete
                     a JavaThread::java_suspend() call as part
                     of a FlatProfiler record_thread_ticks()
                     call; this thread owns the PeriodicTask_lock
                     since it is processing a periodic task.
     VMThread      - Trying to start a safepoint; this thread is
                     blocked waiting for the Main thread to reach
                     a safepoint.

The PeriodicTask_lock is one of the VM internal locks and is
typically managed using Mutex::_no_safepoint_check_flag to
avoid deadlocks. Yes, the irony is dripping on the floor... :-)


What was overlooked here is that the holder of a lock that is acquired

without safepoint checks, must never block at a safepoint whilstholding

that lock.


I'm not sure about the above rule when you're talking about
internal VM threads that are really JavaThreads. JavaThreads
can run into all kinds of our special logic when doing thread
state transitions and the like. JavaThreads can also post JVM/TI
events which can lead to blocking at safepoints or blocking on
JVM/TI raw monitors in event handlers, etc...

In this case the blocking is indirect, caused by the
synchronous nature of the VM_Operation, rather than a direct result of
"blocking for the safepoint" (which the WatcherThread does not
participate in).


Right. The WatcherThread intentionally does not participate in
safepoints when using the PeriodicTask_lock. To me, that means
that other threads using that lock should not participate in
the safepoint protocol.

I believe we're trying to be consistent about how we use locks
with respect to honoring the safepoint protocol or not. Markus G
has a bug fix in process where he's trying to add sanity checks
that detect inconsistent use of locks. In this particular case,
I believe that the Main thread's use of the PeriodicTask_lock
with the safepoint protocol enabled would be seen as inconsistent.
However, I may not have all the details for what Markus is doing.

I wonder if the WatcherThread should really be using
the async variant of VM_ForceSafepoint here?


The WatcherThread is trying to do a JavaThread::java_suspend()
call. JavaThread::java_suspend() cannot return to the caller
until it is sure that the target thread will not execute any
more bytecodes. Switching to an async VM_ForceSafepoint would
break that invariant and cause subtle deadlocks because a target
JavaThread could acquire a Java monitor when the suspend
requesting thread thinks it is suspended. Been down that road
before and it is not pretty.

The interesting part of this deadlock is that I think that it
is possible for other periodic tasks to hit it. Anything that
causes the WatcherThread to start a safepoint while processing
a periodic task should be susceptible to this race. Think about
the -XX:+DeoptimizeALot option and how it causes VM_Deopt
requests on thread state transitions... Interesting...


I don't think so. You need three threads involved to get the deadlock.


But that isn't the point. As you state this deadlock, at VM shutdown,
could impact any synchronous safepoint operations executed by the
WatcherThread.


Right. The problem is that the WatcherThread holds the
PeriodicTask_lock across the potential for VM operations
so we have to be very, very careful with non-WatcherThread
uses of that lock to avoid deadlocks.

That aside ...
In the current case the main thread's locking of the PeriodicTask_lock
without a safepoint check is what causes the problem - thatviolates therules surrounding use of "no safepoint checks". The other methodsthat aJavaThread might call that acquire the PeriodicTask_lock do performthe
safepoint checks, so they wouldn't deadlock. Hence it seems to me that
only WatcherThread::stop can lead to this problem. And as
WatcherThread::stop is only called from before_exit, and that can only
be called once, it seems to me that we could/should actuallyacquire the
lock with a safepoint check.

Agreed that only the WatcherThread::stop() call can lead to thisdeadlock

because it is that call that catches the PeriodicTask_lock being held
across a VM operation. That's the heart of this deadlock and we're only
differing on how to solve that deadlock.

The primary purpose of WatcherThread::stop() is setting the
_should_terminate
flag so that the WatcherThread knows that it should stop dispatching
periodic

tasks. _should_terminate is already volatile and the existing codedoes an

OrderAccess::fence() just to be sure the update is seen. The secondary

purpose of WatcherThread::stop() is waking up the WatcherThread sothat it

stops in a more timely fashion. The WatcherThread waits for a fixed
amount of

time so it will eventually wake up and see the new _should_terminatevalue.

If caller of WatcherThread::stop() cannot get the PeriodicTask_lock,then

that means the WatcherThread is active so WatcherThread::stop() only has

to set the flag. It is possible for the WatcherThread::stop() call toracewith the WatcherThread dropping the PeriodicTask_lock and waiting. Inthat

case, the WatcherThread waits for a fixed amount of time so it will
eventually wake up and see the new _should_terminate value.

I intentionally don't want WatcherThread::stop() to honor the safepoint

protocol when accessing the PeriodicTask_lock so that we remainconsistentwith our use of that lock. See my comments above about Markus' workin this

area.

WatcherThread::stop() will honor the safepoint protocol a couple oflines

down from my changes:

1380   // it is ok to take late safepoints here, if needed
1381   MutexLocker mu(Terminator_lock);

The subsequent wait() loop also honors the safepoint protocol.

Dan


David
-----

Cheers,
David


Testing:
     - I found a way to add delays to the right spots in the
       VM to make the deadlock reproduce in just about every
       run of the test associated with the bug. The new
       os::naked_short_sleep() function is your friend. Thanks
       to Fred for adding that! See the bug report for the
       debugging diffs.
     - 72 hours of running the test in the bug report with
       delays enabled for product, fastdebug and jvmg bits
       in parallel on my Solaris X86 server.
     - JPRT test run
     - Aurora Adhoc results are in process; we're having issues
       with both a broken testbase build and infra problems
       with results not being uploaded.

Re: RFR (XS) fix for a safepoint deadlock (8047720)

Reply via email to