Re: code review round 0 for ObjectMonitor-JVM/TI hang fix (8028073)

David Holmes Mon, 10 Feb 2014 22:38:50 -0800

Dan - thanks for an awesome analysis once again!

Yes mea culpa the code I was looking at was sans Mr Simms mostfortuitous change. (I'm suffering from repo overload these days.)

As per my previous email I could see that a timed-wait, or spuriouswakeup, could take us to the potentially problematic SpinAcquire,however I could not see how/if that thread could become _succ such thatan unpark directed to _succ could be consumed in the SpinAcquire park().


Thanks again,
David

On 11/02/2014 4:55 AM, Daniel D. Daugherty wrote:

On 2/9/14 8:37 PM, David Holmes wrote:

trimming content ...

On 8/02/2014 9:45 AM, Daniel D. Daugherty wrote:

On 2/7/14 2:56 PM, Karen Kinnear wrote:

3. Did I read the code correctly that the Thread::SpinAcquire can make
a timed park
call on the same thread's _ParkEvent? And that this is used to get on
and off the wait queue,
i.e. to acquire the WaitSetLock?
    Is there the same risk that a notify might be eaten here also?


As far as I can see, Thread::SpinAcquire() does not use a ParkEvent


It sure does:

void Thread::SpinAcquire (volatile int * adr, const char * LockName) {
  if (Atomic::cmpxchg (1, adr, 0) == 0) {
     return ;   // normal fast-path return
  }

  // Slow-path : We've encountered contention -- Spin/Yield/Block
strategy.
  TEVENT (SpinAcquire - ctx) ;
  int ctr = 0 ;
  int Yields = 0 ;
  for (;;) {
     while (*adr != 0) {
        ++ctr ;
        if ((ctr & 0xFFF) == 0 || !os::is_MP()) {
           if (Yields > 5) {
             // Consider using a simple NakedSleep() instead.
             // Then SpinAcquire could be called by non-JVM threads
             Thread::current()->_ParkEvent->park(1) ;


Ummmm... that's not the code I'm seeing...

src/share/vm/runtime/thread.cpp:

   4417  void Thread::SpinAcquire (volatile int * adr, const char *
LockName) {
   4418    if (Atomic::cmpxchg (1, adr, 0) == 0) {
   4419       return ;   // normal fast-path return
   4420    }
   4421
   4422    // Slow-path : We've encountered contention --
Spin/Yield/Block strategy.
   4423    TEVENT (SpinAcquire - ctx) ;
   4424    int ctr = 0 ;
   4425    int Yields = 0 ;
   4426    for (;;) {
   4427       while (*adr != 0) {
   4428          ++ctr ;
   4429          if ((ctr & 0xFFF) == 0 || !os::is_MP()) {
   4430             if (Yields > 5) {
   4431               os::naked_short_sleep(1);
   4432             } else {
   4433               os::NakedYield() ;
   4434               ++Yields ;
   4435             }
   4436          } else {
   4437             SpinPause() ;
   4438          }
   4439       }
   4440       if (Atomic::cmpxchg (1, adr, 0) == 0) return ;
   4441    }
   4442  }

Mr Simms recently changed the above code via:

changeset:   5832:5944dba4badc
user:        dsimms
date:        Fri Jan 24 09:28:47 2014 +0100
summary:     8028280: ParkEvent leak when running modified runThese
which only loads classes

os::naked_short_sleep() is new:

- BSD/MacOS X, Linux - uses nanosleep()
- Solaris - uses usleep()
- Windows - uses Sleep()

The fix for 8028280 was pushed to JDK9/hs-rt on 2014.01.24 and to JDK9/hs
on 2014.01.29. I don't see any signs that Mr Simm's fix will be backported
to JDK8u/HSX-25u (yet) so this part of the review thread might impact the
backport of my fix to earlier releases.

So considering Karen's question ... I can't tell for certain. :(

I do not think the SpinAcquire on grabbing the wait-set lock to add to
the wait-set can be an issue because we will only park in response to
the actual wait, and hence only get unparked due to a
notify/notifyAll, but at this point we still own the monitor so no
notify/notifyAll is possible.

However, for the removal from the wait-set a more complex analysis is
needed. To do the SpinAcquire we must still be flagged as TS_WAIT -
which means we have not been notified, but must be returning due to a
timeout (or spurious wakeup?). In such circumstances could we be
_succ? I don't think so but I'll leave it to Dan to confirm that part :)


So for HSX-25 and probably older...

There are four Thread::SpinAcquire() calls in the objectMonitor code:

     Thread::SpinAcquire (&_WaitSetLock, "WaitSet - add") ;
     Thread::SpinAcquire (&_WaitSetLock, "WaitSet - unlink") ;
     Thread::SpinAcquire (&_WaitSetLock, "WaitSet - notify") ;
     Thread::SpinAcquire (&_WaitSetLock, "WaitSet - notifyall") ;

We can easily rule out the "notify" and "notifyAll" uses since the
current thread owns the Java-level monitor and there are no events
to post in this part of the notify() or notifyAll() protocols.

For the "WaitSet - add" use, the current thread owns the Java-level
monitor and the thread has not been added as a waiter yet so another
thread cannot do the notify-exit-make-successor part of the protocol
yet.

For the "WaitSet - unlink" use:

src/share/vm/runtime/objectMonitor.cpp:

   1569       if (node.TState == ObjectWaiter::TS_WAIT) {
   1570           Thread::SpinAcquire (&_WaitSetLock, "WaitSet - unlink") ;
   1571           if (node.TState == ObjectWaiter::TS_WAIT) {
   1572              DequeueSpecificWaiter (&node) ;       // unlink
from WaitSet
   1573              assert(node._notified == 0, "invariant");
   1574              node.TState = ObjectWaiter::TS_RUN ;
   1575           }
   1576           Thread::SpinRelease (&_WaitSetLock) ;
   1577       }

It is the call on line 1570 above that gets us into this code:

src/share/vm/runtime/thread.cpp:

   4435  void Thread::SpinAcquire (volatile int * adr, const char *
LockName) {
   4436    if (Atomic::cmpxchg (1, adr, 0) == 0) {
   4437       return ;   // normal fast-path return
   4438    }
   4439
   4440    // Slow-path : We've encountered contention --
Spin/Yield/Block strategy.
   4441    TEVENT (SpinAcquire - ctx) ;
   4442    int ctr = 0 ;
   4443    int Yields = 0 ;
   4444    for (;;) {
   4445       while (*adr != 0) {
   4446          ++ctr ;
   4447          if ((ctr & 0xFFF) == 0 || !os::is_MP()) {
   4448             if (Yields > 5) {
   4449               // Consider using a simple NakedSleep() instead.
   4450               // Then SpinAcquire could be called by non-JVM
threads
   4451               Thread::current()->_ParkEvent->park(1) ;
   4452             } else {
   4453               os::NakedYield() ;
   4454               ++Yields ;
   4455             }
   4456          } else {
   4457             SpinPause() ;
   4458          }
   4459       }
   4460       if (Atomic::cmpxchg (1, adr, 0) == 0) return ;
   4461    }
   4462  }

And the above code can consume the unpark() on line 4451.

So how the heck do we get to line 1570???

Well, the target thread would have to be both notified and unparked
to be executing this code path. When the notify() code runs, the
target of the notify() is changed from ObjectWaiter::TS_WAIT to
ObjectWaiter::TS_ENTER unless Knob_MoveNotifyee == 4. The default
for Knob_MoveNotifyee == 2 so we're in non default mode here...

Here are the Knob_MoveNotifyee policy values:

    1717      if (Policy == 0) {       // prepend to EntryList
    1728      if (Policy == 1) {      // append to EntryList
    1744      if (Policy == 2) {      // prepend to cxq
    1760      if (Policy == 3) {      // append to cxq

For Knob_MoveNotifyee == 4 (or higher), we use the old mechanism
where we just unpark the target thread and let it run. Part of
that code changes from ObjectWaiter::TS_WAIT to ObjectWaiter::TS_RUN.

The code works the same for notifyAll() for the thread picked
to be notified. For the Knob_MoveNotifyee == 4 (or higher) case,
we just unpark all the waiters and we a free-for-all.

So it looks like the code block from lines 1569-1577 is never
used... or is it? Well... you have to remember two things:

1) spurious unpark()
2) timed wait()

The caller might have called wait(0), but that doesn't mean that
the underlying park() mechanism won't have a spurious unpark().
Or better, the caller might have called wait(1) and be running
again after a millisecond.

So in the HSX25 and older system (i.e., without Mr Simms fix for
8028280), it is possible for this call:

   1570           Thread::SpinAcquire (&_WaitSetLock, "WaitSet - unlink") ;

to consume the unpark(). The gauntlet that has to be traversed
to get to this call:

   4451               Thread::current()->_ParkEvent->park(1) ;

is impressive:

- fast-path acquisition of the _WaitSetLock has to fail:

   4436    if (Atomic::cmpxchg (1, adr, 0) == 0) {
   4437       return ;   // normal fast-path return
   4438    }

- if the machine is a uniprocessor, then 6 os::NakedYield()
   call-loop-recheck attempts have to fail:

   4447          if ((ctr & 0xFFF) == 0 || !os::is_MP()) {
   4448             if (Yields > 5) {
   4449               // Consider using a simple NakedSleep() instead.
   4450               // Then SpinAcquire could be called by non-JVM
threads
   4451               Thread::current()->_ParkEvent->park(1) ;
   4452             } else {
   4453               os::NakedYield() ;
   4454               ++Yields ;
   4455             }

- if the machine is a multi-processor, then 6 rounds of { 4095 SpinPause()
   attempts, 1 os::NakedYield() attempt}  have to fail:

   4446          ++ctr ;
   4447          if ((ctr & 0xFFF) == 0 || !os::is_MP()) {
   4448             if (Yields > 5) {
   4449               // Consider using a simple NakedSleep() instead.
   4450               // Then SpinAcquire could be called by non-JVM
threads
   4451               Thread::current()->_ParkEvent->park(1) ;
   4452             } else {
   4453               os::NakedYield() ;
   4454               ++Yields ;
   4455             }
   4456          } else {
   4457             SpinPause() ;
   4458          }

But it is possible. It is one of those once-in-a-blue moon type
windows where everything has to line up just so.

So how do we address this issue in HSX-25 and possibly older?

If Mr Simms fix for 8028280 is also backported, then there is no
issue. If it is not backported, then applying the fix for this
bug like so:

src/share/vm/runtime/objectMonitor.cpp:

   1596       if (JvmtiExport::should_post_monitor_waited()) {
   1597         JvmtiExport::post_monitor_waited(jt, this, ret ==
OS_TIMEOUT);
   1598       }

   1604       if (node._notified != 0 && _succ == Self) {
   1605         // In this part of the monitor wait-notify-reenter
protocol it
   1606         // is possible (and normal) for another thread to do a
fastpath
   1607         // monitor enter-exit while this thread is still trying
to get
   1608         // to the reenter portion of the protocol.
   1609         //
   1610         // The ObjectMonitor was notified and the current thread is
   1611         // the successor which also means that an unpark() has
already
   1612         // been done. The JVMTI_EVENT_MONITOR_WAITED event
handler can
   1613         // consume the unpark() that was done when the successor
was
   1614         // set because the same ParkEvent is shared between Java
   1615         // monitors and JVM/TI RawMonitors (for now).
   1616         //
   1617         // We redo the unpark() to ensure forward progress,
i.e., we
   1618         // don't want all pending threads hanging (parked) with
none
   1619         // entering the unlocked monitor.
   1620         node._event->unpark();
   1621       }

Of course the line numbers for the "fix" would be different and the comment
would need to be updated to reflect that the:

   1570           Thread::SpinAcquire (&_WaitSetLock, "WaitSet - unlink") ;

call above could also consume an unpark(), but it should work.

If you've read this far, then I'm impressed. If you've read this far
and only fallen asleep a couple of times, then I'm still impressed.

Summary: I don't think we have an issue in JDK9, but we'll have to do
          the fix in JDK8/HSX25 and older a little differently.

Dan


David
-----

at all. However, Thread::muxAcquire() does use a ParkEvent, but it
is a different ParkEvent. From src/share/vm/runtime/thread.hpp:

   ParkEvent * _ParkEvent ;               // for synchronized()
   ParkEvent * _SleepEvent ;              // for Thread.sleep
   ParkEvent * _MutexEvent ;              // for native internal
Mutex/Monitor
   ParkEvent * _MuxEvent ;                // for low-level
muxAcquire-muxRelease

So ObjectMonitor uses the _ParkEvent field and Thread::muxAcquire()
uses the _MuxEvent. There are some comments in thread.cpp about
how _MuxEvent could be eliminated and _ParkEvent shared, but I don't
think we ever want to go there.

I also filed this RFE:

     8033399 add a separate ParkEvent for JVM/TI RawMonitor use
https://bugs.openjdk.java.net/browse/JDK-8033399

just in case the Serviceability team wants to migrate JVM/TI RawMonitors
to a separate ParkEvent.

Please let me know if you concur that I've resolved issue #3.

If so, I wonder if we want this added unpark to not just be called if
JVMTI_EVENT_MONITOR_WAITED
is enabled?


I don't think we need it, but I've noted its removal as a risk.

Again, thanks for the review!

Dan


thanks,
Karen

On Feb 1, 2014, at 1:38 PM, Daniel D. Daugherty wrote:

Greetings,

I have a fix ready for the following bug:

    8028073 race condition in ObjectMonitor implementation causing
deadlocks
https://bugs.openjdk.java.net/browse/JDK-8028073

On the surface, this is a very simple fix that relocates a few
lines of
code, relocates and rewrites the comments associated with that code
and
adds several new comments.

Of course, in reality, the issue is much more complicated, but I'm
hoping to make it easy for anyone not acquainted with this issue to
understand what's going on.

Here are the JDK9 webrev URLs:

OpenJDK:
http://cr.openjdk.java.net/~dcubed/8028073-webrev/0-jdk9-hs-runtime/

Oracle internal:
http://javaweb.us.oracle.com/~ddaugher/8028073-webrev/0-jdk9-hs-runtime/


The simple summary:

- since Java Monitors and JVM/TI RawMonitors share a ParkEvent,
  it is possible for a JVM/TI monitor event handler to accidentally
  consume a ParkEvent.unpark() call meant for Java Monitor layer
- the original code fix was made on 2005.07.04 using this bug ID:
https://bugs.openjdk.java.net/browse/JDK-5030359
- it's the right fix, but it's in the wrong place
- the fix needs to be after the JVMTI_EVENT_MONITOR_WAITED
  event handler is called because it is that event handler
  that can cause the hang


Testing
-------

- a new StessMonitorWait test has been created that reliably
  reproduces the hang in JDK[6789]; see the bug's gory details
  for the specific versions where the hang has been reproduced
  - the test reliably reproduces the hang in 5 seconds on my
    T7600 running Solaris 10u11 X86; 1 minute runs reproduce
    the hang reliably on other machines
  - 12 hour stress run of the new test on Linux-X64, MacOS X-X64,
    Solaris-SPARCV9, Solaris-X64, and Win7-X86 with the JPRT
    bits did not reproduce the hang
- JPRT test job
- VM/SQE Adhoc test job on Server VM, fastdebug bits on Linux-X86,
  Linux-X64, MacOS X-X64, Solaris-SPARCV9, Solaris-X64, Windows-X86,
  and Windows-X64:
  - vm.quick
  - Kitchensink (bigapps)
  - Weblogic+medrec (bigapps)
  - runThese (bigapps)


The Gory Details Start Here
---------------------------

This is the old location of block of code that's being moved:

src/share/vm/runtime/objectMonitor.cpp:

1440 void ObjectMonitor::wait(jlong millis, bool interruptible,
TRAPS) {
<snip>
1499    exit (true, Self) ;                    // exit the monitor
<snip>
1513    if (node._notified != 0 && _succ == Self) {
1514       node._event->unpark();
1515    }


This is the new location of block of code that's being moved:

src/share/vm/runtime/objectMonitor.cpp:

1452 void ObjectMonitor::wait(jlong millis, bool interruptible,
TRAPS) {
<snip>
1601      if (JvmtiExport::should_post_monitor_waited()) {
1602        JvmtiExport::post_monitor_waited(jt, this, ret ==
OS_TIMEOUT);
<snip>
1604        if (node._notified != 0 && _succ == Self) {
<snip>
1620          node._event->unpark();
1621        }


The Risks
---------

- The code now executes only when the JVMTI_EVENT_MONITOR_WAITED event
  is enabled:
  - previously it was always executed
  - while the old code was not effective for the hang that is being
    fixed with this bug, it is possible that the old code prevented
    a different bug in the successor protocol from manifesting
  - thorough analysis of the successor protocol did not reveal a
    case where the old code was needed in the old location
- Thorough analysis indicates that the other JVM/TI monitor events
  do not need a fix like the one for JVMTI_EVENT_MONITOR_WAITED:
  - the successor protocol is complicated and the analysis could
    be wrong when certain options are used
  - comments were added to each location where a JVM/TI monitor
    event handler is called documenting why a fix like this one
    is not needed there
  - if the analysis is wrong, the new comments show where a new
    code change would be needed


The Scenario
------------

I've created a scenario that reproduces this hang:

T1 - enters monitor and calls monitor.wait()
T2 - enters the monitor, calls monitor.notify() and exits the monitor
T3 - enters and exits the monitor
T4 - enters the monitor, delays for 5 seconds, exits the monitor

A JVM/TI agent that enables JVMTI_EVENT_MONITOR_WAITED and has a
handler that: enters a raw monitor, waits for 1ms, exits a raw
monitor.

Here are the six events necessary to make this hang happen:

// KEY-EVENT-1a: After being unparked(), T1 has cleared the _succ
field, but
// KEY-EVENT-1b: T3 is exiting the monitor and makes T1 the successor
again.

// KEY-EVENT-2a: The unpark() done by T3 when it made T1 the successor
// KEY-EVENT-2b: is consumed by the JVM/TI event handler.

// KEY-EVENT-3a: T3 made T1 the successor
// KEY-EVENT-3b: but before T1 could reenter the monitor T4 grabbed
it.

// KEY-EVENT-4a: T1's TrySpin() call sees T4 as NotRunnable so
// KEY-EVENT-4b: T1 bails from TrySpin without touching _succ.

// KEY-EVENT-5a: T4 sees that T1 is still the successor so
// KEY-EVENT-5b: T4 takes the quick exit path (no ExitEpilog)

// KEY-EVENT-6a: T1 is about to park and it is the successor, but
// KEY-EVENT-6b: T3's unpark has been eaten by the JVM/TI event
handler
// KEY-EVENT-6c: and T4 took the quick exit path. T1 is about to be
stuck.


This bug is intertwined with:

- The ObjectMonitor successor protocol
- the sharing of a ParkEvent between Java Monitors and JVM/TI
RawMonitors

There is a very long successor.notes attachment to JDK-8028073 that
attempts to describe the ObjectMonitor successor protocol. It's good
for putting pretty much anyone to sleep.

Since this hang reproduces back to JDK6, this bug is taking the easily
backported solution of moving the original fix to the right location.
The following new bug has been filed for possible future work in this
area by the Serviceability Team:

    8033399 add a separate ParkEvent for JVM/TI RawMonitor use
https://bugs.openjdk.java.net/browse/JDK-8033399


The Symptoms
------------

With intermittent hangs like this, it is useful to know what to look
for in order to determine if you are running into this issue:

- if you aren't using a debugger or a profiler or some other
  JVM/TI agent, then this hang is not the same as yours
- if your JVM/TI agent isn't using a JVMTI_EVENT_MONITOR_WAITED
  event handler, then this hang is not the same as yours
- if your JVMTI_EVENT_MONITOR_WAITED event handler is not using
  JVM/TI RawMonitors, then this hang is not the same as yours
- if your JVMTI_EVENT_MONITOR_WAITED event handler is calling
  back into Java code, then you might just be insane and this
  hang might be similar to yours. However, using a Java callback
  in an event handler is an even bigger problem/risk so fix that
  first.
- if you one or more threads blocked like this and making no
  progress, then this hang might be the same as yours:

"T1" #22 prio=5 os_prio=64 tid=0x00000000009ca800 nid=0x2f waiting
for monitor e
ntry [0xfffffd7fc0231000]
   java.lang.Thread.State: BLOCKED (on object monitor)
   JavaThread state: _thread_blocked
Thread: 0x00000000009ca800  [0x2f] State: _at_safepoint
_has_called_back 0 _at_p
oll_safepoint 0
   JavaThread state: _thread_blocked
        at java.lang.Object.wait(Native Method)
        - waiting on <0xfffffd7e6a2b6ff0> (a java.lang.String)
        at java.lang.Object.wait(Object.java:502)
        at SMW_WorkerThread.run(StressMonitorWait.java:103)
        - locked <0xfffffd7e6a2b6ff0> (a java.lang.String)

"T2" #23 prio=5 os_prio=64 tid=0x00000000009cc000 nid=0x30 waiting
for monitor e
ntry [0xfffffd7fc0130000]
   java.lang.Thread.State: BLOCKED (on object monitor)
   JavaThread state: _thread_blocked
Thread: 0x00000000009cc000  [0x30] State: _at_safepoint
_has_called_back 0 _at_p
oll_safepoint 0
   JavaThread state: _thread_blocked
        at SMW_WorkerThread.run(StressMonitorWait.java:120)
        - waiting to lock <0xfffffd7e6a2b6ff0> (a java.lang.String)

"T3" #24 prio=5 os_prio=64 tid=0x00000000009ce000 nid=0x31 waiting
for monitor e
ntry [0xfffffd7fc002f000]
   java.lang.Thread.State: BLOCKED (on object monitor)
   JavaThread state: _thread_blocked
Thread: 0x00000000009ce000  [0x31] State: _at_safepoint
_has_called_back 0 _at_p
oll_safepoint 0
   JavaThread state: _thread_blocked
        at SMW_WorkerThread.run(StressMonitorWait.java:139)
        - waiting to lock <0xfffffd7e6a2b6ff0> (a java.lang.String)

Key symptoms in thread T1:

- had the object locked:

  locked <0xfffffd7e6a2b6ff0> (a java.lang.String)

- did an Object.wait():

  waiting on <0xfffffd7e6a2b6ff0> (a java.lang.String)

- is blocked on reentry:

  waiting for monitor entry [0xfffffd7fc0231000]

Key symtoms in thread T2:

- is blocked waiting to lock the object:

  waiting for monitor entry [0xfffffd7fc0130000]
  waiting to lock <0xfffffd7e6a2b6ff0> (a java.lang.String)

Key symtoms in thread T3:

- is blocked waiting to lock the object:

  waiting for monitor entry [0xfffffd7fc002f000]
  waiting to lock <0xfffffd7e6a2b6ff0> (a java.lang.String)

Re: code review round 0 for ObjectMonitor-JVM/TI hang fix (8028073)

Reply via email to