Dan,

Greatly impressed. Very subtle. Thanks  to you and David and Dice and Serguei
for a lot of blood, sweat and tears.

Apologies that I haven't been following all the details, so a couple of 
questions.

1. I like the new fix. I totally appreciate the massive documentation in the 
bug, it really helps.
And the comments in the code.

2. In removing the earlier fix, does that work because
   - there is no posting of monitor_contended_exit (or any posting) during
    the exit logic? Or did I just miss finding it?

3. Did I read the code correctly that the Thread::SpinAcquire can make a timed 
park
call on the same thread's _ParkEvent? And that this is used to get on and off 
the wait queue,
i.e. to acquire the WaitSetLock?
   Is there the same risk that a notify might be eaten here also?

If so, I wonder if we want this added unpark to not just be called if 
JVMTI_EVENT_MONITOR_WAITED
is enabled?

thanks,
Karen

On Feb 1, 2014, at 1:38 PM, Daniel D. Daugherty wrote:

> Greetings,
> 
> I have a fix ready for the following bug:
> 
>    8028073 race condition in ObjectMonitor implementation causing deadlocks
>    https://bugs.openjdk.java.net/browse/JDK-8028073
> 
> On the surface, this is a very simple fix that relocates a few lines of
> code, relocates and rewrites the comments associated with that code and
> adds several new comments.
> 
> Of course, in reality, the issue is much more complicated, but I'm
> hoping to make it easy for anyone not acquainted with this issue to
> understand what's going on.
> 
> Here are the JDK9 webrev URLs:
> 
> OpenJDK:
> http://cr.openjdk.java.net/~dcubed/8028073-webrev/0-jdk9-hs-runtime/
> 
> Oracle internal:
> http://javaweb.us.oracle.com/~ddaugher/8028073-webrev/0-jdk9-hs-runtime/
> 
> The simple summary:
> 
> - since Java Monitors and JVM/TI RawMonitors share a ParkEvent,
>  it is possible for a JVM/TI monitor event handler to accidentally
>  consume a ParkEvent.unpark() call meant for Java Monitor layer
> - the original code fix was made on 2005.07.04 using this bug ID:
>  https://bugs.openjdk.java.net/browse/JDK-5030359
> - it's the right fix, but it's in the wrong place
> - the fix needs to be after the JVMTI_EVENT_MONITOR_WAITED
>  event handler is called because it is that event handler
>  that can cause the hang
> 
> 
> Testing
> -------
> 
> - a new StessMonitorWait test has been created that reliably
>  reproduces the hang in JDK[6789]; see the bug's gory details
>  for the specific versions where the hang has been reproduced
>  - the test reliably reproduces the hang in 5 seconds on my
>    T7600 running Solaris 10u11 X86; 1 minute runs reproduce
>    the hang reliably on other machines
>  - 12 hour stress run of the new test on Linux-X64, MacOS X-X64,
>    Solaris-SPARCV9, Solaris-X64, and Win7-X86 with the JPRT
>    bits did not reproduce the hang
> - JPRT test job
> - VM/SQE Adhoc test job on Server VM, fastdebug bits on Linux-X86,
>  Linux-X64, MacOS X-X64, Solaris-SPARCV9, Solaris-X64, Windows-X86,
>  and Windows-X64:
>  - vm.quick
>  - Kitchensink (bigapps)
>  - Weblogic+medrec (bigapps)
>  - runThese (bigapps)
> 
> 
> The Gory Details Start Here
> ---------------------------
> 
> This is the old location of block of code that's being moved:
> 
> src/share/vm/runtime/objectMonitor.cpp:
> 
> 1440 void ObjectMonitor::wait(jlong millis, bool interruptible, TRAPS) {
> <snip>
> 1499    exit (true, Self) ;                    // exit the monitor
> <snip>
> 1513    if (node._notified != 0 && _succ == Self) {
> 1514       node._event->unpark();
> 1515    }
> 
> 
> This is the new location of block of code that's being moved:
> 
> src/share/vm/runtime/objectMonitor.cpp:
> 
> 1452 void ObjectMonitor::wait(jlong millis, bool interruptible, TRAPS) {
> <snip>
> 1601      if (JvmtiExport::should_post_monitor_waited()) {
> 1602        JvmtiExport::post_monitor_waited(jt, this, ret == OS_TIMEOUT);
> <snip>
> 1604        if (node._notified != 0 && _succ == Self) {
> <snip>
> 1620          node._event->unpark();
> 1621        }
> 
> 
> The Risks
> ---------
> 
> - The code now executes only when the JVMTI_EVENT_MONITOR_WAITED event
>  is enabled:
>  - previously it was always executed
>  - while the old code was not effective for the hang that is being
>    fixed with this bug, it is possible that the old code prevented
>    a different bug in the successor protocol from manifesting
>  - thorough analysis of the successor protocol did not reveal a
>    case where the old code was needed in the old location
> - Thorough analysis indicates that the other JVM/TI monitor events
>  do not need a fix like the one for JVMTI_EVENT_MONITOR_WAITED:
>  - the successor protocol is complicated and the analysis could
>    be wrong when certain options are used
>  - comments were added to each location where a JVM/TI monitor
>    event handler is called documenting why a fix like this one
>    is not needed there
>  - if the analysis is wrong, the new comments show where a new
>    code change would be needed
> 
> 
> The Scenario
> ------------
> 
> I've created a scenario that reproduces this hang:
> 
> T1 - enters monitor and calls monitor.wait()
> T2 - enters the monitor, calls monitor.notify() and exits the monitor
> T3 - enters and exits the monitor
> T4 - enters the monitor, delays for 5 seconds, exits the monitor
> 
> A JVM/TI agent that enables JVMTI_EVENT_MONITOR_WAITED and has a
> handler that: enters a raw monitor, waits for 1ms, exits a raw monitor.
> 
> Here are the six events necessary to make this hang happen:
> 
> // KEY-EVENT-1a: After being unparked(), T1 has cleared the _succ field, but
> // KEY-EVENT-1b: T3 is exiting the monitor and makes T1 the successor again.
> 
> // KEY-EVENT-2a: The unpark() done by T3 when it made T1 the successor
> // KEY-EVENT-2b: is consumed by the JVM/TI event handler.
> 
> // KEY-EVENT-3a: T3 made T1 the successor
> // KEY-EVENT-3b: but before T1 could reenter the monitor T4 grabbed it.
> 
> // KEY-EVENT-4a: T1's TrySpin() call sees T4 as NotRunnable so
> // KEY-EVENT-4b: T1 bails from TrySpin without touching _succ.
> 
> // KEY-EVENT-5a: T4 sees that T1 is still the successor so
> // KEY-EVENT-5b: T4 takes the quick exit path (no ExitEpilog)
> 
> // KEY-EVENT-6a: T1 is about to park and it is the successor, but
> // KEY-EVENT-6b: T3's unpark has been eaten by the JVM/TI event handler
> // KEY-EVENT-6c: and T4 took the quick exit path. T1 is about to be stuck.
> 
> 
> This bug is intertwined with:
> 
> - The ObjectMonitor successor protocol
> - the sharing of a ParkEvent between Java Monitors and JVM/TI RawMonitors
> 
> There is a very long successor.notes attachment to JDK-8028073 that
> attempts to describe the ObjectMonitor successor protocol. It's good
> for putting pretty much anyone to sleep.
> 
> Since this hang reproduces back to JDK6, this bug is taking the easily
> backported solution of moving the original fix to the right location.
> The following new bug has been filed for possible future work in this
> area by the Serviceability Team:
> 
>    8033399 add a separate ParkEvent for JVM/TI RawMonitor use
>    https://bugs.openjdk.java.net/browse/JDK-8033399
> 
> 
> The Symptoms
> ------------
> 
> With intermittent hangs like this, it is useful to know what to look
> for in order to determine if you are running into this issue:
> 
> - if you aren't using a debugger or a profiler or some other
>  JVM/TI agent, then this hang is not the same as yours
> - if your JVM/TI agent isn't using a JVMTI_EVENT_MONITOR_WAITED
>  event handler, then this hang is not the same as yours
> - if your JVMTI_EVENT_MONITOR_WAITED event handler is not using
>  JVM/TI RawMonitors, then this hang is not the same as yours
> - if your JVMTI_EVENT_MONITOR_WAITED event handler is calling
>  back into Java code, then you might just be insane and this
>  hang might be similar to yours. However, using a Java callback
>  in an event handler is an even bigger problem/risk so fix that
>  first.
> - if you one or more threads blocked like this and making no
>  progress, then this hang might be the same as yours:
> 
> "T1" #22 prio=5 os_prio=64 tid=0x00000000009ca800 nid=0x2f waiting for 
> monitor e
> ntry [0xfffffd7fc0231000]
>   java.lang.Thread.State: BLOCKED (on object monitor)
>   JavaThread state: _thread_blocked
> Thread: 0x00000000009ca800  [0x2f] State: _at_safepoint _has_called_back 0 
> _at_p
> oll_safepoint 0
>   JavaThread state: _thread_blocked
>        at java.lang.Object.wait(Native Method)
>        - waiting on <0xfffffd7e6a2b6ff0> (a java.lang.String)
>        at java.lang.Object.wait(Object.java:502)
>        at SMW_WorkerThread.run(StressMonitorWait.java:103)
>        - locked <0xfffffd7e6a2b6ff0> (a java.lang.String)
> 
> "T2" #23 prio=5 os_prio=64 tid=0x00000000009cc000 nid=0x30 waiting for 
> monitor e
> ntry [0xfffffd7fc0130000]
>   java.lang.Thread.State: BLOCKED (on object monitor)
>   JavaThread state: _thread_blocked
> Thread: 0x00000000009cc000  [0x30] State: _at_safepoint _has_called_back 0 
> _at_p
> oll_safepoint 0
>   JavaThread state: _thread_blocked
>        at SMW_WorkerThread.run(StressMonitorWait.java:120)
>        - waiting to lock <0xfffffd7e6a2b6ff0> (a java.lang.String)
> 
> "T3" #24 prio=5 os_prio=64 tid=0x00000000009ce000 nid=0x31 waiting for 
> monitor e
> ntry [0xfffffd7fc002f000]
>   java.lang.Thread.State: BLOCKED (on object monitor)
>   JavaThread state: _thread_blocked
> Thread: 0x00000000009ce000  [0x31] State: _at_safepoint _has_called_back 0 
> _at_p
> oll_safepoint 0
>   JavaThread state: _thread_blocked
>        at SMW_WorkerThread.run(StressMonitorWait.java:139)
>        - waiting to lock <0xfffffd7e6a2b6ff0> (a java.lang.String)
> 
> Key symptoms in thread T1:
> 
> - had the object locked:
> 
>  locked <0xfffffd7e6a2b6ff0> (a java.lang.String)
> 
> - did an Object.wait():
> 
>  waiting on <0xfffffd7e6a2b6ff0> (a java.lang.String)
> 
> - is blocked on reentry:
> 
>  waiting for monitor entry [0xfffffd7fc0231000]
> 
> Key symtoms in thread T2:
> 
> - is blocked waiting to lock the object:
> 
>  waiting for monitor entry [0xfffffd7fc0130000]
>  waiting to lock <0xfffffd7e6a2b6ff0> (a java.lang.String)
> 
> Key symtoms in thread T3:
> 
> - is blocked waiting to lock the object:
> 
>  waiting for monitor entry [0xfffffd7fc002f000]
>  waiting to lock <0xfffffd7e6a2b6ff0> (a java.lang.String)
> 

Reply via email to