Re: RFR (S) 8049304: race between VM_Exit and _sync_FutileWakeups->inc()

Daniel D. Daugherty Thu, 27 Aug 2015 14:42:13 -0700

Sorry for starting another e-mail thread fork in an already complicated
review...


I was mulling on this piece of the thread over lunch (my TZ):

> Right. To sum up: the basic problem is that the PerfData objects
> are deallocated at the safepoint established for VM termination,
> but those objects can actually be used by threads that are in a
> safepoint-safe state: in particular within the low-level
> synchronization code.

and that got me thinking about safepoints and VM exit/shutdown/termination.

The problem that I'm trying to solve is the one where we have a test
that has passed for all intents and purposes. It has logged success
in a message to stdout, stderr, or a log file. The test is exiting
with a successful exit code (exit_code == 0 for some, exit_code == 95
for others)... On the way out the door, the test crashes with a SIGSEGV.

In these failures, the VM was doing a normal shutdown. In the case
where main() has fallen off the end, jni_DestroyJavaVM() is called.
In the case where System.exit() is called from Java then vm_exit()
is called.

src/share/vm/runtime/java.cpp:

503  void vm_exit(int code) {
<snip>
511    if (VMThread::vm_thread() != NULL) {
512      // Fire off a VM_Exit operation to bring VM to a safepoint and exit
513      VM_Exit op(code);
514      if (thread->is_Java_thread())
515 ((JavaThread*)thread)->set_thread_state(_thread_in_vm);
516      VMThread::execute(&op);

src/share/vm/runtime/vm_operations.cpp:

443  void VM_Exit::doit() {
<snip>
457    exit_globals();

exit_globals() calls perfMemory_exit() which calls
PerfDataManager::destroy() which sets the flag that
says there is no more PerfData...

So a System.exit() call results in setting the new
flag at a safepoint.



The jni_DestroyJavaVM() is more complicated:

src/share/vm/prims/jni.cpp:

4081  jint JNICALL jni_DestroyJavaVM(JavaVM *vm) {
<snip>
4105    if (Threads::destroy_vm()) {

src/share/vm/runtime/thread.cpp:

3890  bool Threads::destroy_vm() {
<snip>

3896    // Wait until we are the last non-daemon thread to execute
3897    { MutexLocker nu(Threads_lock);
3898      while (Threads::number_of_non_daemon_threads() > 1)

After this while-loop there are no non-daemon JavaThreads.


3926    // Stop VM thread.
3927    {
3928      // 4945125 The vm thread comes to a safepoint during exit.
3929      // GC vm_operations can get caught at the safepoint, and the
3930      // heap is unparseable if they are caught. Grab the Heap_lock
3931      // to prevent this. The GC vm_operations will not be able to
3932      // queue until after the vm thread is dead. After this point,
3933      // we'll never emerge out of the safepoint before the VM exits.
3934
3935      MutexLocker ml(Heap_lock);
3936
3937      VMThread::wait_for_vm_thread_exit();


During VMThread exit, the system was brought to a safepoint. All
daemon JavaThreads are at a safepoint.

  3966    // exit_globals() will delete tty
  3967    exit_globals();

exit_globals() calls perfMemory_exit() which calls
PerfDataManager::destroy() which sets the flag that
says there is no more PerfData...

So a jni_DestroyJavaVM() call results in setting the
flag at a safepoint.


Whew!!! So the two "normal" exit paths both set the new flag
while the system is at a safepoint (as David H said). The
remaining question is whether the Java monitor PerfData
usage can happen in parallel while the system is at a
safepoint.

Fortunately, I converted all of the Java monitor subsystem's
usage of PerfData to use a new macro called OM_PERFDATA_OP
so I can easily visit them all:

src/share/vm/runtime/objectMonitor.cpp: ObjectMonitor::enter()
   OM_PERFDATA_OP(ContendedLockAttempts, inc()) at the end to
   record a contended monitor enter. The thread is not seen as
   safepoint-safe at this point so no conflict.

src/share/vm/runtime/objectMonitor.cpp: ObjectMonitor::EnterI()
   OM_PERFDATA_OP(FutileWakeups, inc()) after returning from
   a park() call and not being able to acquire the lock. The
   thread is in state _thread_blocked so the thread is seen as
   safepoint-safe so we may have a conflict.

src/share/vm/runtime/objectMonitor.cpp: ObjectMonitor::ReenterI()
   OM_PERFDATA_OP(FutileWakeups, inc()) after returning from
   a park() call and not being able to acquire the lock. The
   thread is in state _thread_blocked so the thread is seen as
   safepoint-safe so we may have a conflict.

src/share/vm/runtime/objectMonitor.cpp: ObjectMonitor::ExitEpilog()
   OM_PERFDATA_OP(Parks, inc()) at the end of the function after
   releasing ownership of the monitor. The thread is not seen as
   safepoint-safe at this point so no conflict.

src/share/vm/runtime/objectMonitor.cpp: ObjectMonitor::notify()
   OM_PERFDATA_OP(Notifications, inc(1)) at the end of the function
   after internal INotify() has returned. The thread is not seen as
   safepoint-safe at this point so no conflict.

src/share/vm/runtime/objectMonitor.cpp: ObjectMonitor::notifyAll()
   OM_PERFDATA_OP(Notifications, inc(tally)) at the end of the function
   after internal INotify() while-loop has finished. The thread is not
   seen as safepoint-safe at this point so no conflict.

src/share/vm/runtime/synchronizer.cpp: ObjectSynchronizer::quick_notify()
   OM_PERFDATA_OP(Notifications, inc(tally)) after internal INotify()
   has been called on one or more monitors. The thread is not seen as
   safepoint-safe at this point so no conflict.

src/share/vm/runtime/synchronizer.cpp: ObjectSynchronizer::inflate()
   OM_PERFDATA_OP(Inflations, inc()) in two places after we've
   successfully inflated a stack lock into an ObjectMonitor; The
   thread is not seen as safepoint-safe at this point so no conflict.

src/share/vm/runtime/synchronizer.cpp:ObjectSynchronizer::deflate_idle_monitors()

   OM_PERFDATA_OP(Deflations, inc(nScavenged)) and
   OM_PERFDATA_OP(MonExtant, set_value(nInCirculation)) after we've
   scavenged the global list. This function is only called at a
   safepoint as a periodic cleanup task. No conflict because the
   periodic task stuff is already disabled.


So the only two conflicts that I see are the futile wakeup in EnterI()
and ReenterI(). In both of those cases, the thread was in park() and
had to be spuriously unpark()'ed or intentionally unpark()'ed after
being chosen as the successor for a contended monitor. Additionally,
the associated monitor has be held by another thread which is why
this is called a futile wakeup.

This brings us full circle... The futile wakeup case happens to be
the original sighting that motivated me to file this bug back on
2014.07.03. The EnterI() futile wakeup is also the case that Kim's
debug repro code tickles for this bug. Kim's repro case for this
bug requires a CTRL-C, but I think that results in a mostly orderly
shutdown of the VM. Since the test case requires an interactive
java session, I tested the fix with delays in place for 50 runs
without a failure two different times. So the EnterI() futile
wakeup case is tested with delays in place and I can't get the
bug to reproduce anymore. Without the fix and with the delays,
the crash happens everytime.

For the other bug that Kim did repro code for: JDK-8129978, that
crash happened in the ObjectMonitor inflation code path and that
case is covered by the OM_PERFDATA_OP(Inflations, inc()) notes
above. The flag is set at a safepoint, the safepoint ends, the
stack lock is inflated into an ObjectMonitor while the system
is not at a safepoint, the flag is seen and the PerfData is not
used so no crash. I did a 1000 runs with the delays in place to
verify that the repro for JDK-8129978 no longer crashes.


Short version (if you read this far, you should be laughing at this :-))

I still think this is a good "fix" for this problem. I'm not yet
convinced that we need to take the path where we "leak" PerfData
objects.

Thanks for reading this far... :-)

Dan



On 8/26/15 10:26 PM, David Holmes wrote:

Hi Dan,

On 26/08/2015 7:08 AM, Daniel D. Daugherty wrote:
Greetings,

I have a "fix" for a long standing race between JVM shutdown and the
JVM statistics subsystem:

JDK-8049304 race between VM_Exit and _sync_FutileWakeups->inc()
https://bugs.openjdk.java.net/browse/JDK-8049304
Webrev URL:http://cr.openjdk.java.net/~dcubed/8049304-webrev/0-jdk9-hs-rt/
Testing: Aurora Adhoc RT-SVC nightly batch
          Aurora Adhoc vm.tmtools batch
          Kim's repro sequence for JDK-8049304
          Kim's repro sequence for JDK-8129978
          JPRT -testset hotspot

This "fix":

- adds a volatile flag to record whether PerfDataManager is holding
   data (PerfData objects)
- adds PerfDataManager::has_PerfData() to return the flag
- changes the Java monitor subsystem's use of PerfData to
   check both allocation of the monitor subsystem specific
   PerfData object and the new PerfDataManager::has_PerfData()
   return value

If the global 'UsePerfData' option is false, the system works as
it did before. If 'UsePerfData' is true (the default on non-embedded
systems), the Java monitor subsystem will allocate a number of
PerfData objects to record information. The objects will record
information about Java monitor subsystem until the JVM shuts down.

When the JVM starts to shutdown, the new PerfDataManager flag will
change to false and the Java monitor subsystem will stop using the
PerfData objects. This is the new behavior. As noted in the comments
I added to the code, the race is still present; I'm just changing
the order and the timing to reduce the likelihood of the crash.
Right. To sum up: the basic problem is that the PerfData objects aredeallocated at the safepoint established for VM termination, but thoseobjects can actually be used by threads that are in a safepoint-safestate: in particular within the low-level synchronization code.
As you say this fix narrows the window where a crash can occur, butcan not close it. If a thread is descheduled after the check ofhasPerfData it can still access the PerfData object when it resumes,which may be after the object was deallocated. There's no true fixhere without introducing synchronization (which would have to be evenlower-level to avoid reentrant use of the same code we're fixing!) andthe overhead of that would be prohibitive for these perf counters.
In response to Kim's concern about other code that uses PerfDataobjects I think you would have to examine those uses to see which, ifany, can occur from either a non-JavaThread, or from within the codewhere a thread is considered safepoint-safe. I'm inclined to agreethat given we have not seen issues with such code, either it does notexist or is extremely unlikely to hit this issue. Given the "fix" isitself only narrowing the window it doesn't seem necessary to addresscode that already has a narrower window.
That all said "leaking" the PerfData objects seems no less unpleasanta "fix". There are so many obstacles in the way of being able tounload and re-load the JVM that I do not think this makes the positionmeasurably worse. In fact I can imagine that if we were to allow forsuch behaviour we would need to be able to terminate threads andreclaim all their resources (like Monitor instances), at which pointit would also become easy to deallocate shared memory like PerfDataobjects.
I'll leave it up to you which way to go. As it stands this is Reviewed.

Thanks,
David
Thanks, in advance, for any comments, questions or suggestions.

Dan

Re: RFR (S) 8049304: race between VM_Exit and _sync_FutileWakeups->inc()

Reply via email to