Re: RFR (S) 8049304: race between VM_Exit and _sync_FutileWakeups->inc()

Daniel D. Daugherty Thu, 27 Aug 2015 08:52:16 -0700

Hi David!

Thanks for chiming in on this thread!


Replies embedded below as usual...


On 8/26/15 10:26 PM, David Holmes wrote:

Hi Dan,

On 26/08/2015 7:08 AM, Daniel D. Daugherty wrote:
Greetings,

I have a "fix" for a long standing race between JVM shutdown and the
JVM statistics subsystem:

JDK-8049304 race between VM_Exit and _sync_FutileWakeups->inc()
https://bugs.openjdk.java.net/browse/JDK-8049304
Webrev URL:http://cr.openjdk.java.net/~dcubed/8049304-webrev/0-jdk9-hs-rt/
Testing: Aurora Adhoc RT-SVC nightly batch
          Aurora Adhoc vm.tmtools batch
          Kim's repro sequence for JDK-8049304
          Kim's repro sequence for JDK-8129978
          JPRT -testset hotspot

This "fix":

- adds a volatile flag to record whether PerfDataManager is holding
   data (PerfData objects)
- adds PerfDataManager::has_PerfData() to return the flag
- changes the Java monitor subsystem's use of PerfData to
   check both allocation of the monitor subsystem specific
   PerfData object and the new PerfDataManager::has_PerfData()
   return value

If the global 'UsePerfData' option is false, the system works as
it did before. If 'UsePerfData' is true (the default on non-embedded
systems), the Java monitor subsystem will allocate a number of
PerfData objects to record information. The objects will record
information about Java monitor subsystem until the JVM shuts down.

When the JVM starts to shutdown, the new PerfDataManager flag will
change to false and the Java monitor subsystem will stop using the
PerfData objects. This is the new behavior. As noted in the comments
I added to the code, the race is still present; I'm just changing
the order and the timing to reduce the likelihood of the crash.
Right. To sum up: the basic problem is that the PerfData objects aredeallocated at the safepoint established for VM termination, but thoseobjects can actually be used by threads that are in a safepoint-safestate: in particular within the low-level synchronization code.
As you say this fix narrows the window where a crash can occur, butcan not close it. If a thread is descheduled after the check ofhasPerfData it can still access the PerfData object when it resumes,which may be after the object was deallocated. There's no true fixhere without introducing synchronization (which would have to be evenlower-level to avoid reentrant use of the same code we're fixing!) andthe overhead of that would be prohibitive for these perf counters.
In response to Kim's concern about other code that uses PerfDataobjects I think you would have to examine those uses to see which, ifany, can occur from either a non-JavaThread, or from within the codewhere a thread is considered safepoint-safe. I'm inclined to agreethat given we have not seen issues with such code, either it does notexist or is extremely unlikely to hit this issue. Given the "fix" isitself only narrowing the window it doesn't seem necessary to addresscode that already has a narrower window.
That all said "leaking" the PerfData objects seems no less unpleasanta "fix". There are so many obstacles in the way of being able tounload and re-load the JVM that I do not think this makes the positionmeasurably worse. In fact I can imagine that if we were to allow forsuch behaviour we would need to be able to terminate threads andreclaim all their resources (like Monitor instances), at which pointit would also become easy to deallocate shared memory like PerfDataobjects.


Here's what I wrote in the bug report before I started this review cycle:

Daniel Daugherty added a comment - 2015-08-21 20:40
Continued investigating VM shutdown race:

JDK-8049304 race between VM_Exit and _sync_FutileWakeups->inc()
JDK-8129978 SIGSEGV when parsing command line options

   - Thanks to Kim for providing easy reproduction instructions for
     both bugs; I've tweaked the repro code a bit
   - The "correct" solution is to add a locking/memory ordering
     mechanism to ensure that PerfData is only used when valid.
     The locking/memory ordering would slow down the PerfData
     mechanism for every update. Ouch!
   - The "fast but safe" solution is to leak the PerfData memory
     and not clean them up at VM shutdown. We're trying to clean
     up the code base so the idea of intentionally leaking memory
     makes me cringe.
   - The solution I'm investigating is between "fast but safe"
     and "correct". I'm adding a PerfDataManager.has_PerfData()
     function that returns true when PerfDataManager is holding
     PerfData objects and false when none have been allocated
     or when they have been freed at VM shutdown. The flag
     holding the state is volatile and I use release_store()
     to change it so that publication is visible more quickly.
     On the VM shutdown path, I also do a 1ms sleep after setting
     the flag and before freeing the memory.
   - The idea is that the 1ms sleep will give any threads that
     saw PerfDataManager.has_PerfData() == true a chance to do
     their operation on the PerfData object before VM shutdown
     thread frees it.


So I think we're all roughly on the same page here:

1) We don't like the current system because we keep getting
   these shutdown race crashes. Of course, a new one came
   in early this AM:

   JDK-8134566 java/lang/invoke/LFCaching/LFMultiThreadCachingTest.java
               crashes in monitor synchronization code
   https://bugs.openjdk.java.net/browse/JDK-8134566

2) We don't like the "correct" solution because it would slow down
   the performance counters and possibly skew the very data we are
   trying to gather. Kim has also pointed out that adding more
   locking in a subsystem used by higher level locking is risky.

3) We don't like the "fast but safe" solution of leaking the PerfData
   memory. We try to make ourselves feel better about this by saying
   there are plenty of other leaks in the VM... slippery slope?

4) We don't like the proposed solution because the race still exists
   and we could continue to see failures like these. Only they would
   be more rare and possibly harder to spot.

5) Off-thread Kim and I have been talking about adding logic to the
   signal handler filters to detect a SIGSEGV that comes from use of
   a now freed PerfData object. We're mulling on the idea, but have
   not determined if it is even possible or an acceptable idea...

Hopefully, the above accurately sums up our options...

I'll leave it up to you which way to go. As it stands this is Reviewed.


Thanks!

Here's my proposed plan:

1) I'd like to move forward with this change in order to reduce the
   occurrences of this crasher. Yes, I'm getting tired of seeing and
   analyzing them.

2) If we see PerfData crashes in the future with non-monitor subsystem
   PerfData usage, then we look at adding has_PerfData() calls to that
   subsystem.

3) If we see PerfData crashes in the future in the monitor subsystem,
   then that indicates that the theoretical race is real or I missed
   protecting a PerfData usage with has_PerfData(). If the race is
   real, then we examine these alternatives:

   - leak the PerfData objects on the JVM shutdown path, i.e.,
     switch to the "fast but safe" solution
   - add signal handler support to make PerfData SIGSEGVs benign

What do folks think?

Dan


Thanks,
David

Thanks, in advance, for any comments, questions or suggestions.

Dan

Re: RFR (S) 8049304: race between VM_Exit and _sync_FutileWakeups->inc()

Reply via email to