Re: RFR (S) 8049304: race between VM_Exit and _sync_FutileWakeups->inc()

Daniel D. Daugherty Fri, 28 Aug 2015 11:11:38 -0700

On 8/28/15 11:57 AM, Tom Benson wrote:

Hi Dan,


On 8/28/2015 12:27 PM, Daniel D. Daugherty wrote:

On 8/28/15 8:45 AM, Tom Benson wrote:

Hi,
One more pair of eyes on this.  8^)


Hi Tom!

Thanks for reviewing and welcome to the party...


On 8/27/2015 8:16 PM, Kim Barrett wrote:

On Aug 27, 2015, at 5:42 PM, Daniel D. Daugherty<daniel.daughe...@oracle.com> wrote:

Sorry for starting another e-mail thread fork in an alreadycomplicated
review...

OK, that was fascinating.  No, really, I mean it.

It made me realize that we've been arguing and talking past each other
in part because we're really dealing with two distinct though closely
related bugs here.

I've been primarily thinking about the case where we're calling
vm_abort / os::abort, where the we presently delete the PerfData
memory even though there can be arbitrary other threads running.  This
was the case in JDK-8129978, which is how I got involved here in the
first place.  In that bug we were in vm_exit_during_initialization and
had called perfMemory_exit when some thread attempted to inflate a
monitor (which is not one of the conflicting cases discussed by Dan).

The problem Dan has been looking at, JDK-8049304, is about a "normal"
VM shutdown.  In this case, the problem is that we believe it is safe
to delete the PerfData, because we've safepointed, and yet some thread
unexpectedly runs and attempts to touch the deleted data anyway.

I think Dan's proposed fix (mostly) avoids the specific instance of
JDK-8129978, but doesn't solve the more general problem of abnormal
exit deleting the PerfData while some running thread is touching some
non-monitor-related part of that data.  My proposal to leave it to the
OS to deal with memory cleanup on process exit would deal with this
case.

I think Dan's proposed fix (mostly) avoids problems like JDK-8049304.
And the approach I've been talking about doesn't help at all for this
case.  But I wonder if Dan's proposed fix can be improved. A "futile
wakeup" case doesn't seem to me like one which requires super-high
performance.  Would it be ok, in the two problematic cases that Dan
identified, to use some kind of atomic / locking protocol with the
cleanup?  Or is the comment for the counter increment in EnterI (and
only there) correct that it's important to avoid a lock or atomics
here (and presumably in ReenterI too).

I notice that EnteriI/ReenterI both end with OrderAccess::fence().Can the potential update of _sync_FutileWakeups be delayed untilthat point, to take advantage of the fence to make the sync holeeven smaller?


Not easily with EnterI() since there is one optional optimization
between the OM_PERFDATA_OP(FutileWakeups, inc()) call and the
OrderAccess::fence() call and that would result in lost FutileWakeups
increments.

Not easily in ReenterI(), the OM_PERFDATA_OP(FutileWakeups, inc()) call
is at the bottom of the for-loop and the OrderAccess::fence() call at
the end of the function is outside the loop. This would result in lost
FutileWakeups increments.

Yes, you'd have to keep a local count, and then add the total outsidethe loop for both Enter/Reenter, after the fence. But I see what youmean about the other exit paths in Enter. (The more I look at thiscode, the more I remember it...


I'm sure that is a wonderfully loving memory too!

BTW, are those knob_ setting defaults ever going to be moved to aplatform specific-module? That was my beef (well, one of them) in 2different ports. Or is the goal to make monitors so well self-tuningthat they can go away? Sorry for the digression... 8^))


Dice is working on another idea to move tuning to a separate loadable
module which is why we deferred the "adaptive spin" and "SpinPause on
SPARC" buckets for the Contended Locking project.

At any rate, as you say, perhaps it's not worth it to leverage thefences, though it could be done.


OK so we're agreed on no change here.

So in ReenterI() the OM_PERFDATA_OP(FutileWakeups, inc()) callimmediatelyfollows an OrderAccess::fence() call. Doesn't that make thatincrement as
"safe" as it can be without having a real lock?
You've got a release() (and and short nap!) with the store inPerfDataManager::destroy() to try to close the window somewhat.
Yes, I modeled that after:

src/share/vm/runtime/perfMemory.cpp:

    83  void PerfMemory::initialize() {
<snip>
   156    OrderAccess::release_store(&_initialized, 1);
   157  }
But I think rather than the release_store() you used, you want astore, followed by a release(). release_store() puts a fencebefore the store to ensure earlier updates are seen before thecurrent one, no?
Yup, and I see I got my reasoning wrong. The code I modeled
is right because you want to flush all the inits and it's OK
if the _initialized transition from '0' -> '1' is lazily seen.

For my shutdown use, we are transitioning from '1' -> '0' and
we need that to be seen proactively so:

    OrderAccess::release_store(&_has_PerfData, 0);
    OrderAccess::storeload();

which is modeled after _owner field transitions from non-zero
-> NULL in ObjectMonitor.cpp
It's not clear to me why the store needs to be a release_store in thiscase, as long as the storeload() follows it. You're not protectingany earlier writes. ?


I'm following the model that Dice uses for ObjectMonitors when we
change the _owner field from non-NULL -> NULL. There are some long
comments in the ObjectMonitor.cpp stuff that talk about why it is
done this way. I'm planning to file a cleanup bug for the non-NULL
-> NULL transition stuff because the comments and code are not
consistent.

Also, I think the comment above that release_store() could beclarified. It is fine as is if you're familiar with this bug reportand discussion, but... I think it should explicitly say there isstill a very small window for the lack of true synchronization tocause a failure. And perhaps that the release_store() (orstore/release()) is not half of an acquire/release pair.
Here's the existing comment:
286 // Clear the flag before we free the PerfData counters.Thus begins287 // the race between this thread and another thread that hasjust288 // queried PerfDataManager::has_PerfData() and gotten back'true'.
   289    // The hope is that the other thread will finish its PerfData
290 // manipulation before we free the memory. The twoalternatives291 // are 1) leak the PerfData memory or 2) do some form ofordered
   292    // access before every PerfData operation.

I think it pretty clearly states that there is still a race here.
And I think that option 2 covers that we're not doing completely
safe ordered access. I'm not sure how to make this comment more
clear, but if you have specific suggestions...
OK, I guess it's subjective.


I'll always take wording tweaks so if something occurs to you
later on... :-)

Dan

Tom

Dan

Tom

Re: RFR (S) 8049304: race between VM_Exit and _sync_FutileWakeups->inc()

Reply via email to