Re: RFR(L) 8227745: Enable Escape Analysis for Better Performance in the Presence of JVMTI Agents

Robbin Ehn Tue, 31 Mar 2020 07:23:18 -0700

Thanks for cleaning up thread.hpp!

/Robbin


On 2020-03-30 10:31, Reingruber, Richard wrote:

Hi,

this is webrev.5 based on Robbin's feedback and Martin's review - thanks! :)

The change affects jvmti, hotspot and c2. Partial reviews are very welcome too.

Full:  http://cr.openjdk.java.net/~rrich/webrevs/2019/8227745/webrev.5/
Delta: http://cr.openjdk.java.net/~rrich/webrevs/2019/8227745/webrev.5.inc/

Robbin, Martin, please let me know, if anything shouldn't be quite as you 
wanted it. Also find my
comments on your feedback below.

Robbin, can I count you as Reviewer for the runtime part?

Thanks, Richard.

--

DeoptimizeObjectsALotThread is only used in compileBroker.cpp.
You can move both declaration and definition to that file, no need to clobber
thread.[c|h]pp. (and the static function deopt_objs_alot_thread_entry)


Done.

Does JvmtiDeferredUpdates really need to be in thread.hpp, can't be in it's own
hpp file? It doesn't seem right to add JVM TI classes into thread.hpp.


I moved JvmtiDeferredUpdates to vframe_hp.hpp where preexisting 
jvmtiDeferredLocalVariableSet is
declared.

src/hotspot/share/code/compiledMethod.cpp
Nice cleanup!


Thanks :)

src/hotspot/share/code/debugInfoRec.cpp
src/hotspot/share/code/debugInfoRec.hpp
Additional parmeters. (Remark: I think "non_global_escape_in_scope" would read better 
than "not_global_escape_in_scope", but your version is consistent with existing code, so 
no change request from my side.) Ok.


I've been thinking about this too and finally stayed with 
not_global_escape_in_scope. It's supposed
to mean an object whose escape state is not GlobalEscape is in scope.

src/hotspot/share/compiler/compileBroker.cpp
src/hotspot/share/compiler/compileBroker.hpp
Extra thread for DeoptimizeObjectsALot. (Remark: I would have put it into a 
follow up change together with the test in order to make this webrev smaller, 
but since it is included, I'm reviewing everything at once. Not a big deal.) Ok.


Yes the change would be a little smaller. And if it helps I'll split it off. In 
general I prefer
patches that bring along a suitable amount of tests.

src/hotspot/share/opto/c2compiler.cpp
Make do_escape_analysis independent of JVMCI capabilities. Nice!


It is the main goal of the enhancement. It is done for C2, but could be done 
for JVMCI compilers
with just a small effort as well.

src/hotspot/share/opto/escape.cpp
Annotation for MachSafePointNodes. Your added functionality looks correct.
But I'd prefer to move the bulky code out of the large function.
I suggest to factor out something like has_not_global_escape and 
has_arg_escape. So the code could look like this:
       SafePointNode* sfn = sfn_worklist.at(next);
       sfn->set_not_global_escape_in_scope(has_not_global_escape(sfn));
       if (sfn->is_CallJava()) {
         CallJavaNode* call = sfn->as_CallJava();
         call->set_arg_escape(has_arg_escape(call));
       }
This would also allow us to get rid of the found_..._escape_in_args variables 
making the loops better readable.


Done.

It's kind of ugly to use strcmp to recognize uncommon trap, but that seems to 
be the way to do it (there are more such places). So it's ok.


Yeah. I copied the snippet.

src/hotspot/share/prims/jvmtiImpl.cpp
src/hotspot/share/prims/jvmtiImpl.hpp
The sequence is pretty complex:
VM_GetOrSetLocal element initialization executes EscapeBarrier code which 
suspends the target thread (extra VM Operation).


Note that the target threads have to be suspended already for 
VM_GetOrSetLocal*. So it's mainly the
synchronization effect of EscapeBarrier::sync_and_suspend_one() that is 
required here. Also no extra
_handshake_ is executed, since sync_and_suspend_one() will find the target 
threads already
suspended.

VM_GetOrSetLocal::doit_prologue performs object deoptimization (by VM Thread to 
prepare VM Operation with frame deoptimization).
VM_GetOrSetLocal destructor implicitly calls EscapeBarrier destructor which 
resumes the target thread.
But I don't have any improvement proposal. Performance is probably not a 
concern, here. So it's ok.

VM_GetOrSetLocal::deoptimize_objects deoptimizes the top frame if it has 
non-globally escaping objects and other frames if they have arg escaping ones. 
Good.


It's not specifically the top frame, but the frame that is accessed.

src/hotspot/share/runtime/deoptimization.cpp
Object deoptimization. I have more comments and proposals, here.
First of all, handling recursive and waiting locks in relock_objects is tricky, 
but looks correct.
Comments are sufficient to understand why things are done as they are 
implemented.

BiasedLocking related parts are complex, but we may get rid of them in the 
future (with BiasedLocking removal).
Anyway, looks correct, too.

Typo in comment: "regularily" => "regularly"

Deoptimization::fetch_unroll_info_helper is the only place where 
_jvmti_deferred_updates get deallocated (except JavaThread destructor). But I 
think we always go through it, so I can't see a memory leak or such kind of 
issues.


That's correct. The compiled frame for which deferred updates are allocated is 
always deoptimized
before (see EscapeBarrier::deoptimize_objects()). This is also asserted in
compiledVFrame::update_deferred_value(). I've added the same assertion to
Deoptimization::relock_objects(). So we can be sure that 
_jvmti_deferred_updates are deallocated
again in fetch_unroll_info_helper().

EscapeBarrier::deoptimize_objects: ResourceMark should use calling_thread().


Sure, well spotted!

You can use MutexLocker and MonitorLocker with Thread* to save the 
Thread::current() call.


Right, good hint. This was recently introduced with 8235678. I even had to 
resolve conflicts. Should
have done this then.

I'd make set_objs_are_deoptimized static and remove it from the EscapeBarrier 
interface because I think it shouldn't be used outside of 
EscapeBarrier::deoptimize_objects.


Done.

Typo in comment: "we must only deoptimize" => "we only have to deoptimize"


Replaced with "[...] we deoptimize iff local objects are passed as args"

"bool EscapeBarrier::deoptimize_objects(intptr_t* fr_id)" is trivial and 
barrier_active() is redundant. Implementation can get moved to hpp file.


Ok. Done.

I'll get back to suspend flags, later.

There are weird cases regarding _self_deoptimization_in_progress.
Assume we have 3 threads A, B and C. A deopts C, B deopts C, C deopts C. C can 
set _self_deoptimization_in_progress while A performs the handshake for 
suspending C. I think this doesn't lead to errors, but it's probably not 
desired.
I think it would be better to use only one "wait" call in sync_and_suspend_one 
and sync_and_suspend_all.


You're right. We've discussed that face-to-face, but couldn't find a real 
issue. But now, thinking again, a reckon I found one:

2808   // Sync with other threads that might be doing deoptimizations
2809   {
2810     // Need to switch to _thread_blocked for the wait() call
2811     ThreadBlockInVM tbivm(_calling_thread);
2812     MonitorLocker ml(EscapeBarrier_lock, Mutex::_no_safepoint_check_flag);
2813     while (_self_deoptimization_in_progress) {
2814       ml.wait();
2815     }
2816
2817     if (self_deopt()) {
2818       _self_deoptimization_in_progress = true;
2819     }
2820
2821     while (_deoptee_thread->is_ea_obj_deopt_suspend()) {
2822       ml.wait();
2823     }
2824
2825     if (self_deopt()) {
2826       return;
2827     }
2828
2829     // set suspend flag for target thread
2830     _deoptee_thread->set_ea_obj_deopt_flag();
2831   }

- A waits in 2822
- C is suspended
- B notifies all in resume_one()
- A and C wake up
- C wins over A and sets _self_deoptimization_in_progress = true in 2818
- C does the self deoptimization
- A executes 2830 _deoptee_thread->set_ea_obj_deopt_flag()

C will self suspend at some undefined point. The resulting state is illegal.

I first thought it'd be better to move ThreadBlockInVM before wait() to reduce 
thread state transitions, but that seems to be problematic because 
ThreadBlockInVM destructor contains a safepoint check which we shouldn't do 
while holding EscapeBarrier_lock. So no change request.


Yes, would be nice to have the state change only if needed, but for the reason 
you mentioned it is
not quite as easy as it seems to be. I experimented as well with a second lock, 
but did not succeed.

Change in thred_added:
I think the sequence would be more comprehensive if we waited for 
deopt_all_threads in Thread::start and all other places where a new thread can 
run into Java code (e.g. JVMTI attach).
Your version makes new threads come up with suspend flag set. That looks 
correct, too. Advantage is that you only have to change one place 
(thread_added). It'll be interesting to see how it will look like when we use 
async handshakes instead of suspend flags.
For now, I'm ok with your version.


I had a version that did what you are suggesting. The current version also has 
the advantage, that
there are fewer places where a thread has to wait for ongoing object 
deoptimization. This means
viewer places where you have to worry about correct thread state transitions, 
possible deadlocks,
and if all oops are properly Handle'ed.

I'd only move MutexLocker ml(EscapeBarrier_lock...) after if 
(!jt->is_hidden_from_external_view()).


Done.

Having 4 different deoptimize_objects functions makes it a little hard to keep 
an overview of which one is used for what.
Maybe adding suffixes would help a little bit, but I can also live with what 
you have.
Implementation looks correct to me.


2 are internal. I added the suffix _internal to them. This leaves 2 to choose 
from.

src/hotspot/share/runtime/deoptimization.hpp
Escape barriers and object deoptimization functions.
Typo in comment: "helt" => "held"


Done in place already.

src/hotspot/share/runtime/interfaceSupport.cpp
InterfaceSupport::deoptimizeAllObjects() is only used for DeoptimizeObjectsALot 
= 1.
I think DeoptimizeObjectsALot = 2 is more important, but I think it's not bad 
to have DeoptimizeObjectsALot = 1 in addition. Ok.


I never used DeoptimizeObjectsALot = 1 that much. It could be more 
deterministic in single threaded
scenarios. I wouldn't object to get rid of it though.

src/hotspot/share/runtime/stackValue.hpp
Better reinitilization in StackValue. Good.


StackValue::obj_is_scalar_replaced() should not return true after calling 
set_obj().

src/hotspot/share/runtime/thread.cpp
src/hotspot/share/runtime/thread.hpp
src/hotspot/share/runtime/thread.inline.hpp
wait_for_object_deoptimization, suspend flag, deferred updates and test feature 
to deoptimize objects.

In the long term, we want to get rid of suspend flags, so it's not so nice to 
introduce a new one. But I agree with Götz that it should be acceptable as 
temporary solution until async handshakes are available (which takes more 
time). So I'm ok with your change.


I'm keen to build the feature on async handshakes when the arive.

You can use MutexLocker with Thread*.


Done.

JVMTIDeferredUpdates: I agree with Robin. It'd be nice to move the class out of 
thread.hpp.


Done.

src/hotspot/share/runtime/vframe.cpp
Added support for entry frame to new_vframe. Ok.

src/hotspot/share/runtime/vframe_hp.cpp
src/hotspot/share/runtime/vframe_hp.hpp

I think code()->as_nmethod() in not_global_escape_in_scope() and arg_escape() 
should better be under #ifdef ASSERT or inside the assert statement (no need for 
code cache walking in product build).


Done.

jvmtiDeferredLocalVariableSet::update_monitors:
Please add a comment explaining that owner referenced by original info may be 
scalar replaced, but it is deoptimized in the vframe.

Done.

-----Original Message-----
From: Doerr, Martin <[email protected]>
Sent: Donnerstag, 12. März 2020 17:28
To: Reingruber, Richard <[email protected]>; 'Robbin Ehn' <[email protected]>; 
Lindenmaier, Goetz <[email protected]>; David Holmes <[email protected]>; Vladimir 
Kozlov ([email protected]) <[email protected]>; [email protected]; 
[email protected]; [email protected]
Subject: RE: RFR(L) 8227745: Enable Escape Analysis for Better Performance in 
the Presence of JVMTI Agents

Hi Richard,

I managed to find time for a (almost) complete review of webrev.4. (I'll review 
the tests separately.)

First of all, the change seems to be in pretty good quality for its significant 
complexity. I couldn't find any real bugs. But I'd like to propose minor 
improvements.
I'm convinced that it's mature because we did substantial testing.

I like the new functionality for object deoptimization. It can possibly be 
reused for future escape analysis based optimizations. So I appreciate having 
it available in the code base.
In addition to that, your change makes the JVMTI implementation better 
integrated into the VM.

Now to the details:

src/hotspot/share/c1/c1_IR.hpp
describe_scope parameters. Ok.

src/hotspot/share/ci/ciEnv.cpp
src/hotspot/share/ci/ciEnv.hpp
Fix for JvmtiExport::can_walk_any_space() capability. Ok.

src/hotspot/share/code/compiledMethod.cpp
Nice cleanup!

src/hotspot/share/code/debugInfoRec.cpp
src/hotspot/share/code/debugInfoRec.hpp
Additional parmeters. (Remark: I think "non_global_escape_in_scope" would read better 
than "not_global_escape_in_scope", but your version is consistent with existing code, so 
no change request from my side.) Ok.

src/hotspot/share/code/nmethod.cpp
Nice cleanup!

src/hotspot/share/code/pcDesc.hpp
Additional parameters. Ok.

src/hotspot/share/code/scopeDesc.cpp
src/hotspot/share/code/scopeDesc.hpp
Improved implementation + additional parameters. Ok.

src/hotspot/share/compiler/compileBroker.cpp
src/hotspot/share/compiler/compileBroker.hpp
Extra thread for DeoptimizeObjectsALot. (Remark: I would have put it into a 
follow up change together with the test in order to make this webrev smaller, 
but since it is included, I'm reviewing everything at once. Not a big deal.) Ok.

src/hotspot/share/jvmci/jvmciCodeInstaller.cpp
Additional parameters. Ok.

src/hotspot/share/opto/c2compiler.cpp
Make do_escape_analysis independent of JVMCI capabilities. Nice!

src/hotspot/share/opto/callnode.hpp
Additional fields for MachSafePointNodes. Ok.

src/hotspot/share/opto/escape.cpp
Annotation for MachSafePointNodes. Your added functionality looks correct.
But I'd prefer to move the bulky code out of the large function.
I suggest to factor out something like has_not_global_escape and 
has_arg_escape. So the code could look like this:
       SafePointNode* sfn = sfn_worklist.at(next);
       sfn->set_not_global_escape_in_scope(has_not_global_escape(sfn));
       if (sfn->is_CallJava()) {
         CallJavaNode* call = sfn->as_CallJava();
         call->set_arg_escape(has_arg_escape(call));
       }
This would also allow us to get rid of the found_..._escape_in_args variables 
making the loops better readable.

It's kind of ugly to use strcmp to recognize uncommon trap, but that seems to 
be the way to do it (there are more such places). So it's ok.

src/hotspot/share/opto/machnode.hpp
Additional fields for MachSafePointNodes. Ok.

src/hotspot/share/opto/macro.cpp
Allow elimination of non-escaping allocations. Ok.

src/hotspot/share/opto/matcher.cpp
src/hotspot/share/opto/output.cpp
Copy attribute / pass parameters. Ok.

src/hotspot/share/prims/jvmtiCodeBlobEvents.cpp
Nice cleanup!

src/hotspot/share/prims/jvmtiEnv.cpp
src/hotspot/share/prims/jvmtiEnvBase.cpp
Escape barriers + deoptimize objects for target thread. Good.

src/hotspot/share/prims/jvmtiImpl.cpp
src/hotspot/share/prims/jvmtiImpl.hpp
The sequence is pretty complex:
VM_GetOrSetLocal element initialization executes EscapeBarrier code which 
suspends the target thread (extra VM Operation).
VM_GetOrSetLocal::doit_prologue performs object deoptimization (by VM Thread to 
prepare VM Operation with frame deoptimization).
VM_GetOrSetLocal destructor implicitly calls EscapeBarrier destructor which 
resumes the target thread.
But I don't have any improvement proposal. Performance is probably not a 
concern, here. So it's ok.

VM_GetOrSetLocal::deoptimize_objects deoptimizes the top frame if it has 
non-globally escaping objects and other frames if they have arg escaping ones. 
Good.

src/hotspot/share/prims/jvmtiTagMap.cpp
Escape barriers + deoptimize objects for all threads. Ok.

src/hotspot/share/prims/whitebox.cpp
Added WB_IsFrameDeoptimized to API. Ok.

src/hotspot/share/runtime/deoptimization.cpp
Object deoptimization. I have more comments and proposals, here.
First of all, handling recursive and waiting locks in relock_objects is tricky, 
but looks correct.
Comments are sufficient to understand why things are done as they are 
implemented.

BiasedLocking related parts are complex, but we may get rid of them in the 
future (with BiasedLocking removal).
Anyway, looks correct, too.

Typo in comment: "regularily" => "regularly"

Deoptimization::fetch_unroll_info_helper is the only place where 
_jvmti_deferred_updates get deallocated (except JavaThread destructor). But I 
think we always go through it, so I can't see a memory leak or such kind of 
issues.

EscapeBarrier::deoptimize_objects: ResourceMark should use calling_thread().

You can use MutexLocker and MonitorLocker with Thread* to save the 
Thread::current() call.

I'd make set_objs_are_deoptimized static and remove it from the EscapeBarrier 
interface because I think it shouldn't be used outside of 
EscapeBarrier::deoptimize_objects.

Typo in comment: "we must only deoptimize" => "we only have to deoptimize"

"bool EscapeBarrier::deoptimize_objects(intptr_t* fr_id)" is trivial and 
barrier_active() is redundant. Implementation can get moved to hpp file.

I'll get back to suspend flags, later.

There are weird cases regarding _self_deoptimization_in_progress.
Assume we have 3 threads A, B and C. A deopts C, B deopts C, C deopts C. C can 
set _self_deoptimization_in_progress while A performs the handshake for 
suspending C. I think this doesn't lead to errors, but it's probably not 
desired.
I think it would be better to use only one "wait" call in sync_and_suspend_one 
and sync_and_suspend_all.

I first thought it'd be better to move ThreadBlockInVM before wait() to reduce 
thread state transitions, but that seems to be problematic because 
ThreadBlockInVM destructor contains a safepoint check which we shouldn't do 
while holding EscapeBarrier_lock. So no change request.

Change in thred_added:
I think the sequence would be more comprehensive if we waited for 
deopt_all_threads in Thread::start and all other places where a new thread can 
run into Java code (e.g. JVMTI attach).
Your version makes new threads come up with suspend flag set. That looks 
correct, too. Advantage is that you only have to change one place 
(thread_added). It'll be interesting to see how it will look like when we use 
async handshakes instead of suspend flags.
For now, I'm ok with your version.

I'd only move MutexLocker ml(EscapeBarrier_lock...) after if 
(!jt->is_hidden_from_external_view()).

Having 4 different deoptimize_objects functions makes it a little hard to keep 
an overview of which one is used for what.
Maybe adding suffixes would help a little bit, but I can also live with what 
you have.
Implementation looks correct to me.

src/hotspot/share/runtime/deoptimization.hpp
Escape barriers and object deoptimization functions.
Typo in comment: "helt" => "held"

src/hotspot/share/runtime/globals.hpp
Addition of develop flag DeoptimizeObjectsALotInterval. Ok.

src/hotspot/share/runtime/interfaceSupport.cpp
InterfaceSupport::deoptimizeAllObjects() is only used for DeoptimizeObjectsALot 
= 1.
I think DeoptimizeObjectsALot = 2 is more important, but I think it's not bad 
to have DeoptimizeObjectsALot = 1 in addition. Ok.

src/hotspot/share/runtime/interfaceSupport.inline.hpp
Addition of deoptimizeAllObjects. Ok.

src/hotspot/share/runtime/mutexLocker.cpp
src/hotspot/share/runtime/mutexLocker.hpp
Addition of EscapeBarrier_lock. Ok.

src/hotspot/share/runtime/objectMonitor.cpp
Make recursion count relock aware. Ok.

src/hotspot/share/runtime/stackValue.hpp
Better reinitilization in StackValue. Good.

src/hotspot/share/runtime/thread.cpp
src/hotspot/share/runtime/thread.hpp
src/hotspot/share/runtime/thread.inline.hpp
wait_for_object_deoptimization, suspend flag, deferred updates and test feature 
to deoptimize objects.

In the long term, we want to get rid of suspend flags, so it's not so nice to 
introduce a new one. But I agree with Götz that it should be acceptable as 
temporary solution until async handshakes are available (which takes more 
time). So I'm ok with your change.

You can use MutexLocker with Thread*.

JVMTIDeferredUpdates: I agree with Robin. It'd be nice to move the class out of 
thread.hpp.

src/hotspot/share/runtime/vframe.cpp
Added support for entry frame to new_vframe. Ok.

src/hotspot/share/runtime/vframe_hp.cpp
src/hotspot/share/runtime/vframe_hp.hpp

I think code()->as_nmethod() in not_global_escape_in_scope() and arg_escape() 
should better be under #ifdef ASSERT or inside the assert statement (no need for 
code cache walking in product build).

jvmtiDeferredLocalVariableSet::update_monitors:
Please add a comment explaining that owner referenced by original info may be 
scalar replaced, but it is deoptimized in the vframe.

src/hotspot/share/utilities/macros.hpp
Addition of NOT_COMPILER2_OR_JVMCI_RETURN macros. Ok.

test/hotspot/jtreg/serviceability/jvmti/Heap/IterateHeapWithEscapeAnalysisEnabled.java
test/hotspot/jtreg/serviceability/jvmti/Heap/libIterateHeapWithEscapeAnalysisEnabled.c
New test. Will review separately.

test/jdk/TEST.ROOT
Addition of vm.jvmci as required property. Ok.

test/jdk/com/sun/jdi/EATests.java
test/jdk/com/sun/jdi/EATestsJVMCI.java
New test. Will review separately.

test/lib/sun/hotspot/WhiteBox.java
Added isFrameDeoptimized to API. Ok.

That was it. Best regards,
Martin

-----Original Message-----
From: hotspot-compiler-dev <hotspot-compiler-dev-
[email protected]> On Behalf Of Reingruber, Richard
Sent: Dienstag, 3. März 2020 21:23
To: 'Robbin Ehn' <[email protected]>; Lindenmaier, Goetz
<[email protected]>; David Holmes <[email protected]>;
Vladimir Kozlov ([email protected])
<[email protected]>; [email protected];
[email protected]; hotspot-runtime-
[email protected]
Subject: RE: RFR(L) 8227745: Enable Escape Analysis for Better
Performance in the Presence of JVMTI Agents

Hi Robbin,

I understand that Robbin proposed to replace the usage of
_suspend_flag with handshakes. Apparently, async handshakes
are needed to do so. We have been waiting a while for removal
of the _suspend_flag / introduction of async handshakes [2].
What is the status here?

I have an old prototype which I would like to continue to work on.
So do not assume asynch handshakes will make 15.
Even if it would, I think there are a lot more investigate work to remove
_suspend_flag.


Let us know, if we can be of any help to you and be it only testing.

Full:

http://cr.openjdk.java.net/~rrich/webrevs/2019/8227745/webrev.4/

DeoptimizeObjectsALotThread is only used in compileBroker.cpp.
You can move both declaration and definition to that file, no need to

clobber

thread.[c|h]pp. (and the static function deopt_objs_alot_thread_entry)


Will do.

Does JvmtiDeferredUpdates really need to be in thread.hpp, can't be in it's

own

hpp file? It doesn't seem right to add JVM TI classes into thread.hpp.


You are right. It shouldn't be declared in thread.hpp. I will look into that.

Note that we also think we may have a bug in deopt:
https://bugs.openjdk.java.net/browse/JDK-8238237

I think it would be best, if possible, to push after that is resolved.


Sure.

Not even nearly a full review :)


I know :)

Anyways, thanks a lot,
Richard.


-----Original Message-----
From: Robbin Ehn <[email protected]>
Sent: Monday, March 2, 2020 11:17 AM
To: Lindenmaier, Goetz <[email protected]>; Reingruber, Richard
<[email protected]>; David Holmes <[email protected]>;
Vladimir Kozlov ([email protected])
<[email protected]>; [email protected];
[email protected]; hotspot-runtime-
[email protected]
Subject: Re: RFR(L) 8227745: Enable Escape Analysis for Better Performance
in the Presence of JVMTI Agents

Hi,

On 2/24/20 5:39 PM, Lindenmaier, Goetz wrote:

Hi,

I had a look at the progress of this change. Nothing
happened since Richard posted his update using more
handshakes [1].
But we (SAP) would appreciate a lot if this change could
be successfully reviewed and pushed.

I think there is basic understanding that this
change is helpful. It fixes a number of issues with JVMTI,
and will deliver the same performance benefits as EA
does in current production mode for debugging scenarios.

This is important for us as we run our VMs prepared
for debugging in production mode.

I understand that Robbin proposed to replace the usage of
_suspend_flag with handshakes. Apparently, async handshakes
are needed to do so. We have been waiting a while for removal
of the _suspend_flag / introduction of async handshakes [2].
What is the status here?


I have an old prototype which I would like to continue to work on.
So do not assume asynch handshakes will make 15.
Even if it would, I think there are a lot more investigate work to remove
_suspend_flag.


I think we should no longer wait, but proceed with
this change. We will look into removing the usage of
suspend_flag introduced here once it is possible to implement
it with handshakes.


Yes, sure.

Full: http://cr.openjdk.java.net/~rrich/webrevs/2019/8227745/webrev.4/


DeoptimizeObjectsALotThread is only used in compileBroker.cpp.
You can move both declaration and definition to that file, no need to clobber
thread.[c|h]pp. (and the static function deopt_objs_alot_thread_entry)

Does JvmtiDeferredUpdates really need to be in thread.hpp, can't be in it's
own
hpp file? It doesn't seem right to add JVM TI classes into thread.hpp.

Note that we also think we may have a bug in deopt:
https://bugs.openjdk.java.net/browse/JDK-8238237

I think it would be best, if possible, to push after that is resolved.

Not even nearly a full review :)

Thanks, Robbin

Incremental:
http://cr.openjdk.java.net/~rrich/webrevs/2019/8227745/webrev.4.inc/

I was not able to eliminate the additional suspend flag now. I'll take care

of this

as soon as the
existing suspend-resume-mechanism is reworked.

Testing:

Nightly tests @SAP:

    JCK and JTREG, also in Xcomp mode, SPECjvm2008, SPECjbb2015,

Renaissance

Suite, SAP specific tests
    with fastdebug and release builds on all platforms

    Stress testing with DeoptimizeObjectsALot running SPECjvm2008 40x

parallel

for 24h

Thanks, Richard.


More details on the changes:

* Hide DeoptimizeObjectsALotThread from external view.

* Changed EscapeBarrier_lock to be a _safepoint_check_never lock.
    It used to be _safepoint_check_sometimes, which will be eliminated

sooner or

later.
    I added explicit thread state changes with ThreadBlockInVM to code

paths

where we can wait()
    on EscapeBarrier_lock to become safepoint safe.

* Use handshake EscapeBarrierSuspendHandshake to suspend target

threads

instead of vm operation
    VM_ThreadSuspendAllForObjDeopt.

* Removed uses of Threads_lock. When adding a new thread we suspend

it iff

EA optimizations are
    being reverted. In the previous version we were waiting on

Threads_lock

while EA optimizations
    were reverted. See EscapeBarrier::thread_added().

* Made tests require Xmixed compilation mode.

* Made tests agnostic regarding tiered compilation.
    I.e. tc isn't disabled anymore, and the tests can be run with tc enabled or
disabled.

* Exercising EATests.java as well with stress test options
DeoptimizeObjectsALot*
    Due to the non-deterministic deoptimizations some tests need to be

skipped.

    We do this to prevent bit-rot of the stress test code.

* Executing EATests.java as well with graal if available. Driver for this is
    EATestsJVMCI.java. Graal cannot pass all tests, because it does not

provide all

the new debug info
    (namely not_global_escape_in_scope and arg_escape in

scopeDesc.hpp).

    And graal does not yet support the JVMTI operations force early return

and

pop frame.

* Removed tracing from new jdi tests in EATests.java. Too much trace

output

before the debugging
    connection is established can cause deadlock because output buffers fill

up.

    (See https://bugs.openjdk.java.net/browse/JDK-8173304)

* Many copyright year changes and smaller clean-up changes of testing

code

(trailing white-space and
    the like).


-----Original Message-----
From: David Holmes <[email protected]>
Sent: Donnerstag, 19. Dezember 2019 03:12
To: Reingruber, Richard <[email protected]>; serviceability-
[email protected]; [email protected];

hotspot-

[email protected]; Vladimir Kozlov

([email protected])

<[email protected]>
Subject: Re: RFR(L) 8227745: Enable Escape Analysis for Better

Performance in

the Presence of JVMTI Agents

Hi Richard,

I think my issue is with the way EliminateNestedLocks works so I'm going
to look into that more deeply.

Thanks for the explanations.

David

On 18/12/2019 12:47 am, Reingruber, Richard wrote:

Hi David,

     > >    > Some further queries/concerns:
     > >    >
     > >    > src/hotspot/share/runtime/objectMonitor.cpp
     > >    >
     > >    > Can you please explain the changes to ObjectMonitor::wait:
     > >    >
     > >    > !   _recursions = save      // restore the old recursion count
     > >    > !                 + jt->get_and_reset_relock_count_after_wait(); 
//
     > >    > increased by the deferred relock count
     > >    >
     > >    > what is the "deferred relock count"? I gather it relates to
     > >    >
     > >    > "The code was extended to be able to deoptimize objects of a
     > > frame that
     > >    > is not the top frame and to let another thread than the owning
     > > thread do
     > >    > it."
     > >
     > > Yes, these relate. Currently EA based optimizations are reverted,

when a

compiled frame is

     > > replaced with corresponding interpreter frames. Part of this is

relocking

objects with eliminated

     > > locking. New with the enhancement is that we do this also just

before

object references are

     > > acquired through JVMTI. In this case we deoptimize also the

owning

compiled frame C and we

     > > register deoptimized objects as deferred updates. When control

returns

to C it gets deoptimized,

     > > we notice that objects are already deoptimized (reallocated and

relocked), so we don't do it again

     > > (relocking twice would be incorrect of course). Deferred updates

are

copied into the new

     > > interpreter frames.
     > >
     > > Problem: relocking is not possible if the target thread T is waiting

on the

monitor that needs to

     > > be relocked. This happens only with non-local objects with

EliminateNestedLocks. Instead relocking

     > > is deferred until T owns the monitor again. This is what the piece of

code above does.

     >
     >  Sorry I need some more detail here. How can you wait() on an

object

     >  monitor if the object allocation and/or locking was optimised away?

And

     >  what is a "non-local object" in this context? Isn't EA restricted to
     >  thread-confined objects?

"Non-local object" is an object that escapes its thread. The issue I'm

addressing with the changes

in ObjectMonitor::wait are almost unrelated to EA. They are caused by

EliminateNestedLocks, where C2

eliminates recursive locking of an already owned lock. The lock owning

object

exists on the heap, it

is locked and you can call wait() on it.

EliminateLocks is the C2 option that controls lock elimination based on

EA.

Both optimizations have

in common that objects with eliminated locking need to be relocked

when

deoptimizing a frame,

i.e. when replacing a compiled frame with equivalent interpreter
frames. Deoptimization::relock_objects does that job for /all/ eliminated

locks in scope. /All/ can

be a mix of eliminated nested locks and locks of not-escaping objects.

New with the enhancement: I call relock_objects earlier, just before

objects

pontentially

escape. But then later when the owning compiled frame gets

deoptimized, I

must not do it again:


See call to EscapeBarrier::objs_are_deoptimized in deoptimization.cpp:

    373   if ((jvmci_enabled || ((DoEscapeAnalysis ||

EliminateNestedLocks) &&

EliminateLocks))

    374       && !EscapeBarrier::objs_are_deoptimized(thread,

deoptee.id())) {

    375     bool unused;
    376     eliminate_locks(thread, chunk, realloc_failures, deoptee,

exec_mode,

unused);

    377   }

Now when calling relock_objects early it is quiet possible that I have to

relock

an object the

target thread currently waits for. Obviously I cannot relock in this case,

instead I chose to

introduce relock_count_after_wait to JavaThread.

     >  Is it just that some of the locking gets optimized away e.g.
     >
     >  synchronised(obj) {
     >     synchronised(obj) {
     >       synchronised(obj) {
     >         obj.wait();
     >       }
     >     }
     >  }
     >
     >  If this is reduced to a form as-if it were a single lock of the monitor
     >  (due to EA) and the wait() triggers a JVM TI event which leads to the
     >  escape of "obj" then we need to reconstruct the true lock state, and

so

     >  when the wait() internally unblocks and reacquires the monitor it

has to

     >  set the true recursion count to 3, not the 1 that it appeared to be

when

     >  wait() was initially called. Is that the scenario?

Kind of... except that the locking is not eliminated due to EA and there is

no

JVM TI event

triggered by wait.

Add

LocalObject l1 = new LocalObject();

in front of the synchrnized blocks and assume a JVM TI agent acquires l1.

This

triggers the code in

question.

See that relocking/reallocating is transactional. If it is done then for /all/

objects in scope and it is

done at most once. It wouldn't be quite so easy to split this in relocking

of

nested/EA-based

eliminated locks.

     >  If so I find this truly awful. Anyone using wait() in a realistic form
     >  requires a notification and so the object cannot be thread confined.

In


It is not thread confined.

     >  which case I would strongly argue that upon hitting the wait() the

deopt

     >  should occur unconditionally and so the lock state is correct before

we

     >  wait and so we don't need to mess with the recursion count

internally

     >  when we reacquire the monitor.
     >
     > >
     > >    > which I don't like the sound of at all when it comes to

ObjectMonitor

     > >    > state. So I'd like to understand in detail exactly what is going 
on

here

     > >    > and why.  This is a very intrusive change that seems to badly

break

     > >    > encapsulation and impacts future changes to ObjectMonitor

that are

under

     > >    > investigation.
     > >
     > > I would not regard this as breaking encapsulation. Certainly not

badly.

     > >
     > > I've added a property relock_count_after_wait to JavaThread. The

property is well

     > > encapsulated. Future ObjectMonitor implementations have to deal

with

recursion too. They are free

     > > in choosing a way to do that as long as that property is taken into

account. This is hardly a

     > > limitation.
     >
     >  I do think this badly breaks encapsulation as you have to add a

callout

     >  from the guts of the ObjectMonitor code to reach into the thread to

get

     >  this lock count adjustment. I understand why you have had to do

this but

     >  I would much rather see a change to the EA optimisation strategy so

that

     >  this is not needed.
     >
     > > Note also that the property is a straight forward extension of the

existing concept of deferred

     > > local updates. It is embedded into the structure holding them. So

not

even the footprint of a

     > > JavaThread is enlarged if no deferred updates are generated.
     >
     > [...]
     >
     > >
     > > I'm actually duplicating the existing external suspend mechanism,

because a thread can be

     > > suspended at most once. And hey, and don't like that either! But it

seems not unlikely that the

     > > duplicate can be removed together with the original and the new

type

of handshakes that will be

     > > used for thread suspend can be used for object deoptimization

too. See

today's discussion in

     > > JDK-8227745 [2].
     >
     >  I hope that discussion bears some fruit, at the moment it seems not

to

     >  be possible to use handshakes here. :(
     >
     >  The external suspend mechanism is a royal pain in the proverbial

that we

     >  have to carefully live with. The idea that we're duplicating that for
     >  use in another fringe area of functionality does not thrill me at all.
     >
     >  To be clear, I understand the problem that exists and that you wish

to

     >  solve, but for the runtime parts I balk at the complexity cost of
     >  solving it.

I know it's complex, but by far no rocket science.

Also I find it hard to imagine another fix for JDK-8233915 besides

changing

the JVM TI specification.


Thanks, Richard.

-----Original Message-----
From: David Holmes <[email protected]>
Sent: Dienstag, 17. Dezember 2019 08:03
To: Reingruber, Richard <[email protected]>; serviceability-

[email protected]; [email protected];

hotspot-

[email protected]; Vladimir Kozlov

([email protected])

<[email protected]>

Subject: Re: RFR(L) 8227745: Enable Escape Analysis for Better

Performance

in the Presence of JVMTI Agents


<resend as my mailer crashed during last send>

David

On 17/12/2019 4:57 pm, David Holmes wrote:

Hi Richard,

On 14/12/2019 5:01 am, Reingruber, Richard wrote:

Hi David,

      > Some further queries/concerns:
      >
      > src/hotspot/share/runtime/objectMonitor.cpp
      >
      > Can you please explain the changes to ObjectMonitor::wait:
      >
      > !   _recursions = save      // restore the old recursion count
      > !                 + jt->get_and_reset_relock_count_after_wait(); //
      > increased by the deferred relock count
      >
      > what is the "deferred relock count"? I gather it relates to
      >
      > "The code was extended to be able to deoptimize objects of a
frame that
      > is not the top frame and to let another thread than the owning
thread do
      > it."

Yes, these relate. Currently EA based optimizations are reverted,

when

a compiled frame is replaced
with corresponding interpreter frames. Part of this is relocking
objects with eliminated
locking. New with the enhancement is that we do this also just before
object references are acquired
through JVMTI. In this case we deoptimize also the owning compiled
frame C and we register
deoptimized objects as deferred updates. When control returns to C

it

gets deoptimized, we notice
that objects are already deoptimized (reallocated and relocked), so

we

don't do it again (relocking
twice would be incorrect of course). Deferred updates are copied into
the new interpreter frames.

Problem: relocking is not possible if the target thread T is waiting
on the monitor that needs to be
relocked. This happens only with non-local objects with
EliminateNestedLocks. Instead relocking is
deferred until T owns the monitor again. This is what the piece of
code above does.


Sorry I need some more detail here. How can you wait() on an object
monitor if the object allocation and/or locking was optimised away?

And

what is a "non-local object" in this context? Isn't EA restricted to
thread-confined objects?

Is it just that some of the locking gets optimized away e.g.

synchronised(obj) {
      synchronised(obj) {
        synchronised(obj) {
          obj.wait();
        }
      }
}

If this is reduced to a form as-if it were a single lock of the monitor
(due to EA) and the wait() triggers a JVM TI event which leads to the
escape of "obj" then we need to reconstruct the true lock state, and so
when the wait() internally unblocks and reacquires the monitor it has to
set the true recursion count to 3, not the 1 that it appeared to be when
wait() was initially called. Is that the scenario?

If so I find this truly awful. Anyone using wait() in a realistic form
requires a notification and so the object cannot be thread confined. In
which case I would strongly argue that upon hitting the wait() the

deopt

should occur unconditionally and so the lock state is correct before we
wait and so we don't need to mess with the recursion count internally
when we reacquire the monitor.


      > which I don't like the sound of at all when it comes to
ObjectMonitor
      > state. So I'd like to understand in detail exactly what is going
on here
      > and why.  This is a very intrusive change that seems to badly

break

      > encapsulation and impacts future changes to ObjectMonitor that
are under
      > investigation.

I would not regard this as breaking encapsulation. Certainly not badly.

I've added a property relock_count_after_wait to JavaThread. The
property is well
encapsulated. Future ObjectMonitor implementations have to deal

with

recursion too. They are free in
choosing a way to do that as long as that property is taken into
account. This is hardly a
limitation.


I do think this badly breaks encapsulation as you have to add a callout
from the guts of the ObjectMonitor code to reach into the thread to

get

this lock count adjustment. I understand why you have had to do this

but

I would much rather see a change to the EA optimisation strategy so

that

this is not needed.

Note also that the property is a straight forward extension of the
existing concept of deferred
local updates. It is embedded into the structure holding them. So not
even the footprint of a
JavaThread is enlarged if no deferred updates are generated.

      > ---
      >
      > src/hotspot/share/runtime/thread.cpp
      >
      > Can you please explain why
JavaThread::wait_for_object_deoptimization
      > has to be handcrafted in this way rather than using proper
transitions.
      >

I wrote wait_for_object_deoptimization taking
JavaThread::java_suspend_self_with_safepoint_check
as template. So in short: for the same reasons :)

Threads reach both methods as part of thread state transitions,
therefore special handling is
required to change thread state on top of ongoing transitions.

      > We got rid of "deopt suspend" some time ago and it is disturbing
to see
      > it being added back (effectively). This seems like it may be
something
      > that handshakes could be used for.

Deopt suspend used to be something rather different with a similar
name[1]. It is not being added back.


I stand corrected. Despite comments in the code to the contrary
deopt_suspend didn't actually cause a self-suspend. I was doing a lot of
cleanup in this area 13 years ago :)


I'm actually duplicating the existing external suspend mechanism,
because a thread can be suspended
at most once. And hey, and don't like that either! But it seems not
unlikely that the duplicate can
be removed together with the original and the new type of

handshakes

that will be used for
thread suspend can be used for object deoptimization too. See

today's

discussion in JDK-8227745 [2].


I hope that discussion bears some fruit, at the moment it seems not to
be possible to use handshakes here. :(

The external suspend mechanism is a royal pain in the proverbial that

we

have to carefully live with. The idea that we're duplicating that for
use in another fringe area of functionality does not thrill me at all.

To be clear, I understand the problem that exists and that you wish to
solve, but for the runtime parts I balk at the complexity cost of
solving it.

Thanks,
David
-----

Thanks, Richard.

[1] Deopt suspend was something like an async. handshake for
architectures with register windows,
        where patching the return pc for deoptimization of a compiled
frame was racy if the owner thread
        was in native code. Instead a "deopt" suspend flag was set on
which the thread patched its own
        frame upon return from native. So no thread was suspended. It

got

its name only from the name of
        the flags.

[2] Discussion about using handshakes to sync. with the target thread:

https://bugs.openjdk.java.net/browse/JDK-

8227745?focusedCommentId=14306727&page=com.atlassian.jira.plugin.syst
e

m.issuetabpanels:comment-tabpanel#comment-14306727



-----Original Message-----
From: David Holmes <[email protected]>
Sent: Freitag, 13. Dezember 2019 00:56
To: Reingruber, Richard <[email protected]>;
[email protected];
[email protected];
[email protected]
Subject: Re: RFR(L) 8227745: Enable Escape Analysis for Better
Performance in the Presence of JVMTI Agents

Hi Richard,

Some further queries/concerns:

src/hotspot/share/runtime/objectMonitor.cpp

Can you please explain the changes to ObjectMonitor::wait:

!   _recursions = save      // restore the old recursion count
!                 + jt->get_and_reset_relock_count_after_wait(); //
increased by the deferred relock count

what is the "deferred relock count"? I gather it relates to

"The code was extended to be able to deoptimize objects of a frame

that

is not the top frame and to let another thread than the owning thread

do

it."

which I don't like the sound of at all when it comes to ObjectMonitor
state. So I'd like to understand in detail exactly what is going on here
and why.  This is a very intrusive change that seems to badly break
encapsulation and impacts future changes to ObjectMonitor that are

under

investigation.

---

src/hotspot/share/runtime/thread.cpp

Can you please explain why

JavaThread::wait_for_object_deoptimization

has to be handcrafted in this way rather than using proper transitions.

We got rid of "deopt suspend" some time ago and it is disturbing to

see

it being added back (effectively). This seems like it may be something
that handshakes could be used for.

Thanks,
David
-----

On 12/12/2019 7:02 am, David Holmes wrote:

On 12/12/2019 1:07 am, Reingruber, Richard wrote:

Hi David,

       > Most of the details here are in areas I can comment on in

detail,

but I
       > did take an initial general look at things.

Thanks for taking the time!


Apologies the above should read:

"Most of the details here are in areas I *can't* comment on in detail
..."

David

       > The only thing that jumped out at me is that I think the
       > DeoptimizeObjectsALotThread should be a hidden thread.
       >
       > +  bool is_hidden_from_external_view() const { return true; }

Yes, it should. Will add the method like above.

       > Also I don't see any testing of the

DeoptimizeObjectsALotThread.

Without
       > active testing this will just bit-rot.

DeoptimizeObjectsALot is meant for stress testing with a larger
workload. I will add a minimal test
to keep it fresh.

       > Also on the tests I don't understand your @requires clause:
       >
       >   @requires ((vm.compMode != "Xcomp") &

vm.compiler2.enabled

       > (vm.opt.TieredCompilation != true))
       >
       > This seems to require that TieredCompilation is disabled, but
tiered is
       > our normal mode of operation. ??
       >

I removed the clause. I guess I wanted to target the tests towards

the

code they are supposed to
test, and it's easier to analyze failures w/o tiered compilation and
with just one compiler thread.

Additionally I will make use of
compiler.whitebox.CompilerWhiteBoxTest.THRESHOLD in the tests.

Thanks,
Richard.

-----Original Message-----
From: David Holmes <[email protected]>
Sent: Mittwoch, 11. Dezember 2019 08:03
To: Reingruber, Richard <[email protected]>;
[email protected];
[email protected];
[email protected]
Subject: Re: RFR(L) 8227745: Enable Escape Analysis for Better
Performance in the Presence of JVMTI Agents

Hi Richard,

On 11/12/2019 7:45 am, Reingruber, Richard wrote:

Hi,

I would like to get reviews please for

http://cr.openjdk.java.net/~rrich/webrevs/2019/8227745/webrev.3/


Corresponding RFE:
https://bugs.openjdk.java.net/browse/JDK-8227745

Fixes also https://bugs.openjdk.java.net/browse/JDK-8233915
And potentially https://bugs.openjdk.java.net/browse/JDK-

8214584 [1]


Vladimir Kozlov kindly put webrev.3 through tier1-8 testing

without

issues (thanks!). In addition the
change is being tested at SAP since I posted the first RFR some
months ago.

The intention of this enhancement is to benefit performance wise

from

escape analysis even if JVMTI
agents request capabilities that allow them to access local variable
values. E.g. if you start-up
with -agentlib:jdwp=transport=dt_socket,server=y,suspend=n,

then

escape analysis is disabled right
from the beginning, well before a debugger attaches -- if ever one
should do so. With the
enhancement, escape analysis will remain enabled until and after

debugger attaches. EA based
optimizations are reverted just before an agent acquires the
reference to an object. In the JBS item
you'll find more details.


Most of the details here are in areas I can comment on in detail, but

did take an initial general look at things.

The only thing that jumped out at me is that I think the
DeoptimizeObjectsALotThread should be a hidden thread.

+  bool is_hidden_from_external_view() const { return true; }

Also I don't see any testing of the DeoptimizeObjectsALotThread.
Without
active testing this will just bit-rot.

Also on the tests I don't understand your @requires clause:

       @requires ((vm.compMode != "Xcomp") &

vm.compiler2.enabled &

(vm.opt.TieredCompilation != true))

This seems to require that TieredCompilation is disabled, but tiered

is

our normal mode of operation. ??

Thanks,
David

Thanks,
Richard.

[1] Experimental fix for JDK-8214584 based on JDK-8227745

http://cr.openjdk.java.net/~rrich/webrevs/2019/8214584/experiment_v1.pa
tc

Re: RFR(L) 8227745: Enable Escape Analysis for Better Performance in the Presence of JVMTI Agents

Reply via email to