On Wed, 22 Apr 2026 12:58:44 GMT, Andrew Haley <[email protected]> wrote:

>> Please use [this 
>> link](https://github.com/openjdk/jdk/pull/28541/changes?w=1) to view the 
>> files changed.
>> 
>> Profile counters scale very badly.
>> 
>> The overhead for profiled code isn't too bad with one thread, but as the 
>> thread count increases, things go wrong very quickly.
>> 
>> For example, here's a benchmark from the OpenJDK test suite, run at 
>> TieredLevel 3 with one thread, then three threads:
>> 
>> 
>> Benchmark (randomized) Mode Cnt Score Error Units
>> InterfaceCalls.test2ndInt5Types false avgt 4 27.468 ± 2.631 ns/op
>> InterfaceCalls.test2ndInt5Types false avgt 4 240.010 ± 6.329 ns/op
>> 
>> 
>> This slowdown is caused by high memory contention on the profile counters. 
>> Not only is this slow, but it can also lose profile counts.
>> 
>> This patch is for C1 only. It'd be easy to randomize C1 counters as well in 
>> another PR, if anyone thinks it's worth doing.
>> 
>> One other thing to note is that randomized profile counters degrade very 
>> badly with small decimation ratios. For example, using a ratio of 2 with 
>> `-XX:ProfileCaptureRatio=2` with a single thread results in
>> 
>> 
>> Benchmark                        (randomized)  Mode  Cnt   Score   Error  
>> Units
>> InterfaceCalls.test2ndInt5Types         false  avgt    4  80.147 ± 9.991  
>> ns/op
>> 
>> 
>> The problem is that the branch prediction rate drops away very badly, 
>> leading to many mispredictions. It only really makes sense to use higher 
>> decimation ratios, e.g. 64.
>> 
>> ---------
>> - [x] I confirm that I make this contribution in accordance with the 
>> [OpenJDK Interim AI Policy](https://openjdk.org/legal/ai).
>
> Andrew Haley has updated the pull request with a new target base due to a 
> merge or a rebase. The pull request now contains 111 commits:
> 
>  - Merge branch 'master' into JDK-8134940
>  - Review comments
>  - Review comments
>  - Add FrameMap::last_fpu_reg()
>  - Review comments
>  - Review comments
>  - Review comments
>  - Merge branch 'JDK-8134940' of https://github.com/theRealAph/jdk into 
> JDK-8134940
>  - More
>  - Merge branch 'JDK-8134940' of https://github.com/theRealAph/jdk into 
> JDK-8134940
>  - ... and 101 more: https://git.openjdk.org/jdk/compare/3e1d0b8e...4610777b

Made a pass with our code analysers, there are quite a few hits since the code 
is hairy :) I believe the following deserve followups:

src/hotspot/cpu/aarch64/c1_LIRAssembler_aarch64.cpp line 1312:

> 1310: 
> 1311:   int profile_capture_ratio = ProfileCaptureRatio;
> 1312:   int ratio_shift = exact_log2(profile_capture_ratio);

These look unused, intentional?

src/hotspot/cpu/aarch64/c1_MacroAssembler_aarch64.hpp line 121:

> 119:   void restore_profile_rng();
> 120: 
> 121: void adjust_mdo_address(Address *a, BasicType type);

Indenting.

src/hotspot/cpu/arm/c1_FrameMap_arm.hpp line 99:

> 97:   static int adjust_reg_range(int range) {
> 98:     int result = range - (ProfileCaptureRatio > 1);
> 99:     return align_down(result, 2);  // ouch

"Ouch" deserves a bit more explanation in the comment :)

src/hotspot/cpu/arm/c1_LIRAssembler_arm.cpp line 1910:

> 1908:   assert(ret_addr_offset == __ offset(), "embedded return address not 
> allowed");
> 1909:   add_call_info_here(op->info());
> 1910:   __ restore_profile_rng();

All right, this handles Java calls. What about `rt_call`-s? They clobber 
`xmm14/xmm15` that we are using for `UseVregsForProfileCapture` too, since they 
are caller-save.

src/hotspot/cpu/arm/c1_LIRAssembler_arm.cpp line 2515:

> 2513:         __ mov(inc, AsmOperand(inc, lsl, ratio_shift));
> 2514:       }
> 2515:       __ increment_mdp_data_at(dest_adr, dest, 1);

So we compute `inc`, but never pass it to `increment_mdp_data_at`?

src/hotspot/cpu/arm/c1_LIRAssembler_arm.cpp line 2525:

> 2523:           Address dest_adr = counter_address;
> 2524:           inc *= ProfileCaptureRatio;
> 2525:           __ increment_mdp_data_at(dest_adr, dest, 1);

Same here: `inc` is computed, but not passed.

src/hotspot/cpu/arm/c1_MacroAssembler_arm.cpp line 232:

> 230:                                               Register bumped_count,
> 231:                                               int increment) {
> 232:   assert(ProfileInterpreter, "must be profiling interpreter");

Sounds misleading. Surely this is now not only interpreter, but C1 code too.

src/hotspot/cpu/arm/c1_MacroAssembler_arm.cpp line 259:

> 257:      
> https://www.researchgate.net/publication/2683298_A_Collection_of_Selected_Pseudorandom_Number_Generators_With_Linear_Structures
>  */
> 258:   mov_slow(temp, 69069);
> 259:   mul(state, state, temp);

`+1` is missing in this LGC? AArch64 has it.

src/hotspot/cpu/arm/c1_MacroAssembler_arm.cpp line 263:

> 261: 
> 262: void C1_MacroAssembler::save_profile_rng() {
> 263:   if (ProfileCaptureRatio != 1) {

Here and later, looks like a common way to test if `PCR` is enabled is `PCR > 
0`, use it everywhere?

src/hotspot/cpu/arm/c1_Runtime1_arm.cpp line 780:

> 778: #undef FUNCTION_CASE
> 779:   return "";
> 780: }

With `SOFT_FP`, this will produce an extra `}`, breaking the build. Since you 
have moved the `}` outside the `#ifdef SOFT_FP`, this one should be removed as 
well.

src/hotspot/cpu/arm/c1_Runtime1_arm.cpp line 782:

> 780: }
> 781: #else  // __SOFTFP__
> 782: if (entry == StubRoutines::Arm::atomic_compareAndSet_long_entry()) {

While you are here, indenting.

src/hotspot/cpu/x86/c1_LIRAssembler_x86.cpp line 2214:

> 2212: void LIR_Assembler::align_call(LIR_Code code) {
> 2213:   // We do this here in order not to affect call site alignment.
> 2214:   __ save_profile_rng();

So this does saving twice on x86, compared to other platforms? Once in 
`LIR_Assembler::emit_call`, and then here, called from that `LIR_A::emit_call`.

src/hotspot/cpu/x86/c1_LIRAssembler_x86.cpp line 2877:

> 2875:     if (step_opr->is_register()) {
> 2876:       Register inc = step_opr->as_register();
> 2877:       __ lea(dest, counter_address);

This looks wrong: it loads the _address_ of the counter, not the counter value 
itself. So the subsequent ops update the counter to `&counter + inc`, rather 
than to `counter + inc`, corrupting the counter? The branch below does it 
right: ` __ movl(dest, counter_address);`

src/hotspot/cpu/x86/c1_LIRAssembler_x86.cpp line 2918:

> 2916:         } else {
> 2917:           __ jmp(*overflow_stub->entry());
> 2918:           goto exit;

Do you care about `goto` here? Sounds like `return` would suffice, unless I am 
missing something?

src/hotspot/cpu/x86/c1_LIRAssembler_x86.cpp line 3009:

> 3007:         if (known_klass->equals(receiver)) {
> 3008:           Address data_addr(mdo, md->byte_offset_of_slot(data, 
> VirtualCallData::receiver_count_offset(i)));
> 3009:           increment_mdo(masm(), counter_addr, 
> DataLayout::counter_increment,

This one should be `data_addr`, not `counter_addr`, right? Looks like 
copy-paste error.

src/hotspot/cpu/x86/c1_LinearScan_x86.hpp line 65:

> 63: inline bool LinearScanWalker::pd_init_regs_for_alloc(Interval* cur) {
> 64:   int last_xmm_reg = pd_first_xmm_reg + 
> XMMRegister::available_xmm_registers()
> 65:     - (UseVregsForProfileCapture ? 2 : 0) - 1;

With AVX-512, subtracting two last registers reserve `xmm30/xmm31`, not 
`xmm14/xmm15` that we are using for counters. So profiling code would stomp the 
real registers on AVX-512?

src/hotspot/cpu/x86/c1_MacroAssembler_x86.cpp line 44:

> 42: #include "utilities/globalDefinitions.hpp"
> 43: 
> 44: Register r_profile_rng;

So in AArch64 this one is `extern`. Should neither/both be `extern`?

src/hotspot/cpu/x86/c1_MacroAssembler_x86.cpp line 275:

> 273: 
> 274:   if (!UseVregsForProfileCapture && VM_Version::supports_sse4_2()) {
> 275:     /* CRC used as a psuedo-random-number generator */

`pseudo`

src/hotspot/cpu/x86/c1_MacroAssembler_x86.cpp line 282:

> 280:   } else {
> 281:     if (UseVregsForProfileCapture) {
> 282:       vpmulld(xmm15, xmm15, xmm14, Assembler::AVX_128bit);

Does this require `vpmulld` `AVX > 0`? Does current code assert/fail with 
`-XX:UseAVX=0` or some such?

src/hotspot/cpu/x86/c1_MacroAssembler_x86.cpp line 287:

> 285:        
> https://www.researchgate.net/publication/2683298_A_Collection_of_Selected_Pseudorandom_Number_Generators_With_Linear_Structures
>  */
> 286:       movl(temp, 69069);
> 287:       imull(state, temp);

`+1` is missing in this LGC? AArch64 has it.

src/hotspot/cpu/x86/c1_MacroAssembler_x86.cpp line 295:

> 293:   if (ProfileCaptureRatio != 1) {
> 294:     if (UseVregsForProfileCapture) {
> 295:       movsd(Address(r15_thread, JavaThread::profile_rng_offset()), 
> xmm15);

`movsd` does a 8-byte load, while `JavaThread::_profile_rng` is `uint32_t`. So 
this reads adjacent memory. I think it is also a problem in 
`restore_profile_rng`.

src/hotspot/share/c1/c1_CodeStubs.hpp line 151:

> 149: class ProfileStub: public CodeStub {
> 150: private:
> 151:   AbstractLambdaWrapper *_action;

Initialize this to `nullptr`.

src/hotspot/share/c1/c1_CodeStubs.hpp line 167:

> 165:   virtual void print_name(outputStream* out) const { out->print("%s", 
> _name); }
> 166: #endif // PRODUCT
> 167:   virtual void visit(LIR_OpVisitState* visitor) { }

Is there a problem in having empty `ProfileStub::visit` and deferring emit to 
lambda? AFAIU, C1 linear scan is using this visitor to figure out register 
lifetimes. But since this is a stub, we probably do not care? Anyway, from the 
symmetry with other `CodeStub` subclasses, I think it misses  
`visitor->do_slow_case(_info);` or some such.

src/hotspot/share/c1/c1_LIR.cpp line 897:

> 895:     }
> 896: 
> 897:     case lir_increment_counter: {

Just to make sure I understand: `freq_op` operand is missing here 
intentionally, or?

src/hotspot/share/c1/c1_LIR.cpp line 904:

> 902:       do_input(opr->_step);                do_temp(opr->_step);
> 903:       if (opr->_result->is_valid()) {
> 904:         do_temp(opr->_result);             do_output(opr->_result);

Analysis flags that it is unusual to have an operand declared both `temp` and 
`output`. Not sure if C1 treats this well.

src/hotspot/share/c1/c1_LIR.cpp line 2108:

> 2106:   step()->print(out);          out->print(" ");
> 2107:   dest()->print(out);          out->print(" ");
> 2108:   // temp_op()->print(out);       out->print(" ");

Dead code.

src/hotspot/share/c1/c1_LIR.hpp line 1949:

> 1947: 
> 1948:  public:
> 1949:   // Destroys recv

There is no `recv`, so comment is misleading.

src/hotspot/share/c1/c1_LIRGenerator.cpp line 968:

> 966: 
> 967:     // MDO cells are intptr_t, so the data_reg width is arch-dependent.
> 968:     LIR_Opr data_reg = new_pointer_register();

Looks unused. The comment is probably stale too.

src/hotspot/share/c1/c1_LIRGenerator.cpp line 3215:

> 3213:   }
> 3214:   LIR_Opr result = notify ? new_register(T_INT) : 
> LIR_OprFact::intConst(0);
> 3215:   LIR_Opr tmp = new_register(T_INT);

Looks unused.

src/hotspot/share/runtime/flags/jvmFlagConstraintsCompiler.cpp line 408:

> 406:   {
> 407:     JVMFlag::printError(verbose,
> 408:                         "ProfileCaptureRatio is not supported on this 
> target\n");

Sounds like `1` should be accepted for all platforms as "randomized profiles 
are effectively disabled"? Otherwise putting `-XX:ProfileCaptureRatio=1` would 
fail on unsupported platforms. Actually, I am not 100% sure it would not fail 
with just the default value?

src/hotspot/share/runtime/sharedRuntime.cpp line 2585:

> 2583:     return;
> 2584:   }
> 2585: 

This looks unrelated to PR, and at least indenting is broken. Maybe you want to 
pull it out as separate change?

-------------

PR Review: https://git.openjdk.org/jdk/pull/28541#pullrequestreview-4234047534
PR Review Comment: https://git.openjdk.org/jdk/pull/28541#discussion_r3193890041
PR Review Comment: https://git.openjdk.org/jdk/pull/28541#discussion_r3193899973
PR Review Comment: https://git.openjdk.org/jdk/pull/28541#discussion_r3193913255
PR Review Comment: https://git.openjdk.org/jdk/pull/28541#discussion_r3193857551
PR Review Comment: https://git.openjdk.org/jdk/pull/28541#discussion_r3193721818
PR Review Comment: https://git.openjdk.org/jdk/pull/28541#discussion_r3193723158
PR Review Comment: https://git.openjdk.org/jdk/pull/28541#discussion_r3193931283
PR Review Comment: https://git.openjdk.org/jdk/pull/28541#discussion_r3193956004
PR Review Comment: https://git.openjdk.org/jdk/pull/28541#discussion_r3194005612
PR Review Comment: https://git.openjdk.org/jdk/pull/28541#discussion_r3193711776
PR Review Comment: https://git.openjdk.org/jdk/pull/28541#discussion_r3193712468
PR Review Comment: https://git.openjdk.org/jdk/pull/28541#discussion_r3193949223
PR Review Comment: https://git.openjdk.org/jdk/pull/28541#discussion_r3193770581
PR Review Comment: https://git.openjdk.org/jdk/pull/28541#discussion_r3193898518
PR Review Comment: https://git.openjdk.org/jdk/pull/28541#discussion_r3193697691
PR Review Comment: https://git.openjdk.org/jdk/pull/28541#discussion_r3193864084
PR Review Comment: https://git.openjdk.org/jdk/pull/28541#discussion_r3193909447
PR Review Comment: https://git.openjdk.org/jdk/pull/28541#discussion_r3193920208
PR Review Comment: https://git.openjdk.org/jdk/pull/28541#discussion_r3193797351
PR Review Comment: https://git.openjdk.org/jdk/pull/28541#discussion_r3193956665
PR Review Comment: https://git.openjdk.org/jdk/pull/28541#discussion_r3193784385
PR Review Comment: https://git.openjdk.org/jdk/pull/28541#discussion_r3193885084
PR Review Comment: https://git.openjdk.org/jdk/pull/28541#discussion_r3193836206
PR Review Comment: https://git.openjdk.org/jdk/pull/28541#discussion_r3193993003
PR Review Comment: https://git.openjdk.org/jdk/pull/28541#discussion_r3194000598
PR Review Comment: https://git.openjdk.org/jdk/pull/28541#discussion_r3193917168
PR Review Comment: https://git.openjdk.org/jdk/pull/28541#discussion_r3193918800
PR Review Comment: https://git.openjdk.org/jdk/pull/28541#discussion_r3193893896
PR Review Comment: https://git.openjdk.org/jdk/pull/28541#discussion_r3193887623
PR Review Comment: https://git.openjdk.org/jdk/pull/28541#discussion_r3193881568
PR Review Comment: https://git.openjdk.org/jdk/pull/28541#discussion_r3193871573

Reply via email to