On Wed, 22 Apr 2026 12:58:44 GMT, Andrew Haley <[email protected]> wrote:
>> Please use [this >> link](https://github.com/openjdk/jdk/pull/28541/changes?w=1) to view the >> files changed. >> >> Profile counters scale very badly. >> >> The overhead for profiled code isn't too bad with one thread, but as the >> thread count increases, things go wrong very quickly. >> >> For example, here's a benchmark from the OpenJDK test suite, run at >> TieredLevel 3 with one thread, then three threads: >> >> >> Benchmark (randomized) Mode Cnt Score Error Units >> InterfaceCalls.test2ndInt5Types false avgt 4 27.468 ± 2.631 ns/op >> InterfaceCalls.test2ndInt5Types false avgt 4 240.010 ± 6.329 ns/op >> >> >> This slowdown is caused by high memory contention on the profile counters. >> Not only is this slow, but it can also lose profile counts. >> >> This patch is for C1 only. It'd be easy to randomize C1 counters as well in >> another PR, if anyone thinks it's worth doing. >> >> One other thing to note is that randomized profile counters degrade very >> badly with small decimation ratios. For example, using a ratio of 2 with >> `-XX:ProfileCaptureRatio=2` with a single thread results in >> >> >> Benchmark (randomized) Mode Cnt Score Error >> Units >> InterfaceCalls.test2ndInt5Types false avgt 4 80.147 ± 9.991 >> ns/op >> >> >> The problem is that the branch prediction rate drops away very badly, >> leading to many mispredictions. It only really makes sense to use higher >> decimation ratios, e.g. 64. >> >> --------- >> - [x] I confirm that I make this contribution in accordance with the >> [OpenJDK Interim AI Policy](https://openjdk.org/legal/ai). > > Andrew Haley has updated the pull request with a new target base due to a > merge or a rebase. The pull request now contains 111 commits: > > - Merge branch 'master' into JDK-8134940 > - Review comments > - Review comments > - Add FrameMap::last_fpu_reg() > - Review comments > - Review comments > - Review comments > - Merge branch 'JDK-8134940' of https://github.com/theRealAph/jdk into > JDK-8134940 > - More > - Merge branch 'JDK-8134940' of https://github.com/theRealAph/jdk into > JDK-8134940 > - ... and 101 more: https://git.openjdk.org/jdk/compare/3e1d0b8e...4610777b Made a pass with our code analysers, there are quite a few hits since the code is hairy :) I believe the following deserve followups: src/hotspot/cpu/aarch64/c1_LIRAssembler_aarch64.cpp line 1312: > 1310: > 1311: int profile_capture_ratio = ProfileCaptureRatio; > 1312: int ratio_shift = exact_log2(profile_capture_ratio); These look unused, intentional? src/hotspot/cpu/aarch64/c1_MacroAssembler_aarch64.hpp line 121: > 119: void restore_profile_rng(); > 120: > 121: void adjust_mdo_address(Address *a, BasicType type); Indenting. src/hotspot/cpu/arm/c1_FrameMap_arm.hpp line 99: > 97: static int adjust_reg_range(int range) { > 98: int result = range - (ProfileCaptureRatio > 1); > 99: return align_down(result, 2); // ouch "Ouch" deserves a bit more explanation in the comment :) src/hotspot/cpu/arm/c1_LIRAssembler_arm.cpp line 1910: > 1908: assert(ret_addr_offset == __ offset(), "embedded return address not > allowed"); > 1909: add_call_info_here(op->info()); > 1910: __ restore_profile_rng(); All right, this handles Java calls. What about `rt_call`-s? They clobber `xmm14/xmm15` that we are using for `UseVregsForProfileCapture` too, since they are caller-save. src/hotspot/cpu/arm/c1_LIRAssembler_arm.cpp line 2515: > 2513: __ mov(inc, AsmOperand(inc, lsl, ratio_shift)); > 2514: } > 2515: __ increment_mdp_data_at(dest_adr, dest, 1); So we compute `inc`, but never pass it to `increment_mdp_data_at`? src/hotspot/cpu/arm/c1_LIRAssembler_arm.cpp line 2525: > 2523: Address dest_adr = counter_address; > 2524: inc *= ProfileCaptureRatio; > 2525: __ increment_mdp_data_at(dest_adr, dest, 1); Same here: `inc` is computed, but not passed. src/hotspot/cpu/arm/c1_MacroAssembler_arm.cpp line 232: > 230: Register bumped_count, > 231: int increment) { > 232: assert(ProfileInterpreter, "must be profiling interpreter"); Sounds misleading. Surely this is now not only interpreter, but C1 code too. src/hotspot/cpu/arm/c1_MacroAssembler_arm.cpp line 259: > 257: > https://www.researchgate.net/publication/2683298_A_Collection_of_Selected_Pseudorandom_Number_Generators_With_Linear_Structures > */ > 258: mov_slow(temp, 69069); > 259: mul(state, state, temp); `+1` is missing in this LGC? AArch64 has it. src/hotspot/cpu/arm/c1_MacroAssembler_arm.cpp line 263: > 261: > 262: void C1_MacroAssembler::save_profile_rng() { > 263: if (ProfileCaptureRatio != 1) { Here and later, looks like a common way to test if `PCR` is enabled is `PCR > 0`, use it everywhere? src/hotspot/cpu/arm/c1_Runtime1_arm.cpp line 780: > 778: #undef FUNCTION_CASE > 779: return ""; > 780: } With `SOFT_FP`, this will produce an extra `}`, breaking the build. Since you have moved the `}` outside the `#ifdef SOFT_FP`, this one should be removed as well. src/hotspot/cpu/arm/c1_Runtime1_arm.cpp line 782: > 780: } > 781: #else // __SOFTFP__ > 782: if (entry == StubRoutines::Arm::atomic_compareAndSet_long_entry()) { While you are here, indenting. src/hotspot/cpu/x86/c1_LIRAssembler_x86.cpp line 2214: > 2212: void LIR_Assembler::align_call(LIR_Code code) { > 2213: // We do this here in order not to affect call site alignment. > 2214: __ save_profile_rng(); So this does saving twice on x86, compared to other platforms? Once in `LIR_Assembler::emit_call`, and then here, called from that `LIR_A::emit_call`. src/hotspot/cpu/x86/c1_LIRAssembler_x86.cpp line 2877: > 2875: if (step_opr->is_register()) { > 2876: Register inc = step_opr->as_register(); > 2877: __ lea(dest, counter_address); This looks wrong: it loads the _address_ of the counter, not the counter value itself. So the subsequent ops update the counter to `&counter + inc`, rather than to `counter + inc`, corrupting the counter? The branch below does it right: ` __ movl(dest, counter_address);` src/hotspot/cpu/x86/c1_LIRAssembler_x86.cpp line 2918: > 2916: } else { > 2917: __ jmp(*overflow_stub->entry()); > 2918: goto exit; Do you care about `goto` here? Sounds like `return` would suffice, unless I am missing something? src/hotspot/cpu/x86/c1_LIRAssembler_x86.cpp line 3009: > 3007: if (known_klass->equals(receiver)) { > 3008: Address data_addr(mdo, md->byte_offset_of_slot(data, > VirtualCallData::receiver_count_offset(i))); > 3009: increment_mdo(masm(), counter_addr, > DataLayout::counter_increment, This one should be `data_addr`, not `counter_addr`, right? Looks like copy-paste error. src/hotspot/cpu/x86/c1_LinearScan_x86.hpp line 65: > 63: inline bool LinearScanWalker::pd_init_regs_for_alloc(Interval* cur) { > 64: int last_xmm_reg = pd_first_xmm_reg + > XMMRegister::available_xmm_registers() > 65: - (UseVregsForProfileCapture ? 2 : 0) - 1; With AVX-512, subtracting two last registers reserve `xmm30/xmm31`, not `xmm14/xmm15` that we are using for counters. So profiling code would stomp the real registers on AVX-512? src/hotspot/cpu/x86/c1_MacroAssembler_x86.cpp line 44: > 42: #include "utilities/globalDefinitions.hpp" > 43: > 44: Register r_profile_rng; So in AArch64 this one is `extern`. Should neither/both be `extern`? src/hotspot/cpu/x86/c1_MacroAssembler_x86.cpp line 275: > 273: > 274: if (!UseVregsForProfileCapture && VM_Version::supports_sse4_2()) { > 275: /* CRC used as a psuedo-random-number generator */ `pseudo` src/hotspot/cpu/x86/c1_MacroAssembler_x86.cpp line 282: > 280: } else { > 281: if (UseVregsForProfileCapture) { > 282: vpmulld(xmm15, xmm15, xmm14, Assembler::AVX_128bit); Does this require `vpmulld` `AVX > 0`? Does current code assert/fail with `-XX:UseAVX=0` or some such? src/hotspot/cpu/x86/c1_MacroAssembler_x86.cpp line 287: > 285: > https://www.researchgate.net/publication/2683298_A_Collection_of_Selected_Pseudorandom_Number_Generators_With_Linear_Structures > */ > 286: movl(temp, 69069); > 287: imull(state, temp); `+1` is missing in this LGC? AArch64 has it. src/hotspot/cpu/x86/c1_MacroAssembler_x86.cpp line 295: > 293: if (ProfileCaptureRatio != 1) { > 294: if (UseVregsForProfileCapture) { > 295: movsd(Address(r15_thread, JavaThread::profile_rng_offset()), > xmm15); `movsd` does a 8-byte load, while `JavaThread::_profile_rng` is `uint32_t`. So this reads adjacent memory. I think it is also a problem in `restore_profile_rng`. src/hotspot/share/c1/c1_CodeStubs.hpp line 151: > 149: class ProfileStub: public CodeStub { > 150: private: > 151: AbstractLambdaWrapper *_action; Initialize this to `nullptr`. src/hotspot/share/c1/c1_CodeStubs.hpp line 167: > 165: virtual void print_name(outputStream* out) const { out->print("%s", > _name); } > 166: #endif // PRODUCT > 167: virtual void visit(LIR_OpVisitState* visitor) { } Is there a problem in having empty `ProfileStub::visit` and deferring emit to lambda? AFAIU, C1 linear scan is using this visitor to figure out register lifetimes. But since this is a stub, we probably do not care? Anyway, from the symmetry with other `CodeStub` subclasses, I think it misses `visitor->do_slow_case(_info);` or some such. src/hotspot/share/c1/c1_LIR.cpp line 897: > 895: } > 896: > 897: case lir_increment_counter: { Just to make sure I understand: `freq_op` operand is missing here intentionally, or? src/hotspot/share/c1/c1_LIR.cpp line 904: > 902: do_input(opr->_step); do_temp(opr->_step); > 903: if (opr->_result->is_valid()) { > 904: do_temp(opr->_result); do_output(opr->_result); Analysis flags that it is unusual to have an operand declared both `temp` and `output`. Not sure if C1 treats this well. src/hotspot/share/c1/c1_LIR.cpp line 2108: > 2106: step()->print(out); out->print(" "); > 2107: dest()->print(out); out->print(" "); > 2108: // temp_op()->print(out); out->print(" "); Dead code. src/hotspot/share/c1/c1_LIR.hpp line 1949: > 1947: > 1948: public: > 1949: // Destroys recv There is no `recv`, so comment is misleading. src/hotspot/share/c1/c1_LIRGenerator.cpp line 968: > 966: > 967: // MDO cells are intptr_t, so the data_reg width is arch-dependent. > 968: LIR_Opr data_reg = new_pointer_register(); Looks unused. The comment is probably stale too. src/hotspot/share/c1/c1_LIRGenerator.cpp line 3215: > 3213: } > 3214: LIR_Opr result = notify ? new_register(T_INT) : > LIR_OprFact::intConst(0); > 3215: LIR_Opr tmp = new_register(T_INT); Looks unused. src/hotspot/share/runtime/flags/jvmFlagConstraintsCompiler.cpp line 408: > 406: { > 407: JVMFlag::printError(verbose, > 408: "ProfileCaptureRatio is not supported on this > target\n"); Sounds like `1` should be accepted for all platforms as "randomized profiles are effectively disabled"? Otherwise putting `-XX:ProfileCaptureRatio=1` would fail on unsupported platforms. Actually, I am not 100% sure it would not fail with just the default value? src/hotspot/share/runtime/sharedRuntime.cpp line 2585: > 2583: return; > 2584: } > 2585: This looks unrelated to PR, and at least indenting is broken. Maybe you want to pull it out as separate change? ------------- PR Review: https://git.openjdk.org/jdk/pull/28541#pullrequestreview-4234047534 PR Review Comment: https://git.openjdk.org/jdk/pull/28541#discussion_r3193890041 PR Review Comment: https://git.openjdk.org/jdk/pull/28541#discussion_r3193899973 PR Review Comment: https://git.openjdk.org/jdk/pull/28541#discussion_r3193913255 PR Review Comment: https://git.openjdk.org/jdk/pull/28541#discussion_r3193857551 PR Review Comment: https://git.openjdk.org/jdk/pull/28541#discussion_r3193721818 PR Review Comment: https://git.openjdk.org/jdk/pull/28541#discussion_r3193723158 PR Review Comment: https://git.openjdk.org/jdk/pull/28541#discussion_r3193931283 PR Review Comment: https://git.openjdk.org/jdk/pull/28541#discussion_r3193956004 PR Review Comment: https://git.openjdk.org/jdk/pull/28541#discussion_r3194005612 PR Review Comment: https://git.openjdk.org/jdk/pull/28541#discussion_r3193711776 PR Review Comment: https://git.openjdk.org/jdk/pull/28541#discussion_r3193712468 PR Review Comment: https://git.openjdk.org/jdk/pull/28541#discussion_r3193949223 PR Review Comment: https://git.openjdk.org/jdk/pull/28541#discussion_r3193770581 PR Review Comment: https://git.openjdk.org/jdk/pull/28541#discussion_r3193898518 PR Review Comment: https://git.openjdk.org/jdk/pull/28541#discussion_r3193697691 PR Review Comment: https://git.openjdk.org/jdk/pull/28541#discussion_r3193864084 PR Review Comment: https://git.openjdk.org/jdk/pull/28541#discussion_r3193909447 PR Review Comment: https://git.openjdk.org/jdk/pull/28541#discussion_r3193920208 PR Review Comment: https://git.openjdk.org/jdk/pull/28541#discussion_r3193797351 PR Review Comment: https://git.openjdk.org/jdk/pull/28541#discussion_r3193956665 PR Review Comment: https://git.openjdk.org/jdk/pull/28541#discussion_r3193784385 PR Review Comment: https://git.openjdk.org/jdk/pull/28541#discussion_r3193885084 PR Review Comment: https://git.openjdk.org/jdk/pull/28541#discussion_r3193836206 PR Review Comment: https://git.openjdk.org/jdk/pull/28541#discussion_r3193993003 PR Review Comment: https://git.openjdk.org/jdk/pull/28541#discussion_r3194000598 PR Review Comment: https://git.openjdk.org/jdk/pull/28541#discussion_r3193917168 PR Review Comment: https://git.openjdk.org/jdk/pull/28541#discussion_r3193918800 PR Review Comment: https://git.openjdk.org/jdk/pull/28541#discussion_r3193893896 PR Review Comment: https://git.openjdk.org/jdk/pull/28541#discussion_r3193887623 PR Review Comment: https://git.openjdk.org/jdk/pull/28541#discussion_r3193881568 PR Review Comment: https://git.openjdk.org/jdk/pull/28541#discussion_r3193871573
