On Mon, 23 Mar 2026 18:43:41 GMT, Andrew Haley <[email protected]> wrote:
>> Please use [this >> link](https://github.com/openjdk/jdk/pull/28541/changes?w=1) to view the >> files changed. >> >> Profile counters scale very badly. >> >> The overhead for profiled code isn't too bad with one thread, but as the >> thread count increases, things go wrong very quickly. >> >> For example, here's a benchmark from the OpenJDK test suite, run at >> TieredLevel 3 with one thread, then three threads: >> >> >> Benchmark (randomized) Mode Cnt Score Error Units >> InterfaceCalls.test2ndInt5Types false avgt 4 27.468 ± 2.631 ns/op >> InterfaceCalls.test2ndInt5Types false avgt 4 240.010 ± 6.329 ns/op >> >> >> This slowdown is caused by high memory contention on the profile counters. >> Not only is this slow, but it can also lose profile counts. >> >> This patch is for C1 only. It'd be easy to randomize C1 counters as well in >> another PR, if anyone thinks it's worth doing. >> >> One other thing to note is that randomized profile counters degrade very >> badly with small decimation ratios. For example, using a ratio of 2 with >> `-XX:ProfileCaptureRatio=2` with a single thread results in >> >> >> Benchmark (randomized) Mode Cnt Score Error >> Units >> InterfaceCalls.test2ndInt5Types false avgt 4 80.147 ± 9.991 >> ns/op >> >> >> The problem is that the branch prediction rate drops away very badly, >> leading to many mispredictions. It only really makes sense to use higher >> decimation ratios, e.g. 64. > > Andrew Haley has updated the pull request incrementally with one additional > commit since the last revision: > > Fix up any out-of-range offsets **Summary** This EXPERIMENTAL PR is now ready for review. It dramatically reduces memory traffic during the C1-compiled profile capture phase of HotSpot compilation. It does so at a modest cost, due to increased C1-generated code size. The resulting C2-generated code might be slightly worse in quality, but it'll take deep analysis to be sure. **Performance** I've been testing with baseline, this branch using `-XX:ProfileCaptureRatio=64`, and this branch using `-XX:ProfileCaptureRatio=64` For high-level tests I've used DaCapo: `dacapo-23.11-MR2-chopin.jar --iterations 15 jython -s large` Platform: Apple Mac Mini M1 running Fedora Mainline: 100% With this PR, ProfileCaptureRatio=1: 2%-3% faster With this PR, ProfileCaptureRatio=64: 0.7% slower I do not understand why this PR improves performance on the jython benchmark with `ProfileCaptureRatio=1`. The profile-capture code is almost identical to that generated by mainline, and in any case after warmup I believe that almost all of the code being executed is C2-compiled. A slight slowdown with `ProfileCaptureRatio=64` is possible because the generated profile counts are more noisy, so C2 has somewhat less reliable data. Given the usual noisiness of profile counts I wouldn't expect a huge difference, and 0.7% is within the margin of error. **Code Size** C1-compiled code is slightly larger because a random number is generated at every sample point. This is somewhat mitigated by slightly better code quality because sampling is hand-coded assembly rather than generated via C1 LIR, but that is a really small difference. **Random number generation** Most commonly-used random number generators are too slow for this application. Even XorShift, the usual go-to for super-lightweight generators, is too much. I've ended up with a simple linear feedback shift register (using a CRC instruction) and the smallest decent linear congruential generator, _69069x + 1_. Unfortunately, it's really hard to know which of these to choose. CRC probably wins on high-end modern processors, but on a AMD Threadripper launched in 2018 (12nm Zen+) it seems to be faster to use LCG (i.e multiply) than CRC. I don't know of any way to find out which will be best on any particular processor without trying it. It's not impossible to run both for a millisecond at startup to find out! **ProfileCaptureRatio selection** Low profile capture ratios, where counter updates are frequent, destroy performance because of constant branch mispredictions. 64 seems to be a reasonable compromise between profile data accuracy and profiling overhead. **Supported Architectures** I've done 32- and 64-bit Arm and x86. ------------- PR Comment: https://git.openjdk.org/jdk/pull/28541#issuecomment-4120264165
