Re: RFR: JDK-8293114: JVM should trim the native heap [v12]
On Fri, 14 Jul 2023 20:03:24 GMT, Thomas Stuefe wrote: >> This is a continuation of https://github.com/openjdk/jdk/pull/10085. I >> closed https://github.com/openjdk/jdk/pull/10085 because it had accumulated >> too much comment history and got confusing. For a history of this issue, see >> previous discussions [1] and the comment section of 10085. >> >> --- >> >> This RFE adds the option to trim the Glibc heap periodically. This can >> recover a significant memory footprint if the VM process suffers from >> high-but-rare malloc spikes. It does not matter who causes the spikes: the >> JDK or customer code running in the JVM process. >> >> ### Background: >> >> The Glibc is reluctant to return memory to the OS. Temporary malloc spikes >> often carry over as permanent RSS increase. Note that C-heap retention is >> difficult to observe. Since it is freed memory, it won't appear in NMT; it >> is just a part of RSS. >> >> This is, effectively, caching - a performance tradeoff by the glibc. It >> makes a lot of sense with applications that cause high traffic on the >> C-heap. The JVM, however, clusters allocations and often rolls its own >> memory management based on virtual memory for many of its use cases. >> >> To manually trim the C-heap, Glibc exposes `malloc_trim(3)`. With JDK 18 >> [2], we added a new jcmd command to *manually* trim the C-heap on Linux >> (`jcmd System.trim_native_heap`). We then observed customers running this >> command periodically to slim down process sizes of container-bound jvms. >> That is cumbersome, and the JVM can do this a lot better - among other >> things because it knows best when *not* to trim. >> >> GLIBC internals >> >> The following information I took from the glibc source code and >> experimenting. >> >> # Why do we need to trim manually? Does the Glibc not trim on free? >> >> Upon `free()`, glibc may return memory to the OS if: >> - the returned block was mmap'ed >> - the returned block was not added to tcache or to fastbins >> - the returned block, possibly merged with its two immediate neighbors, had >> they been free, is larger than FASTBIN_CONSOLIDATION_THRESHOLD (64K) - in >> that case: >> a) for the main arena, glibc attempts to lower the brk() >> b) for mmap-ed heaps, glibc attempts to completely unmap or shrink the >> heap. >> In both cases, (a) and (b), only the top portion of the heap is reclaimed. >> "Holes" in the middle of other in-use chunks are not reclaimed. >> >> So: glibc *may* automatically reclaim memory. In normal configurations, with >> typical C-heap allocation granularity, it is unlikely. >> >> To increase the ... > > Thomas Stuefe has updated the pull request incrementally with two additional > commits since the last revision: > > - Display unsupported text with UL warning level, + test > - Last Aleksey Feedback Marked as reviewed by rehn (Reviewer). - PR Review: https://git.openjdk.org/jdk/pull/14781#pullrequestreview-1538731282
Re: RFR: JDK-8293114: JVM should trim the native heap [v4]
On Thu, 6 Jul 2023 16:14:20 GMT, Aleksey Shipilev wrote: >> Thomas Stuefe has updated the pull request incrementally with one additional >> commit since the last revision: >> >> last cleanups and shade feedback > > src/hotspot/share/runtime/trimNative.cpp line 149: > >> 147: return true; >> 148: } else { >> 149: log_info(trim)("Trim native heap (no details)"); > > Consistency: `Trim native heap: complete, no details`. I would still like to know the trim_time. > test/hotspot/jtreg/runtime/os/TestTrimNative.java line 247: > >> 245: System.out.println("Will spike now..."); >> 246: for (int i = 0; i < numAllocations; i++) { >> 247: ptrs[i] = >> Unsafe.getUnsafe().allocateMemory(szAllocations); > > Pull `Unsafe.getUnsafe()` into a local or a field? Maybe use WB instead of unsafe? - PR Review Comment: https://git.openjdk.org/jdk/pull/14781#discussion_r1254816943 PR Review Comment: https://git.openjdk.org/jdk/pull/14781#discussion_r1254806836
Re: RFR: JDK-8293114: JVM should trim the native heap [v4]
On Thu, 6 Jul 2023 15:25:03 GMT, Thomas Stuefe wrote: >> This is a continuation of https://github.com/openjdk/jdk/pull/10085. I >> closed https://github.com/openjdk/jdk/pull/10085 because it had accumulated >> too much comment history and got confusing. For a history of this issue, see >> previous discussions [1] and the comment section of 10085. >> >> --- >> >> This RFE adds the option to trim the Glibc heap periodically. This can >> recover a significant memory footprint if the VM process suffers from >> high-but-rare malloc spikes. It does not matter who causes the spikes: the >> JDK or customer code running in the JVM process. >> >> ### Background: >> >> The Glibc is reluctant to return memory to the OS. Temporary malloc spikes >> often carry over as permanent RSS increase. Note that C-heap retention is >> difficult to observe. Since it is freed memory, it won't appear in NMT; it >> is just a part of RSS. >> >> This is, effectively, caching - a performance tradeoff by the glibc. It >> makes a lot of sense with applications that cause high traffic on the >> C-heap. The JVM, however, clusters allocations and often rolls its own >> memory management based on virtual memory for many of its use cases. >> >> To manually trim the C-heap, Glibc exposes `malloc_trim(3)`. With JDK 18 >> [2], we added a new jcmd command to *manually* trim the C-heap on Linux >> (`jcmd System.trim_native_heap`). We then observed customers running this >> command periodically to slim down process sizes of container-bound jvms. >> That is cumbersome, and the JVM can do this a lot better - among other >> things because it knows best when *not* to trim. >> >> GLIBC internals >> >> The following information I took from the glibc source code and >> experimenting. >> >> # Why do we need to trim manually? Does the Glibc not trim on free? >> >> Upon `free()`, glibc may return memory to the OS if: >> - the returned block was mmap'ed >> - the returned block was not added to tcache or to fastbins >> - the returned block, possibly merged with its two immediate neighbors, had >> they been free, is larger than FASTBIN_CONSOLIDATION_THRESHOLD (64K) - in >> that case: >> a) for the main arena, glibc attempts to lower the brk() >> b) for mmap-ed heaps, glibc attempts to completely unmap or shrink the >> heap. >> In both cases, (a) and (b), only the top portion of the heap is reclaimed. >> "Holes" in the middle of other in-use chunks are not reclaimed. >> >> So: glibc *may* automatically reclaim memory. In normal configurations, with >> typical C-heap allocation granularity, it is unlikely. >> >> To increase the ... > > Thomas Stuefe has updated the pull request incrementally with one additional > commit since the last revision: > > last cleanups and shade feedback Thanks, looks good to me. src/hotspot/share/logging/logTag.hpp line 199: > 197: LOG_TAG(tlab) \ > 198: LOG_TAG(tracking) \ > 199: LOG_TAG(trim) \ Not sure if 'trim' is the best tag name for an average user? src/hotspot/share/runtime/trimNative.cpp line 137: > 135: os::size_change_t sc; > 136: Ticks start = Ticks::now(); > 137: log_debug(trim)("Trim native heap started..."); TraceTime not a good fit? - Marked as reviewed by rehn (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/14781#pullrequestreview-1517121458 PR Review Comment: https://git.openjdk.org/jdk/pull/14781#discussion_r1254801114 PR Review Comment: https://git.openjdk.org/jdk/pull/14781#discussion_r1254805330
Re: RFR: 8307067: remove broken EnableThreadSMRExtraValidityChecks option [v2]
On Tue, 2 May 2023 20:37:18 GMT, Daniel D. Daugherty wrote: >> A trivial fix to remove broken EnableThreadSMRExtraValidityChecks option. > > Daniel D. Daugherty has updated the pull request incrementally with one > additional commit since the last revision: > > dholmes CR - change ':' to '.'. Looks good, thanks! - Marked as reviewed by rehn (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/13704#pullrequestreview-1412295770
Re: RFR: 8306006: strace001.java fails due to unknown methods on stack
On Fri, 14 Apr 2023 13:27:37 GMT, Fredrik Bredberg wrote: > Added the missing java.lang.Thread.beforeSleep and java.lang.afterSleep to > expectedSystemTrace. > Tested on my local machine. Marked as reviewed by rehn (Reviewer). Looks good to me! - PR Review: https://git.openjdk.org/jdk/pull/13476#pullrequestreview-1385466134 PR Comment: https://git.openjdk.org/jdk/pull/13476#issuecomment-1508507582
Re: RFR: 8291555: Implement alternative fast-locking scheme [v27]
On Thu, 16 Mar 2023 10:26:26 GMT, Thomas Stuefe wrote: > Sounds good. Just to be clear, you mean enforce symmetric locking? resp. > forbid asymmetric locking? Yes, sorry, thanks for correcting! :) - PR: https://git.openjdk.org/jdk/pull/10907
Re: RFR: 8291555: Implement alternative fast-locking scheme [v27]
On Thu, 16 Mar 2023 09:02:19 GMT, Roman Kennke wrote: >> I like -XX:+UseNewLocks, too. I wouldn't overcomplicate things: this flag is >> meant to be transitional, it is not meant to be used by end-users (except >> the bravest nerds) at all. When it lands, the Lilliput flag (e.g. >> +UseCompactObjectHeaders) will also control the locking flag. Eventually >> (e.g. release+1) both flags would become on by default and afterwards (e.g. >> release+2) would go away entirely, at which point the whole original >> stack-locking would disappear. > >> @rkennke I must be missing something. In aarch64, why do we handle the >> non-symmetric-unlock case in interpreter, but not in C1/C2? There, we just >> seem to pop whatever is on top. > > C1 and C2 don't allow assymmetric locking. If that ever happens, they would > refuse to compile the method. We should probably check that this assumption > holds true when popping the top entry in an #ASSERT block. > > > @rkennke I must be missing something. In aarch64, why do we handle the > > > non-symmetric-unlock case in interpreter, but not in C1/C2? There, we > > > just seem to pop whatever is on top. > > > > > > C1 and C2 don't allow assymmetric locking. If that ever happens, they would > > refuse to compile the method. We should probably check that this assumption > > holds true when popping the top entry in an #ASSERT block. > > Thanks for clarifying. Yes, asserting that would make sense. FYI: I'm trying to convince folks that JVMS should be allowed to enforce asymmetric locking. We think most people don't know they will be stuck in interpreter, unintended. What was discussed latest was to diagnose and warn about this behavior as a first step. - PR: https://git.openjdk.org/jdk/pull/10907
Re: RFR: 8291555: Implement alternative fast-locking scheme [v27]
On Thu, 16 Mar 2023 08:00:38 GMT, Thomas Stuefe wrote: > I like UseNewLocks but fear that this may conflict with Oracles plan (?) to > move OMs into heap, which would be another revamp of locking - fat locks in > this case - and may come with yet another switch. Other than that, > UseNewLocks sounds good and succinct. > > Another proposal: UseThreadLockStack or UseLockStack Just a FYI, at the moment we have: product(ccstr, ObjectSynchronizerMode, "fast",\ "ObjectSynchronizer modes: " \ "legacy: legacy native system; " \ "native: java entry with native monitors; " \ "heavy: java entry with always inflated Java monitors; " \ "fast: java entry with fast-locks and"\ " inflate-on-demand Java monitors; ")\ At least personally I prefer one option than using many. A cmd line with e.g. `-XX:-UseLockStack -XX:+UseHeavyMonitors` It's harder, for me ?, to figure out what is selected and what was intended to be selected. - PR: https://git.openjdk.org/jdk/pull/10907
RFR: 8300926: Several startup regressions ~6-70% in 21-b6 all platforms
Hi all, please consider. The original issue was when thread 1 asked to deopt nmethod set X and thread 2 asked for the same or a subset of X. All method will already be marked, but the actual deoptimizing, not entrant, patching PC on stacks and patching post call nops, was not done yet. Which meant thread 2 could 'pass' thread 1. Most places did deopt under Compile_lock, thus this is not an issue, but WB and clearCallSiteContext do not. Since a handshakes may take long before completion and Compile_lock is used for so much more than deopts, the fix in https://bugs.openjdk.org/browse/JDK-8299074 instead always emitted a handshake even when everything was already marked. This turnout to be problematic in the startup, for example the number of deopt handshakes in jetty dry run startup went from 5 to 39 handshakes. This fix first adds a barrier for which you do not pass until the requested deopts have happened and it coalesces the handshakes. Secondly it moves handshakes part out of the Compile_lock where it is possible. Which means we fix the performance bug and we reduce the contention on Compile_lock, meaning higher throughput in compiler and things such as class-loading. It passes t1-t7 with flying colours! t8 still not completed and I'm redoing some testing due to last minute simplifications. Thanks, Robbin - Commit messages: - Fixed WS - Deopt scopes Changes: https://git.openjdk.org/jdk/pull/12585/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=12585&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8300926 Stats: 380 lines in 24 files changed: 192 ins; 91 del; 97 mod Patch: https://git.openjdk.org/jdk/pull/12585.diff Fetch: git fetch https://git.openjdk.org/jdk pull/12585/head:pull/12585 PR: https://git.openjdk.org/jdk/pull/12585
Re: RFR: 8267935: Replace BasicHashtable and Hashtable
On Sat, 28 Jan 2023 18:32:31 GMT, Afshin Zafari wrote: > ### Description > Hashtable class to be replaced with ResourceHashtable class in all source > code. > > ### Patch > The only instance was `#include "hashtable.hpp"` in `jvmtiTagMapTable.cpp` > and removed. > The corresponding files (`hashtable.hpp`, `hashtable.inline.hpp` and > `hashtable.cpp`) are removed too. > > ### Test > mach5 tier1 all platforms. Thank you! - Marked as reviewed by rehn (Reviewer). PR: https://git.openjdk.org/jdk/pull/12275
Re: RFR: 8291555: Implement alternative fast-locking scheme [v4]
On Thu, 19 Jan 2023 12:09:27 GMT, Roman Kennke wrote: > > I always change the flags in code for the reasons David state and I can't > > forget to add any flags. > > (Test batch is stilling running, no failures except > > MonitorInflationTest.java) > > Ok, right. > > I just re-ran tier1-3 with the flag changed in code, on aarch64 and x86_64 > without regressions. Will run tier4 overnight. I did a bigger batch of testing, there were some ThreadMXBean failures with wrong state. I'll investigate, otherwise passed. - PR: https://git.openjdk.org/jdk/pull/10907
Re: RFR: 8300913: ZGC: assert(to_addr != 0) failed: Should be forwarded
On Wed, 25 Jan 2023 15:56:49 GMT, Coleen Phillimore wrote: > We thought we didn't need the keep-alive call but we do for heap walking. > Tested with failing test case locally, and tier1-4 in progress. Thanks! - Marked as reviewed by rehn (Reviewer). PR: https://git.openjdk.org/jdk/pull/12202
Re: RFR: 8299795: Relativize locals in interpreter frames [v4]
On Tue, 17 Jan 2023 17:34:53 GMT, Andrew Haley wrote: >> Fredrik Bredberg has updated the pull request with a new target base due to >> a merge or a rebase. The incremental webrev excludes the unrelated changes >> brought in by the merge/rebase. The pull request contains five additional >> commits since the last revision: >> >> - Merge branch 'master' into relativize-locals-JDK-8299795_2023-01-09 >> - Added references to JDK-8300197 >> - Updated some copyright dates. >> - Changed copyright date to 2023 >> - 8299795: Relativize locals in interpreter frames > >> Implementation of relativized locals in interpreter frames for x86. x64, >> arm, aarch64, ppc64le and riscv. Not relativized locals on zero and s390 but >> done some changes to cope with the changed generic code. Tested tier1-tier8 >> on supported platforms. The rest was sanity tested using Qemu, except s390, >> which was only tested by GitHub Actions. > > Please explain for the record the purpose of this patch. Neither the PR nor > the Jira issue explain. I presume it's something to do with making the > mounting and unmounting of interpreter frames easier, but this must be said > somewhere. @theRealAph are you planning to provide additional feedback, or can it be shipped? - PR: https://git.openjdk.org/jdk/pull/11902
Re: RFR: 8291555: Implement alternative fast-locking scheme [v4]
On Wed, 18 Jan 2023 20:10:21 GMT, Roman Kennke wrote: >> This change adds a fast-locking scheme as an alternative to the current >> stack-locking implementation. It retains the advantages of stack-locking >> (namely fast locking in uncontended code-paths), while avoiding the overload >> of the mark word. That overloading causes massive problems with Lilliput, >> because it means we have to check and deal with this situation when trying >> to access the mark-word. And because of the very racy nature, this turns out >> to be very complex and would involve a variant of the inflation protocol to >> ensure that the object header is stable. (The current implementation of >> setting/fetching the i-hash provides a glimpse into the complexity). >> >> What the original stack-locking does is basically to push a stack-lock onto >> the stack which consists only of the displaced header, and CAS a pointer to >> this stack location into the object header (the lowest two header bits being >> 00 indicate 'stack-locked'). The pointer into the stack can then be used to >> identify which thread currently owns the lock. >> >> This change basically reverses stack-locking: It still CASes the lowest two >> header bits to 00 to indicate 'fast-locked' but does *not* overload the >> upper bits with a stack-pointer. Instead, it pushes the object-reference to >> a thread-local lock-stack. This is a new structure which is basically a >> small array of oops that is associated with each thread. Experience shows >> that this array typcially remains very small (3-5 elements). Using this lock >> stack, it is possible to query which threads own which locks. Most >> importantly, the most common question 'does the current thread own me?' is >> very quickly answered by doing a quick scan of the array. More complex >> queries like 'which thread owns X?' are not performed in very >> performance-critical paths (usually in code like JVMTI or deadlock >> detection) where it is ok to do more complex operations (and we already do). >> The lock-stack is also a new set of GC roots, and would be scanned during >> thread scanning, possibly concurrently, via the normal protocols. >> >> The lock-stack is grown when needed. This means that we need to check for >> potential overflow before attempting locking. When that is the case, locking >> fast-paths would call into the runtime to grow the stack and handle the >> locking. Compiled fast-paths (C1 and C2 on x86_64 and aarch64) do this check >> on method entry to avoid (possibly lots) of such checks at locking sites. >> >> In contrast to stack-locking, fast-locking does *not* support recursive >> locking (yet). When that happens, the fast-lock gets inflated to a full >> monitor. It is not clear if it is worth to add support for recursive >> fast-locking. >> >> One trouble is that when a contending thread arrives at a fast-locked >> object, it must inflate the fast-lock to a full monitor. Normally, we need >> to know the current owning thread, and record that in the monitor, so that >> the contending thread can wait for the current owner to properly exit the >> monitor. However, fast-locking doesn't have this information. What we do >> instead is to record a special marker ANONYMOUS_OWNER. When the thread that >> currently holds the lock arrives at monitorexit, and observes >> ANONYMOUS_OWNER, it knows it must be itself, fixes the owner to be itself, >> and then properly exits the monitor, and thus handing over to the contending >> thread. >> >> As an alternative, I considered to remove stack-locking altogether, and only >> use heavy monitors. In most workloads this did not show measurable >> regressions. However, in a few workloads, I have observed severe >> regressions. All of them have been using old synchronized Java collections >> (Vector, Stack), StringBuffer or similar code. The combination of two >> conditions leads to regressions without stack- or fast-locking: 1. The >> workload synchronizes on uncontended locks (e.g. single-threaded use of >> Vector or StringBuffer) and 2. The workload churns such locks. IOW, >> uncontended use of Vector, StringBuffer, etc as such is ok, but creating >> lots of such single-use, single-threaded-locked objects leads to massive >> ObjectMonitor churn, which can lead to a significant performance impact. But >> alas, such code exists, and we probably don't want to punish it if we can >> avoid it. >> >> This change enables to simplify (and speed-up!) a lot of code: >> >> - The inflation protocol is no longer necessary: we can directly CAS the >> (tagged) ObjectMonitor pointer to the object header. >> - Accessing the hashcode could now be done in the fastpath always, if the >> hashcode has been installed. Fast-locked headers can be used directly, for >> monitor-locked objects we can easily reach-through to the displaced header. >> This is safe because Java threads participate in monitor deflation protocol.
Re: RFR: 8291555: Implement alternative fast-locking scheme [v2]
On Wed, 18 Jan 2023 15:45:38 GMT, Roman Kennke wrote: >> This change adds a fast-locking scheme as an alternative to the current >> stack-locking implementation. It retains the advantages of stack-locking >> (namely fast locking in uncontended code-paths), while avoiding the overload >> of the mark word. That overloading causes massive problems with Lilliput, >> because it means we have to check and deal with this situation when trying >> to access the mark-word. And because of the very racy nature, this turns out >> to be very complex and would involve a variant of the inflation protocol to >> ensure that the object header is stable. (The current implementation of >> setting/fetching the i-hash provides a glimpse into the complexity). >> >> What the original stack-locking does is basically to push a stack-lock onto >> the stack which consists only of the displaced header, and CAS a pointer to >> this stack location into the object header (the lowest two header bits being >> 00 indicate 'stack-locked'). The pointer into the stack can then be used to >> identify which thread currently owns the lock. >> >> This change basically reverses stack-locking: It still CASes the lowest two >> header bits to 00 to indicate 'fast-locked' but does *not* overload the >> upper bits with a stack-pointer. Instead, it pushes the object-reference to >> a thread-local lock-stack. This is a new structure which is basically a >> small array of oops that is associated with each thread. Experience shows >> that this array typcially remains very small (3-5 elements). Using this lock >> stack, it is possible to query which threads own which locks. Most >> importantly, the most common question 'does the current thread own me?' is >> very quickly answered by doing a quick scan of the array. More complex >> queries like 'which thread owns X?' are not performed in very >> performance-critical paths (usually in code like JVMTI or deadlock >> detection) where it is ok to do more complex operations (and we already do). >> The lock-stack is also a new set of GC roots, and would be scanned during >> thread scanning, possibly concurrently, via the normal protocols. >> >> The lock-stack is grown when needed. This means that we need to check for >> potential overflow before attempting locking. When that is the case, locking >> fast-paths would call into the runtime to grow the stack and handle the >> locking. Compiled fast-paths (C1 and C2 on x86_64 and aarch64) do this check >> on method entry to avoid (possibly lots) of such checks at locking sites. >> >> In contrast to stack-locking, fast-locking does *not* support recursive >> locking (yet). When that happens, the fast-lock gets inflated to a full >> monitor. It is not clear if it is worth to add support for recursive >> fast-locking. >> >> One trouble is that when a contending thread arrives at a fast-locked >> object, it must inflate the fast-lock to a full monitor. Normally, we need >> to know the current owning thread, and record that in the monitor, so that >> the contending thread can wait for the current owner to properly exit the >> monitor. However, fast-locking doesn't have this information. What we do >> instead is to record a special marker ANONYMOUS_OWNER. When the thread that >> currently holds the lock arrives at monitorexit, and observes >> ANONYMOUS_OWNER, it knows it must be itself, fixes the owner to be itself, >> and then properly exits the monitor, and thus handing over to the contending >> thread. >> >> As an alternative, I considered to remove stack-locking altogether, and only >> use heavy monitors. In most workloads this did not show measurable >> regressions. However, in a few workloads, I have observed severe >> regressions. All of them have been using old synchronized Java collections >> (Vector, Stack), StringBuffer or similar code. The combination of two >> conditions leads to regressions without stack- or fast-locking: 1. The >> workload synchronizes on uncontended locks (e.g. single-threaded use of >> Vector or StringBuffer) and 2. The workload churns such locks. IOW, >> uncontended use of Vector, StringBuffer, etc as such is ok, but creating >> lots of such single-use, single-threaded-locked objects leads to massive >> ObjectMonitor churn, which can lead to a significant performance impact. But >> alas, such code exists, and we probably don't want to punish it if we can >> avoid it. >> >> This change enables to simplify (and speed-up!) a lot of code: >> >> - The inflation protocol is no longer necessary: we can directly CAS the >> (tagged) ObjectMonitor pointer to the object header. >> - Accessing the hashcode could now be done in the fastpath always, if the >> hashcode has been installed. Fast-locked headers can be used directly, for >> monitor-locked objects we can easily reach-through to the displaced header. >> This is safe because Java threads participate in monitor deflation protocol.
Re: RFR: 8291555: Implement alternative fast-locking scheme
On Fri, 28 Oct 2022 20:17:37 GMT, Roman Kennke wrote: > This change adds a fast-locking scheme as an alternative to the current > stack-locking implementation. It retains the advantages of stack-locking > (namely fast locking in uncontended code-paths), while avoiding the overload > of the mark word. That overloading causes massive problems with Lilliput, > because it means we have to check and deal with this situation when trying to > access the mark-word. And because of the very racy nature, this turns out to > be very complex and would involve a variant of the inflation protocol to > ensure that the object header is stable. (The current implementation of > setting/fetching the i-hash provides a glimpse into the complexity). > > What the original stack-locking does is basically to push a stack-lock onto > the stack which consists only of the displaced header, and CAS a pointer to > this stack location into the object header (the lowest two header bits being > 00 indicate 'stack-locked'). The pointer into the stack can then be used to > identify which thread currently owns the lock. > > This change basically reverses stack-locking: It still CASes the lowest two > header bits to 00 to indicate 'fast-locked' but does *not* overload the upper > bits with a stack-pointer. Instead, it pushes the object-reference to a > thread-local lock-stack. This is a new structure which is basically a small > array of oops that is associated with each thread. Experience shows that this > array typcially remains very small (3-5 elements). Using this lock stack, it > is possible to query which threads own which locks. Most importantly, the > most common question 'does the current thread own me?' is very quickly > answered by doing a quick scan of the array. More complex queries like 'which > thread owns X?' are not performed in very performance-critical paths (usually > in code like JVMTI or deadlock detection) where it is ok to do more complex > operations (and we already do). The lock-stack is also a new set of GC roots, > and would be scanned during thread scanning, possibly concurrently, via the > normal p rotocols. > > The lock-stack is grown when needed. This means that we need to check for > potential overflow before attempting locking. When that is the case, locking > fast-paths would call into the runtime to grow the stack and handle the > locking. Compiled fast-paths (C1 and C2 on x86_64 and aarch64) do this check > on method entry to avoid (possibly lots) of such checks at locking sites. > > In contrast to stack-locking, fast-locking does *not* support recursive > locking (yet). When that happens, the fast-lock gets inflated to a full > monitor. It is not clear if it is worth to add support for recursive > fast-locking. > > One trouble is that when a contending thread arrives at a fast-locked object, > it must inflate the fast-lock to a full monitor. Normally, we need to know > the current owning thread, and record that in the monitor, so that the > contending thread can wait for the current owner to properly exit the > monitor. However, fast-locking doesn't have this information. What we do > instead is to record a special marker ANONYMOUS_OWNER. When the thread that > currently holds the lock arrives at monitorexit, and observes > ANONYMOUS_OWNER, it knows it must be itself, fixes the owner to be itself, > and then properly exits the monitor, and thus handing over to the contending > thread. > > As an alternative, I considered to remove stack-locking altogether, and only > use heavy monitors. In most workloads this did not show measurable > regressions. However, in a few workloads, I have observed severe regressions. > All of them have been using old synchronized Java collections (Vector, > Stack), StringBuffer or similar code. The combination of two conditions leads > to regressions without stack- or fast-locking: 1. The workload synchronizes > on uncontended locks (e.g. single-threaded use of Vector or StringBuffer) and > 2. The workload churns such locks. IOW, uncontended use of Vector, > StringBuffer, etc as such is ok, but creating lots of such single-use, > single-threaded-locked objects leads to massive ObjectMonitor churn, which > can lead to a significant performance impact. But alas, such code exists, and > we probably don't want to punish it if we can avoid it. > > This change enables to simplify (and speed-up!) a lot of code: > > - The inflation protocol is no longer necessary: we can directly CAS the > (tagged) ObjectMonitor pointer to the object header. > - Accessing the hashcode could now be done in the fastpath always, if the > hashcode has been installed. Fast-locked headers can be used directly, for > monitor-locked objects we can easily reach-through to the displaced header. > This is safe because Java threads participate in monitor deflation protocol. > This would be implemented in a separate PR > > > Testing: > - [x] tier1
Re: RFR: 8291555: Implement alternative fast-locking scheme
On Fri, 28 Oct 2022 20:17:37 GMT, Roman Kennke wrote: > This change adds a fast-locking scheme as an alternative to the current > stack-locking implementation. It retains the advantages of stack-locking > (namely fast locking in uncontended code-paths), while avoiding the overload > of the mark word. That overloading causes massive problems with Lilliput, > because it means we have to check and deal with this situation when trying to > access the mark-word. And because of the very racy nature, this turns out to > be very complex and would involve a variant of the inflation protocol to > ensure that the object header is stable. (The current implementation of > setting/fetching the i-hash provides a glimpse into the complexity). > > What the original stack-locking does is basically to push a stack-lock onto > the stack which consists only of the displaced header, and CAS a pointer to > this stack location into the object header (the lowest two header bits being > 00 indicate 'stack-locked'). The pointer into the stack can then be used to > identify which thread currently owns the lock. > > This change basically reverses stack-locking: It still CASes the lowest two > header bits to 00 to indicate 'fast-locked' but does *not* overload the upper > bits with a stack-pointer. Instead, it pushes the object-reference to a > thread-local lock-stack. This is a new structure which is basically a small > array of oops that is associated with each thread. Experience shows that this > array typcially remains very small (3-5 elements). Using this lock stack, it > is possible to query which threads own which locks. Most importantly, the > most common question 'does the current thread own me?' is very quickly > answered by doing a quick scan of the array. More complex queries like 'which > thread owns X?' are not performed in very performance-critical paths (usually > in code like JVMTI or deadlock detection) where it is ok to do more complex > operations (and we already do). The lock-stack is also a new set of GC roots, > and would be scanned during thread scanning, possibly concurrently, via the > normal p rotocols. > > The lock-stack is grown when needed. This means that we need to check for > potential overflow before attempting locking. When that is the case, locking > fast-paths would call into the runtime to grow the stack and handle the > locking. Compiled fast-paths (C1 and C2 on x86_64 and aarch64) do this check > on method entry to avoid (possibly lots) of such checks at locking sites. > > In contrast to stack-locking, fast-locking does *not* support recursive > locking (yet). When that happens, the fast-lock gets inflated to a full > monitor. It is not clear if it is worth to add support for recursive > fast-locking. > > One trouble is that when a contending thread arrives at a fast-locked object, > it must inflate the fast-lock to a full monitor. Normally, we need to know > the current owning thread, and record that in the monitor, so that the > contending thread can wait for the current owner to properly exit the > monitor. However, fast-locking doesn't have this information. What we do > instead is to record a special marker ANONYMOUS_OWNER. When the thread that > currently holds the lock arrives at monitorexit, and observes > ANONYMOUS_OWNER, it knows it must be itself, fixes the owner to be itself, > and then properly exits the monitor, and thus handing over to the contending > thread. > > As an alternative, I considered to remove stack-locking altogether, and only > use heavy monitors. In most workloads this did not show measurable > regressions. However, in a few workloads, I have observed severe regressions. > All of them have been using old synchronized Java collections (Vector, > Stack), StringBuffer or similar code. The combination of two conditions leads > to regressions without stack- or fast-locking: 1. The workload synchronizes > on uncontended locks (e.g. single-threaded use of Vector or StringBuffer) and > 2. The workload churns such locks. IOW, uncontended use of Vector, > StringBuffer, etc as such is ok, but creating lots of such single-use, > single-threaded-locked objects leads to massive ObjectMonitor churn, which > can lead to a significant performance impact. But alas, such code exists, and > we probably don't want to punish it if we can avoid it. > > This change enables to simplify (and speed-up!) a lot of code: > > - The inflation protocol is no longer necessary: we can directly CAS the > (tagged) ObjectMonitor pointer to the object header. > - Accessing the hashcode could now be done in the fastpath always, if the > hashcode has been installed. Fast-locked headers can be used directly, for > monitor-locked objects we can easily reach-through to the displaced header. > This is safe because Java threads participate in monitor deflation protocol. > This would be implemented in a separate PR > > > Testing: > - [x] tier1
Re: RFR: 8291555: Implement alternative fast-locking scheme
On Fri, 28 Oct 2022 20:17:37 GMT, Roman Kennke wrote: > This change adds a fast-locking scheme as an alternative to the current > stack-locking implementation. It retains the advantages of stack-locking > (namely fast locking in uncontended code-paths), while avoiding the overload > of the mark word. That overloading causes massive problems with Lilliput, > because it means we have to check and deal with this situation when trying to > access the mark-word. And because of the very racy nature, this turns out to > be very complex and would involve a variant of the inflation protocol to > ensure that the object header is stable. (The current implementation of > setting/fetching the i-hash provides a glimpse into the complexity). > > What the original stack-locking does is basically to push a stack-lock onto > the stack which consists only of the displaced header, and CAS a pointer to > this stack location into the object header (the lowest two header bits being > 00 indicate 'stack-locked'). The pointer into the stack can then be used to > identify which thread currently owns the lock. > > This change basically reverses stack-locking: It still CASes the lowest two > header bits to 00 to indicate 'fast-locked' but does *not* overload the upper > bits with a stack-pointer. Instead, it pushes the object-reference to a > thread-local lock-stack. This is a new structure which is basically a small > array of oops that is associated with each thread. Experience shows that this > array typcially remains very small (3-5 elements). Using this lock stack, it > is possible to query which threads own which locks. Most importantly, the > most common question 'does the current thread own me?' is very quickly > answered by doing a quick scan of the array. More complex queries like 'which > thread owns X?' are not performed in very performance-critical paths (usually > in code like JVMTI or deadlock detection) where it is ok to do more complex > operations (and we already do). The lock-stack is also a new set of GC roots, > and would be scanned during thread scanning, possibly concurrently, via the > normal p rotocols. > > The lock-stack is grown when needed. This means that we need to check for > potential overflow before attempting locking. When that is the case, locking > fast-paths would call into the runtime to grow the stack and handle the > locking. Compiled fast-paths (C1 and C2 on x86_64 and aarch64) do this check > on method entry to avoid (possibly lots) of such checks at locking sites. > > In contrast to stack-locking, fast-locking does *not* support recursive > locking (yet). When that happens, the fast-lock gets inflated to a full > monitor. It is not clear if it is worth to add support for recursive > fast-locking. > > One trouble is that when a contending thread arrives at a fast-locked object, > it must inflate the fast-lock to a full monitor. Normally, we need to know > the current owning thread, and record that in the monitor, so that the > contending thread can wait for the current owner to properly exit the > monitor. However, fast-locking doesn't have this information. What we do > instead is to record a special marker ANONYMOUS_OWNER. When the thread that > currently holds the lock arrives at monitorexit, and observes > ANONYMOUS_OWNER, it knows it must be itself, fixes the owner to be itself, > and then properly exits the monitor, and thus handing over to the contending > thread. > > As an alternative, I considered to remove stack-locking altogether, and only > use heavy monitors. In most workloads this did not show measurable > regressions. However, in a few workloads, I have observed severe regressions. > All of them have been using old synchronized Java collections (Vector, > Stack), StringBuffer or similar code. The combination of two conditions leads > to regressions without stack- or fast-locking: 1. The workload synchronizes > on uncontended locks (e.g. single-threaded use of Vector or StringBuffer) and > 2. The workload churns such locks. IOW, uncontended use of Vector, > StringBuffer, etc as such is ok, but creating lots of such single-use, > single-threaded-locked objects leads to massive ObjectMonitor churn, which > can lead to a significant performance impact. But alas, such code exists, and > we probably don't want to punish it if we can avoid it. > > This change enables to simplify (and speed-up!) a lot of code: > > - The inflation protocol is no longer necessary: we can directly CAS the > (tagged) ObjectMonitor pointer to the object header. > - Accessing the hashcode could now be done in the fastpath always, if the > hashcode has been installed. Fast-locked headers can be used directly, for > monitor-locked objects we can easily reach-through to the displaced header. > This is safe because Java threads participate in monitor deflation protocol. > This would be implemented in a separate PR > > > Testing: > - [x] tier1
Re: RFR: 8299795: Relativize locals in interpreter frames [v4]
On Tue, 17 Jan 2023 08:35:45 GMT, Fredrik Bredberg wrote: >> Implementation of relativized locals in interpreter frames for x86. x64, >> arm, aarch64, ppc64le and riscv. >> Not relativized locals on zero and s390 but done some changes to cope with >> the changed generic code. >> Tested tier1-tier8 on supported platforms. The rest was sanity tested using >> Qemu, except s390, which was only tested by GitHub Actions. > > Fredrik Bredberg has updated the pull request with a new target base due to a > merge or a rebase. The incremental webrev excludes the unrelated changes > brought in by the merge/rebase. The pull request contains five additional > commits since the last revision: > > - Merge branch 'master' into relativize-locals-JDK-8299795_2023-01-09 > - Added references to JDK-8300197 > - Updated some copyright dates. > - Changed copyright date to 2023 > - 8299795: Relativize locals in interpreter frames Thank you, looks good! - Marked as reviewed by rehn (Reviewer). PR: https://git.openjdk.org/jdk/pull/11902
Re: RFR: 8292741: Convert JvmtiTagMapTable to ResourceHashtable [v7]
On Thu, 8 Dec 2022 16:56:40 GMT, Afshin Zafari wrote: >> test of tier1-5 passed. > > Afshin Zafari has updated the pull request incrementally with one additional > commit since the last revision: > > 8292741: Convert JvmtiTagMapTable to ResourceHashtable Looks fine, thanks. (+- DavidH comment) - Marked as reviewed by rehn (Reviewer). PR: https://git.openjdk.org/jdk/pull/11288
Re: RFR: 8292741: Convert JvmtiTagMapTable to ResourceHashtable [v7]
On Mon, 12 Dec 2022 00:58:35 GMT, David Holmes wrote: > Also please do not force-push. FYI: I'm not sure it even was possible to fix the "full of unexpected changes" without force push. Even if it was, the effort required to fix it without force push would be pretty significant. So I'd say this one of those exceptions to the rule :) - PR: https://git.openjdk.org/jdk/pull/11288
Re: RFR: 8291555: Replace stack-locking with fast-locking
On Fri, 28 Oct 2022 03:32:58 GMT, Remi Forax wrote: > i've some trouble to see how it can be implemented given that because of lock > coarsening (+ may be OSR), the number of time a lock is held is different > between the interpreted code and the compiled code. Correct me if I'm wrong, only C2 eliminates locks and C2 only compile if there is proper structured locking. This should mean that when we restore the eliminated locks in deopt we can inflate the recursive locks which are no longer interleaved and restructure the lock-stack accordingly. Is there another situation than deopt where it would matter? - PR: https://git.openjdk.org/jdk/pull/10590
Re: RFR: 8291555: Replace stack-locking with fast-locking [v7]
On Mon, 24 Oct 2022 08:03:13 GMT, Roman Kennke wrote: >> This change replaces the current stack-locking implementation with a >> fast-locking scheme that retains the advantages of stack-locking (namely >> fast locking in uncontended code-paths), while avoiding the overload of the >> mark word. That overloading causes massive problems with Lilliput, because >> it means we have to check and deal with this situation. And because of the >> very racy nature, this turns out to be very complex and involved a variant >> of the inflation protocol to ensure that the object header is stable. >> >> What the original stack-locking does is basically to push a stack-lock onto >> the stack which consists only of the displaced header, and CAS a pointer to >> this stack location into the object header (the lowest two header bits being >> 00 indicate 'stack-locked'). The pointer into the stack can then be used to >> identify which thread currently owns the lock. >> >> This change basically reverses stack-locking: It still CASes the lowest two >> header bits to 00 to indicate 'fast-locked' but does *not* overload the >> upper bits with a stack-pointer. Instead, it pushes the object-reference to >> a thread-local lock-stack. This is a new structure which is basically a >> small array of oops that is associated with each thread. Experience shows >> that this array typcially remains very small (3-5 elements). Using this lock >> stack, it is possible to query which threads own which locks. Most >> importantly, the most common question 'does the current thread own me?' is >> very quickly answered by doing a quick scan of the array. More complex >> queries like 'which thread owns X?' are not performed in very >> performance-critical paths (usually in code like JVMTI or deadlock >> detection) where it is ok to do more complex operations. The lock-stack is >> also a new set of GC roots, and would be scanned during thread scanning, >> possibly concurrently, via the normal protocols. >> >> In contrast to stack-locking, fast-locking does *not* support recursive >> locking (yet). When that happens, the fast-lock gets inflated to a full >> monitor. It is not clear if it is worth to add support for recursive >> fast-locking. >> >> One trouble is that when a contending thread arrives at a fast-locked >> object, it must inflate the fast-lock to a full monitor. Normally, we need >> to know the current owning thread, and record that in the monitor, so that >> the contending thread can wait for the current owner to properly exit the >> monitor. However, fast-locking doesn't have this information. What we do >> instead is to record a special marker ANONYMOUS_OWNER. When the thread that >> currently holds the lock arrives at monitorexit, and observes >> ANONYMOUS_OWNER, it knows it must be itself, fixes the owner to be itself, >> and then properly exits the monitor, and thus handing over to the contending >> thread. >> >> As an alternative, I considered to remove stack-locking altogether, and only >> use heavy monitors. In most workloads this did not show measurable >> regressions. However, in a few workloads, I have observed severe >> regressions. All of them have been using old synchronized Java collections >> (Vector, Stack), StringBuffer or similar code. The combination of two >> conditions leads to regressions without stack- or fast-locking: 1. The >> workload synchronizes on uncontended locks (e.g. single-threaded use of >> Vector or StringBuffer) and 2. The workload churns such locks. IOW, >> uncontended use of Vector, StringBuffer, etc as such is ok, but creating >> lots of such single-use, single-threaded-locked objects leads to massive >> ObjectMonitor churn, which can lead to a significant performance impact. But >> alas, such code exists, and we probably don't want to punish it if we can >> avoid it. >> >> This change enables to simplify (and speed-up!) a lot of code: >> >> - The inflation protocol is no longer necessary: we can directly CAS the >> (tagged) ObjectMonitor pointer to the object header. >> - Accessing the hashcode could now be done in the fastpath always, if the >> hashcode has been installed. Fast-locked headers can be used directly, for >> monitor-locked objects we can easily reach-through to the displaced header. >> This is safe because Java threads participate in monitor deflation protocol. >> This would be implemented in a separate PR >> >> ### Benchmarks >> >> All benchmarks are run on server-class metal machines. The JVM settings are >> always: `-Xmx20g -Xms20g -XX:+UseParallelGC`. All benchmarks are ms/ops, >> less is better. >> >> DaCapo/AArch64 >> >> Those measurements have been taken on a Graviton2 box with 64 CPU cores (an >> AWS m6g.metal instance). It is using DaCapo evaluation version, git hash >> 309e1fa (download file dacapo-evaluation-git+309e1fa.jar). I needed to >> exclude cassandra, h2o & kafka benchmarks because of incompati
Re: RFR: 8291555: Replace stack-locking with fast-locking [v4]
On Thu, 13 Oct 2022 10:34:04 GMT, Roman Kennke wrote: > It used to be very stable before that. I have backed out that change, can you > try again? Seems fine now, thanks. - PR: https://git.openjdk.org/jdk/pull/10590
Re: RFR: 8291555: Replace stack-locking with fast-locking [v4]
On Thu, 13 Oct 2022 07:33:48 GMT, Roman Kennke wrote: >> This change replaces the current stack-locking implementation with a >> fast-locking scheme that retains the advantages of stack-locking (namely >> fast locking in uncontended code-paths), while avoiding the overload of the >> mark word. That overloading causes massive problems with Lilliput, because >> it means we have to check and deal with this situation. And because of the >> very racy nature, this turns out to be very complex and involved a variant >> of the inflation protocol to ensure that the object header is stable. >> >> What the original stack-locking does is basically to push a stack-lock onto >> the stack which consists only of the displaced header, and CAS a pointer to >> this stack location into the object header (the lowest two header bits being >> 00 indicate 'stack-locked'). The pointer into the stack can then be used to >> identify which thread currently owns the lock. >> >> This change basically reverses stack-locking: It still CASes the lowest two >> header bits to 00 to indicate 'fast-locked' but does *not* overload the >> upper bits with a stack-pointer. Instead, it pushes the object-reference to >> a thread-local lock-stack. This is a new structure which is basically a >> small array of oops that is associated with each thread. Experience shows >> that this array typcially remains very small (3-5 elements). Using this lock >> stack, it is possible to query which threads own which locks. Most >> importantly, the most common question 'does the current thread own me?' is >> very quickly answered by doing a quick scan of the array. More complex >> queries like 'which thread owns X?' are not performed in very >> performance-critical paths (usually in code like JVMTI or deadlock >> detection) where it is ok to do more complex operations. The lock-stack is >> also a new set of GC roots, and would be scanned during thread scanning, >> possibly concurrently, via the normal protocols. >> >> In contrast to stack-locking, fast-locking does *not* support recursive >> locking (yet). When that happens, the fast-lock gets inflated to a full >> monitor. It is not clear if it is worth to add support for recursive >> fast-locking. >> >> One trouble is that when a contending thread arrives at a fast-locked >> object, it must inflate the fast-lock to a full monitor. Normally, we need >> to know the current owning thread, and record that in the monitor, so that >> the contending thread can wait for the current owner to properly exit the >> monitor. However, fast-locking doesn't have this information. What we do >> instead is to record a special marker ANONYMOUS_OWNER. When the thread that >> currently holds the lock arrives at monitorexit, and observes >> ANONYMOUS_OWNER, it knows it must be itself, fixes the owner to be itself, >> and then properly exits the monitor, and thus handing over to the contending >> thread. >> >> As an alternative, I considered to remove stack-locking altogether, and only >> use heavy monitors. In most workloads this did not show measurable >> regressions. However, in a few workloads, I have observed severe >> regressions. All of them have been using old synchronized Java collections >> (Vector, Stack), StringBuffer or similar code. The combination of two >> conditions leads to regressions without stack- or fast-locking: 1. The >> workload synchronizes on uncontended locks (e.g. single-threaded use of >> Vector or StringBuffer) and 2. The workload churns such locks. IOW, >> uncontended use of Vector, StringBuffer, etc as such is ok, but creating >> lots of such single-use, single-threaded-locked objects leads to massive >> ObjectMonitor churn, which can lead to a significant performance impact. But >> alas, such code exists, and we probably don't want to punish it if we can >> avoid it. >> >> This change enables to simplify (and speed-up!) a lot of code: >> >> - The inflation protocol is no longer necessary: we can directly CAS the >> (tagged) ObjectMonitor pointer to the object header. >> - Accessing the hashcode could now be done in the fastpath always, if the >> hashcode has been installed. Fast-locked headers can be used directly, for >> monitor-locked objects we can easily reach-through to the displaced header. >> This is safe because Java threads participate in monitor deflation protocol. >> This would be implemented in a separate PR >> >> ### Benchmarks >> >> All benchmarks are run on server-class metal machines. The JVM settings are >> always: `-Xmx20g -Xms20g -XX:+UseParallelGC`. All benchmarks are ms/ops, >> less is better. >> >> DaCapo/AArch64 >> >> Those measurements have been taken on a Graviton2 box with 64 CPU cores (an >> AWS m6g.metal instance). It is using DaCapo evaluation version, git hash >> 309e1fa (download file dacapo-evaluation-git+309e1fa.jar). I needed to >> exclude cassandra, h2o & kafka benchmarks because of incompati
Re: RFR: 8291555: Replace stack-locking with fast-locking [v3]
On Tue, 11 Oct 2022 20:01:32 GMT, Roman Kennke wrote: >> This change replaces the current stack-locking implementation with a >> fast-locking scheme that retains the advantages of stack-locking (namely >> fast locking in uncontended code-paths), while avoiding the overload of the >> mark word. That overloading causes massive problems with Lilliput, because >> it means we have to check and deal with this situation. And because of the >> very racy nature, this turns out to be very complex and involved a variant >> of the inflation protocol to ensure that the object header is stable. >> >> What the original stack-locking does is basically to push a stack-lock onto >> the stack which consists only of the displaced header, and CAS a pointer to >> this stack location into the object header (the lowest two header bits being >> 00 indicate 'stack-locked'). The pointer into the stack can then be used to >> identify which thread currently owns the lock. >> >> This change basically reverses stack-locking: It still CASes the lowest two >> header bits to 00 to indicate 'fast-locked' but does *not* overload the >> upper bits with a stack-pointer. Instead, it pushes the object-reference to >> a thread-local lock-stack. This is a new structure which is basically a >> small array of oops that is associated with each thread. Experience shows >> that this array typcially remains very small (3-5 elements). Using this lock >> stack, it is possible to query which threads own which locks. Most >> importantly, the most common question 'does the current thread own me?' is >> very quickly answered by doing a quick scan of the array. More complex >> queries like 'which thread owns X?' are not performed in very >> performance-critical paths (usually in code like JVMTI or deadlock >> detection) where it is ok to do more complex operations. The lock-stack is >> also a new set of GC roots, and would be scanned during thread scanning, >> possibly concurrently, via the normal protocols. >> >> In contrast to stack-locking, fast-locking does *not* support recursive >> locking (yet). When that happens, the fast-lock gets inflated to a full >> monitor. It is not clear if it is worth to add support for recursive >> fast-locking. >> >> One trouble is that when a contending thread arrives at a fast-locked >> object, it must inflate the fast-lock to a full monitor. Normally, we need >> to know the current owning thread, and record that in the monitor, so that >> the contending thread can wait for the current owner to properly exit the >> monitor. However, fast-locking doesn't have this information. What we do >> instead is to record a special marker ANONYMOUS_OWNER. When the thread that >> currently holds the lock arrives at monitorexit, and observes >> ANONYMOUS_OWNER, it knows it must be itself, fixes the owner to be itself, >> and then properly exits the monitor, and thus handing over to the contending >> thread. >> >> As an alternative, I considered to remove stack-locking altogether, and only >> use heavy monitors. In most workloads this did not show measurable >> regressions. However, in a few workloads, I have observed severe >> regressions. All of them have been using old synchronized Java collections >> (Vector, Stack), StringBuffer or similar code. The combination of two >> conditions leads to regressions without stack- or fast-locking: 1. The >> workload synchronizes on uncontended locks (e.g. single-threaded use of >> Vector or StringBuffer) and 2. The workload churns such locks. IOW, >> uncontended use of Vector, StringBuffer, etc as such is ok, but creating >> lots of such single-use, single-threaded-locked objects leads to massive >> ObjectMonitor churn, which can lead to a significant performance impact. But >> alas, such code exists, and we probably don't want to punish it if we can >> avoid it. >> >> This change enables to simplify (and speed-up!) a lot of code: >> >> - The inflation protocol is no longer necessary: we can directly CAS the >> (tagged) ObjectMonitor pointer to the object header. >> - Accessing the hashcode could now be done in the fastpath always, if the >> hashcode has been installed. Fast-locked headers can be used directly, for >> monitor-locked objects we can easily reach-through to the displaced header. >> This is safe because Java threads participate in monitor deflation protocol. >> This would be implemented in a separate PR >> >> ### Benchmarks >> >> All benchmarks are run on server-class metal machines. The JVM settings are >> always: `-Xmx20g -Xms20g -XX:+UseParallelGC`. All benchmarks are ms/ops, >> less is better. >> >> DaCapo/AArch64 >> >> Those measurements have been taken on a Graviton2 box with 64 CPU cores (an >> AWS m6g.metal instance). It is using DaCapo evaluation version, git hash >> 309e1fa (download file dacapo-evaluation-git+309e1fa.jar). I needed to >> exclude cassandra, h2o & kafka benchmarks because of incompati
Re: RFR: 8291555: Replace stack-locking with fast-locking
On Mon, 12 Sep 2022 07:54:48 GMT, Roman Kennke wrote: >>> How have you handled the interpreter lock-stack-area in your >>> implementation? Is it worth to get rid of it and consolidate with the >>> per-thread lock-stack? >> >> At the moment I had to store a "frame id" for each entry in the lock stack. >> The frame id is previous fp, grabbed from "link()" when entering the locking >> code. >> >> private static final void monitorEnter(Object o) { >> >> long monitorFrameId = getCallerFrameId(); >> ``` >> When popping we can thus check if there is still monitors/locks for the >> frame to be popped. >> Remove activation reads the lock stack, with a bunch of assembly, e.g.: >> ` access_load_at(T_INT, IN_HEAP, rax, Address(rax, >> java_lang_Thread::lock_stack_pos_offset()), noreg, noreg); >> ` >> If we would keep this, loom freezing would need to relativize and >> derelativize these values. >> (we only have interpreter) >> >> But, according to JVMS 2.11.10. the VM only needs to automatically unlock >> synchronized method. >> This code that unlocks all locks in the frame seems to have been added for >> JLS 17.1. >> I have asked for clarification and we only need and should care about JVMS. >> >> So if we could make popframe do more work (popframe needs to unlock all), >> there seems to be way forward allowing more flexibility. >> >> Still working on trying to make what we have public, even if it's in roughly >> shape and it's very unclear if that is the correct approach at all. > >> > How have you handled the interpreter lock-stack-area in your >> > implementation? Is it worth to get rid of it and consolidate with the >> > per-thread lock-stack? >> >> At the moment I had to store a "frame id" for each entry in the lock stack. >> The frame id is previous fp, grabbed from "link()" when entering the locking >> code. >> >> ``` >> private static final void monitorEnter(Object o) { >> >> long monitorFrameId = getCallerFrameId(); >> ``` >> >> When popping we can thus check if there is still monitors/locks for the >> frame to be popped. Remove activation reads the lock stack, with a bunch of >> assembly, e.g.: ` access_load_at(T_INT, IN_HEAP, rax, Address(rax, >> java_lang_Thread::lock_stack_pos_offset()), noreg, noreg);` If we would keep >> this, loom freezing would need to relativize and derelativize these values. >> (we only have interpreter) > > Hmm ok. I was thinking something similar, but instead of storing pairs > (oop/frame-id), push frame-markers on the lock-stack. > > But given that we only need all this for the interpreter, I am wondering if > keeping what we have now (e.g. the per-frame-lock-stack in interpreter frame) > is the saner thing to do. The overhead seems very small, perhaps very similar > to keeping track of frames in the per-thread lock-stack. > >> But, according to JVMS 2.11.10. the VM only needs to automatically unlock >> synchronized method. This code that unlocks all locks in the frame seems to >> have been added for JLS 17.1. I have asked for clarification and we only >> need and should care about JVMS. >> >> So if we could make popframe do more work (popframe needs to unlock all), >> there seems to be way forward allowing more flexibility. > >> Still working on trying to make what we have public, even if it's in roughly >> shape and it's very unclear if that is the correct approach at all. > > Nice! > From your snippets above I am gleaning that your implementation has the > actual lock-stack in Java. Is that correct? Is there a particular reason why > you need this? Is this for Loom? Would the implementation that I am proposing > here also work for your use-case(s)? > > Thanks, > Roman @rkennke I will have a look, but may I suggest to open a new PR and just reference this as background discussion? I think most of the comments above is not relevant enough for a new reviewer to struggle through. What do you think? - PR: https://git.openjdk.org/jdk/pull/9680
Re: RFR: 8291555: Replace stack-locking with fast-locking
On Mon, 12 Sep 2022 07:54:48 GMT, Roman Kennke wrote: > Nice! From your snippets above I am gleaning that your implementation has the > actual lock-stack in Java. Is that correct? Is there a particular reason why > you need this? Is this for Loom? Would the implementation that I am proposing > here also work for your use-case(s)? > Yes, the entire implementation is in Java. void push(Object lockee, long fid) { if (this != Thread.currentThread()) Monitor.abort("invariant"); if (lockStackPos == lockStack.length) { grow(); } frameId[lockStackPos] = fid; lockStack[lockStackPos++] = lockee; } We are starting from the point of let's do everything be in Java. I want smart people to being able to change the implementation. So I really don't like the hardcoded assembly in remove_activation which do this check on frame id on the lock stack. If we can make the changes to e.g. popframe and take a bit different approach to JVMS we may have a total flexible Java implementation. But a flexible Java implementation means compiler can't have intrinsics, so what will the performance be We have more loose-ends than we can handle at the moment. Your code may be useable for JOM if we lock the implementation to using a lock-stack and we are going to write intrinsics to it. There is no point of it being in Java if so IMHO. - PR: https://git.openjdk.org/jdk/pull/9680
Re: RFR: 8291555: Replace stack-locking with fast-locking
On Fri, 9 Sep 2022 19:01:14 GMT, Roman Kennke wrote: > How have you handled the interpreter lock-stack-area in your implementation? > Is it worth to get rid of it and consolidate with the per-thread lock-stack? At the moment I had to store a "frame id" for each entry in the lock stack. The frame id is previous fp, grabbed from "link()" when entering the locking code. private static final void monitorEnter(Object o) { long monitorFrameId = getCallerFrameId(); ``` When popping we can thus check if there is still monitors/locks for the frame to be popped. Remove activation reads the lock stack, with a bunch of assembly, e.g.: ` access_load_at(T_INT, IN_HEAP, rax, Address(rax, java_lang_Thread::lock_stack_pos_offset()), noreg, noreg); ` If we would keep this, loom freezing would need to relativize and derelativize these values. (we only have interpreter) But, according to JVMS 2.11.10. the VM only needs to automatically unlock synchronized method. This code that unlocks all locks in the frame seems to have been added for JLS 17.1. I have asked for clarification and we only need and should care about JVMS. So if we could make popframe do more work (popframe needs to unlock all), there seems to be way forward allowing more flexibility. Still working on trying to make what we have public, even if it's in roughly shape and it's very unclear if that is the correct approach at all. - PR: https://git.openjdk.org/jdk/pull/9680
Re: RFR: 8291555: Replace stack-locking with fast-locking
On Tue, 16 Aug 2022 16:21:04 GMT, Roman Kennke wrote: > Strictly speaking, I believe the conditions check for the (weaker) balanced > property, but not for the (stronger) structured property. I know but the text says: - "every exit on a given monitor matches a preceding entry on that monitor." - "implementations of the Java Virtual Machine are permitted but not required to enforce both of the following two rules guaranteeing structured locking" I read this as if the rules do not guarantee structured locking the rules are not correct. The VM is allowed to enforce it. But thats just my take on it. EDIT: Maybe I'm reading to much into it. Lock A,B then unlock A,B maybe is considered structured locking? But then again what if: void foo_lock() { monitorenter(A); monitorenter(B); // If VM abruptly returns here // VM can unlock them in reverse order first B and then A ? monitorexit(A); monitorexit(B); } - PR: https://git.openjdk.org/jdk/pull/9680
Re: RFR: 8291555: Replace stack-locking with fast-locking
On Thu, 11 Aug 2022 11:39:01 GMT, Robbin Ehn wrote: >>> > Thanks for giving this PR a spin. I pushed a fix for the aarch64 build >>> > problem (seems weird that GHA did not catch it). >>> >>> NP, thanks. I notice some other user of owning_thread_from_monitor_owner() >>> such as DeadlockCycle::print_on_with() which asserts on "assert(adr != >>> reinterpret_cast(1)) failed: must convert to lock object". >> >> Do you know by any chance which tests trigger this? > >> > > Thanks for giving this PR a spin. I pushed a fix for the aarch64 build >> > > problem (seems weird that GHA did not catch it). >> > >> > >> > NP, thanks. I notice some other user of owning_thread_from_monitor_owner() >> > such as DeadlockCycle::print_on_with() which asserts on "assert(adr != >> > reinterpret_cast(1)) failed: must convert to lock object". >> >> Do you know by any chance which tests trigger this? > > Yes, there is a couple of to choose from, I think the jstack cmd may be > easiest: jstack/DeadlockDetectionTest.java > @robehn or @dholmes-ora I believe one of you mentioned somewhere (can't find > the comment, though) that we might need to support the bytecode sequence > monitorenter A; monitorenter B; monitorexit A; monitorexit B; properly. I > have now made a testcase that checks this, and it does indeed fail with this > PR, while passing with upstream. Also, the JVM spec doesn't mention anywhere > that it is required that monitorenter/exit are properly nested. I'll have to > fix this in the interpreter (JIT compilers refuse to compile > not-properly-nested monitorenter/exit anyway). > > See > https://github.com/rkennke/jdk/blob/fast-locking/test/hotspot/jtreg/runtime/locking/TestUnstructuredLocking.jasm jvms-2.11.10 > Structured locking is the situation when, during a method invocation, every > exit on a given monitor matches a preceding entry on that monitor. Since > there is no assurance that all code submitted to the Java Virtual Machine > will perform structured locking, implementations of the Java Virtual Machine > are permitted but not required to enforce both of the following two rules > guaranteeing structured locking. Let T be a thread and M be a monitor. Then: > > The number of monitor entries performed by T on M during a method invocation > must equal the number of monitor exits performed by T on M during the method > invocation whether the method invocation completes normally or abruptly. > > At no point during a method invocation may the number of monitor exits > performed by T on M since the method invocation exceed the number of monitor > entries performed by T on M since the method invocation. > > Note that the monitor entry and exit automatically performed by the Java > Virtual Machine when invoking a synchronized method are considered to occur > during the calling method's invocation. I think the intent of above was to allow enforcing structured locking. In relevant other projects, we support only structured locking in Java, but permit some unstructured locking when done via JNI. In that project JNI monitor enter/exit do not use the lockstack. I don't think we today fully support unstructured locking either: void foo_lock() { monitorenter(this); // If VM abruptly returns here 'this' will be unlocked // Because VM assumes structured locking. // see e.g. remove_activation(...) } *I scratch this as it was a bit off topic.* - PR: https://git.openjdk.org/jdk/pull/9680
Re: RFR: 8291555: Replace stack-locking with fast-locking
On Thu, 11 Aug 2022 11:19:31 GMT, Roman Kennke wrote: > > > Thanks for giving this PR a spin. I pushed a fix for the aarch64 build > > > problem (seems weird that GHA did not catch it). > > > > > > NP, thanks. I notice some other user of owning_thread_from_monitor_owner() > > such as DeadlockCycle::print_on_with() which asserts on "assert(adr != > > reinterpret_cast(1)) failed: must convert to lock object". > > Do you know by any chance which tests trigger this? Yes, there is a couple of to choose from, I think the jstack cmd may be easiest: jstack/DeadlockDetectionTest.java - PR: https://git.openjdk.org/jdk/pull/9680
Re: RFR: 8291555: Replace stack-locking with fast-locking
On Tue, 9 Aug 2022 10:46:51 GMT, Roman Kennke wrote: > Thanks for giving this PR a spin. I pushed a fix for the aarch64 build > problem (seems weird that GHA did not catch it). NP, thanks. I notice some other user of owning_thread_from_monitor_owner() such as DeadlockCycle::print_on_with() which asserts on "assert(adr != reinterpret_cast(1)) failed: must convert to lock object". - PR: https://git.openjdk.org/jdk/pull/9680
Re: RFR: 8291555: Replace stack-locking with fast-locking
On Mon, 8 Aug 2022 18:29:54 GMT, Roman Kennke wrote: > > I ran some test locally, 4 JDI fails and 3 JVM TI, all seems to fail in: > > ``` > > #7 0x7f7cefc5c1ce in Thread::is_lock_owned > > (this=this@entry=0x7f7ce801dd90, adr=adr@entry=0x1 > memory at address 0x1>) at > > /home/rehn/source/jdk/ongit/dev-jdk/open/src/hotspot/share/runtime/thread.cpp:549 > > #8 0x7f7cef22c062 in JavaThread::is_lock_owned (this=0x7f7ce801dd90, > > adr=0x1 ) at > > /home/rehn/source/jdk/ongit/dev-jdk/open/src/hotspot/share/runtime/javaThread.cpp:979 > > #9 0x7f7cefc79ab0 in Threads::owning_thread_from_monitor_owner > > (t_list=, owner=owner@entry=0x1 > at address 0x1>) > > at > > /home/rehn/source/jdk/ongit/dev-jdk/open/src/hotspot/share/runtime/threads.cpp:1382 > > ``` > > Thanks, Robbin! That was a bug in JvmtiBase::get_owning_thread() where an > anonymous owner must be converted to the oop address before passing down to > Threads::owning_thread_from_monitor_owner(). I pushed a fix. Can you re-test? > Testing com/sun/jdi passes for me, now. Yes, that fixed it. I'm running more tests also. I got this build problem on aarch64: open/src/hotspot/share/asm/assembler.hpp:168), pid=3387376, tid=3387431 # assert(is_bound() || is_unused()) failed: Label was never bound to a location, but it was used as a jmp target V [libjvm.so+0x4f4788] Label::~Label()+0x48 V [libjvm.so+0x424a44] cmpFastLockNode::emit(CodeBuffer&, PhaseRegAlloc*) const+0x764 V [libjvm.so+0x1643888] PhaseOutput::fill_buffer(CodeBuffer*, unsigned int*)+0x538 V [libjvm.so+0xa85fcc] Compile::Code_Gen()+0x3bc - PR: https://git.openjdk.org/jdk/pull/9680
Re: RFR: 8291555: Replace stack-locking with fast-locking
On Thu, 28 Jul 2022 19:58:34 GMT, Roman Kennke wrote: > This change replaces the current stack-locking implementation with a > fast-locking scheme that retains the advantages of stack-locking (namely fast > locking in uncontended code-paths), while avoiding the overload of the mark > word. That overloading causes massive problems with Lilliput, because it > means we have to check and deal with this situation. And because of the very > racy nature, this turns out to be very complex and involved a variant of the > inflation protocol to ensure that the object header is stable. > > What the original stack-locking does is basically to push a stack-lock onto > the stack which consists only of the displaced header, and CAS a pointer to > this stack location into the object header (the lowest two header bits being > 00 indicate 'stack-locked'). The pointer into the stack can then be used to > identify which thread currently owns the lock. > > This change basically reverses stack-locking: It still CASes the lowest two > header bits to 00 to indicate 'fast-locked' but does *not* overload the upper > bits with a stack-pointer. Instead, it pushes the object-reference to a > thread-local lock-stack. This is a new structure which is basically a small > array of oops that is associated with each thread. Experience shows that this > array typcially remains very small (3-5 elements). Using this lock stack, it > is possible to query which threads own which locks. Most importantly, the > most common question 'does the current thread own me?' is very quickly > answered by doing a quick scan of the array. More complex queries like 'which > thread owns X?' are not performed in very performance-critical paths (usually > in code like JVMTI or deadlock detection) where it is ok to do more complex > operations. The lock-stack is also a new set of GC roots, and would be > scanned during thread scanning, possibly concurrently, via the normal > protocols. > > In contrast to stack-locking, fast-locking does *not* support recursive > locking (yet). When that happens, the fast-lock gets inflated to a full > monitor. It is not clear if it is worth to add support for recursive > fast-locking. > > One trouble is that when a contending thread arrives at a fast-locked object, > it must inflate the fast-lock to a full monitor. Normally, we need to know > the current owning thread, and record that in the monitor, so that the > contending thread can wait for the current owner to properly exit the > monitor. However, fast-locking doesn't have this information. What we do > instead is to record a special marker ANONYMOUS_OWNER. When the thread that > currently holds the lock arrives at monitorexit, and observes > ANONYMOUS_OWNER, it knows it must be itself, fixes the owner to be itself, > and then properly exits the monitor, and thus handing over to the contending > thread. > > As an alternative, I considered to remove stack-locking altogether, and only > use heavy monitors. In most workloads this did not show measurable > regressions. However, in a few workloads, I have observed severe regressions. > All of them have been using old synchronized Java collections (Vector, > Stack), StringBuffer or similar code. The combination of two conditions leads > to regressions without stack- or fast-locking: 1. The workload synchronizes > on uncontended locks (e.g. single-threaded use of Vector or StringBuffer) and > 2. The workload churns such locks. IOW, uncontended use of Vector, > StringBuffer, etc as such is ok, but creating lots of such single-use, > single-threaded-locked objects leads to massive ObjectMonitor churn, which > can lead to a significant performance impact. But alas, such code exists, and > we probably don't want to punish it if we can avoid it. > > This change enables to simplify (and speed-up!) a lot of code: > > - The inflation protocol is no longer necessary: we can directly CAS the > (tagged) ObjectMonitor pointer to the object header. > - Accessing the hashcode could now be done in the fastpath always, if the > hashcode has been installed. Fast-locked headers can be used directly, for > monitor-locked objects we can easily reach-through to the displaced header. > This is safe because Java threads participate in monitor deflation protocol. > This would be implemented in a separate PR > > ### Benchmarks > > All benchmarks are run on server-class metal machines. The JVM settings are > always: `-Xmx20g -Xms20g -XX:+UseParallelGC`. All benchmarks are ms/ops, less > is better. > > DaCapo/AArch64 > > Those measurements have been taken on a Graviton2 box with 64 CPU cores (an > AWS m6g.metal instance). It is using DaCapo evaluation version, git hash > 309e1fa (download file dacapo-evaluation-git+309e1fa.jar). I needed to > exclude cassandra, h2o & kafka benchmarks because of incompatibility with > JDK20. Benchmarks that showed results far off the baseline or showed
Re: RFR: 8294406: Test runtime/handshake/HandshakeDirectTest.java failed: JVMTI_ERROR_WRONG_PHASE
On Fri, 30 Sep 2022 01:32:44 GMT, Leonid Mesnik wrote: > 8294406: Test runtime/handshake/HandshakeDirectTest.java failed: > JVMTI_ERROR_WRONG_PHASE The reason for it to be a daemon is that this thread should try to suspend/resume when we are terminating, thus trying to hit those rare exit parts where we do thread exits with a pending suspend. - PR: https://git.openjdk.org/jdk/pull/10502
Re: RFR: 8293592: Remove JVM_StopThread, stillborn, and related cleanup [v2]
On Fri, 23 Sep 2022 21:47:06 GMT, David Holmes wrote: >> Now that Thread.stop has been degraded to throw >> `UnsupportedOperationException` (JDK-8299610) the only direct source of >> async exceptions is from JVMTI `StopThread`. We can remove the >> `JVM_StopThread` code, remove the `stillborn` field from `java.lang.Thread` >> and its associated accesses from the VM, and we can stop special-casing >> `ThreadDeath` handling (as was done for the JDK code as part of JDK-8299610). >> >> Note that JVMTI `StopThread` can only act on a thread that is alive, so it >> is no longer possible to stop a thread before it has been started. >> >> Also note that there is a change in behaviour for JNI `ExceptionDescribe` as >> it no longer ignores `ThreadDeath` exceptions (not that it was ever >> specified to ignore them, it simply mirrored the behaviour of the default >> `UncaughtExceptionHandler` in `java.lang.ThreadGroup` - which also no longer >> ignores them so the mirroring behaviour remains the same). >> >> Testing: tiers 1-3 > > David Holmes has updated the pull request with a new target base due to a > merge or a rebase. The pull request now contains 16 commits: > > - Merge branch 'master' into 8293592-JVM_StopThread > - Removal all special handling of ThreadDeath. > - Remove all references to the stillborn field > - Initial commit: remove JVM_StopThread > - Merge > - Updates to Java Thread Primitive Deprecation page > - Repalce "it" with "victim thread" > - Merge > - Revert test/langtools/ProblemList.txt as jshell tests no longer rely on > Thread.stop > - become -> became in javadoc > - ... and 6 more: https://git.openjdk.org/jdk/compare/3675f4c2...4eb07903 Looks good, thank for fixing. - Marked as reviewed by rehn (Reviewer). PR: https://git.openjdk.org/jdk/pull/10400
Re: RFR: 8288752: Split thread implementation files [v2]
On Tue, 21 Jun 2022 12:22:46 GMT, Coleen Phillimore wrote: >> This changes splits thread.hpp/cpp into javaThread, and threads files. >> >> I left the commits intact to see better the progression of changes, but most >> files are include file changes. The only tricky parts are that some files >> were included in thread.hpp, like mutexLocker.cpp, which has to be included >> in the files that need it. >> >> I didn't update the copyrights to save diffs but will before integration. >> Also I won't integrate until after Dan's JDK 19 changes are merged into JDK >> 20. >> >> Tested with tier1-4 on Oracle supported platforms and built on other >> platforms. > > Coleen Phillimore has updated the pull request incrementally with one > additional commit since the last revision: > > Fix Shenandoah build. Looks good, thanks. - Marked as reviewed by rehn (Reviewer). PR: https://git.openjdk.org/jdk/pull/9218