Re: RFR: 8291555: Implement alternative fast-locking scheme [v21]
On Sat, 11 Mar 2023 14:52:54 GMT, Thomas Stuefe wrote: >> Roman Kennke has updated the pull request incrementally with two additional >> commits since the last revision: >> >> - Merge remote-tracking branch 'origin/JDK-8291555-v2' into JDK-8291555-v2 >> - Use nullptr instead of NULL in touched code (shared) > > src/hotspot/share/runtime/lockStack.hpp line 64: > >> 62: >> 63: // GC support >> 64: inline void oops_do(OopClosure* cl); > > Does this need to be nonconst? Yes, because the OopClosures can (and will) update the inline array elements. - PR: https://git.openjdk.org/jdk/pull/10907
Re: RFR: 8291555: Implement alternative fast-locking scheme [v21]
On Sat, 11 Mar 2023 14:57:19 GMT, Thomas Stuefe wrote: >> Roman Kennke has updated the pull request incrementally with two additional >> commits since the last revision: >> >> - Merge remote-tracking branch 'origin/JDK-8291555-v2' into JDK-8291555-v2 >> - Use nullptr instead of NULL in touched code (shared) > > src/hotspot/cpu/aarch64/macroAssembler_aarch64.cpp line 6234: > >> 6232: >> 6233: // Load (object->mark() | 1) into hdr >> 6234: orr(hdr, hdr, markWord::unlocked_value); > > I wondered why this is needed. Should we not have the header of an unloaded > object in hdr? Or is this a safeguard against a misuse of this function > (called with the header of an already locked object)? It could be a monitor-locked header. In C2 this is not possible and we *could* save an instruction here, I guess. Not sure if it is worth it, though. - PR: https://git.openjdk.org/jdk/pull/10907
Re: RFR: 8291555: Implement alternative fast-locking scheme [v21]
On Fri, 10 Mar 2023 12:45:12 GMT, Roman Kennke wrote: >> This change adds a fast-locking scheme as an alternative to the current >> stack-locking implementation. It retains the advantages of stack-locking >> (namely fast locking in uncontended code-paths), while avoiding the overload >> of the mark word. That overloading causes massive problems with Lilliput, >> because it means we have to check and deal with this situation when trying >> to access the mark-word. And because of the very racy nature, this turns out >> to be very complex and would involve a variant of the inflation protocol to >> ensure that the object header is stable. (The current implementation of >> setting/fetching the i-hash provides a glimpse into the complexity). >> >> What the original stack-locking does is basically to push a stack-lock onto >> the stack which consists only of the displaced header, and CAS a pointer to >> this stack location into the object header (the lowest two header bits being >> 00 indicate 'stack-locked'). The pointer into the stack can then be used to >> identify which thread currently owns the lock. >> >> This change basically reverses stack-locking: It still CASes the lowest two >> header bits to 00 to indicate 'fast-locked' but does *not* overload the >> upper bits with a stack-pointer. Instead, it pushes the object-reference to >> a thread-local lock-stack. This is a new structure which is basically a >> small array of oops that is associated with each thread. Experience shows >> that this array typcially remains very small (3-5 elements). Using this lock >> stack, it is possible to query which threads own which locks. Most >> importantly, the most common question 'does the current thread own me?' is >> very quickly answered by doing a quick scan of the array. More complex >> queries like 'which thread owns X?' are not performed in very >> performance-critical paths (usually in code like JVMTI or deadlock >> detection) where it is ok to do more complex operations (and we already do). >> The lock-stack is also a new set of GC roots, and would be scanned during >> thread scanning, possibly concurrently, via the normal protocols. >> >> The lock-stack is grown when needed. This means that we need to check for >> potential overflow before attempting locking. When that is the case, locking >> fast-paths would call into the runtime to grow the stack and handle the >> locking. Compiled fast-paths (C1 and C2 on x86_64 and aarch64) do this check >> on method entry to avoid (possibly lots) of such checks at locking sites. >> >> In contrast to stack-locking, fast-locking does *not* support recursive >> locking (yet). When that happens, the fast-lock gets inflated to a full >> monitor. It is not clear if it is worth to add support for recursive >> fast-locking. >> >> One trouble is that when a contending thread arrives at a fast-locked >> object, it must inflate the fast-lock to a full monitor. Normally, we need >> to know the current owning thread, and record that in the monitor, so that >> the contending thread can wait for the current owner to properly exit the >> monitor. However, fast-locking doesn't have this information. What we do >> instead is to record a special marker ANONYMOUS_OWNER. When the thread that >> currently holds the lock arrives at monitorexit, and observes >> ANONYMOUS_OWNER, it knows it must be itself, fixes the owner to be itself, >> and then properly exits the monitor, and thus handing over to the contending >> thread. >> >> As an alternative, I considered to remove stack-locking altogether, and only >> use heavy monitors. In most workloads this did not show measurable >> regressions. However, in a few workloads, I have observed severe >> regressions. All of them have been using old synchronized Java collections >> (Vector, Stack), StringBuffer or similar code. The combination of two >> conditions leads to regressions without stack- or fast-locking: 1. The >> workload synchronizes on uncontended locks (e.g. single-threaded use of >> Vector or StringBuffer) and 2. The workload churns such locks. IOW, >> uncontended use of Vector, StringBuffer, etc as such is ok, but creating >> lots of such single-use, single-threaded-locked objects leads to massive >> ObjectMonitor churn, which can lead to a significant performance impact. But >> alas, such code exists, and we probably don't want to punish it if we can >> avoid it. >> >> This change enables to simplify (and speed-up!) a lot of code: >> >> - The inflation protocol is no longer necessary: we can directly CAS the >> (tagged) ObjectMonitor pointer to the object header. >> - Accessing the hashcode could now be done in the fastpath always, if the >> hashcode has been installed. Fast-locked headers can be used directly, for >> monitor-locked objects we can easily reach-through to the displaced header. >> This is safe because Java threads participate in monitor deflation
Re: RFR: 8291555: Implement alternative fast-locking scheme [v21]
On Sat, 11 Mar 2023 14:53:29 GMT, Thomas Stuefe wrote: >> Roman Kennke has updated the pull request incrementally with two additional >> commits since the last revision: >> >> - Merge remote-tracking branch 'origin/JDK-8291555-v2' into JDK-8291555-v2 >> - Use nullptr instead of NULL in touched code (shared) > > src/hotspot/share/runtime/lockStack.hpp line 52: > >> 50: static ByteSize limit_offset() { return byte_offset_of(LockStack, >> _limit); } >> 51: >> 52: static void ensure_lock_stack_size(oop* _required_limit); > > I would split this, do the comparison inline, only the actual growth in the > cpp file. Just realized that this interface is actually a bit odd: since we pass a wish pointer that has nothing to do with either current state nor final result. In fact, the pointer could at the moment point into the lock stack of a different thread. So this is "the pointer that would designate the end of the LockStack if the lockstack were enlarged *in-place*". Maybe add a comment like that. - PR: https://git.openjdk.org/jdk/pull/10907
Re: RFR: 8291555: Implement alternative fast-locking scheme [v21]
On Fri, 10 Mar 2023 12:45:12 GMT, Roman Kennke wrote: >> This change adds a fast-locking scheme as an alternative to the current >> stack-locking implementation. It retains the advantages of stack-locking >> (namely fast locking in uncontended code-paths), while avoiding the overload >> of the mark word. That overloading causes massive problems with Lilliput, >> because it means we have to check and deal with this situation when trying >> to access the mark-word. And because of the very racy nature, this turns out >> to be very complex and would involve a variant of the inflation protocol to >> ensure that the object header is stable. (The current implementation of >> setting/fetching the i-hash provides a glimpse into the complexity). >> >> What the original stack-locking does is basically to push a stack-lock onto >> the stack which consists only of the displaced header, and CAS a pointer to >> this stack location into the object header (the lowest two header bits being >> 00 indicate 'stack-locked'). The pointer into the stack can then be used to >> identify which thread currently owns the lock. >> >> This change basically reverses stack-locking: It still CASes the lowest two >> header bits to 00 to indicate 'fast-locked' but does *not* overload the >> upper bits with a stack-pointer. Instead, it pushes the object-reference to >> a thread-local lock-stack. This is a new structure which is basically a >> small array of oops that is associated with each thread. Experience shows >> that this array typcially remains very small (3-5 elements). Using this lock >> stack, it is possible to query which threads own which locks. Most >> importantly, the most common question 'does the current thread own me?' is >> very quickly answered by doing a quick scan of the array. More complex >> queries like 'which thread owns X?' are not performed in very >> performance-critical paths (usually in code like JVMTI or deadlock >> detection) where it is ok to do more complex operations (and we already do). >> The lock-stack is also a new set of GC roots, and would be scanned during >> thread scanning, possibly concurrently, via the normal protocols. >> >> The lock-stack is grown when needed. This means that we need to check for >> potential overflow before attempting locking. When that is the case, locking >> fast-paths would call into the runtime to grow the stack and handle the >> locking. Compiled fast-paths (C1 and C2 on x86_64 and aarch64) do this check >> on method entry to avoid (possibly lots) of such checks at locking sites. >> >> In contrast to stack-locking, fast-locking does *not* support recursive >> locking (yet). When that happens, the fast-lock gets inflated to a full >> monitor. It is not clear if it is worth to add support for recursive >> fast-locking. >> >> One trouble is that when a contending thread arrives at a fast-locked >> object, it must inflate the fast-lock to a full monitor. Normally, we need >> to know the current owning thread, and record that in the monitor, so that >> the contending thread can wait for the current owner to properly exit the >> monitor. However, fast-locking doesn't have this information. What we do >> instead is to record a special marker ANONYMOUS_OWNER. When the thread that >> currently holds the lock arrives at monitorexit, and observes >> ANONYMOUS_OWNER, it knows it must be itself, fixes the owner to be itself, >> and then properly exits the monitor, and thus handing over to the contending >> thread. >> >> As an alternative, I considered to remove stack-locking altogether, and only >> use heavy monitors. In most workloads this did not show measurable >> regressions. However, in a few workloads, I have observed severe >> regressions. All of them have been using old synchronized Java collections >> (Vector, Stack), StringBuffer or similar code. The combination of two >> conditions leads to regressions without stack- or fast-locking: 1. The >> workload synchronizes on uncontended locks (e.g. single-threaded use of >> Vector or StringBuffer) and 2. The workload churns such locks. IOW, >> uncontended use of Vector, StringBuffer, etc as such is ok, but creating >> lots of such single-use, single-threaded-locked objects leads to massive >> ObjectMonitor churn, which can lead to a significant performance impact. But >> alas, such code exists, and we probably don't want to punish it if we can >> avoid it. >> >> This change enables to simplify (and speed-up!) a lot of code: >> >> - The inflation protocol is no longer necessary: we can directly CAS the >> (tagged) ObjectMonitor pointer to the object header. >> - Accessing the hashcode could now be done in the fastpath always, if the >> hashcode has been installed. Fast-locked headers can be used directly, for >> monitor-locked objects we can easily reach-through to the displaced header. >> This is safe because Java threads participate in monitor deflation
Re: RFR: 8291555: Implement alternative fast-locking scheme [v21]
On Fri, 10 Mar 2023 12:45:12 GMT, Roman Kennke wrote: >> This change adds a fast-locking scheme as an alternative to the current >> stack-locking implementation. It retains the advantages of stack-locking >> (namely fast locking in uncontended code-paths), while avoiding the overload >> of the mark word. That overloading causes massive problems with Lilliput, >> because it means we have to check and deal with this situation when trying >> to access the mark-word. And because of the very racy nature, this turns out >> to be very complex and would involve a variant of the inflation protocol to >> ensure that the object header is stable. (The current implementation of >> setting/fetching the i-hash provides a glimpse into the complexity). >> >> What the original stack-locking does is basically to push a stack-lock onto >> the stack which consists only of the displaced header, and CAS a pointer to >> this stack location into the object header (the lowest two header bits being >> 00 indicate 'stack-locked'). The pointer into the stack can then be used to >> identify which thread currently owns the lock. >> >> This change basically reverses stack-locking: It still CASes the lowest two >> header bits to 00 to indicate 'fast-locked' but does *not* overload the >> upper bits with a stack-pointer. Instead, it pushes the object-reference to >> a thread-local lock-stack. This is a new structure which is basically a >> small array of oops that is associated with each thread. Experience shows >> that this array typcially remains very small (3-5 elements). Using this lock >> stack, it is possible to query which threads own which locks. Most >> importantly, the most common question 'does the current thread own me?' is >> very quickly answered by doing a quick scan of the array. More complex >> queries like 'which thread owns X?' are not performed in very >> performance-critical paths (usually in code like JVMTI or deadlock >> detection) where it is ok to do more complex operations (and we already do). >> The lock-stack is also a new set of GC roots, and would be scanned during >> thread scanning, possibly concurrently, via the normal protocols. >> >> The lock-stack is grown when needed. This means that we need to check for >> potential overflow before attempting locking. When that is the case, locking >> fast-paths would call into the runtime to grow the stack and handle the >> locking. Compiled fast-paths (C1 and C2 on x86_64 and aarch64) do this check >> on method entry to avoid (possibly lots) of such checks at locking sites. >> >> In contrast to stack-locking, fast-locking does *not* support recursive >> locking (yet). When that happens, the fast-lock gets inflated to a full >> monitor. It is not clear if it is worth to add support for recursive >> fast-locking. >> >> One trouble is that when a contending thread arrives at a fast-locked >> object, it must inflate the fast-lock to a full monitor. Normally, we need >> to know the current owning thread, and record that in the monitor, so that >> the contending thread can wait for the current owner to properly exit the >> monitor. However, fast-locking doesn't have this information. What we do >> instead is to record a special marker ANONYMOUS_OWNER. When the thread that >> currently holds the lock arrives at monitorexit, and observes >> ANONYMOUS_OWNER, it knows it must be itself, fixes the owner to be itself, >> and then properly exits the monitor, and thus handing over to the contending >> thread. >> >> As an alternative, I considered to remove stack-locking altogether, and only >> use heavy monitors. In most workloads this did not show measurable >> regressions. However, in a few workloads, I have observed severe >> regressions. All of them have been using old synchronized Java collections >> (Vector, Stack), StringBuffer or similar code. The combination of two >> conditions leads to regressions without stack- or fast-locking: 1. The >> workload synchronizes on uncontended locks (e.g. single-threaded use of >> Vector or StringBuffer) and 2. The workload churns such locks. IOW, >> uncontended use of Vector, StringBuffer, etc as such is ok, but creating >> lots of such single-use, single-threaded-locked objects leads to massive >> ObjectMonitor churn, which can lead to a significant performance impact. But >> alas, such code exists, and we probably don't want to punish it if we can >> avoid it. >> >> This change enables to simplify (and speed-up!) a lot of code: >> >> - The inflation protocol is no longer necessary: we can directly CAS the >> (tagged) ObjectMonitor pointer to the object header. >> - Accessing the hashcode could now be done in the fastpath always, if the >> hashcode has been installed. Fast-locked headers can be used directly, for >> monitor-locked objects we can easily reach-through to the displaced header. >> This is safe because Java threads participate in monitor deflation
Re: RFR: 8291555: Implement alternative fast-locking scheme [v21]
On Fri, 10 Mar 2023 12:45:12 GMT, Roman Kennke wrote: >> This change adds a fast-locking scheme as an alternative to the current >> stack-locking implementation. It retains the advantages of stack-locking >> (namely fast locking in uncontended code-paths), while avoiding the overload >> of the mark word. That overloading causes massive problems with Lilliput, >> because it means we have to check and deal with this situation when trying >> to access the mark-word. And because of the very racy nature, this turns out >> to be very complex and would involve a variant of the inflation protocol to >> ensure that the object header is stable. (The current implementation of >> setting/fetching the i-hash provides a glimpse into the complexity). >> >> What the original stack-locking does is basically to push a stack-lock onto >> the stack which consists only of the displaced header, and CAS a pointer to >> this stack location into the object header (the lowest two header bits being >> 00 indicate 'stack-locked'). The pointer into the stack can then be used to >> identify which thread currently owns the lock. >> >> This change basically reverses stack-locking: It still CASes the lowest two >> header bits to 00 to indicate 'fast-locked' but does *not* overload the >> upper bits with a stack-pointer. Instead, it pushes the object-reference to >> a thread-local lock-stack. This is a new structure which is basically a >> small array of oops that is associated with each thread. Experience shows >> that this array typcially remains very small (3-5 elements). Using this lock >> stack, it is possible to query which threads own which locks. Most >> importantly, the most common question 'does the current thread own me?' is >> very quickly answered by doing a quick scan of the array. More complex >> queries like 'which thread owns X?' are not performed in very >> performance-critical paths (usually in code like JVMTI or deadlock >> detection) where it is ok to do more complex operations (and we already do). >> The lock-stack is also a new set of GC roots, and would be scanned during >> thread scanning, possibly concurrently, via the normal protocols. >> >> The lock-stack is grown when needed. This means that we need to check for >> potential overflow before attempting locking. When that is the case, locking >> fast-paths would call into the runtime to grow the stack and handle the >> locking. Compiled fast-paths (C1 and C2 on x86_64 and aarch64) do this check >> on method entry to avoid (possibly lots) of such checks at locking sites. >> >> In contrast to stack-locking, fast-locking does *not* support recursive >> locking (yet). When that happens, the fast-lock gets inflated to a full >> monitor. It is not clear if it is worth to add support for recursive >> fast-locking. >> >> One trouble is that when a contending thread arrives at a fast-locked >> object, it must inflate the fast-lock to a full monitor. Normally, we need >> to know the current owning thread, and record that in the monitor, so that >> the contending thread can wait for the current owner to properly exit the >> monitor. However, fast-locking doesn't have this information. What we do >> instead is to record a special marker ANONYMOUS_OWNER. When the thread that >> currently holds the lock arrives at monitorexit, and observes >> ANONYMOUS_OWNER, it knows it must be itself, fixes the owner to be itself, >> and then properly exits the monitor, and thus handing over to the contending >> thread. >> >> As an alternative, I considered to remove stack-locking altogether, and only >> use heavy monitors. In most workloads this did not show measurable >> regressions. However, in a few workloads, I have observed severe >> regressions. All of them have been using old synchronized Java collections >> (Vector, Stack), StringBuffer or similar code. The combination of two >> conditions leads to regressions without stack- or fast-locking: 1. The >> workload synchronizes on uncontended locks (e.g. single-threaded use of >> Vector or StringBuffer) and 2. The workload churns such locks. IOW, >> uncontended use of Vector, StringBuffer, etc as such is ok, but creating >> lots of such single-use, single-threaded-locked objects leads to massive >> ObjectMonitor churn, which can lead to a significant performance impact. But >> alas, such code exists, and we probably don't want to punish it if we can >> avoid it. >> >> This change enables to simplify (and speed-up!) a lot of code: >> >> - The inflation protocol is no longer necessary: we can directly CAS the >> (tagged) ObjectMonitor pointer to the object header. >> - Accessing the hashcode could now be done in the fastpath always, if the >> hashcode has been installed. Fast-locked headers can be used directly, for >> monitor-locked objects we can easily reach-through to the displaced header. >> This is safe because Java threads participate in monitor deflation
Re: RFR: 8291555: Implement alternative fast-locking scheme [v21]
On Fri, 10 Mar 2023 12:45:12 GMT, Roman Kennke wrote: >> This change adds a fast-locking scheme as an alternative to the current >> stack-locking implementation. It retains the advantages of stack-locking >> (namely fast locking in uncontended code-paths), while avoiding the overload >> of the mark word. That overloading causes massive problems with Lilliput, >> because it means we have to check and deal with this situation when trying >> to access the mark-word. And because of the very racy nature, this turns out >> to be very complex and would involve a variant of the inflation protocol to >> ensure that the object header is stable. (The current implementation of >> setting/fetching the i-hash provides a glimpse into the complexity). >> >> What the original stack-locking does is basically to push a stack-lock onto >> the stack which consists only of the displaced header, and CAS a pointer to >> this stack location into the object header (the lowest two header bits being >> 00 indicate 'stack-locked'). The pointer into the stack can then be used to >> identify which thread currently owns the lock. >> >> This change basically reverses stack-locking: It still CASes the lowest two >> header bits to 00 to indicate 'fast-locked' but does *not* overload the >> upper bits with a stack-pointer. Instead, it pushes the object-reference to >> a thread-local lock-stack. This is a new structure which is basically a >> small array of oops that is associated with each thread. Experience shows >> that this array typcially remains very small (3-5 elements). Using this lock >> stack, it is possible to query which threads own which locks. Most >> importantly, the most common question 'does the current thread own me?' is >> very quickly answered by doing a quick scan of the array. More complex >> queries like 'which thread owns X?' are not performed in very >> performance-critical paths (usually in code like JVMTI or deadlock >> detection) where it is ok to do more complex operations (and we already do). >> The lock-stack is also a new set of GC roots, and would be scanned during >> thread scanning, possibly concurrently, via the normal protocols. >> >> The lock-stack is grown when needed. This means that we need to check for >> potential overflow before attempting locking. When that is the case, locking >> fast-paths would call into the runtime to grow the stack and handle the >> locking. Compiled fast-paths (C1 and C2 on x86_64 and aarch64) do this check >> on method entry to avoid (possibly lots) of such checks at locking sites. >> >> In contrast to stack-locking, fast-locking does *not* support recursive >> locking (yet). When that happens, the fast-lock gets inflated to a full >> monitor. It is not clear if it is worth to add support for recursive >> fast-locking. >> >> One trouble is that when a contending thread arrives at a fast-locked >> object, it must inflate the fast-lock to a full monitor. Normally, we need >> to know the current owning thread, and record that in the monitor, so that >> the contending thread can wait for the current owner to properly exit the >> monitor. However, fast-locking doesn't have this information. What we do >> instead is to record a special marker ANONYMOUS_OWNER. When the thread that >> currently holds the lock arrives at monitorexit, and observes >> ANONYMOUS_OWNER, it knows it must be itself, fixes the owner to be itself, >> and then properly exits the monitor, and thus handing over to the contending >> thread. >> >> As an alternative, I considered to remove stack-locking altogether, and only >> use heavy monitors. In most workloads this did not show measurable >> regressions. However, in a few workloads, I have observed severe >> regressions. All of them have been using old synchronized Java collections >> (Vector, Stack), StringBuffer or similar code. The combination of two >> conditions leads to regressions without stack- or fast-locking: 1. The >> workload synchronizes on uncontended locks (e.g. single-threaded use of >> Vector or StringBuffer) and 2. The workload churns such locks. IOW, >> uncontended use of Vector, StringBuffer, etc as such is ok, but creating >> lots of such single-use, single-threaded-locked objects leads to massive >> ObjectMonitor churn, which can lead to a significant performance impact. But >> alas, such code exists, and we probably don't want to punish it if we can >> avoid it. >> >> This change enables to simplify (and speed-up!) a lot of code: >> >> - The inflation protocol is no longer necessary: we can directly CAS the >> (tagged) ObjectMonitor pointer to the object header. >> - Accessing the hashcode could now be done in the fastpath always, if the >> hashcode has been installed. Fast-locked headers can be used directly, for >> monitor-locked objects we can easily reach-through to the displaced header. >> This is safe because Java threads participate in monitor deflation
Re: RFR: 8291555: Implement alternative fast-locking scheme [v21]
On Sat, 11 Mar 2023 15:57:53 GMT, Roman Kennke wrote: > > Proposal for omitting the lockstack size check (at least in 75% of all > > times): > > > > * We know that Thread as well as grown lockstack backing buffers start at > > malloc-aligned boundaries. Practically this is 16 (64-bit), 4-8 (32-bit). > > So at the very least 4. > > * Make the initial lockstack this size. Define it so that initial slot > > stack starts at offset 0. > > * Load the current slot pointer as you do now. Check the lowest 2 bits. If > > all are zero, go the slower path (load the current limit and compare > > against limit, ...). > > * If bit 0 or 1 are set, you can omit this check. You are done since you > > have not yet reached the limit. > > * You can expand this proposal to any alignment you like. You need to > > declare the lockstack slots with `alignof(X)`, and the compiler will take > > care that the _initial_ slot stack is always well aligned. As for larger > > slot stacks, we will have to allocate them in an aligned fashion using > > posix_memalign (we need this as NMT-wrapped version, but thats trivial) > > This would only work when pushing a single slot, right? Have you seen what > we're doing in the compiled (C1 and C2) paths (in x86_64 and aarch64)? There > we're doing a (conservative) estimate how many lock-slots are needed in the > method, and check for enough slots upon method entry once, and then elide the > check altogether in the lock-enter implementation. Yeah, I just realized this myself. I started working on the template interpreter first, where we push single stack slots. There it may still make sense. - PR: https://git.openjdk.org/jdk/pull/10907
Re: RFR: 8291555: Implement alternative fast-locking scheme [v21]
On Fri, 10 Mar 2023 12:45:12 GMT, Roman Kennke wrote: >> This change adds a fast-locking scheme as an alternative to the current >> stack-locking implementation. It retains the advantages of stack-locking >> (namely fast locking in uncontended code-paths), while avoiding the overload >> of the mark word. That overloading causes massive problems with Lilliput, >> because it means we have to check and deal with this situation when trying >> to access the mark-word. And because of the very racy nature, this turns out >> to be very complex and would involve a variant of the inflation protocol to >> ensure that the object header is stable. (The current implementation of >> setting/fetching the i-hash provides a glimpse into the complexity). >> >> What the original stack-locking does is basically to push a stack-lock onto >> the stack which consists only of the displaced header, and CAS a pointer to >> this stack location into the object header (the lowest two header bits being >> 00 indicate 'stack-locked'). The pointer into the stack can then be used to >> identify which thread currently owns the lock. >> >> This change basically reverses stack-locking: It still CASes the lowest two >> header bits to 00 to indicate 'fast-locked' but does *not* overload the >> upper bits with a stack-pointer. Instead, it pushes the object-reference to >> a thread-local lock-stack. This is a new structure which is basically a >> small array of oops that is associated with each thread. Experience shows >> that this array typcially remains very small (3-5 elements). Using this lock >> stack, it is possible to query which threads own which locks. Most >> importantly, the most common question 'does the current thread own me?' is >> very quickly answered by doing a quick scan of the array. More complex >> queries like 'which thread owns X?' are not performed in very >> performance-critical paths (usually in code like JVMTI or deadlock >> detection) where it is ok to do more complex operations (and we already do). >> The lock-stack is also a new set of GC roots, and would be scanned during >> thread scanning, possibly concurrently, via the normal protocols. >> >> The lock-stack is grown when needed. This means that we need to check for >> potential overflow before attempting locking. When that is the case, locking >> fast-paths would call into the runtime to grow the stack and handle the >> locking. Compiled fast-paths (C1 and C2 on x86_64 and aarch64) do this check >> on method entry to avoid (possibly lots) of such checks at locking sites. >> >> In contrast to stack-locking, fast-locking does *not* support recursive >> locking (yet). When that happens, the fast-lock gets inflated to a full >> monitor. It is not clear if it is worth to add support for recursive >> fast-locking. >> >> One trouble is that when a contending thread arrives at a fast-locked >> object, it must inflate the fast-lock to a full monitor. Normally, we need >> to know the current owning thread, and record that in the monitor, so that >> the contending thread can wait for the current owner to properly exit the >> monitor. However, fast-locking doesn't have this information. What we do >> instead is to record a special marker ANONYMOUS_OWNER. When the thread that >> currently holds the lock arrives at monitorexit, and observes >> ANONYMOUS_OWNER, it knows it must be itself, fixes the owner to be itself, >> and then properly exits the monitor, and thus handing over to the contending >> thread. >> >> As an alternative, I considered to remove stack-locking altogether, and only >> use heavy monitors. In most workloads this did not show measurable >> regressions. However, in a few workloads, I have observed severe >> regressions. All of them have been using old synchronized Java collections >> (Vector, Stack), StringBuffer or similar code. The combination of two >> conditions leads to regressions without stack- or fast-locking: 1. The >> workload synchronizes on uncontended locks (e.g. single-threaded use of >> Vector or StringBuffer) and 2. The workload churns such locks. IOW, >> uncontended use of Vector, StringBuffer, etc as such is ok, but creating >> lots of such single-use, single-threaded-locked objects leads to massive >> ObjectMonitor churn, which can lead to a significant performance impact. But >> alas, such code exists, and we probably don't want to punish it if we can >> avoid it. >> >> This change enables to simplify (and speed-up!) a lot of code: >> >> - The inflation protocol is no longer necessary: we can directly CAS the >> (tagged) ObjectMonitor pointer to the object header. >> - Accessing the hashcode could now be done in the fastpath always, if the >> hashcode has been installed. Fast-locked headers can be used directly, for >> monitor-locked objects we can easily reach-through to the displaced header. >> This is safe because Java threads participate in monitor deflation
Re: RFR: 8291555: Implement alternative fast-locking scheme [v21]
On Fri, 10 Mar 2023 12:45:12 GMT, Roman Kennke wrote: >> This change adds a fast-locking scheme as an alternative to the current >> stack-locking implementation. It retains the advantages of stack-locking >> (namely fast locking in uncontended code-paths), while avoiding the overload >> of the mark word. That overloading causes massive problems with Lilliput, >> because it means we have to check and deal with this situation when trying >> to access the mark-word. And because of the very racy nature, this turns out >> to be very complex and would involve a variant of the inflation protocol to >> ensure that the object header is stable. (The current implementation of >> setting/fetching the i-hash provides a glimpse into the complexity). >> >> What the original stack-locking does is basically to push a stack-lock onto >> the stack which consists only of the displaced header, and CAS a pointer to >> this stack location into the object header (the lowest two header bits being >> 00 indicate 'stack-locked'). The pointer into the stack can then be used to >> identify which thread currently owns the lock. >> >> This change basically reverses stack-locking: It still CASes the lowest two >> header bits to 00 to indicate 'fast-locked' but does *not* overload the >> upper bits with a stack-pointer. Instead, it pushes the object-reference to >> a thread-local lock-stack. This is a new structure which is basically a >> small array of oops that is associated with each thread. Experience shows >> that this array typcially remains very small (3-5 elements). Using this lock >> stack, it is possible to query which threads own which locks. Most >> importantly, the most common question 'does the current thread own me?' is >> very quickly answered by doing a quick scan of the array. More complex >> queries like 'which thread owns X?' are not performed in very >> performance-critical paths (usually in code like JVMTI or deadlock >> detection) where it is ok to do more complex operations (and we already do). >> The lock-stack is also a new set of GC roots, and would be scanned during >> thread scanning, possibly concurrently, via the normal protocols. >> >> The lock-stack is grown when needed. This means that we need to check for >> potential overflow before attempting locking. When that is the case, locking >> fast-paths would call into the runtime to grow the stack and handle the >> locking. Compiled fast-paths (C1 and C2 on x86_64 and aarch64) do this check >> on method entry to avoid (possibly lots) of such checks at locking sites. >> >> In contrast to stack-locking, fast-locking does *not* support recursive >> locking (yet). When that happens, the fast-lock gets inflated to a full >> monitor. It is not clear if it is worth to add support for recursive >> fast-locking. >> >> One trouble is that when a contending thread arrives at a fast-locked >> object, it must inflate the fast-lock to a full monitor. Normally, we need >> to know the current owning thread, and record that in the monitor, so that >> the contending thread can wait for the current owner to properly exit the >> monitor. However, fast-locking doesn't have this information. What we do >> instead is to record a special marker ANONYMOUS_OWNER. When the thread that >> currently holds the lock arrives at monitorexit, and observes >> ANONYMOUS_OWNER, it knows it must be itself, fixes the owner to be itself, >> and then properly exits the monitor, and thus handing over to the contending >> thread. >> >> As an alternative, I considered to remove stack-locking altogether, and only >> use heavy monitors. In most workloads this did not show measurable >> regressions. However, in a few workloads, I have observed severe >> regressions. All of them have been using old synchronized Java collections >> (Vector, Stack), StringBuffer or similar code. The combination of two >> conditions leads to regressions without stack- or fast-locking: 1. The >> workload synchronizes on uncontended locks (e.g. single-threaded use of >> Vector or StringBuffer) and 2. The workload churns such locks. IOW, >> uncontended use of Vector, StringBuffer, etc as such is ok, but creating >> lots of such single-use, single-threaded-locked objects leads to massive >> ObjectMonitor churn, which can lead to a significant performance impact. But >> alas, such code exists, and we probably don't want to punish it if we can >> avoid it. >> >> This change enables to simplify (and speed-up!) a lot of code: >> >> - The inflation protocol is no longer necessary: we can directly CAS the >> (tagged) ObjectMonitor pointer to the object header. >> - Accessing the hashcode could now be done in the fastpath always, if the >> hashcode has been installed. Fast-locked headers can be used directly, for >> monitor-locked objects we can easily reach-through to the displaced header. >> This is safe because Java threads participate in monitor deflation
Re: RFR: 8291555: Implement alternative fast-locking scheme [v21]
On Fri, 10 Mar 2023 12:45:12 GMT, Roman Kennke wrote: >> This change adds a fast-locking scheme as an alternative to the current >> stack-locking implementation. It retains the advantages of stack-locking >> (namely fast locking in uncontended code-paths), while avoiding the overload >> of the mark word. That overloading causes massive problems with Lilliput, >> because it means we have to check and deal with this situation when trying >> to access the mark-word. And because of the very racy nature, this turns out >> to be very complex and would involve a variant of the inflation protocol to >> ensure that the object header is stable. (The current implementation of >> setting/fetching the i-hash provides a glimpse into the complexity). >> >> What the original stack-locking does is basically to push a stack-lock onto >> the stack which consists only of the displaced header, and CAS a pointer to >> this stack location into the object header (the lowest two header bits being >> 00 indicate 'stack-locked'). The pointer into the stack can then be used to >> identify which thread currently owns the lock. >> >> This change basically reverses stack-locking: It still CASes the lowest two >> header bits to 00 to indicate 'fast-locked' but does *not* overload the >> upper bits with a stack-pointer. Instead, it pushes the object-reference to >> a thread-local lock-stack. This is a new structure which is basically a >> small array of oops that is associated with each thread. Experience shows >> that this array typcially remains very small (3-5 elements). Using this lock >> stack, it is possible to query which threads own which locks. Most >> importantly, the most common question 'does the current thread own me?' is >> very quickly answered by doing a quick scan of the array. More complex >> queries like 'which thread owns X?' are not performed in very >> performance-critical paths (usually in code like JVMTI or deadlock >> detection) where it is ok to do more complex operations (and we already do). >> The lock-stack is also a new set of GC roots, and would be scanned during >> thread scanning, possibly concurrently, via the normal protocols. >> >> The lock-stack is grown when needed. This means that we need to check for >> potential overflow before attempting locking. When that is the case, locking >> fast-paths would call into the runtime to grow the stack and handle the >> locking. Compiled fast-paths (C1 and C2 on x86_64 and aarch64) do this check >> on method entry to avoid (possibly lots) of such checks at locking sites. >> >> In contrast to stack-locking, fast-locking does *not* support recursive >> locking (yet). When that happens, the fast-lock gets inflated to a full >> monitor. It is not clear if it is worth to add support for recursive >> fast-locking. >> >> One trouble is that when a contending thread arrives at a fast-locked >> object, it must inflate the fast-lock to a full monitor. Normally, we need >> to know the current owning thread, and record that in the monitor, so that >> the contending thread can wait for the current owner to properly exit the >> monitor. However, fast-locking doesn't have this information. What we do >> instead is to record a special marker ANONYMOUS_OWNER. When the thread that >> currently holds the lock arrives at monitorexit, and observes >> ANONYMOUS_OWNER, it knows it must be itself, fixes the owner to be itself, >> and then properly exits the monitor, and thus handing over to the contending >> thread. >> >> As an alternative, I considered to remove stack-locking altogether, and only >> use heavy monitors. In most workloads this did not show measurable >> regressions. However, in a few workloads, I have observed severe >> regressions. All of them have been using old synchronized Java collections >> (Vector, Stack), StringBuffer or similar code. The combination of two >> conditions leads to regressions without stack- or fast-locking: 1. The >> workload synchronizes on uncontended locks (e.g. single-threaded use of >> Vector or StringBuffer) and 2. The workload churns such locks. IOW, >> uncontended use of Vector, StringBuffer, etc as such is ok, but creating >> lots of such single-use, single-threaded-locked objects leads to massive >> ObjectMonitor churn, which can lead to a significant performance impact. But >> alas, such code exists, and we probably don't want to punish it if we can >> avoid it. >> >> This change enables to simplify (and speed-up!) a lot of code: >> >> - The inflation protocol is no longer necessary: we can directly CAS the >> (tagged) ObjectMonitor pointer to the object header. >> - Accessing the hashcode could now be done in the fastpath always, if the >> hashcode has been installed. Fast-locked headers can be used directly, for >> monitor-locked objects we can easily reach-through to the displaced header. >> This is safe because Java threads participate in monitor deflation
Re: RFR: 8291555: Implement alternative fast-locking scheme [v21]
> This change adds a fast-locking scheme as an alternative to the current > stack-locking implementation. It retains the advantages of stack-locking > (namely fast locking in uncontended code-paths), while avoiding the overload > of the mark word. That overloading causes massive problems with Lilliput, > because it means we have to check and deal with this situation when trying to > access the mark-word. And because of the very racy nature, this turns out to > be very complex and would involve a variant of the inflation protocol to > ensure that the object header is stable. (The current implementation of > setting/fetching the i-hash provides a glimpse into the complexity). > > What the original stack-locking does is basically to push a stack-lock onto > the stack which consists only of the displaced header, and CAS a pointer to > this stack location into the object header (the lowest two header bits being > 00 indicate 'stack-locked'). The pointer into the stack can then be used to > identify which thread currently owns the lock. > > This change basically reverses stack-locking: It still CASes the lowest two > header bits to 00 to indicate 'fast-locked' but does *not* overload the upper > bits with a stack-pointer. Instead, it pushes the object-reference to a > thread-local lock-stack. This is a new structure which is basically a small > array of oops that is associated with each thread. Experience shows that this > array typcially remains very small (3-5 elements). Using this lock stack, it > is possible to query which threads own which locks. Most importantly, the > most common question 'does the current thread own me?' is very quickly > answered by doing a quick scan of the array. More complex queries like 'which > thread owns X?' are not performed in very performance-critical paths (usually > in code like JVMTI or deadlock detection) where it is ok to do more complex > operations (and we already do). The lock-stack is also a new set of GC roots, > and would be scanned during thread scanning, possibly concurrently, via the > normal p rotocols. > > The lock-stack is grown when needed. This means that we need to check for > potential overflow before attempting locking. When that is the case, locking > fast-paths would call into the runtime to grow the stack and handle the > locking. Compiled fast-paths (C1 and C2 on x86_64 and aarch64) do this check > on method entry to avoid (possibly lots) of such checks at locking sites. > > In contrast to stack-locking, fast-locking does *not* support recursive > locking (yet). When that happens, the fast-lock gets inflated to a full > monitor. It is not clear if it is worth to add support for recursive > fast-locking. > > One trouble is that when a contending thread arrives at a fast-locked object, > it must inflate the fast-lock to a full monitor. Normally, we need to know > the current owning thread, and record that in the monitor, so that the > contending thread can wait for the current owner to properly exit the > monitor. However, fast-locking doesn't have this information. What we do > instead is to record a special marker ANONYMOUS_OWNER. When the thread that > currently holds the lock arrives at monitorexit, and observes > ANONYMOUS_OWNER, it knows it must be itself, fixes the owner to be itself, > and then properly exits the monitor, and thus handing over to the contending > thread. > > As an alternative, I considered to remove stack-locking altogether, and only > use heavy monitors. In most workloads this did not show measurable > regressions. However, in a few workloads, I have observed severe regressions. > All of them have been using old synchronized Java collections (Vector, > Stack), StringBuffer or similar code. The combination of two conditions leads > to regressions without stack- or fast-locking: 1. The workload synchronizes > on uncontended locks (e.g. single-threaded use of Vector or StringBuffer) and > 2. The workload churns such locks. IOW, uncontended use of Vector, > StringBuffer, etc as such is ok, but creating lots of such single-use, > single-threaded-locked objects leads to massive ObjectMonitor churn, which > can lead to a significant performance impact. But alas, such code exists, and > we probably don't want to punish it if we can avoid it. > > This change enables to simplify (and speed-up!) a lot of code: > > - The inflation protocol is no longer necessary: we can directly CAS the > (tagged) ObjectMonitor pointer to the object header. > - Accessing the hashcode could now be done in the fastpath always, if the > hashcode has been installed. Fast-locked headers can be used directly, for > monitor-locked objects we can easily reach-through to the displaced header. > This is safe because Java threads participate in monitor deflation protocol. > This would be implemented in a separate PR > > > Testing: > - [x] tier1 x86_64 x aarch64 x +UseFastLocking > - [x] tier2