Re: RFR: 8257967: JFR: Events for loaded agents [v15]

2023-03-31 Thread Serguei Spitsyn
On Fri, 31 Mar 2023 11:18:23 GMT, Markus Grönlund  wrote:

>> Greetings,
>> 
>> We are adding support to let JFR report on Agents.
>> 
>>  Design
>> 
>> An Agent is a library that uses any instrumentation or profiling APIs. Most 
>> agents are started and initialized on the command line, but agents can also 
>> be loaded dynamically during runtime. Because command line agents initialize 
>> during the VM startup sequence, they add to the overall startup time latency 
>> in getting the VM ready. The events will report on the time the agent took 
>> to initialize.
>> 
>> A JavaAgent is an agent written in the Java programming language, using the 
>> APIs in the package 
>> [java.lang.instrument](https://docs.oracle.com/en/java/javase/19/docs/api/java.instrument/java/lang/instrument/package-summary.html)
>> 
>> A JavaAgent is sometimes called a JPLIS agent, where the acronym JPLIS 
>> stands for Java Programming Language Instrumentation Services.
>> 
>> To report on JavaAgents, JFR will add the new event type jdk.JavaAgent and 
>> events will look similar to these two examples:
>> 
>> // Command line
>> jdk.JavaAgent {
>>   startTime = 12:31:19.789 (2023-03-08)
>>   name = "JavaAgent.jar"
>>   options = "foo=bar"
>>   dynamic = false
>>   initializationTime = 12:31:15.574 (2023-03-08)
>>   initializationDuration = 172 ms
>> }
>> 
>> // Dynamic load
>> jdk.JavaAgent {
>>   startTime = 12:31:31.158 (2023-03-08)
>>   name = "JavaAgent.jar"
>>   options = "bar=baz"
>>   dynamic = true
>>   initializationTime = 12:31:31.037 (2023-03-08)
>>   initializationDuration = 64,1 ms
>> }
>> 
>> The jdk.JavaAgent event type is a JFR periodic event that iterates over 
>> running Java agents.
>> 
>> For a JavaAgent event, the agent's name will be the specific .jar file 
>> containing the instrumentation code. The options will be the specific 
>> options passed to the .jar file as part of launching the agent, for example, 
>> on the command line: -javaagent: JavaAgent.jar=foo=bar.
>> 
>> The "dynamic" field denotes if the agent was loaded via the command line 
>> (dynamic = false) or dynamically (dynamic = true)
>> 
>> "initializationTime" is the timestamp the JVM invoked the initialization 
>> method, and "initializationDuration" is the duration of executing the 
>> initialization method.
>> 
>> "startTime" represents the time the JFR framework issued the periodic event; 
>> hence "initializationTime" will be earlier than "startTime".
>> 
>> An agent can also be written in a native programming language using the [JVM 
>> Tools Interface 
>> (JVMTI)](https://docs.oracle.com/en/java/javase/19/docs/specs/jvmti.html). 
>> This kind of agent, sometimes called a native agent, is a platform-specific 
>> binary, sometimes referred to as a library, but here it means a .so or .dll 
>> file.
>> 
>> To report on native agents, JFR will add the new event type jdk.NativeAgent 
>> and events will look similar to this example:
>> 
>> jdk.NativeAgent {
>>   startTime = 12:31:40.398 (2023-03-08)
>>   name = "jdwp"
>>   options = "transport=dt_socket,server=y,address=any,onjcmd=y"
>>   dynamic = false
>>   initializationTime = 12:31:36.142 (2023-03-08)
>>   initializationDuration = 0,00184 ms
>>   path = 
>> "c:\ade\github\openjdk\jdk\build\windows-x86_64-server-slowdebug\jdk\bin\jdwp.dll"
>> }
>> 
>> The layout of the event type is very similar to the jdk.JavaAgent event, but 
>> here the path to the native library is reported.
>> 
>> The initialization of a native agent is performed by invoking an 
>> agent-specified callback routine. The "initializationTime" is when the JVM 
>> sent or would have sent the JVMTI VMInit event to a specified callback. 
>> "initializationDuration" is the duration to execute that specific callback. 
>> If no callback is specified for the JVMTI VMInit event, the 
>> "initializationDuration" will be 0. If the agent is loaded dynamically, 
>> "initializationDuration" is the time taken to execute the Agent_OnAttach 
>> callback.
>> 
>>  Implementation
>> 
>> There has not existed a reification of a JavaAgent directly in the JVM, as 
>> these are built on top of the JDK native library, "instrument", using a 
>> many-to-one mapping. At the level of the JVM, the only representation of 
>> agents after startup is through JvmtiEnv's, which agents request from the 
>> JVM during startup and initialization — as such, mapping which JvmtiEnv 
>> belongs to what JavaAgent was not possible before.
>> 
>> Using implementation details of how the JDK native library "instrument" 
>> interacts with the JVM, we can build this mapping to track what JvmtiEnv's 
>> "belong" to what JavaAgent. This mapping now lets us report the 
>> Java-relevant context (name, options) and measure the time it takes for the 
>> JavaAgent to initialize.
>> 
>> When implementing this capability, it was necessary to refactor the code 
>> used to represent agents, AgentLibrary. The previous implementation was 
>> located primarily in arguments.cpp, 

Re: RFR: 8257967: JFR: Events for loaded agents [v10]

2023-03-31 Thread Serguei Spitsyn
On Wed, 22 Mar 2023 09:18:51 GMT, David Holmes  wrote:

>> src/hotspot/share/prims/jvmtiEnvBase.hpp line 166:
>> 
>>> 164: 
>>> 165:   const void* get_env_local_storage() { return _env_local_storage; }
>>> 166: 
>> 
>> Why was this change/move necessary? Do I miss anything?
>
> It is now public, not protected.

I see, thanks.

-

PR Review Comment: https://git.openjdk.org/jdk/pull/12923#discussion_r1155048753


Re: RFR: 8301991: Convert l10n properties resource bundles to UTF-8 native [v5]

2023-03-31 Thread Naoto Sato
On Fri, 17 Mar 2023 22:27:48 GMT, Justin Lu  wrote:

>> This PR converts Unicode sequences to UTF-8 native in .properties file. 
>> (Excluding the Unicode space and tab sequence). The conversion was done 
>> using native2ascii.
>> 
>> In addition, the build logic is adjusted to support reading in the 
>> .properties files as UTF-8 during the conversion from .properties file to 
>> .java ListResourceBundle file.
>
> Justin Lu has updated the pull request incrementally with one additional 
> commit since the last revision:
> 
>   Close streams when finished loading into props

Hmm, I just wonder why they are sticking to ISO-8859-1 as the default. I know 
j.u.Properties defaults to 8859-1, but PropertyResourceBundle, which is their 
primary use defaults to UTF-8 since JDK9 (https://openjdk.org/jeps/226)

-

PR Comment: https://git.openjdk.org/jdk/pull/12726#issuecomment-1492682703


Re: RFR: 8291555: Implement alternative fast-locking scheme [v47]

2023-03-31 Thread Dean Long
On Fri, 31 Mar 2023 07:25:48 GMT, Thomas Stuefe  wrote:

>> Roman Kennke has updated the pull request incrementally with one additional 
>> commit since the last revision:
>> 
>>   Use int instead of size_t for cached offsets, to match the uncached offset 
>> type and avoid build failures
>
> src/hotspot/cpu/aarch64/macroAssembler_aarch64.cpp line 6234:
> 
>> 6232:   orr(hdr, hdr, markWord::unlocked_value);
>> 6233:   // Clear lock-bits, into t2
>> 6234:   eor(t2, hdr, markWord::unlocked_value);
> 
> In arm, I use a combination of bic and orr instead. That gives me, with just 
> two instructions, added safety against someone handing in a "11" marked MW. I 
> know, should never happen, but better safe.
> 
> 
>   ldr(new_hdr, Address(obj, oopDesc::mark_offset_in_bytes()));
>   bic(new_hdr, new_hdr, markWord::lock_mask_in_place);  // new header (00)
>   orr(old_hdr, new_hdr, markWord::unlocked_value);  // old header (01)
> 
> (note that I moved MW loading down into MA::fast_lock for unrelated reasons).
> 
> Unfortunately, on aarch64 there seem to be no bic variants that accept 
> immediates. So it would take one more instruction to get the same result:
> 
> 
> -  // Load (object->mark() | 1) into hdr
> -  orr(hdr, hdr, markWord::unlocked_value);
> -  // Clear lock-bits, into t2
> -  eor(t2, hdr, markWord::unlocked_value);
> +  // Prepare new and old header
> +  mov(t2, markWord::lock_mask_in_place);
> +  bic(t2, hdr, t2);
> +  orr(hdr, t2, markWord::unlocked_value);
> 
> 
> But maybe there is a better way that does not need three instructions.

There is a BFC (Bitfield Clear) pseudo-instruction that uses the BFM 
instruction.

-

PR Review Comment: https://git.openjdk.org/jdk/pull/10907#discussion_r1154955795


Re: RFR: 8291555: Implement alternative fast-locking scheme [v49]

2023-03-31 Thread Daniel D . Daugherty
On Fri, 31 Mar 2023 19:39:03 GMT, Roman Kennke  wrote:

>> This change adds a fast-locking scheme as an alternative to the current 
>> stack-locking implementation. It retains the advantages of stack-locking 
>> (namely fast locking in uncontended code-paths), while avoiding the overload 
>> of the mark word. That overloading causes massive problems with Lilliput, 
>> because it means we have to check and deal with this situation when trying 
>> to access the mark-word. And because of the very racy nature, this turns out 
>> to be very complex and would involve a variant of the inflation protocol to 
>> ensure that the object header is stable. (The current implementation of 
>> setting/fetching the i-hash provides a glimpse into the complexity).
>> 
>> What the original stack-locking does is basically to push a stack-lock onto 
>> the stack which consists only of the displaced header, and CAS a pointer to 
>> this stack location into the object header (the lowest two header bits being 
>> 00 indicate 'stack-locked'). The pointer into the stack can then be used to 
>> identify which thread currently owns the lock.
>> 
>> This change basically reverses stack-locking: It still CASes the lowest two 
>> header bits to 00 to indicate 'fast-locked' but does *not* overload the 
>> upper bits with a stack-pointer. Instead, it pushes the object-reference to 
>> a thread-local lock-stack. This is a new structure which is basically a 
>> small array of oops that is associated with each thread. Experience shows 
>> that this array typcially remains very small (3-5 elements). Using this lock 
>> stack, it is possible to query which threads own which locks. Most 
>> importantly, the most common question 'does the current thread own me?' is 
>> very quickly answered by doing a quick scan of the array. More complex 
>> queries like 'which thread owns X?' are not performed in very 
>> performance-critical paths (usually in code like JVMTI or deadlock 
>> detection) where it is ok to do more complex operations (and we already do). 
>> The lock-stack is also a new set of GC roots, and would be scanned during 
>> thread scanning, possibly concurrently, via the normal 
 protocols.
>> 
>> The lock-stack is fixed size, currently with 8 elements. According to my 
>> experiments with various workloads, this covers the vast majority of 
>> workloads (in-fact, most workloads seem to never exceed 5 active locks per 
>> thread at a time). We check for overflow in the fast-paths and when the 
>> lock-stack is full, we take the slow-path, which would inflate the lock to a 
>> monitor. That case should be very rare.
>> 
>> In contrast to stack-locking, fast-locking does *not* support recursive 
>> locking (yet). When that happens, the fast-lock gets inflated to a full 
>> monitor. It is not clear if it is worth to add support for recursive 
>> fast-locking.
>> 
>> One trouble is that when a contending thread arrives at a fast-locked 
>> object, it must inflate the fast-lock to a full monitor. Normally, we need 
>> to know the current owning thread, and record that in the monitor, so that 
>> the contending thread can wait for the current owner to properly exit the 
>> monitor. However, fast-locking doesn't have this information. What we do 
>> instead is to record a special marker ANONYMOUS_OWNER. When the thread that 
>> currently holds the lock arrives at monitorexit, and observes 
>> ANONYMOUS_OWNER, it knows it must be itself, fixes the owner to be itself, 
>> and then properly exits the monitor, and thus handing over to the contending 
>> thread.
>> 
>> As an alternative, I considered to remove stack-locking altogether, and only 
>> use heavy monitors. In most workloads this did not show measurable 
>> regressions. However, in a few workloads, I have observed severe 
>> regressions. All of them have been using old synchronized Java collections 
>> (Vector, Stack), StringBuffer or similar code. The combination of two 
>> conditions leads to regressions without stack- or fast-locking: 1. The 
>> workload synchronizes on uncontended locks (e.g. single-threaded use of 
>> Vector or StringBuffer) and 2. The workload churns such locks. IOW, 
>> uncontended use of Vector, StringBuffer, etc as such is ok, but creating 
>> lots of such single-use, single-threaded-locked objects leads to massive 
>> ObjectMonitor churn, which can lead to a significant performance impact. But 
>> alas, such code exists, and we probably don't want to punish it if we can 
>> avoid it.
>> 
>> This change enables to simplify (and speed-up!) a lot of code:
>> 
>> - The inflation protocol is no longer necessary: we can directly CAS the 
>> (tagged) ObjectMonitor pointer to the object header.
>> - Accessing the hashcode could now be done in the fastpath always, if the 
>> hashcode has been installed. Fast-locked headers can be used directly, for 
>> monitor-locked objects we can easily reach-through to the displaced header. 
>> This is safe because Java threads 

Re: RFR: 8301991: Convert l10n properties resource bundles to UTF-8 native [v5]

2023-03-31 Thread Justin Lu
On Fri, 17 Mar 2023 22:27:48 GMT, Justin Lu  wrote:

>> This PR converts Unicode sequences to UTF-8 native in .properties file. 
>> (Excluding the Unicode space and tab sequence). The conversion was done 
>> using native2ascii.
>> 
>> In addition, the build logic is adjusted to support reading in the 
>> .properties files as UTF-8 during the conversion from .properties file to 
>> .java ListResourceBundle file.
>
> Justin Lu has updated the pull request incrementally with one additional 
> commit since the last revision:
> 
>   Close streams when finished loading into props

Something thing to consider is that Intellj defaults .properties files to ISO 
8859-1. 

https://www.jetbrains.com/help/idea/properties-files.html#encoding

So users of Intellj / (other IDEs that default to ISO 8859-1 for .properties 
files) will need to change the default encoding to utf-8 for such files. Or 
ideally, the respective IDEs can change their default encoding for .properties 
files if this change is integrated.

-

PR Comment: https://git.openjdk.org/jdk/pull/12726#issuecomment-1492640306


Re: RFR: 8291555: Implement alternative fast-locking scheme [v49]

2023-03-31 Thread Daniel D . Daugherty
On Fri, 31 Mar 2023 19:39:03 GMT, Roman Kennke  wrote:

>> This change adds a fast-locking scheme as an alternative to the current 
>> stack-locking implementation. It retains the advantages of stack-locking 
>> (namely fast locking in uncontended code-paths), while avoiding the overload 
>> of the mark word. That overloading causes massive problems with Lilliput, 
>> because it means we have to check and deal with this situation when trying 
>> to access the mark-word. And because of the very racy nature, this turns out 
>> to be very complex and would involve a variant of the inflation protocol to 
>> ensure that the object header is stable. (The current implementation of 
>> setting/fetching the i-hash provides a glimpse into the complexity).
>> 
>> What the original stack-locking does is basically to push a stack-lock onto 
>> the stack which consists only of the displaced header, and CAS a pointer to 
>> this stack location into the object header (the lowest two header bits being 
>> 00 indicate 'stack-locked'). The pointer into the stack can then be used to 
>> identify which thread currently owns the lock.
>> 
>> This change basically reverses stack-locking: It still CASes the lowest two 
>> header bits to 00 to indicate 'fast-locked' but does *not* overload the 
>> upper bits with a stack-pointer. Instead, it pushes the object-reference to 
>> a thread-local lock-stack. This is a new structure which is basically a 
>> small array of oops that is associated with each thread. Experience shows 
>> that this array typcially remains very small (3-5 elements). Using this lock 
>> stack, it is possible to query which threads own which locks. Most 
>> importantly, the most common question 'does the current thread own me?' is 
>> very quickly answered by doing a quick scan of the array. More complex 
>> queries like 'which thread owns X?' are not performed in very 
>> performance-critical paths (usually in code like JVMTI or deadlock 
>> detection) where it is ok to do more complex operations (and we already do). 
>> The lock-stack is also a new set of GC roots, and would be scanned during 
>> thread scanning, possibly concurrently, via the normal 
 protocols.
>> 
>> The lock-stack is fixed size, currently with 8 elements. According to my 
>> experiments with various workloads, this covers the vast majority of 
>> workloads (in-fact, most workloads seem to never exceed 5 active locks per 
>> thread at a time). We check for overflow in the fast-paths and when the 
>> lock-stack is full, we take the slow-path, which would inflate the lock to a 
>> monitor. That case should be very rare.
>> 
>> In contrast to stack-locking, fast-locking does *not* support recursive 
>> locking (yet). When that happens, the fast-lock gets inflated to a full 
>> monitor. It is not clear if it is worth to add support for recursive 
>> fast-locking.
>> 
>> One trouble is that when a contending thread arrives at a fast-locked 
>> object, it must inflate the fast-lock to a full monitor. Normally, we need 
>> to know the current owning thread, and record that in the monitor, so that 
>> the contending thread can wait for the current owner to properly exit the 
>> monitor. However, fast-locking doesn't have this information. What we do 
>> instead is to record a special marker ANONYMOUS_OWNER. When the thread that 
>> currently holds the lock arrives at monitorexit, and observes 
>> ANONYMOUS_OWNER, it knows it must be itself, fixes the owner to be itself, 
>> and then properly exits the monitor, and thus handing over to the contending 
>> thread.
>> 
>> As an alternative, I considered to remove stack-locking altogether, and only 
>> use heavy monitors. In most workloads this did not show measurable 
>> regressions. However, in a few workloads, I have observed severe 
>> regressions. All of them have been using old synchronized Java collections 
>> (Vector, Stack), StringBuffer or similar code. The combination of two 
>> conditions leads to regressions without stack- or fast-locking: 1. The 
>> workload synchronizes on uncontended locks (e.g. single-threaded use of 
>> Vector or StringBuffer) and 2. The workload churns such locks. IOW, 
>> uncontended use of Vector, StringBuffer, etc as such is ok, but creating 
>> lots of such single-use, single-threaded-locked objects leads to massive 
>> ObjectMonitor churn, which can lead to a significant performance impact. But 
>> alas, such code exists, and we probably don't want to punish it if we can 
>> avoid it.
>> 
>> This change enables to simplify (and speed-up!) a lot of code:
>> 
>> - The inflation protocol is no longer necessary: we can directly CAS the 
>> (tagged) ObjectMonitor pointer to the object header.
>> - Accessing the hashcode could now be done in the fastpath always, if the 
>> hashcode has been installed. Fast-locked headers can be used directly, for 
>> monitor-locked objects we can easily reach-through to the displaced header. 
>> This is safe because Java threads 

Re: RFR: 8297286: runtime/vthread tests crashing after JDK-8296324 [v15]

2023-03-31 Thread Serguei Spitsyn
> The fix is to enable virtual threads support for late binding JVMTI agents.
> The fix includes:
> - New function `JvmtiEnvBase::enable_virtual_threads_notify_jvmti()` which 
> does enabling JVMTI VTMS transition notifications in case of agent loaded 
> into running VM. This function executes a VM operation counting VTMS 
> transition bits in all `JavaThread`'s to correctly set the static counter 
> `_VTMS_transition_count` needed for VTMS transition protocol.
> - New function `JvmtiEnvBase::disable_virtual_threads_notify_jvmti()` which 
> is needed for testing. It is used by the `WhiteBox` API.
> - New WhiteBox function `WB_SetVirtualThreadsNotifyJvmtiMode(JNIEnv* env, 
> jobject wb, jboolean enable)` needed for testing of this update.
> - New regression test: `serviceability/jvmti/vthread/ToggleNotifyJvmtiTest`
> 
> Testing:
> - New test: `serviceability/jvmti/vthread/ToggleNotifyJvmtiTest`
> - The originally failed tests are expected to pass now:
>   `runtime/vthread/RedefineClass.java`
>   `runtime/vthread/TestObjectAllocationSampleEvent.java` 
> - In progress: Run the tiers 1-6 to make sure there are no regression.

Serguei Spitsyn has updated the pull request incrementally with one additional 
commit since the last revision:

  minor simplification in ToggleNotifyJvmtiTest.java

-

Changes:
  - all: https://git.openjdk.org/jdk/pull/13133/files
  - new: https://git.openjdk.org/jdk/pull/13133/files/aef87273..c55b6b38

Webrevs:
 - full: https://webrevs.openjdk.org/?repo=jdk=13133=14
 - incr: https://webrevs.openjdk.org/?repo=jdk=13133=13-14

  Stats: 13 lines in 1 file changed: 3 ins; 7 del; 3 mod
  Patch: https://git.openjdk.org/jdk/pull/13133.diff
  Fetch: git fetch https://git.openjdk.org/jdk.git pull/13133/head:pull/13133

PR: https://git.openjdk.org/jdk/pull/13133


Re: RFR: 8297286: runtime/vthread tests crashing after JDK-8296324 [v14]

2023-03-31 Thread Serguei Spitsyn
On Fri, 31 Mar 2023 06:52:18 GMT, Serguei Spitsyn  wrote:

>> The fix is to enable virtual threads support for late binding JVMTI agents.
>> The fix includes:
>> - New function `JvmtiEnvBase::enable_virtual_threads_notify_jvmti()` which 
>> does enabling JVMTI VTMS transition notifications in case of agent loaded 
>> into running VM. This function executes a VM operation counting VTMS 
>> transition bits in all `JavaThread`'s to correctly set the static counter 
>> `_VTMS_transition_count` needed for VTMS transition protocol.
>> - New function `JvmtiEnvBase::disable_virtual_threads_notify_jvmti()` which 
>> is needed for testing. It is used by the `WhiteBox` API.
>> - New WhiteBox function `WB_SetVirtualThreadsNotifyJvmtiMode(JNIEnv* env, 
>> jobject wb, jboolean enable)` needed for testing of this update.
>> - New regression test: `serviceability/jvmti/vthread/ToggleNotifyJvmtiTest`
>> 
>> Testing:
>> - New test: `serviceability/jvmti/vthread/ToggleNotifyJvmtiTest`
>> - The originally failed tests are expected to pass now:
>>   `runtime/vthread/RedefineClass.java`
>>   `runtime/vthread/TestObjectAllocationSampleEvent.java` 
>> - In progress: Run the tiers 1-6 to make sure there are no regression.
>
> Serguei Spitsyn has updated the pull request incrementally with one 
> additional commit since the last revision:
> 
>   review: addressed next round of review suggestions

Leonid and Chris, thank you for review!

-

PR Comment: https://git.openjdk.org/jdk/pull/13133#issuecomment-1492575146


Re: RFR: 8291555: Implement alternative fast-locking scheme [v49]

2023-03-31 Thread Daniel D . Daugherty
On Fri, 31 Mar 2023 19:39:03 GMT, Roman Kennke  wrote:

>> This change adds a fast-locking scheme as an alternative to the current 
>> stack-locking implementation. It retains the advantages of stack-locking 
>> (namely fast locking in uncontended code-paths), while avoiding the overload 
>> of the mark word. That overloading causes massive problems with Lilliput, 
>> because it means we have to check and deal with this situation when trying 
>> to access the mark-word. And because of the very racy nature, this turns out 
>> to be very complex and would involve a variant of the inflation protocol to 
>> ensure that the object header is stable. (The current implementation of 
>> setting/fetching the i-hash provides a glimpse into the complexity).
>> 
>> What the original stack-locking does is basically to push a stack-lock onto 
>> the stack which consists only of the displaced header, and CAS a pointer to 
>> this stack location into the object header (the lowest two header bits being 
>> 00 indicate 'stack-locked'). The pointer into the stack can then be used to 
>> identify which thread currently owns the lock.
>> 
>> This change basically reverses stack-locking: It still CASes the lowest two 
>> header bits to 00 to indicate 'fast-locked' but does *not* overload the 
>> upper bits with a stack-pointer. Instead, it pushes the object-reference to 
>> a thread-local lock-stack. This is a new structure which is basically a 
>> small array of oops that is associated with each thread. Experience shows 
>> that this array typcially remains very small (3-5 elements). Using this lock 
>> stack, it is possible to query which threads own which locks. Most 
>> importantly, the most common question 'does the current thread own me?' is 
>> very quickly answered by doing a quick scan of the array. More complex 
>> queries like 'which thread owns X?' are not performed in very 
>> performance-critical paths (usually in code like JVMTI or deadlock 
>> detection) where it is ok to do more complex operations (and we already do). 
>> The lock-stack is also a new set of GC roots, and would be scanned during 
>> thread scanning, possibly concurrently, via the normal 
 protocols.
>> 
>> The lock-stack is fixed size, currently with 8 elements. According to my 
>> experiments with various workloads, this covers the vast majority of 
>> workloads (in-fact, most workloads seem to never exceed 5 active locks per 
>> thread at a time). We check for overflow in the fast-paths and when the 
>> lock-stack is full, we take the slow-path, which would inflate the lock to a 
>> monitor. That case should be very rare.
>> 
>> In contrast to stack-locking, fast-locking does *not* support recursive 
>> locking (yet). When that happens, the fast-lock gets inflated to a full 
>> monitor. It is not clear if it is worth to add support for recursive 
>> fast-locking.
>> 
>> One trouble is that when a contending thread arrives at a fast-locked 
>> object, it must inflate the fast-lock to a full monitor. Normally, we need 
>> to know the current owning thread, and record that in the monitor, so that 
>> the contending thread can wait for the current owner to properly exit the 
>> monitor. However, fast-locking doesn't have this information. What we do 
>> instead is to record a special marker ANONYMOUS_OWNER. When the thread that 
>> currently holds the lock arrives at monitorexit, and observes 
>> ANONYMOUS_OWNER, it knows it must be itself, fixes the owner to be itself, 
>> and then properly exits the monitor, and thus handing over to the contending 
>> thread.
>> 
>> As an alternative, I considered to remove stack-locking altogether, and only 
>> use heavy monitors. In most workloads this did not show measurable 
>> regressions. However, in a few workloads, I have observed severe 
>> regressions. All of them have been using old synchronized Java collections 
>> (Vector, Stack), StringBuffer or similar code. The combination of two 
>> conditions leads to regressions without stack- or fast-locking: 1. The 
>> workload synchronizes on uncontended locks (e.g. single-threaded use of 
>> Vector or StringBuffer) and 2. The workload churns such locks. IOW, 
>> uncontended use of Vector, StringBuffer, etc as such is ok, but creating 
>> lots of such single-use, single-threaded-locked objects leads to massive 
>> ObjectMonitor churn, which can lead to a significant performance impact. But 
>> alas, such code exists, and we probably don't want to punish it if we can 
>> avoid it.
>> 
>> This change enables to simplify (and speed-up!) a lot of code:
>> 
>> - The inflation protocol is no longer necessary: we can directly CAS the 
>> (tagged) ObjectMonitor pointer to the object header.
>> - Accessing the hashcode could now be done in the fastpath always, if the 
>> hashcode has been installed. Fast-locked headers can be used directly, for 
>> monitor-locked objects we can easily reach-through to the displaced header. 
>> This is safe because Java threads 

Re: RFR: 8305341: Alignment outside of HotSpot should be enforced by alignas instead of compiler specific attributes

2023-03-31 Thread Chris Plummer
On Fri, 31 Mar 2023 06:07:39 GMT, Julian Waters  wrote:

> C11 has been stable for a long time on all platforms, so native code can use 
> the standard alignas operator for alignment requirements

I don't have any comments on this change in general (it's not something I've 
dealt with in the past), but I did notice that there are a couple of places you 
missed:


src/hotspot/share/utilities/globalDefinitions_visCPP.hpp:119:#define 
ATTRIBUTE_ALIGNED(x) __declspec(align(x))
src/java.desktop/share/native/libfreetype/include/freetype/internal/ftvalid.h:82:
  /* __declspec(align())' in order to compile cleanly with */
src/java.desktop/share/native/libfreetype/src/smooth/ftgrays.c:484:  /* 
__declspec(align())' in order to compile cleanly with */


For the 2nd and 3rd ones you would want to remove all of the following:


#if defined( _MSC_VER )  /* Visual C++ (and Intel C++) */
  /* We disable the warning `structure was padded due to   */
  /* __declspec(align())' in order to compile cleanly with */
  /* the maximum level of warnings.*/
#pragma warning( push )
#pragma warning( disable : 4324 )
#endif /* _MSC_VER */
...
#if defined( _MSC_VER )
#pragma warning( pop )
#endif

-

PR Comment: https://git.openjdk.org/jdk/pull/13258#issuecomment-1492522828


Re: RFR: 8305237: CompilerDirectives DCmds permissions correction

2023-03-31 Thread Chris Plummer
On Fri, 31 Mar 2023 08:24:19 GMT, Kevin Walls  wrote:

> The Permissions in DCmds relate to remote usage over JMX. 
> "monitor" is generally for reading information, and "control" is generally 
> for making changes.
> The DCmds for changing compiler directives should have "control" as the 
> required permission.
> 
> Tests in test/hotspot/jtreg/serviceability/dcmd/compiler and 
> test/hotspot/jtreg/compiler/compilercontrol still pass with this change.

I assume this means we have no tests that try to change these compiler 
directives. Should we?

-

PR Comment: https://git.openjdk.org/jdk/pull/13262#issuecomment-1492504793


Re: RFR: 8291555: Implement alternative fast-locking scheme [v49]

2023-03-31 Thread Roman Kennke
> This change adds a fast-locking scheme as an alternative to the current 
> stack-locking implementation. It retains the advantages of stack-locking 
> (namely fast locking in uncontended code-paths), while avoiding the overload 
> of the mark word. That overloading causes massive problems with Lilliput, 
> because it means we have to check and deal with this situation when trying to 
> access the mark-word. And because of the very racy nature, this turns out to 
> be very complex and would involve a variant of the inflation protocol to 
> ensure that the object header is stable. (The current implementation of 
> setting/fetching the i-hash provides a glimpse into the complexity).
> 
> What the original stack-locking does is basically to push a stack-lock onto 
> the stack which consists only of the displaced header, and CAS a pointer to 
> this stack location into the object header (the lowest two header bits being 
> 00 indicate 'stack-locked'). The pointer into the stack can then be used to 
> identify which thread currently owns the lock.
> 
> This change basically reverses stack-locking: It still CASes the lowest two 
> header bits to 00 to indicate 'fast-locked' but does *not* overload the upper 
> bits with a stack-pointer. Instead, it pushes the object-reference to a 
> thread-local lock-stack. This is a new structure which is basically a small 
> array of oops that is associated with each thread. Experience shows that this 
> array typcially remains very small (3-5 elements). Using this lock stack, it 
> is possible to query which threads own which locks. Most importantly, the 
> most common question 'does the current thread own me?' is very quickly 
> answered by doing a quick scan of the array. More complex queries like 'which 
> thread owns X?' are not performed in very performance-critical paths (usually 
> in code like JVMTI or deadlock detection) where it is ok to do more complex 
> operations (and we already do). The lock-stack is also a new set of GC roots, 
> and would be scanned during thread scanning, possibly concurrently, via the 
> normal p
 rotocols.
> 
> The lock-stack is fixed size, currently with 8 elements. According to my 
> experiments with various workloads, this covers the vast majority of 
> workloads (in-fact, most workloads seem to never exceed 5 active locks per 
> thread at a time). We check for overflow in the fast-paths and when the 
> lock-stack is full, we take the slow-path, which would inflate the lock to a 
> monitor. That case should be very rare.
> 
> In contrast to stack-locking, fast-locking does *not* support recursive 
> locking (yet). When that happens, the fast-lock gets inflated to a full 
> monitor. It is not clear if it is worth to add support for recursive 
> fast-locking.
> 
> One trouble is that when a contending thread arrives at a fast-locked object, 
> it must inflate the fast-lock to a full monitor. Normally, we need to know 
> the current owning thread, and record that in the monitor, so that the 
> contending thread can wait for the current owner to properly exit the 
> monitor. However, fast-locking doesn't have this information. What we do 
> instead is to record a special marker ANONYMOUS_OWNER. When the thread that 
> currently holds the lock arrives at monitorexit, and observes 
> ANONYMOUS_OWNER, it knows it must be itself, fixes the owner to be itself, 
> and then properly exits the monitor, and thus handing over to the contending 
> thread.
> 
> As an alternative, I considered to remove stack-locking altogether, and only 
> use heavy monitors. In most workloads this did not show measurable 
> regressions. However, in a few workloads, I have observed severe regressions. 
> All of them have been using old synchronized Java collections (Vector, 
> Stack), StringBuffer or similar code. The combination of two conditions leads 
> to regressions without stack- or fast-locking: 1. The workload synchronizes 
> on uncontended locks (e.g. single-threaded use of Vector or StringBuffer) and 
> 2. The workload churns such locks. IOW, uncontended use of Vector, 
> StringBuffer, etc as such is ok, but creating lots of such single-use, 
> single-threaded-locked objects leads to massive ObjectMonitor churn, which 
> can lead to a significant performance impact. But alas, such code exists, and 
> we probably don't want to punish it if we can avoid it.
> 
> This change enables to simplify (and speed-up!) a lot of code:
> 
> - The inflation protocol is no longer necessary: we can directly CAS the 
> (tagged) ObjectMonitor pointer to the object header.
> - Accessing the hashcode could now be done in the fastpath always, if the 
> hashcode has been installed. Fast-locked headers can be used directly, for 
> monitor-locked objects we can easily reach-through to the displaced header. 
> This is safe because Java threads participate in monitor deflation protocol. 
> This would be implemented in a separate PR
> 
> 
> Testing:
>  - [x] tier1 x86_64 x aarch64 x 

Re: RFR: 8297286: runtime/vthread tests crashing after JDK-8296324 [v14]

2023-03-31 Thread Chris Plummer
On Fri, 31 Mar 2023 06:52:18 GMT, Serguei Spitsyn  wrote:

>> The fix is to enable virtual threads support for late binding JVMTI agents.
>> The fix includes:
>> - New function `JvmtiEnvBase::enable_virtual_threads_notify_jvmti()` which 
>> does enabling JVMTI VTMS transition notifications in case of agent loaded 
>> into running VM. This function executes a VM operation counting VTMS 
>> transition bits in all `JavaThread`'s to correctly set the static counter 
>> `_VTMS_transition_count` needed for VTMS transition protocol.
>> - New function `JvmtiEnvBase::disable_virtual_threads_notify_jvmti()` which 
>> is needed for testing. It is used by the `WhiteBox` API.
>> - New WhiteBox function `WB_SetVirtualThreadsNotifyJvmtiMode(JNIEnv* env, 
>> jobject wb, jboolean enable)` needed for testing of this update.
>> - New regression test: `serviceability/jvmti/vthread/ToggleNotifyJvmtiTest`
>> 
>> Testing:
>> - New test: `serviceability/jvmti/vthread/ToggleNotifyJvmtiTest`
>> - The originally failed tests are expected to pass now:
>>   `runtime/vthread/RedefineClass.java`
>>   `runtime/vthread/TestObjectAllocationSampleEvent.java` 
>> - In progress: Run the tiers 1-6 to make sure there are no regression.
>
> Serguei Spitsyn has updated the pull request incrementally with one 
> additional commit since the last revision:
> 
>   review: addressed next round of review suggestions

Changes look good, but for the most part I just looked at the test related 
changes.

-

Marked as reviewed by cjplummer (Reviewer).

PR Review: https://git.openjdk.org/jdk/pull/13133#pullrequestreview-1367488100


Re: RFR: 8291555: Implement alternative fast-locking scheme [v48]

2023-03-31 Thread Thomas Stuefe
On Fri, 31 Mar 2023 13:54:47 GMT, Roman Kennke  wrote:

>> This change adds a fast-locking scheme as an alternative to the current 
>> stack-locking implementation. It retains the advantages of stack-locking 
>> (namely fast locking in uncontended code-paths), while avoiding the overload 
>> of the mark word. That overloading causes massive problems with Lilliput, 
>> because it means we have to check and deal with this situation when trying 
>> to access the mark-word. And because of the very racy nature, this turns out 
>> to be very complex and would involve a variant of the inflation protocol to 
>> ensure that the object header is stable. (The current implementation of 
>> setting/fetching the i-hash provides a glimpse into the complexity).
>> 
>> What the original stack-locking does is basically to push a stack-lock onto 
>> the stack which consists only of the displaced header, and CAS a pointer to 
>> this stack location into the object header (the lowest two header bits being 
>> 00 indicate 'stack-locked'). The pointer into the stack can then be used to 
>> identify which thread currently owns the lock.
>> 
>> This change basically reverses stack-locking: It still CASes the lowest two 
>> header bits to 00 to indicate 'fast-locked' but does *not* overload the 
>> upper bits with a stack-pointer. Instead, it pushes the object-reference to 
>> a thread-local lock-stack. This is a new structure which is basically a 
>> small array of oops that is associated with each thread. Experience shows 
>> that this array typcially remains very small (3-5 elements). Using this lock 
>> stack, it is possible to query which threads own which locks. Most 
>> importantly, the most common question 'does the current thread own me?' is 
>> very quickly answered by doing a quick scan of the array. More complex 
>> queries like 'which thread owns X?' are not performed in very 
>> performance-critical paths (usually in code like JVMTI or deadlock 
>> detection) where it is ok to do more complex operations (and we already do). 
>> The lock-stack is also a new set of GC roots, and would be scanned during 
>> thread scanning, possibly concurrently, via the normal 
 protocols.
>> 
>> The lock-stack is fixed size, currently with 8 elements. According to my 
>> experiments with various workloads, this covers the vast majority of 
>> workloads (in-fact, most workloads seem to never exceed 5 active locks per 
>> thread at a time). We check for overflow in the fast-paths and when the 
>> lock-stack is full, we take the slow-path, which would inflate the lock to a 
>> monitor. That case should be very rare.
>> 
>> In contrast to stack-locking, fast-locking does *not* support recursive 
>> locking (yet). When that happens, the fast-lock gets inflated to a full 
>> monitor. It is not clear if it is worth to add support for recursive 
>> fast-locking.
>> 
>> One trouble is that when a contending thread arrives at a fast-locked 
>> object, it must inflate the fast-lock to a full monitor. Normally, we need 
>> to know the current owning thread, and record that in the monitor, so that 
>> the contending thread can wait for the current owner to properly exit the 
>> monitor. However, fast-locking doesn't have this information. What we do 
>> instead is to record a special marker ANONYMOUS_OWNER. When the thread that 
>> currently holds the lock arrives at monitorexit, and observes 
>> ANONYMOUS_OWNER, it knows it must be itself, fixes the owner to be itself, 
>> and then properly exits the monitor, and thus handing over to the contending 
>> thread.
>> 
>> As an alternative, I considered to remove stack-locking altogether, and only 
>> use heavy monitors. In most workloads this did not show measurable 
>> regressions. However, in a few workloads, I have observed severe 
>> regressions. All of them have been using old synchronized Java collections 
>> (Vector, Stack), StringBuffer or similar code. The combination of two 
>> conditions leads to regressions without stack- or fast-locking: 1. The 
>> workload synchronizes on uncontended locks (e.g. single-threaded use of 
>> Vector or StringBuffer) and 2. The workload churns such locks. IOW, 
>> uncontended use of Vector, StringBuffer, etc as such is ok, but creating 
>> lots of such single-use, single-threaded-locked objects leads to massive 
>> ObjectMonitor churn, which can lead to a significant performance impact. But 
>> alas, such code exists, and we probably don't want to punish it if we can 
>> avoid it.
>> 
>> This change enables to simplify (and speed-up!) a lot of code:
>> 
>> - The inflation protocol is no longer necessary: we can directly CAS the 
>> (tagged) ObjectMonitor pointer to the object header.
>> - Accessing the hashcode could now be done in the fastpath always, if the 
>> hashcode has been installed. Fast-locked headers can be used directly, for 
>> monitor-locked objects we can easily reach-through to the displaced header. 
>> This is safe because Java threads 

Re: RFR: 8291555: Implement alternative fast-locking scheme [v48]

2023-03-31 Thread Thomas Stuefe
On Fri, 31 Mar 2023 15:24:07 GMT, Thomas Stuefe  wrote:

>> Roman Kennke has updated the pull request incrementally with two additional 
>> commits since the last revision:
>> 
>>  - Merge remote-tracking branch 'origin/JDK-8291555-v2' into JDK-8291555-v2
>>  - Check underflow, top-of-stack and mark-bits for sanity, in fast_unlock() 
>> (aarch64)
>
> src/hotspot/cpu/aarch64/macroAssembler_aarch64.cpp line 6264:
> 
>> 6262: ldrw(t1, Address(rthread, JavaThread::lock_stack_top_offset()));
>> 6263: cmpw(t1, (unsigned)LockStack::start_offset());
>> 6264: br(Assembler::GT, stack_ok);
> 
> I had to think hard about "GT" here. 
> 
> We could have entered with the thread holding just one inflated lock, then 
> LockStack would be empty but the monitorexit would still be valid. You now do 
> check in the callers for markWord::monitor_value. But the lock could have 
> been inflated concurrently after the caller checks and before this point. 
> 
> But then the LockStack would not have changed, since it represents what the 
> current thread *thinks* are thin locks, not what are actually thin locks? In 
> other words, LockStack is only modified by its owning thread, never from the 
> outside.
> 
> So this *should* be correct, but its certainly a brain teaser. Maybe add a 
> comment? 
> 
> E.g. "These checks rely on the fact that LockStack is only ever modified by 
> its owning stack, even if the lock got inflated concurrently; removal of 
> LockStack entries after inflation will happen delayed in that case" or 
> somesuch.

This also mandates that fast_lock can only ever entered if the current thread 
thinks that the lock in question is a thin lock. So the caller checks for 
markWord::monitor_value are mandatory now.

-

PR Review Comment: https://git.openjdk.org/jdk/pull/10907#discussion_r1154619603


Re: RFR: 8301995: Move invokedynamic resolution information out of ConstantPoolCacheEntry [v16]

2023-03-31 Thread Matias Saavedra Silva
On Tue, 28 Mar 2023 19:50:36 GMT, Matias Saavedra Silva  
wrote:

>> The current structure used to store the resolution information for 
>> invokedynamic, ConstantPoolCacheEntry, is difficult to interpret due to its 
>> ambigious fields f1 and f2. This structure can hold information for fields, 
>> methods, and invokedynamics and each of its fields can hold different types 
>> of values depending on the entry. 
>> 
>> This enhancement proposes a new structure to exclusively contain 
>> invokedynamic information in a manner that is easy to interpret and easy to 
>> extend.  Resolved invokedynamic entries will be stored in an array in the 
>> constant pool cache and the operand of the invokedynamic bytecode will be 
>> rewritten to be the index into this array.
>> 
>> Any areas that previously accessed invokedynamic data from 
>> ConstantPoolCacheEntry will be replaced with accesses to this new array and 
>> structure. Verified with tier1-9 tests.
>> 
>> The PPC port was provided by @reinrich, RISCV was provided by @DingliZhang 
>> and @zifeihan, and S390x by @offamitkumar.
>> 
>> This change supports the following platforms: x86, aarch64, PPC, RISCV, and 
>> S390x
>
> Matias Saavedra Silva has updated the pull request incrementally with one 
> additional commit since the last revision:
> 
>   s390x NULL to nullptr

> This obviously breaks arm, since its implementation is missing. I opened 
> https://bugs.openjdk.org/browse/JDK-8305387 to track this. This is 
> unfortunate since it holds work on arm in other areas, in my case for #10907.
> 
> > This change supports the following platforms: x86, aarch64, PPC, RISCV, and 
> > S390x
> 
> I wonder about the explicit exclusion of arm. Every other CPU seems to be 
> taken care of, even those Oracle does not maintain. Just curious, was there a 
> special reason for excluding arm?

There is no special reason ARM32 was excluded other than the fact no porter has 
picked it up yet. Fortunately I was able to get in contact with porters for the 
other platforms, but nobody took on the ARM port until now. Thank you for 
opening the issue!

-

PR Comment: https://git.openjdk.org/jdk/pull/12778#issuecomment-1492144686


Re: RFR: 8291555: Implement alternative fast-locking scheme [v48]

2023-03-31 Thread Thomas Stuefe
On Fri, 31 Mar 2023 13:54:47 GMT, Roman Kennke  wrote:

>> This change adds a fast-locking scheme as an alternative to the current 
>> stack-locking implementation. It retains the advantages of stack-locking 
>> (namely fast locking in uncontended code-paths), while avoiding the overload 
>> of the mark word. That overloading causes massive problems with Lilliput, 
>> because it means we have to check and deal with this situation when trying 
>> to access the mark-word. And because of the very racy nature, this turns out 
>> to be very complex and would involve a variant of the inflation protocol to 
>> ensure that the object header is stable. (The current implementation of 
>> setting/fetching the i-hash provides a glimpse into the complexity).
>> 
>> What the original stack-locking does is basically to push a stack-lock onto 
>> the stack which consists only of the displaced header, and CAS a pointer to 
>> this stack location into the object header (the lowest two header bits being 
>> 00 indicate 'stack-locked'). The pointer into the stack can then be used to 
>> identify which thread currently owns the lock.
>> 
>> This change basically reverses stack-locking: It still CASes the lowest two 
>> header bits to 00 to indicate 'fast-locked' but does *not* overload the 
>> upper bits with a stack-pointer. Instead, it pushes the object-reference to 
>> a thread-local lock-stack. This is a new structure which is basically a 
>> small array of oops that is associated with each thread. Experience shows 
>> that this array typcially remains very small (3-5 elements). Using this lock 
>> stack, it is possible to query which threads own which locks. Most 
>> importantly, the most common question 'does the current thread own me?' is 
>> very quickly answered by doing a quick scan of the array. More complex 
>> queries like 'which thread owns X?' are not performed in very 
>> performance-critical paths (usually in code like JVMTI or deadlock 
>> detection) where it is ok to do more complex operations (and we already do). 
>> The lock-stack is also a new set of GC roots, and would be scanned during 
>> thread scanning, possibly concurrently, via the normal 
 protocols.
>> 
>> The lock-stack is fixed size, currently with 8 elements. According to my 
>> experiments with various workloads, this covers the vast majority of 
>> workloads (in-fact, most workloads seem to never exceed 5 active locks per 
>> thread at a time). We check for overflow in the fast-paths and when the 
>> lock-stack is full, we take the slow-path, which would inflate the lock to a 
>> monitor. That case should be very rare.
>> 
>> In contrast to stack-locking, fast-locking does *not* support recursive 
>> locking (yet). When that happens, the fast-lock gets inflated to a full 
>> monitor. It is not clear if it is worth to add support for recursive 
>> fast-locking.
>> 
>> One trouble is that when a contending thread arrives at a fast-locked 
>> object, it must inflate the fast-lock to a full monitor. Normally, we need 
>> to know the current owning thread, and record that in the monitor, so that 
>> the contending thread can wait for the current owner to properly exit the 
>> monitor. However, fast-locking doesn't have this information. What we do 
>> instead is to record a special marker ANONYMOUS_OWNER. When the thread that 
>> currently holds the lock arrives at monitorexit, and observes 
>> ANONYMOUS_OWNER, it knows it must be itself, fixes the owner to be itself, 
>> and then properly exits the monitor, and thus handing over to the contending 
>> thread.
>> 
>> As an alternative, I considered to remove stack-locking altogether, and only 
>> use heavy monitors. In most workloads this did not show measurable 
>> regressions. However, in a few workloads, I have observed severe 
>> regressions. All of them have been using old synchronized Java collections 
>> (Vector, Stack), StringBuffer or similar code. The combination of two 
>> conditions leads to regressions without stack- or fast-locking: 1. The 
>> workload synchronizes on uncontended locks (e.g. single-threaded use of 
>> Vector or StringBuffer) and 2. The workload churns such locks. IOW, 
>> uncontended use of Vector, StringBuffer, etc as such is ok, but creating 
>> lots of such single-use, single-threaded-locked objects leads to massive 
>> ObjectMonitor churn, which can lead to a significant performance impact. But 
>> alas, such code exists, and we probably don't want to punish it if we can 
>> avoid it.
>> 
>> This change enables to simplify (and speed-up!) a lot of code:
>> 
>> - The inflation protocol is no longer necessary: we can directly CAS the 
>> (tagged) ObjectMonitor pointer to the object header.
>> - Accessing the hashcode could now be done in the fastpath always, if the 
>> hashcode has been installed. Fast-locked headers can be used directly, for 
>> monitor-locked objects we can easily reach-through to the displaced header. 
>> This is safe because Java threads 

Re: RFR: 8301995: Move invokedynamic resolution information out of ConstantPoolCacheEntry [v16]

2023-03-31 Thread Doug Simon
On Tue, 28 Mar 2023 19:50:36 GMT, Matias Saavedra Silva  
wrote:

>> The current structure used to store the resolution information for 
>> invokedynamic, ConstantPoolCacheEntry, is difficult to interpret due to its 
>> ambigious fields f1 and f2. This structure can hold information for fields, 
>> methods, and invokedynamics and each of its fields can hold different types 
>> of values depending on the entry. 
>> 
>> This enhancement proposes a new structure to exclusively contain 
>> invokedynamic information in a manner that is easy to interpret and easy to 
>> extend.  Resolved invokedynamic entries will be stored in an array in the 
>> constant pool cache and the operand of the invokedynamic bytecode will be 
>> rewritten to be the index into this array.
>> 
>> Any areas that previously accessed invokedynamic data from 
>> ConstantPoolCacheEntry will be replaced with accesses to this new array and 
>> structure. Verified with tier1-9 tests.
>> 
>> The PPC port was provided by @reinrich, RISCV was provided by @DingliZhang 
>> and @zifeihan, and S390x by @offamitkumar.
>> 
>> This change supports the following platforms: x86, aarch64, PPC, RISCV, and 
>> S390x
>
> Matias Saavedra Silva has updated the pull request incrementally with one 
> additional commit since the last revision:
> 
>   s390x NULL to nullptr

It has also broken GraalVM Native Image. I'll open a JBS issue with a 
reproducer soon but here's hs-err from a slowdebug JDK build showing the 
problem:
[hs_err_pid30379.log](https://github.com/openjdk/jdk/files/11122818/hs_err_pid30379.log)

-

PR Comment: https://git.openjdk.org/jdk/pull/12778#issuecomment-1492011186


Re: RFR: 8301995: Move invokedynamic resolution information out of ConstantPoolCacheEntry [v16]

2023-03-31 Thread Thomas Stuefe
On Tue, 28 Mar 2023 19:50:36 GMT, Matias Saavedra Silva  
wrote:

>> The current structure used to store the resolution information for 
>> invokedynamic, ConstantPoolCacheEntry, is difficult to interpret due to its 
>> ambigious fields f1 and f2. This structure can hold information for fields, 
>> methods, and invokedynamics and each of its fields can hold different types 
>> of values depending on the entry. 
>> 
>> This enhancement proposes a new structure to exclusively contain 
>> invokedynamic information in a manner that is easy to interpret and easy to 
>> extend.  Resolved invokedynamic entries will be stored in an array in the 
>> constant pool cache and the operand of the invokedynamic bytecode will be 
>> rewritten to be the index into this array.
>> 
>> Any areas that previously accessed invokedynamic data from 
>> ConstantPoolCacheEntry will be replaced with accesses to this new array and 
>> structure. Verified with tier1-9 tests.
>> 
>> The PPC port was provided by @reinrich, RISCV was provided by @DingliZhang 
>> and @zifeihan, and S390x by @offamitkumar.
>> 
>> This change supports the following platforms: x86, aarch64, PPC, RISCV, and 
>> S390x
>
> Matias Saavedra Silva has updated the pull request incrementally with one 
> additional commit since the last revision:
> 
>   s390x NULL to nullptr

This obviously breaks arm, since its implementation is missing. I opened 
https://bugs.openjdk.org/browse/JDK-8305387 to track this. This is unfortunate 
since it holds work on arm in other areas, in my case for 
https://github.com/openjdk/jdk/pull/10907.

> This change supports the following platforms: x86, aarch64, PPC, RISCV, and 
> S390x

I wonder about the explicit exclusion of arm. Every other CPU seems to be taken 
care of, even those Oracle does not maintain. Just curious, was there a special 
reason for excluding arm?

-

PR Comment: https://git.openjdk.org/jdk/pull/12778#issuecomment-1491971108


Re: RFR: 8291555: Implement alternative fast-locking scheme [v48]

2023-03-31 Thread Roman Kennke
> This change adds a fast-locking scheme as an alternative to the current 
> stack-locking implementation. It retains the advantages of stack-locking 
> (namely fast locking in uncontended code-paths), while avoiding the overload 
> of the mark word. That overloading causes massive problems with Lilliput, 
> because it means we have to check and deal with this situation when trying to 
> access the mark-word. And because of the very racy nature, this turns out to 
> be very complex and would involve a variant of the inflation protocol to 
> ensure that the object header is stable. (The current implementation of 
> setting/fetching the i-hash provides a glimpse into the complexity).
> 
> What the original stack-locking does is basically to push a stack-lock onto 
> the stack which consists only of the displaced header, and CAS a pointer to 
> this stack location into the object header (the lowest two header bits being 
> 00 indicate 'stack-locked'). The pointer into the stack can then be used to 
> identify which thread currently owns the lock.
> 
> This change basically reverses stack-locking: It still CASes the lowest two 
> header bits to 00 to indicate 'fast-locked' but does *not* overload the upper 
> bits with a stack-pointer. Instead, it pushes the object-reference to a 
> thread-local lock-stack. This is a new structure which is basically a small 
> array of oops that is associated with each thread. Experience shows that this 
> array typcially remains very small (3-5 elements). Using this lock stack, it 
> is possible to query which threads own which locks. Most importantly, the 
> most common question 'does the current thread own me?' is very quickly 
> answered by doing a quick scan of the array. More complex queries like 'which 
> thread owns X?' are not performed in very performance-critical paths (usually 
> in code like JVMTI or deadlock detection) where it is ok to do more complex 
> operations (and we already do). The lock-stack is also a new set of GC roots, 
> and would be scanned during thread scanning, possibly concurrently, via the 
> normal p
 rotocols.
> 
> The lock-stack is fixed size, currently with 8 elements. According to my 
> experiments with various workloads, this covers the vast majority of 
> workloads (in-fact, most workloads seem to never exceed 5 active locks per 
> thread at a time). We check for overflow in the fast-paths and when the 
> lock-stack is full, we take the slow-path, which would inflate the lock to a 
> monitor. That case should be very rare.
> 
> In contrast to stack-locking, fast-locking does *not* support recursive 
> locking (yet). When that happens, the fast-lock gets inflated to a full 
> monitor. It is not clear if it is worth to add support for recursive 
> fast-locking.
> 
> One trouble is that when a contending thread arrives at a fast-locked object, 
> it must inflate the fast-lock to a full monitor. Normally, we need to know 
> the current owning thread, and record that in the monitor, so that the 
> contending thread can wait for the current owner to properly exit the 
> monitor. However, fast-locking doesn't have this information. What we do 
> instead is to record a special marker ANONYMOUS_OWNER. When the thread that 
> currently holds the lock arrives at monitorexit, and observes 
> ANONYMOUS_OWNER, it knows it must be itself, fixes the owner to be itself, 
> and then properly exits the monitor, and thus handing over to the contending 
> thread.
> 
> As an alternative, I considered to remove stack-locking altogether, and only 
> use heavy monitors. In most workloads this did not show measurable 
> regressions. However, in a few workloads, I have observed severe regressions. 
> All of them have been using old synchronized Java collections (Vector, 
> Stack), StringBuffer or similar code. The combination of two conditions leads 
> to regressions without stack- or fast-locking: 1. The workload synchronizes 
> on uncontended locks (e.g. single-threaded use of Vector or StringBuffer) and 
> 2. The workload churns such locks. IOW, uncontended use of Vector, 
> StringBuffer, etc as such is ok, but creating lots of such single-use, 
> single-threaded-locked objects leads to massive ObjectMonitor churn, which 
> can lead to a significant performance impact. But alas, such code exists, and 
> we probably don't want to punish it if we can avoid it.
> 
> This change enables to simplify (and speed-up!) a lot of code:
> 
> - The inflation protocol is no longer necessary: we can directly CAS the 
> (tagged) ObjectMonitor pointer to the object header.
> - Accessing the hashcode could now be done in the fastpath always, if the 
> hashcode has been installed. Fast-locked headers can be used directly, for 
> monitor-locked objects we can easily reach-through to the displaced header. 
> This is safe because Java threads participate in monitor deflation protocol. 
> This would be implemented in a separate PR
> 
> 
> Testing:
>  - [x] tier1 x86_64 x aarch64 x 

Re: RFR: 8257967: JFR: Events for loaded agents [v14]

2023-03-31 Thread Markus Grönlund
On Fri, 31 Mar 2023 03:05:31 GMT, David Holmes  wrote:

>> Markus Grönlund has updated the pull request incrementally with one 
>> additional commit since the last revision:
>> 
>>   restore misssing frees
>
> src/hotspot/share/prims/agent.cpp line 533:
> 
>> 531: if (thread->is_pending_jni_exception_check()) {
>> 532:   thread->clear_pending_jni_exception_check();
>> 533: }
> 
> Unsure why we pretend the agent checked this - don't we want -Xcheck:jni to 
> report a bug in the agent?

Good question - I don't know. For dynamically loaded agents, there seems to be 
quite a lot of handling to return a JNI_OK, even though the agent failed to 
load or returned failure from the Agent_OnAttach. e.g.

  // Agent_OnAttach executed so completion status is JNI_OK
  return JNI_OK;

-

PR Review Comment: https://git.openjdk.org/jdk/pull/12923#discussion_r1154346856


Re: RFR: 8257967: JFR: Events for loaded agents [v15]

2023-03-31 Thread Markus Grönlund
> Greetings,
> 
> We are adding support to let JFR report on Agents.
> 
>  Design
> 
> An Agent is a library that uses any instrumentation or profiling APIs. Most 
> agents are started and initialized on the command line, but agents can also 
> be loaded dynamically during runtime. Because command line agents initialize 
> during the VM startup sequence, they add to the overall startup time latency 
> in getting the VM ready. The events will report on the time the agent took to 
> initialize.
> 
> A JavaAgent is an agent written in the Java programming language, using the 
> APIs in the package 
> [java.lang.instrument](https://docs.oracle.com/en/java/javase/19/docs/api/java.instrument/java/lang/instrument/package-summary.html)
> 
> A JavaAgent is sometimes called a JPLIS agent, where the acronym JPLIS stands 
> for Java Programming Language Instrumentation Services.
> 
> To report on JavaAgents, JFR will add the new event type jdk.JavaAgent and 
> events will look similar to these two examples:
> 
> // Command line
> jdk.JavaAgent {
>   startTime = 12:31:19.789 (2023-03-08)
>   name = "JavaAgent.jar"
>   options = "foo=bar"
>   dynamic = false
>   initializationTime = 12:31:15.574 (2023-03-08)
>   initializationDuration = 172 ms
> }
> 
> // Dynamic load
> jdk.JavaAgent {
>   startTime = 12:31:31.158 (2023-03-08)
>   name = "JavaAgent.jar"
>   options = "bar=baz"
>   dynamic = true
>   initializationTime = 12:31:31.037 (2023-03-08)
>   initializationDuration = 64,1 ms
> }
> 
> The jdk.JavaAgent event type is a JFR periodic event that iterates over 
> running Java agents.
> 
> For a JavaAgent event, the agent's name will be the specific .jar file 
> containing the instrumentation code. The options will be the specific options 
> passed to the .jar file as part of launching the agent, for example, on the 
> command line: -javaagent: JavaAgent.jar=foo=bar.
> 
> The "dynamic" field denotes if the agent was loaded via the command line 
> (dynamic = false) or dynamically (dynamic = true)
> 
> "initializationTime" is the timestamp the JVM invoked the initialization 
> method, and "initializationDuration" is the duration of executing the 
> initialization method.
> 
> "startTime" represents the time the JFR framework issued the periodic event; 
> hence "initializationTime" will be earlier than "startTime".
> 
> An agent can also be written in a native programming language using the [JVM 
> Tools Interface 
> (JVMTI)](https://docs.oracle.com/en/java/javase/19/docs/specs/jvmti.html). 
> This kind of agent, sometimes called a native agent, is a platform-specific 
> binary, sometimes referred to as a library, but here it means a .so or .dll 
> file.
> 
> To report on native agents, JFR will add the new event type jdk.NativeAgent 
> and events will look similar to this example:
> 
> jdk.NativeAgent {
>   startTime = 12:31:40.398 (2023-03-08)
>   name = "jdwp"
>   options = "transport=dt_socket,server=y,address=any,onjcmd=y"
>   dynamic = false
>   initializationTime = 12:31:36.142 (2023-03-08)
>   initializationDuration = 0,00184 ms
>   path = 
> "c:\ade\github\openjdk\jdk\build\windows-x86_64-server-slowdebug\jdk\bin\jdwp.dll"
> }
> 
> The layout of the event type is very similar to the jdk.JavaAgent event, but 
> here the path to the native library is reported.
> 
> The initialization of a native agent is performed by invoking an 
> agent-specified callback routine. The "initializationTime" is when the JVM 
> sent or would have sent the JVMTI VMInit event to a specified callback. 
> "initializationDuration" is the duration to execute that specific callback. 
> If no callback is specified for the JVMTI VMInit event, the 
> "initializationDuration" will be 0. If the agent is loaded dynamically, 
> "initializationDuration" is the time taken to execute the Agent_OnAttach 
> callback.
> 
>  Implementation
> 
> There has not existed a reification of a JavaAgent directly in the JVM, as 
> these are built on top of the JDK native library, "instrument", using a 
> many-to-one mapping. At the level of the JVM, the only representation of 
> agents after startup is through JvmtiEnv's, which agents request from the JVM 
> during startup and initialization — as such, mapping which JvmtiEnv belongs 
> to what JavaAgent was not possible before.
> 
> Using implementation details of how the JDK native library "instrument" 
> interacts with the JVM, we can build this mapping to track what JvmtiEnv's 
> "belong" to what JavaAgent. This mapping now lets us report the Java-relevant 
> context (name, options) and measure the time it takes for the JavaAgent to 
> initialize.
> 
> When implementing this capability, it was necessary to refactor the code used 
> to represent agents, AgentLibrary. The previous implementation was located 
> primarily in arguments.cpp, and threads.cpp but also jvmtiExport.cpp.
> 
> The refactoring isolates the relevant logic into two new modules, 
> prims/agent.hpp and prims/agentList.hpp. Breaking out 

Re: RFR: 8304919: Implementation of Virtual Threads [v5]

2023-03-31 Thread Alan Bateman
> JEP 444 proposes to make virtual threads a permanent feature in Java 21. The 
> APIs that were preview APIs in Java 19/20 are changed to permanent and their 
> `@since`/equivalent are changed to 21 (as per the guidance in JEP 12). The 
> JNI and JVMTI versions are bumped as this is the first change in 21 to need 
> the new version number. A lot of tests are updated to drop `@enablePreview` 
> and --enable-preview.
> 
> There is one API change from Java 19/20, the preview API 
> Thread.Builder.allowSetThreadLocals(boolean) is dropped. This requires an 
> update to the JVMTI GetThreadInfo implementation to read the TCCL 
> consistently.
> 
> In addition, there are a small number of implementation changes to sync up 
> from the loom fibers branch:
> 
> - A number of stack frames are `@Hidden` to reduce noise in the stack traces. 
> This exposed a few issues with the stack walker code. More specifically, the 
> cases where  end of a continuation falls precisely at the end of the batch, 
> or where the remaining frames are hidden, weren't handled correctly.
> - The code to emit the JFR jdk.ThreadSleepEvent is refactored so it's in 
> Thread rather than in two classes.
> - A few robustness improvements for OOME and SOE. There is more to do here, 
> for future PRs.
> - New system property to print a stack trace when a virtual thread sets its 
> own value of a TL.
> - ThreadPerTaskExecutor is changed to use FutureTask.
> 
> Testing: tier1-6.

Alan Bateman has updated the pull request with a new target base due to a merge 
or a rebase. The incremental webrev excludes the unrelated changes brought in 
by the merge/rebase. The pull request contains ten additional commits since the 
last revision:

 - Expand tests for jdk.ThreadSleep event
 - Review feedback
 - Merge
 - Fix ThreadSleepEvent again
 - Test updates
 - ThreadSleepEvent refactoring
 - Merge
 - Merge
 - Initial sync from fibers branch

-

Changes:
  - all: https://git.openjdk.org/jdk/pull/13203/files
  - new: https://git.openjdk.org/jdk/pull/13203/files/bfd2c816..722d5afa

Webrevs:
 - full: https://webrevs.openjdk.org/?repo=jdk=13203=04
 - incr: https://webrevs.openjdk.org/?repo=jdk=13203=03-04

  Stats: 4799 lines in 134 files changed: 3144 ins; 1060 del; 595 mod
  Patch: https://git.openjdk.org/jdk/pull/13203.diff
  Fetch: git fetch https://git.openjdk.org/jdk.git pull/13203/head:pull/13203

PR: https://git.openjdk.org/jdk/pull/13203


Re: RFR: 8305237: CompilerDirectives DCmds permissions correction

2023-03-31 Thread Kevin Walls
On Fri, 31 Mar 2023 08:24:19 GMT, Kevin Walls  wrote:

> The Permissions in DCmds relate to remote usage over JMX. 
> "monitor" is generally for reading information, and "control" is generally 
> for making changes.
> The DCmds for changing compiler directives should have "control" as the 
> required permission.
> 
> Tests in test/hotspot/jtreg/serviceability/dcmd/compiler and 
> test/hotspot/jtreg/compiler/compilercontrol still pass with this change.

This has a lot of labels for a trivial change in a very niche feature, but they 
all seem relevant.

-

PR Comment: https://git.openjdk.org/jdk/pull/13262#issuecomment-1491551796


RFR: 8305237: CompilerDirectives DCmds permissions correction

2023-03-31 Thread Kevin Walls
The Permissions in DCmds relate to remote usage over JMX. 
"monitor" is generally for reading information, and "control" is generally for 
making changes.
The DCmds for changing compiler directives should have "control" as the 
required permission.

Tests in test/hotspot/jtreg/serviceability/dcmd/compiler and 
test/hotspot/jtreg/compiler/compilercontrol still pass with this change.

-

Commit messages:
 - 8305237: CompilerDirectives DCmds permissions correction

Changes: https://git.openjdk.org/jdk/pull/13262/files
 Webrev: https://webrevs.openjdk.org/?repo=jdk=13262=00
  Issue: https://bugs.openjdk.org/browse/JDK-8305237
  Stats: 3 lines in 1 file changed: 0 ins; 0 del; 3 mod
  Patch: https://git.openjdk.org/jdk/pull/13262.diff
  Fetch: git fetch https://git.openjdk.org/jdk.git pull/13262/head:pull/13262

PR: https://git.openjdk.org/jdk/pull/13262


Re: RFR: 8291555: Implement alternative fast-locking scheme [v47]

2023-03-31 Thread Thomas Stuefe
On Fri, 31 Mar 2023 06:06:47 GMT, Roman Kennke  wrote:

>> This change adds a fast-locking scheme as an alternative to the current 
>> stack-locking implementation. It retains the advantages of stack-locking 
>> (namely fast locking in uncontended code-paths), while avoiding the overload 
>> of the mark word. That overloading causes massive problems with Lilliput, 
>> because it means we have to check and deal with this situation when trying 
>> to access the mark-word. And because of the very racy nature, this turns out 
>> to be very complex and would involve a variant of the inflation protocol to 
>> ensure that the object header is stable. (The current implementation of 
>> setting/fetching the i-hash provides a glimpse into the complexity).
>> 
>> What the original stack-locking does is basically to push a stack-lock onto 
>> the stack which consists only of the displaced header, and CAS a pointer to 
>> this stack location into the object header (the lowest two header bits being 
>> 00 indicate 'stack-locked'). The pointer into the stack can then be used to 
>> identify which thread currently owns the lock.
>> 
>> This change basically reverses stack-locking: It still CASes the lowest two 
>> header bits to 00 to indicate 'fast-locked' but does *not* overload the 
>> upper bits with a stack-pointer. Instead, it pushes the object-reference to 
>> a thread-local lock-stack. This is a new structure which is basically a 
>> small array of oops that is associated with each thread. Experience shows 
>> that this array typcially remains very small (3-5 elements). Using this lock 
>> stack, it is possible to query which threads own which locks. Most 
>> importantly, the most common question 'does the current thread own me?' is 
>> very quickly answered by doing a quick scan of the array. More complex 
>> queries like 'which thread owns X?' are not performed in very 
>> performance-critical paths (usually in code like JVMTI or deadlock 
>> detection) where it is ok to do more complex operations (and we already do). 
>> The lock-stack is also a new set of GC roots, and would be scanned during 
>> thread scanning, possibly concurrently, via the normal 
 protocols.
>> 
>> The lock-stack is fixed size, currently with 8 elements. According to my 
>> experiments with various workloads, this covers the vast majority of 
>> workloads (in-fact, most workloads seem to never exceed 5 active locks per 
>> thread at a time). We check for overflow in the fast-paths and when the 
>> lock-stack is full, we take the slow-path, which would inflate the lock to a 
>> monitor. That case should be very rare.
>> 
>> In contrast to stack-locking, fast-locking does *not* support recursive 
>> locking (yet). When that happens, the fast-lock gets inflated to a full 
>> monitor. It is not clear if it is worth to add support for recursive 
>> fast-locking.
>> 
>> One trouble is that when a contending thread arrives at a fast-locked 
>> object, it must inflate the fast-lock to a full monitor. Normally, we need 
>> to know the current owning thread, and record that in the monitor, so that 
>> the contending thread can wait for the current owner to properly exit the 
>> monitor. However, fast-locking doesn't have this information. What we do 
>> instead is to record a special marker ANONYMOUS_OWNER. When the thread that 
>> currently holds the lock arrives at monitorexit, and observes 
>> ANONYMOUS_OWNER, it knows it must be itself, fixes the owner to be itself, 
>> and then properly exits the monitor, and thus handing over to the contending 
>> thread.
>> 
>> As an alternative, I considered to remove stack-locking altogether, and only 
>> use heavy monitors. In most workloads this did not show measurable 
>> regressions. However, in a few workloads, I have observed severe 
>> regressions. All of them have been using old synchronized Java collections 
>> (Vector, Stack), StringBuffer or similar code. The combination of two 
>> conditions leads to regressions without stack- or fast-locking: 1. The 
>> workload synchronizes on uncontended locks (e.g. single-threaded use of 
>> Vector or StringBuffer) and 2. The workload churns such locks. IOW, 
>> uncontended use of Vector, StringBuffer, etc as such is ok, but creating 
>> lots of such single-use, single-threaded-locked objects leads to massive 
>> ObjectMonitor churn, which can lead to a significant performance impact. But 
>> alas, such code exists, and we probably don't want to punish it if we can 
>> avoid it.
>> 
>> This change enables to simplify (and speed-up!) a lot of code:
>> 
>> - The inflation protocol is no longer necessary: we can directly CAS the 
>> (tagged) ObjectMonitor pointer to the object header.
>> - Accessing the hashcode could now be done in the fastpath always, if the 
>> hashcode has been installed. Fast-locked headers can be used directly, for 
>> monitor-locked objects we can easily reach-through to the displaced header. 
>> This is safe because Java threads 

Re: RFR: 8297286: runtime/vthread tests crashing after JDK-8296324 [v14]

2023-03-31 Thread Serguei Spitsyn
> The fix is to enable virtual threads support for late binding JVMTI agents.
> The fix includes:
> - New function `JvmtiEnvBase::enable_virtual_threads_notify_jvmti()` which 
> does enabling JVMTI VTMS transition notifications in case of agent loaded 
> into running VM. This function executes a VM operation counting VTMS 
> transition bits in all `JavaThread`'s to correctly set the static counter 
> `_VTMS_transition_count` needed for VTMS transition protocol.
> - New function `JvmtiEnvBase::disable_virtual_threads_notify_jvmti()` which 
> is needed for testing. It is used by the `WhiteBox` API.
> - New WhiteBox function `WB_SetVirtualThreadsNotifyJvmtiMode(JNIEnv* env, 
> jobject wb, jboolean enable)` needed for testing of this update.
> - New regression test: `serviceability/jvmti/vthread/ToggleNotifyJvmtiTest`
> 
> Testing:
> - New test: `serviceability/jvmti/vthread/ToggleNotifyJvmtiTest`
> - The originally failed tests are expected to pass now:
>   `runtime/vthread/RedefineClass.java`
>   `runtime/vthread/TestObjectAllocationSampleEvent.java` 
> - In progress: Run the tiers 1-6 to make sure there are no regression.

Serguei Spitsyn has updated the pull request incrementally with one additional 
commit since the last revision:

  review: addressed next round of review suggestions

-

Changes:
  - all: https://git.openjdk.org/jdk/pull/13133/files
  - new: https://git.openjdk.org/jdk/pull/13133/files/1bb250a7..aef87273

Webrevs:
 - full: https://webrevs.openjdk.org/?repo=jdk=13133=13
 - incr: https://webrevs.openjdk.org/?repo=jdk=13133=12-13

  Stats: 34 lines in 1 file changed: 3 ins; 7 del; 24 mod
  Patch: https://git.openjdk.org/jdk/pull/13133.diff
  Fetch: git fetch https://git.openjdk.org/jdk.git pull/13133/head:pull/13133

PR: https://git.openjdk.org/jdk/pull/13133


RFR: 8305341: Alignment outside of HotSpot should be enforced by alignas instead of compiler specific attributes

2023-03-31 Thread Julian Waters
C11 has been stable for a long time on all platforms, so native code can use 
the standard alignas operator for alignment requirements

-

Commit messages:
 - 
 - GSSLibStub.c
 - ArrayReferenceImpl.c
 - Alignment outside of HotSpot should be enforced by alignas instead of 
compiler specific attributes

Changes: https://git.openjdk.org/jdk/pull/13258/files
 Webrev: https://webrevs.openjdk.org/?repo=jdk=13258=00
  Issue: https://bugs.openjdk.org/browse/JDK-8305341
  Stats: 12 lines in 3 files changed: 3 ins; 0 del; 9 mod
  Patch: https://git.openjdk.org/jdk/pull/13258.diff
  Fetch: git fetch https://git.openjdk.org/jdk.git pull/13258/head:pull/13258

PR: https://git.openjdk.org/jdk/pull/13258


Re: RFR: 8291555: Implement alternative fast-locking scheme [v47]

2023-03-31 Thread Roman Kennke
> This change adds a fast-locking scheme as an alternative to the current 
> stack-locking implementation. It retains the advantages of stack-locking 
> (namely fast locking in uncontended code-paths), while avoiding the overload 
> of the mark word. That overloading causes massive problems with Lilliput, 
> because it means we have to check and deal with this situation when trying to 
> access the mark-word. And because of the very racy nature, this turns out to 
> be very complex and would involve a variant of the inflation protocol to 
> ensure that the object header is stable. (The current implementation of 
> setting/fetching the i-hash provides a glimpse into the complexity).
> 
> What the original stack-locking does is basically to push a stack-lock onto 
> the stack which consists only of the displaced header, and CAS a pointer to 
> this stack location into the object header (the lowest two header bits being 
> 00 indicate 'stack-locked'). The pointer into the stack can then be used to 
> identify which thread currently owns the lock.
> 
> This change basically reverses stack-locking: It still CASes the lowest two 
> header bits to 00 to indicate 'fast-locked' but does *not* overload the upper 
> bits with a stack-pointer. Instead, it pushes the object-reference to a 
> thread-local lock-stack. This is a new structure which is basically a small 
> array of oops that is associated with each thread. Experience shows that this 
> array typcially remains very small (3-5 elements). Using this lock stack, it 
> is possible to query which threads own which locks. Most importantly, the 
> most common question 'does the current thread own me?' is very quickly 
> answered by doing a quick scan of the array. More complex queries like 'which 
> thread owns X?' are not performed in very performance-critical paths (usually 
> in code like JVMTI or deadlock detection) where it is ok to do more complex 
> operations (and we already do). The lock-stack is also a new set of GC roots, 
> and would be scanned during thread scanning, possibly concurrently, via the 
> normal p
 rotocols.
> 
> The lock-stack is fixed size, currently with 8 elements. According to my 
> experiments with various workloads, this covers the vast majority of 
> workloads (in-fact, most workloads seem to never exceed 5 active locks per 
> thread at a time). We check for overflow in the fast-paths and when the 
> lock-stack is full, we take the slow-path, which would inflate the lock to a 
> monitor. That case should be very rare.
> 
> In contrast to stack-locking, fast-locking does *not* support recursive 
> locking (yet). When that happens, the fast-lock gets inflated to a full 
> monitor. It is not clear if it is worth to add support for recursive 
> fast-locking.
> 
> One trouble is that when a contending thread arrives at a fast-locked object, 
> it must inflate the fast-lock to a full monitor. Normally, we need to know 
> the current owning thread, and record that in the monitor, so that the 
> contending thread can wait for the current owner to properly exit the 
> monitor. However, fast-locking doesn't have this information. What we do 
> instead is to record a special marker ANONYMOUS_OWNER. When the thread that 
> currently holds the lock arrives at monitorexit, and observes 
> ANONYMOUS_OWNER, it knows it must be itself, fixes the owner to be itself, 
> and then properly exits the monitor, and thus handing over to the contending 
> thread.
> 
> As an alternative, I considered to remove stack-locking altogether, and only 
> use heavy monitors. In most workloads this did not show measurable 
> regressions. However, in a few workloads, I have observed severe regressions. 
> All of them have been using old synchronized Java collections (Vector, 
> Stack), StringBuffer or similar code. The combination of two conditions leads 
> to regressions without stack- or fast-locking: 1. The workload synchronizes 
> on uncontended locks (e.g. single-threaded use of Vector or StringBuffer) and 
> 2. The workload churns such locks. IOW, uncontended use of Vector, 
> StringBuffer, etc as such is ok, but creating lots of such single-use, 
> single-threaded-locked objects leads to massive ObjectMonitor churn, which 
> can lead to a significant performance impact. But alas, such code exists, and 
> we probably don't want to punish it if we can avoid it.
> 
> This change enables to simplify (and speed-up!) a lot of code:
> 
> - The inflation protocol is no longer necessary: we can directly CAS the 
> (tagged) ObjectMonitor pointer to the object header.
> - Accessing the hashcode could now be done in the fastpath always, if the 
> hashcode has been installed. Fast-locked headers can be used directly, for 
> monitor-locked objects we can easily reach-through to the displaced header. 
> This is safe because Java threads participate in monitor deflation protocol. 
> This would be implemented in a separate PR
> 
> 
> Testing:
>  - [x] tier1 x86_64 x aarch64 x