Re: RFR: JDK-8293114: JVM should trim the native heap [v12]

2023-07-20 Thread Robbin Ehn
On Fri, 14 Jul 2023 20:03:24 GMT, Thomas Stuefe  wrote:

>> This is a continuation of https://github.com/openjdk/jdk/pull/10085. I 
>> closed https://github.com/openjdk/jdk/pull/10085 because it had accumulated 
>> too much comment history and got confusing. For a history of this issue, see 
>> previous discussions [1] and the comment section of 10085.
>>  
>> ---
>> 
>> This RFE adds the option to trim the Glibc heap periodically. This can 
>> recover a significant memory footprint if the VM process suffers from 
>> high-but-rare malloc spikes. It does not matter who causes the spikes: the 
>> JDK or customer code running in the JVM process.
>> 
>> ### Background:
>> 
>> The Glibc is reluctant to return memory to the OS. Temporary malloc spikes 
>> often carry over as permanent RSS increase. Note that C-heap retention is 
>> difficult to observe. Since it is freed memory, it won't appear in NMT; it 
>> is just a part of RSS.
>> 
>> This is, effectively, caching - a performance tradeoff by the glibc. It 
>> makes a lot of sense with applications that cause high traffic on the 
>> C-heap. The JVM, however, clusters allocations and often rolls its own 
>> memory management based on virtual memory for many of its use cases.
>> 
>> To manually trim the C-heap, Glibc exposes `malloc_trim(3)`. With JDK 18 
>> [2], we added a new jcmd command to *manually* trim the C-heap on Linux 
>> (`jcmd System.trim_native_heap`). We then observed customers running this 
>> command periodically to slim down process sizes of container-bound jvms. 
>> That is cumbersome, and the JVM can do this a lot better - among other 
>> things because it knows best when *not* to trim.
>> 
>>  GLIBC internals
>> 
>> The following information I took from the glibc source code and 
>> experimenting.
>> 
>> # Why do we need to trim manually? Does the Glibc not trim on free?
>> 
>> Upon `free()`, glibc may return memory to the OS if:
>> - the returned block was mmap'ed
>> - the returned block was not added to tcache or to fastbins
>> - the returned block, possibly merged with its two immediate neighbors, had 
>> they been free, is larger than FASTBIN_CONSOLIDATION_THRESHOLD (64K) - in 
>> that case:
>>   a) for the main arena, glibc attempts to lower the brk()
>>   b) for mmap-ed heaps, glibc attempts to completely unmap or shrink the 
>> heap.
>> In both cases, (a) and (b), only the top portion of the heap is reclaimed. 
>> "Holes" in the middle of other in-use chunks are not reclaimed.
>> 
>> So: glibc *may* automatically reclaim memory. In normal configurations, with 
>> typical C-heap allocation granularity, it is unlikely.
>> 
>> To increase the ...
>
> Thomas Stuefe has updated the pull request incrementally with two additional 
> commits since the last revision:
> 
>  - Display unsupported text with UL warning level, + test
>  - Last Aleksey Feedback

Marked as reviewed by rehn (Reviewer).

-

PR Review: https://git.openjdk.org/jdk/pull/14781#pullrequestreview-1538731282


Re: RFR: JDK-8293114: JVM should trim the native heap [v4]

2023-07-06 Thread Robbin Ehn
On Thu, 6 Jul 2023 16:14:20 GMT, Aleksey Shipilev  wrote:

>> Thomas Stuefe has updated the pull request incrementally with one additional 
>> commit since the last revision:
>> 
>>   last cleanups and shade feedback
>
> src/hotspot/share/runtime/trimNative.cpp line 149:
> 
>> 147: return true;
>> 148:   } else {
>> 149: log_info(trim)("Trim native heap (no details)");
> 
> Consistency: `Trim native heap: complete, no details`.

I would still like to know the trim_time.

> test/hotspot/jtreg/runtime/os/TestTrimNative.java line 247:
> 
>> 245: System.out.println("Will spike now...");
>> 246: for (int i = 0; i < numAllocations; i++) {
>> 247: ptrs[i] = 
>> Unsafe.getUnsafe().allocateMemory(szAllocations);
> 
> Pull `Unsafe.getUnsafe()` into a local or a field?

Maybe use WB instead of unsafe?

-

PR Review Comment: https://git.openjdk.org/jdk/pull/14781#discussion_r1254816943
PR Review Comment: https://git.openjdk.org/jdk/pull/14781#discussion_r1254806836


Re: RFR: JDK-8293114: JVM should trim the native heap [v4]

2023-07-06 Thread Robbin Ehn
On Thu, 6 Jul 2023 15:25:03 GMT, Thomas Stuefe  wrote:

>> This is a continuation of https://github.com/openjdk/jdk/pull/10085. I 
>> closed https://github.com/openjdk/jdk/pull/10085 because it had accumulated 
>> too much comment history and got confusing. For a history of this issue, see 
>> previous discussions [1] and the comment section of 10085.
>>  
>> ---
>> 
>> This RFE adds the option to trim the Glibc heap periodically. This can 
>> recover a significant memory footprint if the VM process suffers from 
>> high-but-rare malloc spikes. It does not matter who causes the spikes: the 
>> JDK or customer code running in the JVM process.
>> 
>> ### Background:
>> 
>> The Glibc is reluctant to return memory to the OS. Temporary malloc spikes 
>> often carry over as permanent RSS increase. Note that C-heap retention is 
>> difficult to observe. Since it is freed memory, it won't appear in NMT; it 
>> is just a part of RSS.
>> 
>> This is, effectively, caching - a performance tradeoff by the glibc. It 
>> makes a lot of sense with applications that cause high traffic on the 
>> C-heap. The JVM, however, clusters allocations and often rolls its own 
>> memory management based on virtual memory for many of its use cases.
>> 
>> To manually trim the C-heap, Glibc exposes `malloc_trim(3)`. With JDK 18 
>> [2], we added a new jcmd command to *manually* trim the C-heap on Linux 
>> (`jcmd System.trim_native_heap`). We then observed customers running this 
>> command periodically to slim down process sizes of container-bound jvms. 
>> That is cumbersome, and the JVM can do this a lot better - among other 
>> things because it knows best when *not* to trim.
>> 
>>  GLIBC internals
>> 
>> The following information I took from the glibc source code and 
>> experimenting.
>> 
>> # Why do we need to trim manually? Does the Glibc not trim on free?
>> 
>> Upon `free()`, glibc may return memory to the OS if:
>> - the returned block was mmap'ed
>> - the returned block was not added to tcache or to fastbins
>> - the returned block, possibly merged with its two immediate neighbors, had 
>> they been free, is larger than FASTBIN_CONSOLIDATION_THRESHOLD (64K) - in 
>> that case:
>>   a) for the main arena, glibc attempts to lower the brk()
>>   b) for mmap-ed heaps, glibc attempts to completely unmap or shrink the 
>> heap.
>> In both cases, (a) and (b), only the top portion of the heap is reclaimed. 
>> "Holes" in the middle of other in-use chunks are not reclaimed.
>> 
>> So: glibc *may* automatically reclaim memory. In normal configurations, with 
>> typical C-heap allocation granularity, it is unlikely.
>> 
>> To increase the ...
>
> Thomas Stuefe has updated the pull request incrementally with one additional 
> commit since the last revision:
> 
>   last cleanups and shade feedback

Thanks, looks good to me.

src/hotspot/share/logging/logTag.hpp line 199:

> 197:   LOG_TAG(tlab) \
> 198:   LOG_TAG(tracking) \
> 199:   LOG_TAG(trim) \

Not sure if 'trim' is the best tag name for an average user?

src/hotspot/share/runtime/trimNative.cpp line 137:

> 135: os::size_change_t sc;
> 136: Ticks start = Ticks::now();
> 137: log_debug(trim)("Trim native heap started...");

TraceTime not a good fit?

-

Marked as reviewed by rehn (Reviewer).

PR Review: https://git.openjdk.org/jdk/pull/14781#pullrequestreview-1517121458
PR Review Comment: https://git.openjdk.org/jdk/pull/14781#discussion_r1254801114
PR Review Comment: https://git.openjdk.org/jdk/pull/14781#discussion_r1254805330


Re: RFR: 8307067: remove broken EnableThreadSMRExtraValidityChecks option [v2]

2023-05-03 Thread Robbin Ehn
On Tue, 2 May 2023 20:37:18 GMT, Daniel D. Daugherty  wrote:

>> A trivial fix to remove broken EnableThreadSMRExtraValidityChecks option.
>
> Daniel D. Daugherty has updated the pull request incrementally with one 
> additional commit since the last revision:
> 
>   dholmes CR - change ':' to '.'.

Looks good, thanks!

-

Marked as reviewed by rehn (Reviewer).

PR Review: https://git.openjdk.org/jdk/pull/13704#pullrequestreview-1412295770


Re: RFR: 8306006: strace001.java fails due to unknown methods on stack

2023-04-14 Thread Robbin Ehn
On Fri, 14 Apr 2023 13:27:37 GMT, Fredrik Bredberg  wrote:

> Added the missing java.lang.Thread.beforeSleep and java.lang.afterSleep to 
> expectedSystemTrace.
> Tested on my local machine.

Marked as reviewed by rehn (Reviewer).

Looks good to me!

-

PR Review: https://git.openjdk.org/jdk/pull/13476#pullrequestreview-1385466134
PR Comment: https://git.openjdk.org/jdk/pull/13476#issuecomment-1508507582


Re: RFR: 8291555: Implement alternative fast-locking scheme [v27]

2023-03-16 Thread Robbin Ehn
On Thu, 16 Mar 2023 10:26:26 GMT, Thomas Stuefe  wrote:

> Sounds good. Just to be clear, you mean enforce symmetric locking? resp. 
> forbid asymmetric locking?

Yes, sorry, thanks for correcting! :)

-

PR: https://git.openjdk.org/jdk/pull/10907


Re: RFR: 8291555: Implement alternative fast-locking scheme [v27]

2023-03-16 Thread Robbin Ehn
On Thu, 16 Mar 2023 09:02:19 GMT, Roman Kennke  wrote:

>> I like -XX:+UseNewLocks, too. I wouldn't overcomplicate things: this flag is 
>> meant to be transitional, it is not meant to be used by end-users (except 
>> the bravest nerds) at all. When it lands, the Lilliput flag (e.g. 
>> +UseCompactObjectHeaders) will also control the locking flag. Eventually 
>> (e.g. release+1) both flags would become on by default and afterwards (e.g. 
>> release+2) would go away entirely, at which point the whole original 
>> stack-locking would disappear.
>
>> @rkennke I must be missing something. In aarch64, why do we handle the 
>> non-symmetric-unlock case in interpreter, but not in C1/C2? There, we just 
>> seem to pop whatever is on top.
> 
> C1 and C2 don't allow assymmetric locking. If that ever happens, they would 
> refuse to compile the method. We should probably check that this assumption 
> holds true when popping the top entry in an #ASSERT block.

> > > @rkennke I must be missing something. In aarch64, why do we handle the 
> > > non-symmetric-unlock case in interpreter, but not in C1/C2? There, we 
> > > just seem to pop whatever is on top.
> > 
> > 
> > C1 and C2 don't allow assymmetric locking. If that ever happens, they would 
> > refuse to compile the method. We should probably check that this assumption 
> > holds true when popping the top entry in an #ASSERT block.
> 
> Thanks for clarifying. Yes, asserting that would make sense.

FYI:
I'm trying to convince folks that JVMS should be allowed to enforce asymmetric 
locking.
We think most people don't know they will be stuck in interpreter, unintended.
What was discussed latest was to diagnose and warn about this behavior as a 
first step.

-

PR: https://git.openjdk.org/jdk/pull/10907


Re: RFR: 8291555: Implement alternative fast-locking scheme [v27]

2023-03-16 Thread Robbin Ehn
On Thu, 16 Mar 2023 08:00:38 GMT, Thomas Stuefe  wrote:

> I like UseNewLocks but fear that this may conflict with Oracles plan (?) to 
> move OMs into heap, which would be another revamp of locking - fat locks in 
> this case - and may come with yet another switch. Other than that, 
> UseNewLocks sounds good and succinct.
> 
> Another proposal: UseThreadLockStack or UseLockStack

Just a FYI, at the moment we have:

  product(ccstr, ObjectSynchronizerMode, "fast",\
  "ObjectSynchronizer modes: "  \
  "legacy: legacy native system; "  \
  "native: java entry with native monitors; "   \
  "heavy: java entry with always inflated Java monitors; "  \
  "fast: java entry with fast-locks and"\
  "  inflate-on-demand Java monitors; ")\



At least personally I prefer one option than using many.
A cmd line with e.g.
`-XX:-UseLockStack -XX:+UseHeavyMonitors`
It's harder, for me ?, to figure out what is selected and what was intended to 
be selected.

-

PR: https://git.openjdk.org/jdk/pull/10907


RFR: 8300926: Several startup regressions ~6-70% in 21-b6 all platforms

2023-02-16 Thread Robbin Ehn
Hi all, please consider.

The original issue was when thread 1 asked to deopt nmethod set X and thread 2 
asked for the same or a subset of X.
All method will already be marked, but the actual deoptimizing, not entrant, 
patching PC on stacks and patching post call nops, was not done yet. Which 
meant thread 2 could 'pass' thread 1.
Most places did deopt under Compile_lock, thus this is not an issue, but WB and 
clearCallSiteContext do not.

Since a handshakes may take long before completion and Compile_lock is used for 
so much more than deopts,
the fix in https://bugs.openjdk.org/browse/JDK-8299074 instead always emitted a 
handshake even when everything was already marked.

This turnout to be problematic in the startup, for example the number of deopt 
handshakes in jetty dry run startup went from 5 to 39 handshakes.

This fix first adds a barrier for which you do not pass until the requested 
deopts have happened and it coalesces the handshakes.
Secondly it moves handshakes part out of the Compile_lock where it is possible.

Which means we fix the performance bug and we reduce the contention on 
Compile_lock, meaning higher throughput in compiler and things such as 
class-loading.

It passes t1-t7 with flying colours! t8 still not completed and I'm redoing 
some testing due to last minute simplifications.

Thanks, Robbin

-

Commit messages:
 - Fixed WS
 - Deopt scopes

Changes: https://git.openjdk.org/jdk/pull/12585/files
 Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=12585&range=00
  Issue: https://bugs.openjdk.org/browse/JDK-8300926
  Stats: 380 lines in 24 files changed: 192 ins; 91 del; 97 mod
  Patch: https://git.openjdk.org/jdk/pull/12585.diff
  Fetch: git fetch https://git.openjdk.org/jdk pull/12585/head:pull/12585

PR: https://git.openjdk.org/jdk/pull/12585


Re: RFR: 8267935: Replace BasicHashtable and Hashtable

2023-01-30 Thread Robbin Ehn
On Sat, 28 Jan 2023 18:32:31 GMT, Afshin Zafari  wrote:

> ### Description
> Hashtable class to be replaced with ResourceHashtable class in all source 
> code.
> 
> ### Patch 
> The only instance was `#include "hashtable.hpp"` in `jvmtiTagMapTable.cpp` 
> and removed.
> The corresponding files (`hashtable.hpp`, `hashtable.inline.hpp` and 
> `hashtable.cpp`) are removed too.
> 
> ### Test
> mach5 tier1 all platforms.

Thank you!

-

Marked as reviewed by rehn (Reviewer).

PR: https://git.openjdk.org/jdk/pull/12275


Re: RFR: 8291555: Implement alternative fast-locking scheme [v4]

2023-01-26 Thread Robbin Ehn
On Thu, 19 Jan 2023 12:09:27 GMT, Roman Kennke  wrote:

> > I always change the flags in code for the reasons David state and I can't 
> > forget to add any flags.
> > (Test batch is stilling running, no failures except 
> > MonitorInflationTest.java)
> 
> Ok, right.
> 
> I just re-ran tier1-3 with the flag changed in code, on aarch64 and x86_64 
> without regressions. Will run tier4 overnight.

I did a bigger batch of testing, there were some ThreadMXBean failures with 
wrong state. I'll investigate, otherwise passed.

-

PR: https://git.openjdk.org/jdk/pull/10907


Re: RFR: 8300913: ZGC: assert(to_addr != 0) failed: Should be forwarded

2023-01-26 Thread Robbin Ehn
On Wed, 25 Jan 2023 15:56:49 GMT, Coleen Phillimore  wrote:

> We thought we didn't need the keep-alive call but we do for heap walking.
> Tested with failing test case locally, and tier1-4 in progress.

Thanks!

-

Marked as reviewed by rehn (Reviewer).

PR: https://git.openjdk.org/jdk/pull/12202


Re: RFR: 8299795: Relativize locals in interpreter frames [v4]

2023-01-20 Thread Robbin Ehn
On Tue, 17 Jan 2023 17:34:53 GMT, Andrew Haley  wrote:

>> Fredrik Bredberg has updated the pull request with a new target base due to 
>> a merge or a rebase. The incremental webrev excludes the unrelated changes 
>> brought in by the merge/rebase. The pull request contains five additional 
>> commits since the last revision:
>> 
>>  - Merge branch 'master' into relativize-locals-JDK-8299795_2023-01-09
>>  - Added references to JDK-8300197
>>  - Updated some copyright dates.
>>  - Changed copyright date to 2023
>>  - 8299795: Relativize locals in interpreter frames
>
>> Implementation of relativized locals in interpreter frames for x86. x64, 
>> arm, aarch64, ppc64le and riscv. Not relativized locals on zero and s390 but 
>> done some changes to cope with the changed generic code. Tested tier1-tier8 
>> on supported platforms. The rest was sanity tested using Qemu, except s390, 
>> which was only tested by GitHub Actions.
> 
> Please explain for the record the purpose of this patch. Neither the PR nor 
> the Jira issue explain. I presume it's something to do with making the 
> mounting and unmounting of interpreter frames easier, but this must be said 
> somewhere.

@theRealAph are you planning to provide additional feedback, or can it be 
shipped?

-

PR: https://git.openjdk.org/jdk/pull/11902


Re: RFR: 8291555: Implement alternative fast-locking scheme [v4]

2023-01-19 Thread Robbin Ehn
On Wed, 18 Jan 2023 20:10:21 GMT, Roman Kennke  wrote:

>> This change adds a fast-locking scheme as an alternative to the current 
>> stack-locking implementation. It retains the advantages of stack-locking 
>> (namely fast locking in uncontended code-paths), while avoiding the overload 
>> of the mark word. That overloading causes massive problems with Lilliput, 
>> because it means we have to check and deal with this situation when trying 
>> to access the mark-word. And because of the very racy nature, this turns out 
>> to be very complex and would involve a variant of the inflation protocol to 
>> ensure that the object header is stable. (The current implementation of 
>> setting/fetching the i-hash provides a glimpse into the complexity).
>> 
>> What the original stack-locking does is basically to push a stack-lock onto 
>> the stack which consists only of the displaced header, and CAS a pointer to 
>> this stack location into the object header (the lowest two header bits being 
>> 00 indicate 'stack-locked'). The pointer into the stack can then be used to 
>> identify which thread currently owns the lock.
>> 
>> This change basically reverses stack-locking: It still CASes the lowest two 
>> header bits to 00 to indicate 'fast-locked' but does *not* overload the 
>> upper bits with a stack-pointer. Instead, it pushes the object-reference to 
>> a thread-local lock-stack. This is a new structure which is basically a 
>> small array of oops that is associated with each thread. Experience shows 
>> that this array typcially remains very small (3-5 elements). Using this lock 
>> stack, it is possible to query which threads own which locks. Most 
>> importantly, the most common question 'does the current thread own me?' is 
>> very quickly answered by doing a quick scan of the array. More complex 
>> queries like 'which thread owns X?' are not performed in very 
>> performance-critical paths (usually in code like JVMTI or deadlock 
>> detection) where it is ok to do more complex operations (and we already do). 
>> The lock-stack is also a new set of GC roots, and would be scanned during 
>> thread scanning, possibly concurrently, via the normal 
 protocols.
>> 
>> The lock-stack is grown when needed. This means that we need to check for 
>> potential overflow before attempting locking. When that is the case, locking 
>> fast-paths would call into the runtime to grow the stack and handle the 
>> locking. Compiled fast-paths (C1 and C2 on x86_64 and aarch64) do this check 
>> on method entry to avoid (possibly lots) of such checks at locking sites.
>> 
>> In contrast to stack-locking, fast-locking does *not* support recursive 
>> locking (yet). When that happens, the fast-lock gets inflated to a full 
>> monitor. It is not clear if it is worth to add support for recursive 
>> fast-locking.
>> 
>> One trouble is that when a contending thread arrives at a fast-locked 
>> object, it must inflate the fast-lock to a full monitor. Normally, we need 
>> to know the current owning thread, and record that in the monitor, so that 
>> the contending thread can wait for the current owner to properly exit the 
>> monitor. However, fast-locking doesn't have this information. What we do 
>> instead is to record a special marker ANONYMOUS_OWNER. When the thread that 
>> currently holds the lock arrives at monitorexit, and observes 
>> ANONYMOUS_OWNER, it knows it must be itself, fixes the owner to be itself, 
>> and then properly exits the monitor, and thus handing over to the contending 
>> thread.
>> 
>> As an alternative, I considered to remove stack-locking altogether, and only 
>> use heavy monitors. In most workloads this did not show measurable 
>> regressions. However, in a few workloads, I have observed severe 
>> regressions. All of them have been using old synchronized Java collections 
>> (Vector, Stack), StringBuffer or similar code. The combination of two 
>> conditions leads to regressions without stack- or fast-locking: 1. The 
>> workload synchronizes on uncontended locks (e.g. single-threaded use of 
>> Vector or StringBuffer) and 2. The workload churns such locks. IOW, 
>> uncontended use of Vector, StringBuffer, etc as such is ok, but creating 
>> lots of such single-use, single-threaded-locked objects leads to massive 
>> ObjectMonitor churn, which can lead to a significant performance impact. But 
>> alas, such code exists, and we probably don't want to punish it if we can 
>> avoid it.
>> 
>> This change enables to simplify (and speed-up!) a lot of code:
>> 
>> - The inflation protocol is no longer necessary: we can directly CAS the 
>> (tagged) ObjectMonitor pointer to the object header.
>> - Accessing the hashcode could now be done in the fastpath always, if the 
>> hashcode has been installed. Fast-locked headers can be used directly, for 
>> monitor-locked objects we can easily reach-through to the displaced header. 
>> This is safe because Java threads participate in monitor deflation protocol.

Re: RFR: 8291555: Implement alternative fast-locking scheme [v2]

2023-01-18 Thread Robbin Ehn
On Wed, 18 Jan 2023 15:45:38 GMT, Roman Kennke  wrote:

>> This change adds a fast-locking scheme as an alternative to the current 
>> stack-locking implementation. It retains the advantages of stack-locking 
>> (namely fast locking in uncontended code-paths), while avoiding the overload 
>> of the mark word. That overloading causes massive problems with Lilliput, 
>> because it means we have to check and deal with this situation when trying 
>> to access the mark-word. And because of the very racy nature, this turns out 
>> to be very complex and would involve a variant of the inflation protocol to 
>> ensure that the object header is stable. (The current implementation of 
>> setting/fetching the i-hash provides a glimpse into the complexity).
>> 
>> What the original stack-locking does is basically to push a stack-lock onto 
>> the stack which consists only of the displaced header, and CAS a pointer to 
>> this stack location into the object header (the lowest two header bits being 
>> 00 indicate 'stack-locked'). The pointer into the stack can then be used to 
>> identify which thread currently owns the lock.
>> 
>> This change basically reverses stack-locking: It still CASes the lowest two 
>> header bits to 00 to indicate 'fast-locked' but does *not* overload the 
>> upper bits with a stack-pointer. Instead, it pushes the object-reference to 
>> a thread-local lock-stack. This is a new structure which is basically a 
>> small array of oops that is associated with each thread. Experience shows 
>> that this array typcially remains very small (3-5 elements). Using this lock 
>> stack, it is possible to query which threads own which locks. Most 
>> importantly, the most common question 'does the current thread own me?' is 
>> very quickly answered by doing a quick scan of the array. More complex 
>> queries like 'which thread owns X?' are not performed in very 
>> performance-critical paths (usually in code like JVMTI or deadlock 
>> detection) where it is ok to do more complex operations (and we already do). 
>> The lock-stack is also a new set of GC roots, and would be scanned during 
>> thread scanning, possibly concurrently, via the normal 
 protocols.
>> 
>> The lock-stack is grown when needed. This means that we need to check for 
>> potential overflow before attempting locking. When that is the case, locking 
>> fast-paths would call into the runtime to grow the stack and handle the 
>> locking. Compiled fast-paths (C1 and C2 on x86_64 and aarch64) do this check 
>> on method entry to avoid (possibly lots) of such checks at locking sites.
>> 
>> In contrast to stack-locking, fast-locking does *not* support recursive 
>> locking (yet). When that happens, the fast-lock gets inflated to a full 
>> monitor. It is not clear if it is worth to add support for recursive 
>> fast-locking.
>> 
>> One trouble is that when a contending thread arrives at a fast-locked 
>> object, it must inflate the fast-lock to a full monitor. Normally, we need 
>> to know the current owning thread, and record that in the monitor, so that 
>> the contending thread can wait for the current owner to properly exit the 
>> monitor. However, fast-locking doesn't have this information. What we do 
>> instead is to record a special marker ANONYMOUS_OWNER. When the thread that 
>> currently holds the lock arrives at monitorexit, and observes 
>> ANONYMOUS_OWNER, it knows it must be itself, fixes the owner to be itself, 
>> and then properly exits the monitor, and thus handing over to the contending 
>> thread.
>> 
>> As an alternative, I considered to remove stack-locking altogether, and only 
>> use heavy monitors. In most workloads this did not show measurable 
>> regressions. However, in a few workloads, I have observed severe 
>> regressions. All of them have been using old synchronized Java collections 
>> (Vector, Stack), StringBuffer or similar code. The combination of two 
>> conditions leads to regressions without stack- or fast-locking: 1. The 
>> workload synchronizes on uncontended locks (e.g. single-threaded use of 
>> Vector or StringBuffer) and 2. The workload churns such locks. IOW, 
>> uncontended use of Vector, StringBuffer, etc as such is ok, but creating 
>> lots of such single-use, single-threaded-locked objects leads to massive 
>> ObjectMonitor churn, which can lead to a significant performance impact. But 
>> alas, such code exists, and we probably don't want to punish it if we can 
>> avoid it.
>> 
>> This change enables to simplify (and speed-up!) a lot of code:
>> 
>> - The inflation protocol is no longer necessary: we can directly CAS the 
>> (tagged) ObjectMonitor pointer to the object header.
>> - Accessing the hashcode could now be done in the fastpath always, if the 
>> hashcode has been installed. Fast-locked headers can be used directly, for 
>> monitor-locked objects we can easily reach-through to the displaced header. 
>> This is safe because Java threads participate in monitor deflation protocol.

Re: RFR: 8291555: Implement alternative fast-locking scheme

2023-01-18 Thread Robbin Ehn
On Fri, 28 Oct 2022 20:17:37 GMT, Roman Kennke  wrote:

> This change adds a fast-locking scheme as an alternative to the current 
> stack-locking implementation. It retains the advantages of stack-locking 
> (namely fast locking in uncontended code-paths), while avoiding the overload 
> of the mark word. That overloading causes massive problems with Lilliput, 
> because it means we have to check and deal with this situation when trying to 
> access the mark-word. And because of the very racy nature, this turns out to 
> be very complex and would involve a variant of the inflation protocol to 
> ensure that the object header is stable. (The current implementation of 
> setting/fetching the i-hash provides a glimpse into the complexity).
> 
> What the original stack-locking does is basically to push a stack-lock onto 
> the stack which consists only of the displaced header, and CAS a pointer to 
> this stack location into the object header (the lowest two header bits being 
> 00 indicate 'stack-locked'). The pointer into the stack can then be used to 
> identify which thread currently owns the lock.
> 
> This change basically reverses stack-locking: It still CASes the lowest two 
> header bits to 00 to indicate 'fast-locked' but does *not* overload the upper 
> bits with a stack-pointer. Instead, it pushes the object-reference to a 
> thread-local lock-stack. This is a new structure which is basically a small 
> array of oops that is associated with each thread. Experience shows that this 
> array typcially remains very small (3-5 elements). Using this lock stack, it 
> is possible to query which threads own which locks. Most importantly, the 
> most common question 'does the current thread own me?' is very quickly 
> answered by doing a quick scan of the array. More complex queries like 'which 
> thread owns X?' are not performed in very performance-critical paths (usually 
> in code like JVMTI or deadlock detection) where it is ok to do more complex 
> operations (and we already do). The lock-stack is also a new set of GC roots, 
> and would be scanned during thread scanning, possibly concurrently, via the 
> normal p
 rotocols.
> 
> The lock-stack is grown when needed. This means that we need to check for 
> potential overflow before attempting locking. When that is the case, locking 
> fast-paths would call into the runtime to grow the stack and handle the 
> locking. Compiled fast-paths (C1 and C2 on x86_64 and aarch64) do this check 
> on method entry to avoid (possibly lots) of such checks at locking sites.
> 
> In contrast to stack-locking, fast-locking does *not* support recursive 
> locking (yet). When that happens, the fast-lock gets inflated to a full 
> monitor. It is not clear if it is worth to add support for recursive 
> fast-locking.
> 
> One trouble is that when a contending thread arrives at a fast-locked object, 
> it must inflate the fast-lock to a full monitor. Normally, we need to know 
> the current owning thread, and record that in the monitor, so that the 
> contending thread can wait for the current owner to properly exit the 
> monitor. However, fast-locking doesn't have this information. What we do 
> instead is to record a special marker ANONYMOUS_OWNER. When the thread that 
> currently holds the lock arrives at monitorexit, and observes 
> ANONYMOUS_OWNER, it knows it must be itself, fixes the owner to be itself, 
> and then properly exits the monitor, and thus handing over to the contending 
> thread.
> 
> As an alternative, I considered to remove stack-locking altogether, and only 
> use heavy monitors. In most workloads this did not show measurable 
> regressions. However, in a few workloads, I have observed severe regressions. 
> All of them have been using old synchronized Java collections (Vector, 
> Stack), StringBuffer or similar code. The combination of two conditions leads 
> to regressions without stack- or fast-locking: 1. The workload synchronizes 
> on uncontended locks (e.g. single-threaded use of Vector or StringBuffer) and 
> 2. The workload churns such locks. IOW, uncontended use of Vector, 
> StringBuffer, etc as such is ok, but creating lots of such single-use, 
> single-threaded-locked objects leads to massive ObjectMonitor churn, which 
> can lead to a significant performance impact. But alas, such code exists, and 
> we probably don't want to punish it if we can avoid it.
> 
> This change enables to simplify (and speed-up!) a lot of code:
> 
> - The inflation protocol is no longer necessary: we can directly CAS the 
> (tagged) ObjectMonitor pointer to the object header.
> - Accessing the hashcode could now be done in the fastpath always, if the 
> hashcode has been installed. Fast-locked headers can be used directly, for 
> monitor-locked objects we can easily reach-through to the displaced header. 
> This is safe because Java threads participate in monitor deflation protocol. 
> This would be implemented in a separate PR
> 
> 
> Testing:
>  - [x] tier1

Re: RFR: 8291555: Implement alternative fast-locking scheme

2023-01-18 Thread Robbin Ehn
On Fri, 28 Oct 2022 20:17:37 GMT, Roman Kennke  wrote:

> This change adds a fast-locking scheme as an alternative to the current 
> stack-locking implementation. It retains the advantages of stack-locking 
> (namely fast locking in uncontended code-paths), while avoiding the overload 
> of the mark word. That overloading causes massive problems with Lilliput, 
> because it means we have to check and deal with this situation when trying to 
> access the mark-word. And because of the very racy nature, this turns out to 
> be very complex and would involve a variant of the inflation protocol to 
> ensure that the object header is stable. (The current implementation of 
> setting/fetching the i-hash provides a glimpse into the complexity).
> 
> What the original stack-locking does is basically to push a stack-lock onto 
> the stack which consists only of the displaced header, and CAS a pointer to 
> this stack location into the object header (the lowest two header bits being 
> 00 indicate 'stack-locked'). The pointer into the stack can then be used to 
> identify which thread currently owns the lock.
> 
> This change basically reverses stack-locking: It still CASes the lowest two 
> header bits to 00 to indicate 'fast-locked' but does *not* overload the upper 
> bits with a stack-pointer. Instead, it pushes the object-reference to a 
> thread-local lock-stack. This is a new structure which is basically a small 
> array of oops that is associated with each thread. Experience shows that this 
> array typcially remains very small (3-5 elements). Using this lock stack, it 
> is possible to query which threads own which locks. Most importantly, the 
> most common question 'does the current thread own me?' is very quickly 
> answered by doing a quick scan of the array. More complex queries like 'which 
> thread owns X?' are not performed in very performance-critical paths (usually 
> in code like JVMTI or deadlock detection) where it is ok to do more complex 
> operations (and we already do). The lock-stack is also a new set of GC roots, 
> and would be scanned during thread scanning, possibly concurrently, via the 
> normal p
 rotocols.
> 
> The lock-stack is grown when needed. This means that we need to check for 
> potential overflow before attempting locking. When that is the case, locking 
> fast-paths would call into the runtime to grow the stack and handle the 
> locking. Compiled fast-paths (C1 and C2 on x86_64 and aarch64) do this check 
> on method entry to avoid (possibly lots) of such checks at locking sites.
> 
> In contrast to stack-locking, fast-locking does *not* support recursive 
> locking (yet). When that happens, the fast-lock gets inflated to a full 
> monitor. It is not clear if it is worth to add support for recursive 
> fast-locking.
> 
> One trouble is that when a contending thread arrives at a fast-locked object, 
> it must inflate the fast-lock to a full monitor. Normally, we need to know 
> the current owning thread, and record that in the monitor, so that the 
> contending thread can wait for the current owner to properly exit the 
> monitor. However, fast-locking doesn't have this information. What we do 
> instead is to record a special marker ANONYMOUS_OWNER. When the thread that 
> currently holds the lock arrives at monitorexit, and observes 
> ANONYMOUS_OWNER, it knows it must be itself, fixes the owner to be itself, 
> and then properly exits the monitor, and thus handing over to the contending 
> thread.
> 
> As an alternative, I considered to remove stack-locking altogether, and only 
> use heavy monitors. In most workloads this did not show measurable 
> regressions. However, in a few workloads, I have observed severe regressions. 
> All of them have been using old synchronized Java collections (Vector, 
> Stack), StringBuffer or similar code. The combination of two conditions leads 
> to regressions without stack- or fast-locking: 1. The workload synchronizes 
> on uncontended locks (e.g. single-threaded use of Vector or StringBuffer) and 
> 2. The workload churns such locks. IOW, uncontended use of Vector, 
> StringBuffer, etc as such is ok, but creating lots of such single-use, 
> single-threaded-locked objects leads to massive ObjectMonitor churn, which 
> can lead to a significant performance impact. But alas, such code exists, and 
> we probably don't want to punish it if we can avoid it.
> 
> This change enables to simplify (and speed-up!) a lot of code:
> 
> - The inflation protocol is no longer necessary: we can directly CAS the 
> (tagged) ObjectMonitor pointer to the object header.
> - Accessing the hashcode could now be done in the fastpath always, if the 
> hashcode has been installed. Fast-locked headers can be used directly, for 
> monitor-locked objects we can easily reach-through to the displaced header. 
> This is safe because Java threads participate in monitor deflation protocol. 
> This would be implemented in a separate PR
> 
> 
> Testing:
>  - [x] tier1

Re: RFR: 8291555: Implement alternative fast-locking scheme

2023-01-17 Thread Robbin Ehn
On Fri, 28 Oct 2022 20:17:37 GMT, Roman Kennke  wrote:

> This change adds a fast-locking scheme as an alternative to the current 
> stack-locking implementation. It retains the advantages of stack-locking 
> (namely fast locking in uncontended code-paths), while avoiding the overload 
> of the mark word. That overloading causes massive problems with Lilliput, 
> because it means we have to check and deal with this situation when trying to 
> access the mark-word. And because of the very racy nature, this turns out to 
> be very complex and would involve a variant of the inflation protocol to 
> ensure that the object header is stable. (The current implementation of 
> setting/fetching the i-hash provides a glimpse into the complexity).
> 
> What the original stack-locking does is basically to push a stack-lock onto 
> the stack which consists only of the displaced header, and CAS a pointer to 
> this stack location into the object header (the lowest two header bits being 
> 00 indicate 'stack-locked'). The pointer into the stack can then be used to 
> identify which thread currently owns the lock.
> 
> This change basically reverses stack-locking: It still CASes the lowest two 
> header bits to 00 to indicate 'fast-locked' but does *not* overload the upper 
> bits with a stack-pointer. Instead, it pushes the object-reference to a 
> thread-local lock-stack. This is a new structure which is basically a small 
> array of oops that is associated with each thread. Experience shows that this 
> array typcially remains very small (3-5 elements). Using this lock stack, it 
> is possible to query which threads own which locks. Most importantly, the 
> most common question 'does the current thread own me?' is very quickly 
> answered by doing a quick scan of the array. More complex queries like 'which 
> thread owns X?' are not performed in very performance-critical paths (usually 
> in code like JVMTI or deadlock detection) where it is ok to do more complex 
> operations (and we already do). The lock-stack is also a new set of GC roots, 
> and would be scanned during thread scanning, possibly concurrently, via the 
> normal p
 rotocols.
> 
> The lock-stack is grown when needed. This means that we need to check for 
> potential overflow before attempting locking. When that is the case, locking 
> fast-paths would call into the runtime to grow the stack and handle the 
> locking. Compiled fast-paths (C1 and C2 on x86_64 and aarch64) do this check 
> on method entry to avoid (possibly lots) of such checks at locking sites.
> 
> In contrast to stack-locking, fast-locking does *not* support recursive 
> locking (yet). When that happens, the fast-lock gets inflated to a full 
> monitor. It is not clear if it is worth to add support for recursive 
> fast-locking.
> 
> One trouble is that when a contending thread arrives at a fast-locked object, 
> it must inflate the fast-lock to a full monitor. Normally, we need to know 
> the current owning thread, and record that in the monitor, so that the 
> contending thread can wait for the current owner to properly exit the 
> monitor. However, fast-locking doesn't have this information. What we do 
> instead is to record a special marker ANONYMOUS_OWNER. When the thread that 
> currently holds the lock arrives at monitorexit, and observes 
> ANONYMOUS_OWNER, it knows it must be itself, fixes the owner to be itself, 
> and then properly exits the monitor, and thus handing over to the contending 
> thread.
> 
> As an alternative, I considered to remove stack-locking altogether, and only 
> use heavy monitors. In most workloads this did not show measurable 
> regressions. However, in a few workloads, I have observed severe regressions. 
> All of them have been using old synchronized Java collections (Vector, 
> Stack), StringBuffer or similar code. The combination of two conditions leads 
> to regressions without stack- or fast-locking: 1. The workload synchronizes 
> on uncontended locks (e.g. single-threaded use of Vector or StringBuffer) and 
> 2. The workload churns such locks. IOW, uncontended use of Vector, 
> StringBuffer, etc as such is ok, but creating lots of such single-use, 
> single-threaded-locked objects leads to massive ObjectMonitor churn, which 
> can lead to a significant performance impact. But alas, such code exists, and 
> we probably don't want to punish it if we can avoid it.
> 
> This change enables to simplify (and speed-up!) a lot of code:
> 
> - The inflation protocol is no longer necessary: we can directly CAS the 
> (tagged) ObjectMonitor pointer to the object header.
> - Accessing the hashcode could now be done in the fastpath always, if the 
> hashcode has been installed. Fast-locked headers can be used directly, for 
> monitor-locked objects we can easily reach-through to the displaced header. 
> This is safe because Java threads participate in monitor deflation protocol. 
> This would be implemented in a separate PR
> 
> 
> Testing:
>  - [x] tier1

Re: RFR: 8299795: Relativize locals in interpreter frames [v4]

2023-01-17 Thread Robbin Ehn
On Tue, 17 Jan 2023 08:35:45 GMT, Fredrik Bredberg  wrote:

>> Implementation of relativized locals in interpreter frames for x86. x64, 
>> arm, aarch64, ppc64le and riscv.
>> Not relativized locals on zero and s390 but done some changes to cope with 
>> the changed generic code.
>> Tested tier1-tier8 on supported platforms. The rest was sanity tested using 
>> Qemu, except s390, which was only tested by GitHub Actions.
>
> Fredrik Bredberg has updated the pull request with a new target base due to a 
> merge or a rebase. The incremental webrev excludes the unrelated changes 
> brought in by the merge/rebase. The pull request contains five additional 
> commits since the last revision:
> 
>  - Merge branch 'master' into relativize-locals-JDK-8299795_2023-01-09
>  - Added references to JDK-8300197
>  - Updated some copyright dates.
>  - Changed copyright date to 2023
>  - 8299795: Relativize locals in interpreter frames

Thank you, looks good!

-

Marked as reviewed by rehn (Reviewer).

PR: https://git.openjdk.org/jdk/pull/11902


Re: RFR: 8292741: Convert JvmtiTagMapTable to ResourceHashtable [v7]

2023-01-05 Thread Robbin Ehn
On Thu, 8 Dec 2022 16:56:40 GMT, Afshin Zafari  wrote:

>> test of tier1-5 passed.
>
> Afshin Zafari has updated the pull request incrementally with one additional 
> commit since the last revision:
> 
>   8292741: Convert JvmtiTagMapTable to ResourceHashtable

Looks fine, thanks. (+- DavidH comment)

-

Marked as reviewed by rehn (Reviewer).

PR: https://git.openjdk.org/jdk/pull/11288


Re: RFR: 8292741: Convert JvmtiTagMapTable to ResourceHashtable [v7]

2022-12-12 Thread Robbin Ehn
On Mon, 12 Dec 2022 00:58:35 GMT, David Holmes  wrote:

> Also please do not force-push.

FYI: I'm not sure it even was possible to fix the "full of unexpected changes" 
without force push.
Even if it was, the effort required to fix it without force push would be 
pretty significant.
So I'd say this one of those exceptions to the rule :)

-

PR: https://git.openjdk.org/jdk/pull/11288


Re: RFR: 8291555: Replace stack-locking with fast-locking

2022-10-27 Thread Robbin Ehn
On Fri, 28 Oct 2022 03:32:58 GMT, Remi Forax  wrote:

> i've some trouble to see how it can be implemented given that because of lock 
> coarsening (+ may be OSR), the number of time a lock is held is different 
> between the interpreted code and the compiled code.

Correct me if I'm wrong, only C2 eliminates locks and C2 only compile if there 
is proper structured locking.
This should mean that when we restore the eliminated locks in deopt we can 
inflate the recursive locks which are no longer interleaved and restructure the 
lock-stack accordingly.

Is there another situation than deopt where it would matter?

-

PR: https://git.openjdk.org/jdk/pull/10590


Re: RFR: 8291555: Replace stack-locking with fast-locking [v7]

2022-10-24 Thread Robbin Ehn
On Mon, 24 Oct 2022 08:03:13 GMT, Roman Kennke  wrote:

>> This change replaces the current stack-locking implementation with a 
>> fast-locking scheme that retains the advantages of stack-locking (namely 
>> fast locking in uncontended code-paths), while avoiding the overload of the 
>> mark word. That overloading causes massive problems with Lilliput, because 
>> it means we have to check and deal with this situation. And because of the 
>> very racy nature, this turns out to be very complex and involved a variant 
>> of the inflation protocol to ensure that the object header is stable. 
>> 
>> What the original stack-locking does is basically to push a stack-lock onto 
>> the stack which consists only of the displaced header, and CAS a pointer to 
>> this stack location into the object header (the lowest two header bits being 
>> 00 indicate 'stack-locked'). The pointer into the stack can then be used to 
>> identify which thread currently owns the lock.
>> 
>> This change basically reverses stack-locking: It still CASes the lowest two 
>> header bits to 00 to indicate 'fast-locked' but does *not* overload the 
>> upper bits with a stack-pointer. Instead, it pushes the object-reference to 
>> a thread-local lock-stack. This is a new structure which is basically a 
>> small array of oops that is associated with each thread. Experience shows 
>> that this array typcially remains very small (3-5 elements). Using this lock 
>> stack, it is possible to query which threads own which locks. Most 
>> importantly, the most common question 'does the current thread own me?' is 
>> very quickly answered by doing a quick scan of the array. More complex 
>> queries like 'which thread owns X?' are not performed in very 
>> performance-critical paths (usually in code like JVMTI or deadlock 
>> detection) where it is ok to do more complex operations. The lock-stack is 
>> also a new set of GC roots, and would be scanned during thread scanning, 
>> possibly concurrently, via the normal protocols.
>> 
>> In contrast to stack-locking, fast-locking does *not* support recursive 
>> locking (yet). When that happens, the fast-lock gets inflated to a full 
>> monitor. It is not clear if it is worth to add support for recursive 
>> fast-locking.
>> 
>> One trouble is that when a contending thread arrives at a fast-locked 
>> object, it must inflate the fast-lock to a full monitor. Normally, we need 
>> to know the current owning thread, and record that in the monitor, so that 
>> the contending thread can wait for the current owner to properly exit the 
>> monitor. However, fast-locking doesn't have this information. What we do 
>> instead is to record a special marker ANONYMOUS_OWNER. When the thread that 
>> currently holds the lock arrives at monitorexit, and observes 
>> ANONYMOUS_OWNER, it knows it must be itself, fixes the owner to be itself, 
>> and then properly exits the monitor, and thus handing over to the contending 
>> thread.
>> 
>> As an alternative, I considered to remove stack-locking altogether, and only 
>> use heavy monitors. In most workloads this did not show measurable 
>> regressions. However, in a few workloads, I have observed severe 
>> regressions. All of them have been using old synchronized Java collections 
>> (Vector, Stack), StringBuffer or similar code. The combination of two 
>> conditions leads to regressions without stack- or fast-locking: 1. The 
>> workload synchronizes on uncontended locks (e.g. single-threaded use of 
>> Vector or StringBuffer) and 2. The workload churns such locks. IOW, 
>> uncontended use of Vector, StringBuffer, etc as such is ok, but creating 
>> lots of such single-use, single-threaded-locked objects leads to massive 
>> ObjectMonitor churn, which can lead to a significant performance impact. But 
>> alas, such code exists, and we probably don't want to punish it if we can 
>> avoid it.
>> 
>> This change enables to simplify (and speed-up!) a lot of code:
>> 
>> - The inflation protocol is no longer necessary: we can directly CAS the 
>> (tagged) ObjectMonitor pointer to the object header.
>> - Accessing the hashcode could now be done in the fastpath always, if the 
>> hashcode has been installed. Fast-locked headers can be used directly, for 
>> monitor-locked objects we can easily reach-through to the displaced header. 
>> This is safe because Java threads participate in monitor deflation protocol. 
>> This would be implemented in a separate PR
>> 
>> ### Benchmarks
>> 
>> All benchmarks are run on server-class metal machines. The JVM settings are 
>> always: `-Xmx20g -Xms20g -XX:+UseParallelGC`. All benchmarks are ms/ops, 
>> less is better.
>> 
>>  DaCapo/AArch64
>> 
>> Those measurements have been taken on a Graviton2 box with 64 CPU cores (an 
>> AWS m6g.metal instance). It is using DaCapo evaluation version, git hash 
>> 309e1fa (download file dacapo-evaluation-git+309e1fa.jar). I needed to 
>> exclude cassandra, h2o & kafka benchmarks because of incompati

Re: RFR: 8291555: Replace stack-locking with fast-locking [v4]

2022-10-13 Thread Robbin Ehn
On Thu, 13 Oct 2022 10:34:04 GMT, Roman Kennke  wrote:

> It used to be very stable before that. I have backed out that change, can you 
> try again?

Seems fine now, thanks.

-

PR: https://git.openjdk.org/jdk/pull/10590


Re: RFR: 8291555: Replace stack-locking with fast-locking [v4]

2022-10-13 Thread Robbin Ehn
On Thu, 13 Oct 2022 07:33:48 GMT, Roman Kennke  wrote:

>> This change replaces the current stack-locking implementation with a 
>> fast-locking scheme that retains the advantages of stack-locking (namely 
>> fast locking in uncontended code-paths), while avoiding the overload of the 
>> mark word. That overloading causes massive problems with Lilliput, because 
>> it means we have to check and deal with this situation. And because of the 
>> very racy nature, this turns out to be very complex and involved a variant 
>> of the inflation protocol to ensure that the object header is stable. 
>> 
>> What the original stack-locking does is basically to push a stack-lock onto 
>> the stack which consists only of the displaced header, and CAS a pointer to 
>> this stack location into the object header (the lowest two header bits being 
>> 00 indicate 'stack-locked'). The pointer into the stack can then be used to 
>> identify which thread currently owns the lock.
>> 
>> This change basically reverses stack-locking: It still CASes the lowest two 
>> header bits to 00 to indicate 'fast-locked' but does *not* overload the 
>> upper bits with a stack-pointer. Instead, it pushes the object-reference to 
>> a thread-local lock-stack. This is a new structure which is basically a 
>> small array of oops that is associated with each thread. Experience shows 
>> that this array typcially remains very small (3-5 elements). Using this lock 
>> stack, it is possible to query which threads own which locks. Most 
>> importantly, the most common question 'does the current thread own me?' is 
>> very quickly answered by doing a quick scan of the array. More complex 
>> queries like 'which thread owns X?' are not performed in very 
>> performance-critical paths (usually in code like JVMTI or deadlock 
>> detection) where it is ok to do more complex operations. The lock-stack is 
>> also a new set of GC roots, and would be scanned during thread scanning, 
>> possibly concurrently, via the normal protocols.
>> 
>> In contrast to stack-locking, fast-locking does *not* support recursive 
>> locking (yet). When that happens, the fast-lock gets inflated to a full 
>> monitor. It is not clear if it is worth to add support for recursive 
>> fast-locking.
>> 
>> One trouble is that when a contending thread arrives at a fast-locked 
>> object, it must inflate the fast-lock to a full monitor. Normally, we need 
>> to know the current owning thread, and record that in the monitor, so that 
>> the contending thread can wait for the current owner to properly exit the 
>> monitor. However, fast-locking doesn't have this information. What we do 
>> instead is to record a special marker ANONYMOUS_OWNER. When the thread that 
>> currently holds the lock arrives at monitorexit, and observes 
>> ANONYMOUS_OWNER, it knows it must be itself, fixes the owner to be itself, 
>> and then properly exits the monitor, and thus handing over to the contending 
>> thread.
>> 
>> As an alternative, I considered to remove stack-locking altogether, and only 
>> use heavy monitors. In most workloads this did not show measurable 
>> regressions. However, in a few workloads, I have observed severe 
>> regressions. All of them have been using old synchronized Java collections 
>> (Vector, Stack), StringBuffer or similar code. The combination of two 
>> conditions leads to regressions without stack- or fast-locking: 1. The 
>> workload synchronizes on uncontended locks (e.g. single-threaded use of 
>> Vector or StringBuffer) and 2. The workload churns such locks. IOW, 
>> uncontended use of Vector, StringBuffer, etc as such is ok, but creating 
>> lots of such single-use, single-threaded-locked objects leads to massive 
>> ObjectMonitor churn, which can lead to a significant performance impact. But 
>> alas, such code exists, and we probably don't want to punish it if we can 
>> avoid it.
>> 
>> This change enables to simplify (and speed-up!) a lot of code:
>> 
>> - The inflation protocol is no longer necessary: we can directly CAS the 
>> (tagged) ObjectMonitor pointer to the object header.
>> - Accessing the hashcode could now be done in the fastpath always, if the 
>> hashcode has been installed. Fast-locked headers can be used directly, for 
>> monitor-locked objects we can easily reach-through to the displaced header. 
>> This is safe because Java threads participate in monitor deflation protocol. 
>> This would be implemented in a separate PR
>> 
>> ### Benchmarks
>> 
>> All benchmarks are run on server-class metal machines. The JVM settings are 
>> always: `-Xmx20g -Xms20g -XX:+UseParallelGC`. All benchmarks are ms/ops, 
>> less is better.
>> 
>>  DaCapo/AArch64
>> 
>> Those measurements have been taken on a Graviton2 box with 64 CPU cores (an 
>> AWS m6g.metal instance). It is using DaCapo evaluation version, git hash 
>> 309e1fa (download file dacapo-evaluation-git+309e1fa.jar). I needed to 
>> exclude cassandra, h2o & kafka benchmarks because of incompati

Re: RFR: 8291555: Replace stack-locking with fast-locking [v3]

2022-10-11 Thread Robbin Ehn
On Tue, 11 Oct 2022 20:01:32 GMT, Roman Kennke  wrote:

>> This change replaces the current stack-locking implementation with a 
>> fast-locking scheme that retains the advantages of stack-locking (namely 
>> fast locking in uncontended code-paths), while avoiding the overload of the 
>> mark word. That overloading causes massive problems with Lilliput, because 
>> it means we have to check and deal with this situation. And because of the 
>> very racy nature, this turns out to be very complex and involved a variant 
>> of the inflation protocol to ensure that the object header is stable. 
>> 
>> What the original stack-locking does is basically to push a stack-lock onto 
>> the stack which consists only of the displaced header, and CAS a pointer to 
>> this stack location into the object header (the lowest two header bits being 
>> 00 indicate 'stack-locked'). The pointer into the stack can then be used to 
>> identify which thread currently owns the lock.
>> 
>> This change basically reverses stack-locking: It still CASes the lowest two 
>> header bits to 00 to indicate 'fast-locked' but does *not* overload the 
>> upper bits with a stack-pointer. Instead, it pushes the object-reference to 
>> a thread-local lock-stack. This is a new structure which is basically a 
>> small array of oops that is associated with each thread. Experience shows 
>> that this array typcially remains very small (3-5 elements). Using this lock 
>> stack, it is possible to query which threads own which locks. Most 
>> importantly, the most common question 'does the current thread own me?' is 
>> very quickly answered by doing a quick scan of the array. More complex 
>> queries like 'which thread owns X?' are not performed in very 
>> performance-critical paths (usually in code like JVMTI or deadlock 
>> detection) where it is ok to do more complex operations. The lock-stack is 
>> also a new set of GC roots, and would be scanned during thread scanning, 
>> possibly concurrently, via the normal protocols.
>> 
>> In contrast to stack-locking, fast-locking does *not* support recursive 
>> locking (yet). When that happens, the fast-lock gets inflated to a full 
>> monitor. It is not clear if it is worth to add support for recursive 
>> fast-locking.
>> 
>> One trouble is that when a contending thread arrives at a fast-locked 
>> object, it must inflate the fast-lock to a full monitor. Normally, we need 
>> to know the current owning thread, and record that in the monitor, so that 
>> the contending thread can wait for the current owner to properly exit the 
>> monitor. However, fast-locking doesn't have this information. What we do 
>> instead is to record a special marker ANONYMOUS_OWNER. When the thread that 
>> currently holds the lock arrives at monitorexit, and observes 
>> ANONYMOUS_OWNER, it knows it must be itself, fixes the owner to be itself, 
>> and then properly exits the monitor, and thus handing over to the contending 
>> thread.
>> 
>> As an alternative, I considered to remove stack-locking altogether, and only 
>> use heavy monitors. In most workloads this did not show measurable 
>> regressions. However, in a few workloads, I have observed severe 
>> regressions. All of them have been using old synchronized Java collections 
>> (Vector, Stack), StringBuffer or similar code. The combination of two 
>> conditions leads to regressions without stack- or fast-locking: 1. The 
>> workload synchronizes on uncontended locks (e.g. single-threaded use of 
>> Vector or StringBuffer) and 2. The workload churns such locks. IOW, 
>> uncontended use of Vector, StringBuffer, etc as such is ok, but creating 
>> lots of such single-use, single-threaded-locked objects leads to massive 
>> ObjectMonitor churn, which can lead to a significant performance impact. But 
>> alas, such code exists, and we probably don't want to punish it if we can 
>> avoid it.
>> 
>> This change enables to simplify (and speed-up!) a lot of code:
>> 
>> - The inflation protocol is no longer necessary: we can directly CAS the 
>> (tagged) ObjectMonitor pointer to the object header.
>> - Accessing the hashcode could now be done in the fastpath always, if the 
>> hashcode has been installed. Fast-locked headers can be used directly, for 
>> monitor-locked objects we can easily reach-through to the displaced header. 
>> This is safe because Java threads participate in monitor deflation protocol. 
>> This would be implemented in a separate PR
>> 
>> ### Benchmarks
>> 
>> All benchmarks are run on server-class metal machines. The JVM settings are 
>> always: `-Xmx20g -Xms20g -XX:+UseParallelGC`. All benchmarks are ms/ops, 
>> less is better.
>> 
>>  DaCapo/AArch64
>> 
>> Those measurements have been taken on a Graviton2 box with 64 CPU cores (an 
>> AWS m6g.metal instance). It is using DaCapo evaluation version, git hash 
>> 309e1fa (download file dacapo-evaluation-git+309e1fa.jar). I needed to 
>> exclude cassandra, h2o & kafka benchmarks because of incompati

Re: RFR: 8291555: Replace stack-locking with fast-locking

2022-10-06 Thread Robbin Ehn
On Mon, 12 Sep 2022 07:54:48 GMT, Roman Kennke  wrote:

>>> How have you handled the interpreter lock-stack-area in your 
>>> implementation? Is it worth to get rid of it and consolidate with the 
>>> per-thread lock-stack?
>> 
>> At the moment I had to store a "frame id" for each entry in the lock stack.
>> The frame id is previous fp, grabbed from "link()" when entering the locking 
>> code.
>> 
>> private static final void monitorEnter(Object o) {
>> 
>> long monitorFrameId = getCallerFrameId();
>> ``` 
>> When popping we can thus check if there is still monitors/locks for the 
>> frame to be popped.
>> Remove activation reads the lock stack, with a bunch of assembly, e.g.:
>> `  access_load_at(T_INT, IN_HEAP, rax, Address(rax, 
>> java_lang_Thread::lock_stack_pos_offset()), noreg, noreg);
>> `
>> If we would keep this, loom freezing would need to relativize and 
>> derelativize these values.
>> (we only have interpreter)
>> 
>> But, according to JVMS 2.11.10. the VM only needs to automatically unlock 
>> synchronized method.
>> This code that unlocks all locks in the frame seems to have been added for 
>> JLS 17.1.
>> I have asked for clarification and we only need and should care about JVMS.
>> 
>> So if we could make popframe do more work (popframe needs to unlock all), 
>> there seems to be way forward allowing more flexibility.
>> 
>> Still working on trying to make what we have public, even if it's in roughly 
>> shape and it's very unclear if that is the correct approach at all.
>
>> > How have you handled the interpreter lock-stack-area in your 
>> > implementation? Is it worth to get rid of it and consolidate with the 
>> > per-thread lock-stack?
>> 
>> At the moment I had to store a "frame id" for each entry in the lock stack. 
>> The frame id is previous fp, grabbed from "link()" when entering the locking 
>> code.
>> 
>> ```
>> private static final void monitorEnter(Object o) {
>> 
>> long monitorFrameId = getCallerFrameId();
>> ```
>> 
>> When popping we can thus check if there is still monitors/locks for the 
>> frame to be popped. Remove activation reads the lock stack, with a bunch of 
>> assembly, e.g.: ` access_load_at(T_INT, IN_HEAP, rax, Address(rax, 
>> java_lang_Thread::lock_stack_pos_offset()), noreg, noreg);` If we would keep 
>> this, loom freezing would need to relativize and derelativize these values. 
>> (we only have interpreter)
> 
> Hmm ok. I was thinking something similar, but instead of storing pairs 
> (oop/frame-id), push frame-markers on the lock-stack.
> 
> But given that we only need all this for the interpreter, I am wondering if 
> keeping what we have now (e.g. the per-frame-lock-stack in interpreter frame) 
> is the saner thing to do. The overhead seems very small, perhaps very similar 
> to keeping track of frames in the per-thread lock-stack.
> 
>> But, according to JVMS 2.11.10. the VM only needs to automatically unlock 
>> synchronized method. This code that unlocks all locks in the frame seems to 
>> have been added for JLS 17.1. I have asked for clarification and we only 
>> need and should care about JVMS.
>> 
>> So if we could make popframe do more work (popframe needs to unlock all), 
>> there seems to be way forward allowing more flexibility.
> 
>> Still working on trying to make what we have public, even if it's in roughly 
>> shape and it's very unclear if that is the correct approach at all.
> 
> Nice!
> From your snippets above I am gleaning that your implementation has the 
> actual lock-stack in Java. Is that correct? Is there a particular reason why 
> you need this? Is this for Loom? Would the implementation that I am proposing 
> here also work for your use-case(s)?
> 
> Thanks,
> Roman

@rkennke I will have a look, but may I suggest to open a new PR and just 
reference this as background discussion?
I think most of the comments above is not relevant enough for a new reviewer to 
struggle through.
What do you think?

-

PR: https://git.openjdk.org/jdk/pull/9680


Re: RFR: 8291555: Replace stack-locking with fast-locking

2022-10-06 Thread Robbin Ehn
On Mon, 12 Sep 2022 07:54:48 GMT, Roman Kennke  wrote:

> Nice! From your snippets above I am gleaning that your implementation has the 
> actual lock-stack in Java. Is that correct? Is there a particular reason why 
> you need this? Is this for Loom? Would the implementation that I am proposing 
> here also work for your use-case(s)?
> 

Yes, the entire implementation is in Java.

void push(Object lockee, long fid) {
if (this != Thread.currentThread()) Monitor.abort("invariant");
if (lockStackPos == lockStack.length) {
grow();
}
frameId[lockStackPos] = fid;
lockStack[lockStackPos++] = lockee;
}


We are starting from the point of let's do everything be in Java.
I want smart people to being able to change the implementation.
So I really don't like the hardcoded assembly in remove_activation which do 
this check on frame id on the lock stack.
If we can make the changes to e.g. popframe and take a bit different approach 
to JVMS we may have a total flexible Java implementation.
But a flexible Java implementation means compiler can't have intrinsics, so 
what will the performance be
We have more loose-ends than we can handle at the moment.

Your code may be useable for JOM if we lock the implementation to using a 
lock-stack and we are going to write intrinsics to it.
There is no point of it being in Java if so IMHO.

-

PR: https://git.openjdk.org/jdk/pull/9680


Re: RFR: 8291555: Replace stack-locking with fast-locking

2022-10-06 Thread Robbin Ehn
On Fri, 9 Sep 2022 19:01:14 GMT, Roman Kennke  wrote:

> How have you handled the interpreter lock-stack-area in your implementation? 
> Is it worth to get rid of it and consolidate with the per-thread lock-stack?

At the moment I had to store a "frame id" for each entry in the lock stack.
The frame id is previous fp, grabbed from "link()" when entering the locking 
code.

private static final void monitorEnter(Object o) {

long monitorFrameId = getCallerFrameId();
``` 
When popping we can thus check if there is still monitors/locks for the frame 
to be popped.
Remove activation reads the lock stack, with a bunch of assembly, e.g.:
`  access_load_at(T_INT, IN_HEAP, rax, Address(rax, 
java_lang_Thread::lock_stack_pos_offset()), noreg, noreg);
`
If we would keep this, loom freezing would need to relativize and derelativize 
these values.
(we only have interpreter)

But, according to JVMS 2.11.10. the VM only needs to automatically unlock 
synchronized method.
This code that unlocks all locks in the frame seems to have been added for JLS 
17.1.
I have asked for clarification and we only need and should care about JVMS.

So if we could make popframe do more work (popframe needs to unlock all), there 
seems to be way forward allowing more flexibility.

Still working on trying to make what we have public, even if it's in roughly 
shape and it's very unclear if that is the correct approach at all.

-

PR: https://git.openjdk.org/jdk/pull/9680


Re: RFR: 8291555: Replace stack-locking with fast-locking

2022-10-06 Thread Robbin Ehn
On Tue, 16 Aug 2022 16:21:04 GMT, Roman Kennke  wrote:

> Strictly speaking, I believe the conditions check for the (weaker) balanced 
> property, but not for the (stronger) structured property.

I know but the text says:
- "every exit on a given monitor matches a preceding entry on that monitor."
- "implementations of the Java Virtual Machine are permitted but not required 
to enforce both of the following two rules guaranteeing structured locking"

I read this as if the rules do not guarantee structured locking the rules are 
not correct.
The VM is allowed to enforce it.
But thats just my take on it.

EDIT:
Maybe I'm reading to much into it.
Lock A,B then unlock A,B maybe is considered structured locking?

But then again what if:


void foo_lock() {
  monitorenter(A);
  monitorenter(B);
  // If VM abruptly returns here
  // VM can unlock them in reverse order first B and then A ?
  monitorexit(A);
  monitorexit(B);
}

-

PR: https://git.openjdk.org/jdk/pull/9680


Re: RFR: 8291555: Replace stack-locking with fast-locking

2022-10-06 Thread Robbin Ehn
On Thu, 11 Aug 2022 11:39:01 GMT, Robbin Ehn  wrote:

>>> > Thanks for giving this PR a spin. I pushed a fix for the aarch64 build 
>>> > problem (seems weird that GHA did not catch it).
>>> 
>>> NP, thanks. I notice some other user of owning_thread_from_monitor_owner() 
>>> such as DeadlockCycle::print_on_with() which asserts on "assert(adr != 
>>> reinterpret_cast(1)) failed: must convert to lock object".
>> 
>> Do you know by any chance which tests trigger this?
>
>> > > Thanks for giving this PR a spin. I pushed a fix for the aarch64 build 
>> > > problem (seems weird that GHA did not catch it).
>> > 
>> > 
>> > NP, thanks. I notice some other user of owning_thread_from_monitor_owner() 
>> > such as DeadlockCycle::print_on_with() which asserts on "assert(adr != 
>> > reinterpret_cast(1)) failed: must convert to lock object".
>> 
>> Do you know by any chance which tests trigger this?
> 
> Yes, there is a couple of to choose from, I think the jstack cmd may be 
> easiest: jstack/DeadlockDetectionTest.java

> @robehn or @dholmes-ora I believe one of you mentioned somewhere (can't find 
> the comment, though) that we might need to support the bytecode sequence 
> monitorenter A; monitorenter B; monitorexit A; monitorexit B; properly. I 
> have now made a testcase that checks this, and it does indeed fail with this 
> PR, while passing with upstream. Also, the JVM spec doesn't mention anywhere 
> that it is required that monitorenter/exit are properly nested. I'll have to 
> fix this in the interpreter (JIT compilers refuse to compile 
> not-properly-nested monitorenter/exit anyway).
> 
> See 
> https://github.com/rkennke/jdk/blob/fast-locking/test/hotspot/jtreg/runtime/locking/TestUnstructuredLocking.jasm

jvms-2.11.10

> Structured locking is the situation when, during a method invocation, every 
> exit on a given monitor matches a preceding entry on that monitor. Since 
> there is no assurance that all code submitted to the Java Virtual Machine 
> will perform structured locking, implementations of the Java Virtual Machine 
> are permitted but not required to enforce both of the following two rules 
> guaranteeing structured locking. Let T be a thread and M be a monitor. Then:
> 
> The number of monitor entries performed by T on M during a method invocation 
> must equal the number of monitor exits performed by T on M during the method 
> invocation whether the method invocation completes normally or abruptly.
> 
> At no point during a method invocation may the number of monitor exits 
> performed by T on M since the method invocation exceed the number of monitor 
> entries performed by T on M since the method invocation.
> 
> Note that the monitor entry and exit automatically performed by the Java 
> Virtual Machine when invoking a synchronized method are considered to occur 
> during the calling method's invocation.

I think the intent of above was to allow enforcing structured locking.

In relevant other projects, we support only structured locking in Java, but 
permit some unstructured locking when done via JNI.
In that project JNI monitor enter/exit do not use the lockstack.

I don't think we today fully support unstructured locking either:

void foo_lock() {
  monitorenter(this);
  // If VM abruptly returns here 'this' will be unlocked
  // Because VM assumes structured locking.
  // see e.g. remove_activation(...)
}


*I scratch this as it was a bit off topic.*

-

PR: https://git.openjdk.org/jdk/pull/9680


Re: RFR: 8291555: Replace stack-locking with fast-locking

2022-10-06 Thread Robbin Ehn
On Thu, 11 Aug 2022 11:19:31 GMT, Roman Kennke  wrote:

> > > Thanks for giving this PR a spin. I pushed a fix for the aarch64 build 
> > > problem (seems weird that GHA did not catch it).
> > 
> > 
> > NP, thanks. I notice some other user of owning_thread_from_monitor_owner() 
> > such as DeadlockCycle::print_on_with() which asserts on "assert(adr != 
> > reinterpret_cast(1)) failed: must convert to lock object".
> 
> Do you know by any chance which tests trigger this?

Yes, there is a couple of to choose from, I think the jstack cmd may be 
easiest: jstack/DeadlockDetectionTest.java

-

PR: https://git.openjdk.org/jdk/pull/9680


Re: RFR: 8291555: Replace stack-locking with fast-locking

2022-10-06 Thread Robbin Ehn
On Tue, 9 Aug 2022 10:46:51 GMT, Roman Kennke  wrote:

> Thanks for giving this PR a spin. I pushed a fix for the aarch64 build 
> problem (seems weird that GHA did not catch it).

NP, thanks.
I notice some other user of owning_thread_from_monitor_owner() such as 
DeadlockCycle::print_on_with() which asserts on "assert(adr != 
reinterpret_cast(1)) failed: must convert to lock object".

-

PR: https://git.openjdk.org/jdk/pull/9680


Re: RFR: 8291555: Replace stack-locking with fast-locking

2022-10-06 Thread Robbin Ehn
On Mon, 8 Aug 2022 18:29:54 GMT, Roman Kennke  wrote:

> > I ran some test locally, 4 JDI fails and 3 JVM TI, all seems to fail in:
> > ```
> > #7  0x7f7cefc5c1ce in Thread::is_lock_owned 
> > (this=this@entry=0x7f7ce801dd90, adr=adr@entry=0x1  > memory at address 0x1>) at 
> > /home/rehn/source/jdk/ongit/dev-jdk/open/src/hotspot/share/runtime/thread.cpp:549
> > #8  0x7f7cef22c062 in JavaThread::is_lock_owned (this=0x7f7ce801dd90, 
> > adr=0x1 ) at 
> > /home/rehn/source/jdk/ongit/dev-jdk/open/src/hotspot/share/runtime/javaThread.cpp:979
> > #9  0x7f7cefc79ab0 in Threads::owning_thread_from_monitor_owner 
> > (t_list=, owner=owner@entry=0x1  > at address 0x1>)
> > at 
> > /home/rehn/source/jdk/ongit/dev-jdk/open/src/hotspot/share/runtime/threads.cpp:1382
> > ```
> 
> Thanks, Robbin! That was a bug in JvmtiBase::get_owning_thread() where an 
> anonymous owner must be converted to the oop address before passing down to 
> Threads::owning_thread_from_monitor_owner(). I pushed a fix. Can you re-test? 
> Testing com/sun/jdi passes for me, now.

Yes, that fixed it. I'm running more tests also.

I got this build problem on aarch64:

open/src/hotspot/share/asm/assembler.hpp:168), pid=3387376, tid=3387431
#  assert(is_bound() || is_unused()) failed: Label was never bound to a 
location, but it was used as a jmp target

V  [libjvm.so+0x4f4788]  Label::~Label()+0x48
V  [libjvm.so+0x424a44]  cmpFastLockNode::emit(CodeBuffer&, PhaseRegAlloc*) 
const+0x764
V  [libjvm.so+0x1643888]  PhaseOutput::fill_buffer(CodeBuffer*, unsigned 
int*)+0x538
V  [libjvm.so+0xa85fcc]  Compile::Code_Gen()+0x3bc

-

PR: https://git.openjdk.org/jdk/pull/9680


Re: RFR: 8291555: Replace stack-locking with fast-locking

2022-10-06 Thread Robbin Ehn
On Thu, 28 Jul 2022 19:58:34 GMT, Roman Kennke  wrote:

> This change replaces the current stack-locking implementation with a 
> fast-locking scheme that retains the advantages of stack-locking (namely fast 
> locking in uncontended code-paths), while avoiding the overload of the mark 
> word. That overloading causes massive problems with Lilliput, because it 
> means we have to check and deal with this situation. And because of the very 
> racy nature, this turns out to be very complex and involved a variant of the 
> inflation protocol to ensure that the object header is stable. 
> 
> What the original stack-locking does is basically to push a stack-lock onto 
> the stack which consists only of the displaced header, and CAS a pointer to 
> this stack location into the object header (the lowest two header bits being 
> 00 indicate 'stack-locked'). The pointer into the stack can then be used to 
> identify which thread currently owns the lock.
> 
> This change basically reverses stack-locking: It still CASes the lowest two 
> header bits to 00 to indicate 'fast-locked' but does *not* overload the upper 
> bits with a stack-pointer. Instead, it pushes the object-reference to a 
> thread-local lock-stack. This is a new structure which is basically a small 
> array of oops that is associated with each thread. Experience shows that this 
> array typcially remains very small (3-5 elements). Using this lock stack, it 
> is possible to query which threads own which locks. Most importantly, the 
> most common question 'does the current thread own me?' is very quickly 
> answered by doing a quick scan of the array. More complex queries like 'which 
> thread owns X?' are not performed in very performance-critical paths (usually 
> in code like JVMTI or deadlock detection) where it is ok to do more complex 
> operations. The lock-stack is also a new set of GC roots, and would be 
> scanned during thread scanning, possibly concurrently, via the normal 
> protocols.
> 
> In contrast to stack-locking, fast-locking does *not* support recursive 
> locking (yet). When that happens, the fast-lock gets inflated to a full 
> monitor. It is not clear if it is worth to add support for recursive 
> fast-locking.
> 
> One trouble is that when a contending thread arrives at a fast-locked object, 
> it must inflate the fast-lock to a full monitor. Normally, we need to know 
> the current owning thread, and record that in the monitor, so that the 
> contending thread can wait for the current owner to properly exit the 
> monitor. However, fast-locking doesn't have this information. What we do 
> instead is to record a special marker ANONYMOUS_OWNER. When the thread that 
> currently holds the lock arrives at monitorexit, and observes 
> ANONYMOUS_OWNER, it knows it must be itself, fixes the owner to be itself, 
> and then properly exits the monitor, and thus handing over to the contending 
> thread.
> 
> As an alternative, I considered to remove stack-locking altogether, and only 
> use heavy monitors. In most workloads this did not show measurable 
> regressions. However, in a few workloads, I have observed severe regressions. 
> All of them have been using old synchronized Java collections (Vector, 
> Stack), StringBuffer or similar code. The combination of two conditions leads 
> to regressions without stack- or fast-locking: 1. The workload synchronizes 
> on uncontended locks (e.g. single-threaded use of Vector or StringBuffer) and 
> 2. The workload churns such locks. IOW, uncontended use of Vector, 
> StringBuffer, etc as such is ok, but creating lots of such single-use, 
> single-threaded-locked objects leads to massive ObjectMonitor churn, which 
> can lead to a significant performance impact. But alas, such code exists, and 
> we probably don't want to punish it if we can avoid it.
> 
> This change enables to simplify (and speed-up!) a lot of code:
> 
> - The inflation protocol is no longer necessary: we can directly CAS the 
> (tagged) ObjectMonitor pointer to the object header.
> - Accessing the hashcode could now be done in the fastpath always, if the 
> hashcode has been installed. Fast-locked headers can be used directly, for 
> monitor-locked objects we can easily reach-through to the displaced header. 
> This is safe because Java threads participate in monitor deflation protocol. 
> This would be implemented in a separate PR
> 
> ### Benchmarks
> 
> All benchmarks are run on server-class metal machines. The JVM settings are 
> always: `-Xmx20g -Xms20g -XX:+UseParallelGC`. All benchmarks are ms/ops, less 
> is better.
> 
>  DaCapo/AArch64
> 
> Those measurements have been taken on a Graviton2 box with 64 CPU cores (an 
> AWS m6g.metal instance). It is using DaCapo evaluation version, git hash 
> 309e1fa (download file dacapo-evaluation-git+309e1fa.jar). I needed to 
> exclude cassandra, h2o & kafka benchmarks because of incompatibility with 
> JDK20. Benchmarks that showed results far off the baseline or showed 

Re: RFR: 8294406: Test runtime/handshake/HandshakeDirectTest.java failed: JVMTI_ERROR_WRONG_PHASE

2022-09-30 Thread Robbin Ehn
On Fri, 30 Sep 2022 01:32:44 GMT, Leonid Mesnik  wrote:

> 8294406: Test runtime/handshake/HandshakeDirectTest.java failed: 
> JVMTI_ERROR_WRONG_PHASE

The reason for it to be a daemon is that this thread should try to 
suspend/resume when we are terminating, thus trying to hit those rare exit 
parts where we do thread exits with a pending suspend.

-

PR: https://git.openjdk.org/jdk/pull/10502


Re: RFR: 8293592: Remove JVM_StopThread, stillborn, and related cleanup [v2]

2022-09-27 Thread Robbin Ehn
On Fri, 23 Sep 2022 21:47:06 GMT, David Holmes  wrote:

>> Now that Thread.stop has been degraded to throw 
>> `UnsupportedOperationException` (JDK-8299610) the only direct source of 
>> async exceptions is from JVMTI `StopThread`. We can remove the 
>> `JVM_StopThread` code, remove the `stillborn` field from `java.lang.Thread` 
>> and its associated accesses from the VM, and we can stop special-casing 
>> `ThreadDeath` handling (as was done for the JDK code as part of JDK-8299610).
>> 
>> Note that JVMTI `StopThread` can only act on a thread that is alive, so it 
>> is no longer possible to stop a thread before it has been started.
>> 
>> Also note that there is a change in behaviour for JNI `ExceptionDescribe` as 
>> it no longer ignores `ThreadDeath` exceptions (not that it was ever 
>> specified to ignore them, it simply mirrored the behaviour of the default 
>> `UncaughtExceptionHandler` in `java.lang.ThreadGroup` - which also no longer 
>> ignores them so the mirroring behaviour remains the same).
>> 
>> Testing: tiers 1-3
>
> David Holmes has updated the pull request with a new target base due to a 
> merge or a rebase. The pull request now contains 16 commits:
> 
>  - Merge branch 'master' into 8293592-JVM_StopThread
>  - Removal all special handling of ThreadDeath.
>  - Remove all references to the stillborn field
>  - Initial commit: remove JVM_StopThread
>  - Merge
>  - Updates to Java Thread Primitive Deprecation page
>  - Repalce "it" with "victim thread"
>  - Merge
>  - Revert test/langtools/ProblemList.txt as jshell tests no longer rely on 
> Thread.stop
>  - become -> became in javadoc
>  - ... and 6 more: https://git.openjdk.org/jdk/compare/3675f4c2...4eb07903

Looks good, thank for fixing.

-

Marked as reviewed by rehn (Reviewer).

PR: https://git.openjdk.org/jdk/pull/10400


Re: RFR: 8288752: Split thread implementation files [v2]

2022-06-21 Thread Robbin Ehn
On Tue, 21 Jun 2022 12:22:46 GMT, Coleen Phillimore  wrote:

>> This changes splits thread.hpp/cpp into javaThread, and threads files.
>> 
>> I left the commits intact to see better the progression of changes, but most 
>> files are include file changes.  The only tricky parts are that some files 
>> were included in thread.hpp, like mutexLocker.cpp, which has to be included 
>> in the files that need it.
>> 
>> I didn't update the copyrights to save diffs but will before integration.
>> Also I won't integrate until after Dan's JDK 19 changes are merged into JDK 
>> 20.
>> 
>> Tested with tier1-4 on Oracle supported platforms and built on other 
>> platforms.
>
> Coleen Phillimore has updated the pull request incrementally with one 
> additional commit since the last revision:
> 
>   Fix Shenandoah build.

Looks good, thanks.

-

Marked as reviewed by rehn (Reviewer).

PR: https://git.openjdk.org/jdk/pull/9218