On Mon, 15 Jul 2024 00:50:30 GMT, Axel Boldt-Christmas <abold...@openjdk.org> 
wrote:

>> When inflating a monitor the `ObjectMonitor*` is written directly over the 
>> `markWord` and any overwritten data is displaced into a displaced 
>> `markWord`. This is problematic for concurrent GCs which needs extra care or 
>> looser semantics to use this displaced data. In Lilliput this data also 
>> contains the klass forcing this to be something that the GC has to take into 
>> account everywhere.
>> 
>> This patch introduces an alternative solution where locking only uses the 
>> lock bits of the `markWord` and inflation does not override and displace the 
>> `markWord`. This is done by keeping associations between objects and 
>> `ObjectMonitor*` in an external hash table. Different caching techniques are 
>> used to speedup lookups from compiled code.
>> 
>> A diagnostic VM option is introduced called `UseObjectMonitorTable`. It is 
>> only supported in combination with the LM_LIGHTWEIGHT locking mode (the 
>> default). 
>> 
>> This patch has been evaluated to be performance neutral when 
>> `UseObjectMonitorTable` is turned off (the default). 
>> 
>> Below is a more detailed explanation of this change and how `LM_LIGHTWEIGHT` 
>> and `UseObjectMonitorTable` works.
>> 
>> # Cleanups
>> 
>> Cleaned up displaced header usage for:
>>   * BasicLock
>>     * Contains some Zero changes
>>     * Renames one exported JVMCI field
>>   * ObjectMonitor
>>     * Updates comments and tests consistencies
>> 
>> # Refactoring
>> 
>> `ObjectMonitor::enter` has been refactored an a 
>> `ObjectMonitorContentionMark` witness object has been introduced to the 
>> signatures. Which signals that the contentions reference counter is being 
>> held. More details are given below in the section about deflation.
>> 
>> The initial purpose of this was to allow `UseObjectMonitorTable` to interact 
>> more seamlessly with the `ObjectMonitor::enter` code. 
>> 
>> _There is even more `ObjectMonitor` refactoring which can be done here to 
>> create a more understandable and enforceable API. There are a handful of 
>> invariants / assumptions which are not always explicitly asserted which 
>> could be trivially abstracted and verified by the type system by using 
>> similar witness objects._
>> 
>> # LightweightSynchronizer
>> 
>> Working on adapting and incorporating the following section as a comment in 
>> the source code
>> 
>> ## Fast Locking
>> 
>>   CAS on locking bits in markWord. 
>>   0b00 (Fast Locked) <--> 0b01 (Unlocked)
>> 
>>   When locking and 0b00 (Fast Locked) is observed, it may be beneficial to 
>> avoid inflating by spinning a bit.
>> 
>>   If 0b10 (Inflated) is observed or there is to...
>
> Axel Boldt-Christmas has updated the pull request incrementally with 10 
> additional commits since the last revision:
> 
>  - Remove try_read
>  - Add explicit to single parameter constructors
>  - Remove superfluous access specifier
>  - Remove unused include
>  - Update assert message OMCache::set_monitor
>  - Fix indentation
>  - Remove outdated comment LightweightSynchronizer::exit
>  - Remove logStream include
>  - Remove strange comment
>  - Fix javaThread include

Another review pass by me. It looks to me like the cache lookup can be 
improved, see comments below.

src/hotspot/cpu/aarch64/c2_MacroAssembler_aarch64.cpp line 323:

> 321:       ldr(t1, Address(t3_t));
> 322:       cmp(obj, t1);
> 323:       br(Assembler::EQ, monitor_found);

I think the loop could be optimized a bit, if we start with the (cache_address) 
- 1 in t3, then increment t3 at the start of the loop, and let the success-case 
fall-through and only branch back to loop-start or to failure-path. Something 
like:

bind(loop);
increment(t3_t, in_bytes(OMCache::oop_to_oop_difference()));
ldr(t1, Address(t3_t));
cbnz(t1, loop);
cmp(obj, t1);
br(Assembler::NE, loop);
// Success

Advantage would be that we have no forward-branch in the fast/expected case. 
CPU static branch prediction tends to not like that. I'm not sure if if makes a 
difference, though. Also, if you do that, then the unrolled loop also needs 
corresponding adjustment.

src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 674:

> 672: 
> 673:       // Search for obj in cache.
> 674:       bind(loop);

Same loop transformation would be possible here.

src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 776:

> 774:     movl(top, Address(thread, JavaThread::lock_stack_top_offset()));
> 775: 
> 776:     if (!UseObjectMonitorTable) {

Why is the mark loaded here in the !UOMT case, but later in the +UOMT case?

-------------

PR Review: https://git.openjdk.org/jdk/pull/20067#pullrequestreview-2179942149
PR Review Comment: https://git.openjdk.org/jdk/pull/20067#discussion_r1679210139
PR Review Comment: https://git.openjdk.org/jdk/pull/20067#discussion_r1679313050
PR Review Comment: https://git.openjdk.org/jdk/pull/20067#discussion_r1679315158

Reply via email to