On Wed, 5 Jan 2022 17:23:40 GMT, Paul Sandoz <[email protected]> wrote:
>> This patch fixes a performance issue when dereferencing memory segments
>> backed by different kinds of scopes. When looking at inline traces, I found
>> that one of our benchmark, namely `LoopOverPollutedSegment` was already
>> hitting the ceiling of the bimorphic inline cache, specifically when
>> checking liveness of the segment scope in the memory access hotpath
>> (`ResourceScopeImpl::checkValidState`). The benchmark only used segments
>> backed by confined and global scope. I then added (in the initialization
>> "polluting" loop) segments backed by a shared scope, and then the benchmark
>> numbers started to look as follows:
>>
>>
>> Benchmark Mode Cnt Score
>> Error Units
>> LoopOverPollutedSegments.heap_segment_floats_VH avgt 30 7.004 ?
>> 0.089 ms/op
>> LoopOverPollutedSegments.heap_segment_floats_instance avgt 30 7.159 ?
>> 0.016 ms/op
>> LoopOverPollutedSegments.heap_segment_ints_VH avgt 30 7.017 ?
>> 0.110 ms/op
>> LoopOverPollutedSegments.heap_segment_ints_instance avgt 30 7.175 ?
>> 0.048 ms/op
>> LoopOverPollutedSegments.heap_unsafe avgt 30 0.243 ?
>> 0.004 ms/op
>> LoopOverPollutedSegments.native_segment_VH avgt 30 7.366 ?
>> 0.036 ms/op
>> LoopOverPollutedSegments.native_segment_instance avgt 30 7.305 ?
>> 0.098 ms/op
>> LoopOverPollutedSegments.native_unsafe avgt 30 0.238 ?
>> 0.002 ms/op
>>
>>
>> That is, since now we have *three* different kinds of scopes (confined,
>> shared and global), the call to the liveness check can no longer be inlined.
>> One solution could be, as we do for the *base* accessor, to add a scope
>> accessor to all memory segment implementation classes. But doing so only
>> works ok for heap segments (for which the scope accessor just returns the
>> global scope constants). For native segments, we're still megamorphic (as a
>> native segment can be backed by all kinds of scopes).
>>
>> In the end, it turned out to be much simpler to just make the liveness check
>> monomorphic, since there's so much sharing between the code paths already.
>> With that change, numbers of the tweaked benchmark go back to normal:
>>
>>
>> Benchmark Mode Cnt Score
>> Error Units
>> LoopOverPollutedSegments.heap_segment_floats_VH avgt 30 0.241 ?
>> 0.003 ms/op
>> LoopOverPollutedSegments.heap_segment_floats_instance avgt 30 0.244 ?
>> 0.003 ms/op
>> LoopOverPollutedSegments.heap_segment_ints_VH avgt 30 0.242 ?
>> 0.003 ms/op
>> LoopOverPollutedSegments.heap_segment_ints_instance avgt 30 0.248 ?
>> 0.001 ms/op
>> LoopOverPollutedSegments.heap_unsafe avgt 30 0.247 ?
>> 0.013 ms/op
>> LoopOverPollutedSegments.native_segment_VH avgt 30 0.245 ?
>> 0.004 ms/op
>> LoopOverPollutedSegments.native_segment_instance avgt 30 0.245 ?
>> 0.001 ms/op
>> LoopOverPollutedSegments.native_unsafe avgt 30 0.247 ?
>> 0.005 ms/op
>>
>>
>> Note that this patch tidies up a bit the usage of `checkValidState` vs.
>> `checkValidStateSlow`. The former should only really be used in the hot
>> path, while the latter is a more general routine which should be used in
>> non-performance critical code. Making `checkValidState` monomorphic caused
>> the `ScopeAccessError` to be generated in more places, so I needed to either
>> update the usage to use the safer `checkValidStateSlow` (where possible) or,
>> (in `Buffer` and `ConfinedScope`) just add extra wrapping.
>
> src/jdk.incubator.foreign/share/classes/jdk/internal/foreign/ResourceScopeImpl.java
> line 190:
>
>> 188: @ForceInline
>> 189: public final void checkValidState() {
>> 190: if (owner != null && owner != Thread.currentThread()) {
>
> For consistency we could change code `checkValidStateSlow` to refer directly
> to `owner`.
>
> It would be satisfying, but I don't know if it's possible, to compose
> `checkValidStateSlow` from `checkValidState` e.g.
>
> public final checkValidStateSlow() {
> checkValidState();
> if (!isAlive() { ... }
> }
I'll change use of `owner`. It's not really possible to write
checkValidStateSlow in terms of checkValidState, because the latter does a
plain read of the state, whereas the former does a volatile read. Reusing one
from the other would result in two reads (a plain and a volatile).
-------------
PR: https://git.openjdk.java.net/jdk18/pull/82