On Mon, 11 Nov 2024 14:51:06 GMT, Per Minborg <[email protected]> wrote:
>> Thanks @minborg for this :) Please remember to add the misprediction count
>> if you can and avoid the bulk methods by having a `nextMemorySegment()`
>> benchmark method which make a single fill call site to observe the different
>> segments (types).
>>
>> Having separate call-sites which observe always the same type(s) "could" be
>> too lucky (and gentle) for the runtime (and CHA) and would favour to have a
>> single address entry (or few ones, if we include any optimization for the
>> fill size) in the Branch Target Buffer of the cpu.
>
>> Thanks @minborg for this :) Please remember to add the misprediction count
>> if you can and avoid the bulk methods by having a `nextMemorySegment()`
>> benchmark method which make a single fill call site to observe the different
>> segments (types).
>>
>> Having separate call-sites which observe always the same type(s) "could" be
>> too lucky (and gentle) for the runtime (and CHA) and would favour to have a
>> single address entry (or few ones, if we include any optimization for the
>> fill size) in the Branch Target Buffer of the cpu.
>
> I've added a "mixed" benchmark. I am not sure I understood all of your
> comments but given my changes, maybe you could elaborate a bit more?
@minborg sent me some logs from his machine, and I'm analyzing them now.
Basically, I'm trying to see why your Java code is a bit faster than the Loop
code.
----------------
44.77% c2, level 4
org.openjdk.bench.java.lang.foreign.SegmentBulkRandomFill::heapSegmentFillLoop,
version 4, compile id 946
24.43% c2, level 4
org.openjdk.bench.java.lang.foreign.SegmentBulkRandomFill::heapSegmentFillLoop,
version 4, compile id 946
21.80% c2, level 4
org.openjdk.bench.java.lang.foreign.SegmentBulkRandomFill::heapSegmentFillLoop,
version 4, compile id 946
There seem to be 3 hot regions.
**main-loop** (region has 44.77%):
;; B33: # out( B33 B34 ) <- in( B32 B33 ) Loop( B33-B33 inner
main of N116 strip mined) Freq: 4.62951e+10
0.50% ? 0x00000001149a23c0: sxtw x20, w4
? 0x00000001149a23c4: add x22, x16, x20
0.02% ? 0x00000001149a23c8: str q16, [x22]
16.33% ? 0x00000001149a23cc: str q16, [x22, #16]
;*invokevirtual putByte {reexecute=0 rethrow=0 return_oop=0}
? ; -
jdk.internal.misc.ScopedMemoryAccess::putByteInternal@15 (line 534)
? ; -
jdk.internal.misc.ScopedMemoryAccess::putByte@6 (line 522)
? ; -
java.lang.invoke.VarHandleSegmentAsBytes::set@38 (line 114)
? ; -
java.lang.invoke.LambdaForm$DMH/0x000000013a4d5800::invokeStatic@20
? ; -
java.lang.invoke.LambdaForm$MH/0x000000013a4d8070::invoke@37
? ; -
java.lang.invoke.VarHandleGuards::guard_LJI_V@134 (line 1017)
? ; -
jdk.internal.foreign.AbstractMemorySegmentImpl::set@10 (line 670)
? ; -
org.openjdk.bench.java.lang.foreign.SegmentBulkRandomFill::heapSegmentFillLoop@44
(line 101)
? 0x00000001149a23d0: add w4, w4, #0x20
0.06% ? 0x00000001149a23d4: cmp w4, w10
? 0x00000001149a23d8: b.lt 0x00000001149a23c0 //
b.tstop;*ifge {reexecute=0 rethrow=0 return_oop=0}
**post-loops**: the "vectorized post-loop" and the "single iteration post-loop"
(region has 24.43%):
vectorized post-loop (inner post)
? ? ;; B14: # out( B14 B15 ) <- in( B35 B14 ) Loop(
B14-B14 inner post of N1915) Freq: 174420
2.20% ?? ? 0x00000001149a224c: sxtw x5, w4
0.88% ?? ? 0x00000001149a2250: str q16, [x16, x5]
;*invokevirtual putByte {reexecute=0 rethrow=0 return_oop=0}
?? ? ;
- jdk.internal.misc.ScopedMemoryAccess::putByteInternal@15 (line 534)
?? ? ;
- jdk.internal.misc.ScopedMemoryAccess::putByte@6 (line 522)
?? ? ;
- java.lang.invoke.VarHandleSegmentAsBytes::set@38 (line 114)
?? ? ;
- java.lang.invoke.LambdaForm$DMH/0x000000013a4d5800::invokeStatic@20
?? ? ;
- java.lang.invoke.LambdaForm$MH/0x000000013a4d8070::invoke@37
?? ? ;
- java.lang.invoke.VarHandleGuards::guard_LJI_V@134 (line 1017)
?? ? ;
- jdk.internal.foreign.AbstractMemorySegmentImpl::set@10 (line 670)
?? ? ;
-
org.openjdk.bench.java.lang.foreign.SegmentBulkRandomFill::heapSegmentFillLoop@44
(line 101)
?? ? 0x00000001149a2254: add w4, w4, #0x10
?? ? 0x00000001149a2258: cmp w4, w10
?? ? 0x00000001149a225c: b.lt 0x00000001149a224c //
b.tstop;*ifge {reexecute=0 rethrow=0 return_oop=0}
? ? ;
-
org.openjdk.bench.java.lang.foreign.SegmentBulkRandomFill::heapSegmentFillLoop@33
(line 100)
? ? ;; B15: # out( B16 ) <- in( B14 ) Freq: 87210.2
0.34% ? ? 0x00000001149a2260: add x10, x19, x5
? ? 0x00000001149a2264: add x22, x10, #0x10
;*ladd {reexecute=0 rethrow=0 return_oop=0}
? ? ;
-
org.openjdk.bench.java.lang.foreign.SegmentBulkRandomFill::heapSegmentFillLoop@52
(line 100)
? ? ;; B16: # out( B20 B17 ) <- in( B39 B15 B36 )
top-of-loop Freq: 174421
0.78% ? ? 0x00000001149a2268: cmp w4, w3
? ? ? 0x00000001149a226c: b.ge 0x00000001149a2294 // b.tcont
? ? ? ;; B17: # out( B42 B18 ) <- in( B16 ) Freq: 87210.3
? ? ? 0x00000001149a2270: cmp w4, w2
? ? ? 0x00000001149a2274: b.cs 0x00000001149a24a4 // b.hs,
b.nlast
? ? ?
;*aload {reexecute=0 rethrow=0 return_oop=0}
? ? ? ;
-
org.openjdk.bench.java.lang.foreign.SegmentBulkRandomFill::heapSegmentFillLoop@36
(line 101)
scalar post loop:
? ? ? ;; B18: # out( B18 B19 ) <- in( B17 B18 ) Loop(
B18-B18 inner post of N1402) Freq: 174420
0.56% ? ??? 0x00000001149a2278: sxtw x10, w4
5.47% ? ??? 0x00000001149a227c: strb wzr, [x16, x10, lsl #0]
;*invokevirtual putByte {reexecute=0 rethrow=0 return_oop=0}
? ??? ;
- jdk.internal.misc.ScopedMemoryAccess::putByteInternal@15 (line 534)
? ??? ;
- jdk.internal.misc.ScopedMemoryAccess::putByte@6 (line 522)
? ??? ;
- java.lang.invoke.VarHandleSegmentAsBytes::set@38 (line 114)
? ??? ;
- java.lang.invoke.LambdaForm$DMH/0x000000013a4d5800::invokeStatic@20
? ??? ;
- java.lang.invoke.LambdaForm$MH/0x000000013a4d8070::invoke@37
? ??? ;
- java.lang.invoke.VarHandleGuards::guard_LJI_V@134 (line 1017)
? ??? ;
- jdk.internal.foreign.AbstractMemorySegmentImpl::set@10 (line 670)
? ??? ;
-
org.openjdk.bench.java.lang.foreign.SegmentBulkRandomFill::heapSegmentFillLoop@44
(line 101)
? ??? 0x00000001149a2280: add w4, w4, #0x1
? ??? 0x00000001149a2284: cmp w4, w3
? ??? 0x00000001149a2288: b.lt 0x00000001149a2278 // b.tstop
Not sure why we have this below... probably the check that leads to the
post-loop?
? ? ? ;; B19: # out( B20 ) <- in( B18 ) Freq: 87210.2
8.88% ? ? ? 0x00000001149a228c: add x10, x10, x19
? ? ? 0x00000001149a2290: add x22, x10, #0x1
;*ifge {reexecute=0 rethrow=0 return_oop=0}
? ? ? ;
-
org.openjdk.bench.java.lang.foreign.SegmentBulkRandomFill::heapSegmentFillLoop@33
(line 100)
? ? ? ;; B20: # out( B2 B21 ) <- in( B23 B19 B16 ) Freq:
174760
0.78% ? ? ? 0x00000001149a2294: cmp x22, x7
? ? 0x00000001149a2298: b.ge 0x00000001149a219c // b.tcont
**pre-loop** (region has 21.80%):
;; B27: # out( B29 B28 ) <- in( B26 B28 ) Loop( B27-B28 inner
pre of N1402) Freq: 348842
0.10% ? 0x00000001149a2364: sxtw x22, w10
6.01% ? 0x00000001149a2368: strb wzr, [x16, x22, lsl #0]
;*invokevirtual putByte {reexecute=0 rethrow=0 return_oop=0}
? ; -
jdk.internal.misc.ScopedMemoryAccess::putByteInternal@15 (line 534)
? ; -
jdk.internal.misc.ScopedMemoryAccess::putByte@6 (line 522)
? ; -
java.lang.invoke.VarHandleSegmentAsBytes::set@38 (line 114)
? ; -
java.lang.invoke.LambdaForm$DMH/0x000000013a4d5800::invokeStatic@20
? ; -
java.lang.invoke.LambdaForm$MH/0x000000013a4d8070::invoke@37
? ; -
java.lang.invoke.VarHandleGuards::guard_LJI_V@134 (line 1017)
? ; -
jdk.internal.foreign.AbstractMemorySegmentImpl::set@10 (line 670)
? ; -
org.openjdk.bench.java.lang.foreign.SegmentBulkRandomFill::heapSegmentFillLoop@44
(line 101)
0.08% ? 0x00000001149a236c: add w4, w10, #0x1
0.56% ? 0x00000001149a2370: cmp w4, w20
0.04% ?? 0x00000001149a2374: b.ge 0x00000001149a2380 //
b.tcont;*ifge {reexecute=0 rethrow=0 return_oop=0}
?? ; -
org.openjdk.bench.java.lang.foreign.SegmentBulkRandomFill::heapSegmentFillLoop@33
(line 100)
?? ;; B28: # out( B27 ) <- in( B27 ) Freq: 174421
5.61% ?? 0x00000001149a2378: mov w10, w4
?? 0x00000001149a237c: b 0x00000001149a2364
with a strange extra add that has some strange looking percentage (profile
inaccuracy?):
7.88% ? 0x00000001149a2380: add w10, w10, #0x20
**Summary**:
pre-loop: 22%, byte-store
main-loop: 40% 2x 16-byte-vector-store (profiling is a bit
contradictory here - is it 16% or 44%?)
vectorized post-loop: 4% 1x 16-byte-vector-store (not super sure about
profiling, but could be accurate)
post-loop: 12% byte-store
The numbers don't quite add up - but they are still somewhat telling - and I
think probably accurate enough to see what happens.
Basically: we waste a lot of time in the pre and post-loop: getting alignment
and then finishing off at the end.
-------------------
And to compare:
58.00% c2, level 4
org.openjdk.bench.java.lang.foreign.SegmentBulkRandomFill::heapSegmentFillJava,
version 5, compile id 848
29.83% c2, level 4
org.openjdk.bench.java.lang.foreign.SegmentBulkRandomFill::heapSegmentFillJava,
version 5, compile id 848
We have 2 hot regions.
**main** (58%):
;; B40: # out( B40 B41 ) <- in( B39 B40 ) Loop( B40-B40 inner
main of N140 strip mined) Freq: 2.13696e+08
0.26% ? 0x000000011800f900: add x4, x1, w3, sxtw
? ;; merged str pair
? 0x000000011800f904: stp xzr, xzr, [x4]
? 0x000000011800f908: str xzr, [x4, #16]
;*invokevirtual putLongUnaligned {reexecute=0 rethrow=0 return_oop=0}
? ; -
jdk.internal.misc.Unsafe::putLongUnaligned@10 (line 3677)
? ; -
jdk.internal.misc.ScopedMemoryAccess::putLongUnalignedInternal@17 (line 2605)
? ; -
jdk.internal.misc.ScopedMemoryAccess::putLongUnaligned@8 (line 2593)
? ; -
jdk.internal.foreign.SegmentBulkOperations::fill@133 (line 78)
? ; -
jdk.internal.foreign.AbstractMemorySegmentImpl::fill@2 (line 184)
? ; -
org.openjdk.bench.java.lang.foreign.SegmentBulkRandomFill::heapSegmentFillJava@14
(line 83)
? 0x000000011800f90c: add w3, w3, #0x20 ;*iinc
{reexecute=0 rethrow=0 return_oop=0}
? ; -
jdk.internal.foreign.SegmentBulkOperations::fill@136 (line 77)
? ; -
jdk.internal.foreign.AbstractMemorySegmentImpl::fill@2 (line 184)
? ; -
org.openjdk.bench.java.lang.foreign.SegmentBulkRandomFill::heapSegmentFillJava@14
(line 83)
21.73% ? 0x000000011800f910: str xzr, [x4, #24]
;*invokevirtual putLongUnaligned {reexecute=0 rethrow=0 return_oop=0}
? ; -
jdk.internal.misc.Unsafe::putLongUnaligned@10 (line 3677)
? ; -
jdk.internal.misc.ScopedMemoryAccess::putLongUnalignedInternal@17 (line 2605)
? ; -
jdk.internal.misc.ScopedMemoryAccess::putLongUnaligned@8 (line 2593)
? ; -
jdk.internal.foreign.SegmentBulkOperations::fill@133 (line 78)
? ; -
jdk.internal.foreign.AbstractMemorySegmentImpl::fill@2 (line 184)
? ; -
org.openjdk.bench.java.lang.foreign.SegmentBulkRandomFill::heapSegmentFillJava@14
(line 83)
0.17% ? 0x000000011800f914: cmp w3, w2
2.58% ? 0x000000011800f918: b.lt 0x000000011800f900 //
b.tstop;*if_icmpge {reexecute=0 rethrow=0 return_oop=0}
; -
jdk.internal.foreign.SegmentBulkOperations::fill@98 (line 77)
; -
jdk.internal.foreign.AbstractMemorySegmentImpl::fill@2 (line 184)
; -
org.openjdk.bench.java.lang.foreign.SegmentBulkRandomFill::heapSegmentFillJava@14
(line 83)
;; B41: # out( B39 B42 ) <- in( B40 ) Freq: 3.29583e+06
26.13% 0x000000011800f91c: ldr x2, [x28, #48] ;
ImmutableOopMap {r12=Oop r14=Oop c_rarg1=Derived_oop_r14 r15=Oop r16=Oop }
**Rest**:
vectorized post-loop
;; B2: # out( B2 B3 ) <- in( B42 B2 ) Loop( B2-B2
inner post of N1701) Freq: 50831.6
3.01% ? 0x000000011800f728: str xzr, [x1, w3, sxtw]
;*invokevirtual putLongUnaligned {reexecute=0 rethrow=0 return_oop=0}
? ; -
jdk.internal.misc.Unsafe::putLongUnaligned@10 (line 3677)
? ; -
jdk.internal.misc.ScopedMemoryAccess::putLongUnalignedInternal@17 (line 2605)
? ; -
jdk.internal.misc.ScopedMemoryAccess::putLongUnaligned@8 (line 2593)
? ; -
jdk.internal.foreign.SegmentBulkOperations::fill@133 (line 78)
? ; -
jdk.internal.foreign.AbstractMemorySegmentImpl::fill@2 (line 184)
? ; -
org.openjdk.bench.java.lang.foreign.SegmentBulkRandomFill::heapSegmentFillJava@14
(line 83)
? 0x000000011800f72c: add w3, w3, #0x8
;*iinc {reexecute=0 rethrow=0 return_oop=0}
? ; -
jdk.internal.foreign.SegmentBulkOperations::fill@136 (line 77)
? ; -
jdk.internal.foreign.AbstractMemorySegmentImpl::fill@2 (line 184)
? ; -
org.openjdk.bench.java.lang.foreign.SegmentBulkRandomFill::heapSegmentFillJava@14
(line 83)
? 0x000000011800f730: cmp w3, w10
? 0x000000011800f734: b.lt 0x000000011800f728 //
b.tstop;*if_icmpge {reexecute=0 rethrow=0 return_oop=0}
; -
jdk.internal.foreign.SegmentBulkOperations::fill@98 (line 77)
; -
jdk.internal.foreign.AbstractMemorySegmentImpl::fill@2 (line 184)
; -
org.openjdk.bench.java.lang.foreign.SegmentBulkRandomFill::heapSegmentFillJava@14
(line 83)
;; B3: # out( B5 B4 ) <- in( B2 B43 B44 ) top-of-loop
Freq: 51627.8
... and then the rest of the code I speculate is your **long-int-short-byte
wind-down code**.
-----------------------
**Conclusion:**
Java: spends about 58% in well vectorized main-loop code (2x super-unrolled,
i.e. 2x 16-byte-vectors)
Loop: only spends about 40% in main loop (also 2x 16-byte vectors) - the rest
is spent in pre/post-loops
Hmm. This really makes me want to ditch the alignment-code - it may hurt more
than we gain from it :thinking:
And we should also consider such "wind-down" code: going from 16-element
vectors to 8, 4, 2, 1 elements. Of course that is extra code and extra compile
time...
-------------
PR Comment: https://git.openjdk.org/jdk/pull/22010#issuecomment-2470102192