Re: RFR: 8343933: Add a MemorySegment::fill benchmark with varying sizes

Emanuel Peter Tue, 12 Nov 2024 02:08:26 -0800

On Mon, 11 Nov 2024 14:51:06 GMT, Per Minborg <[email protected]> wrote:


>> Thanks @minborg for this :) Please remember to add the misprediction count 
>> if you can and avoid the bulk methods by having a `nextMemorySegment()` 
>> benchmark method which make a single fill call site to observe the different 
>> segments (types).
>> 
>> Having separate call-sites which observe always the same type(s) "could" be 
>> too lucky (and gentle) for the runtime (and CHA) and would favour to have a 
>> single address entry (or few ones, if we include any optimization for the 
>> fill size) in the Branch Target Buffer of the cpu.
>
>> Thanks @minborg for this :) Please remember to add the misprediction count 
>> if you can and avoid the bulk methods by having a `nextMemorySegment()` 
>> benchmark method which make a single fill call site to observe the different 
>> segments (types).
>> 
>> Having separate call-sites which observe always the same type(s) "could" be 
>> too lucky (and gentle) for the runtime (and CHA) and would favour to have a 
>> single address entry (or few ones, if we include any optimization for the 
>> fill size) in the Branch Target Buffer of the cpu.
> 
> I've added a "mixed" benchmark. I am not sure I understood all of your 
> comments but given my changes, maybe you could elaborate a bit more?

@minborg sent me some logs from his machine, and I'm analyzing them now.

Basically, I'm trying to see why your Java code is a bit faster than the Loop 
code.

----------------

  44.77%                c2, level 4  
org.openjdk.bench.java.lang.foreign.SegmentBulkRandomFill::heapSegmentFillLoop, 
version 4, compile id 946
  24.43%                c2, level 4  
org.openjdk.bench.java.lang.foreign.SegmentBulkRandomFill::heapSegmentFillLoop, 
version 4, compile id 946
  21.80%                c2, level 4  
org.openjdk.bench.java.lang.foreign.SegmentBulkRandomFill::heapSegmentFillLoop, 
version 4, compile id 946

There seem to be 3 hot regions.

**main-loop** (region has 44.77%):

             ;; B33: #  out( B33 B34 ) &lt;- in( B32 B33 ) Loop( B33-B33 inner 
main of N116 strip mined) Freq: 4.62951e+10                                     
                     
   0.50%  ?   0x00000001149a23c0:   sxtw        x20, w4                         
                                                                                
                    
          ?   0x00000001149a23c4:   add x22, x16, x20                           
                                                                                
                    
   0.02%  ?   0x00000001149a23c8:   str q16, [x22]                              
                                                                                
                    
  16.33%  ?   0x00000001149a23cc:   str q16, [x22, #16]             
;*invokevirtual putByte {reexecute=0 rethrow=0 return_oop=0}                    
                                
          ?                                                             ; - 
jdk.internal.misc.ScopedMemoryAccess::putByteInternal@15 (line 534)             
                        
          ?                                                             ; - 
jdk.internal.misc.ScopedMemoryAccess::putByte@6 (line 522)                      
                        
          ?                                                             ; - 
java.lang.invoke.VarHandleSegmentAsBytes::set@38 (line 114)                     
                        
          ?                                                             ; - 
java.lang.invoke.LambdaForm$DMH/0x000000013a4d5800::invokeStatic@20             
                        
          ?                                                             ; - 
java.lang.invoke.LambdaForm$MH/0x000000013a4d8070::invoke@37                    
                        
          ?                                                             ; - 
java.lang.invoke.VarHandleGuards::guard_LJI_V@134 (line 1017)                   
                        
          ?                                                             ; - 
jdk.internal.foreign.AbstractMemorySegmentImpl::set@10 (line 670)               
                        
          ?                                                             ; - 
org.openjdk.bench.java.lang.foreign.SegmentBulkRandomFill::heapSegmentFillLoop@44
 (line 101)            
          ?   0x00000001149a23d0:   add w4, w4, #0x20                           
                                                                                
                    
   0.06%  ?   0x00000001149a23d4:   cmp w4, w10                                 
                                                                                
                    
          ?   0x00000001149a23d8:   b.lt        0x00000001149a23c0  // 
b.tstop;*ifge {reexecute=0 rethrow=0 return_oop=0}


**post-loops**: the "vectorized post-loop" and the "single iteration post-loop" 
(region has 24.43%):
vectorized post-loop (inner post)

             ?   ? ;; B14: #    out( B14 B15 ) &lt;- in( B35 B14 ) Loop( 
B14-B14 inner post of N1915) Freq: 174420
   2.20%     ??  ?  0x00000001149a224c:   sxtw  x5, w4
   0.88%     ??  ?  0x00000001149a2250:   str   q16, [x16, x5]              
;*invokevirtual putByte {reexecute=0 rethrow=0 return_oop=0}
             ??  ?                                                            ; 
- jdk.internal.misc.ScopedMemoryAccess::putByteInternal@15 (line 534)
             ??  ?                                                            ; 
- jdk.internal.misc.ScopedMemoryAccess::putByte@6 (line 522)
             ??  ?                                                            ; 
- java.lang.invoke.VarHandleSegmentAsBytes::set@38 (line 114)
             ??  ?                                                            ; 
- java.lang.invoke.LambdaForm$DMH/0x000000013a4d5800::invokeStatic@20
             ??  ?                                                            ; 
- java.lang.invoke.LambdaForm$MH/0x000000013a4d8070::invoke@37
             ??  ?                                                            ; 
- java.lang.invoke.VarHandleGuards::guard_LJI_V@134 (line 1017)
             ??  ?                                                            ; 
- jdk.internal.foreign.AbstractMemorySegmentImpl::set@10 (line 670)
             ??  ?                                                            ; 
- 
org.openjdk.bench.java.lang.foreign.SegmentBulkRandomFill::heapSegmentFillLoop@44
 (line 101)
             ??  ?  0x00000001149a2254:   add   w4, w4, #0x10
             ??  ?  0x00000001149a2258:   cmp   w4, w10
             ??  ?  0x00000001149a225c:   b.lt  0x00000001149a224c  // 
b.tstop;*ifge {reexecute=0 rethrow=0 return_oop=0}
             ?   ?                                                            ; 
- 
org.openjdk.bench.java.lang.foreign.SegmentBulkRandomFill::heapSegmentFillLoop@33
 (line 100)
             ?   ? ;; B15: #    out( B16 ) &lt;- in( B14 )  Freq: 87210.2
   0.34%     ?   ?  0x00000001149a2260:   add   x10, x19, x5
             ?   ?  0x00000001149a2264:   add   x22, x10, #0x10             
;*ladd {reexecute=0 rethrow=0 return_oop=0}
             ?   ?                                                            ; 
- 
org.openjdk.bench.java.lang.foreign.SegmentBulkRandomFill::heapSegmentFillLoop@52
 (line 100)
             ?   ? ;; B16: #    out( B20 B17 ) &lt;- in( B39 B15 B36 ) 
top-of-loop Freq: 174421
   0.78%     ?   ?  0x00000001149a2268:   cmp   w4, w3
             ? ? ?  0x00000001149a226c:   b.ge  0x00000001149a2294  // b.tcont
             ? ? ? ;; B17: #    out( B42 B18 ) &lt;- in( B16 )  Freq: 87210.3
             ? ? ?  0x00000001149a2270:   cmp   w4, w2
             ? ? ?  0x00000001149a2274:   b.cs  0x00000001149a24a4  // b.hs, 
b.nlast
             ? ? ?                                                            
;*aload {reexecute=0 rethrow=0 return_oop=0}
             ? ? ?                                                            ; 
- 
org.openjdk.bench.java.lang.foreign.SegmentBulkRandomFill::heapSegmentFillLoop@36
 (line 101)

scalar post loop:

             ? ? ? ;; B18: #    out( B18 B19 ) &lt;- in( B17 B18 ) Loop( 
B18-B18 inner post of N1402) Freq: 174420
   0.56%     ? ???  0x00000001149a2278:   sxtw  x10, w4
   5.47%     ? ???  0x00000001149a227c:   strb  wzr, [x16, x10, lsl #0]     
;*invokevirtual putByte {reexecute=0 rethrow=0 return_oop=0}
             ? ???                                                            ; 
- jdk.internal.misc.ScopedMemoryAccess::putByteInternal@15 (line 534)
             ? ???                                                            ; 
- jdk.internal.misc.ScopedMemoryAccess::putByte@6 (line 522)
             ? ???                                                            ; 
- java.lang.invoke.VarHandleSegmentAsBytes::set@38 (line 114)
             ? ???                                                            ; 
- java.lang.invoke.LambdaForm$DMH/0x000000013a4d5800::invokeStatic@20
             ? ???                                                            ; 
- java.lang.invoke.LambdaForm$MH/0x000000013a4d8070::invoke@37
             ? ???                                                            ; 
- java.lang.invoke.VarHandleGuards::guard_LJI_V@134 (line 1017)
             ? ???                                                            ; 
- jdk.internal.foreign.AbstractMemorySegmentImpl::set@10 (line 670)
             ? ???                                                            ; 
- 
org.openjdk.bench.java.lang.foreign.SegmentBulkRandomFill::heapSegmentFillLoop@44
 (line 101)
             ? ???  0x00000001149a2280:   add   w4, w4, #0x1
             ? ???  0x00000001149a2284:   cmp   w4, w3
             ? ???  0x00000001149a2288:   b.lt  0x00000001149a2278  // b.tstop

Not sure why we have this below... probably the check that leads to the 
post-loop?

             ? ? ? ;; B19: #    out( B20 ) &lt;- in( B18 )  Freq: 87210.2
   8.88%     ? ? ?  0x00000001149a228c:   add   x10, x10, x19
             ? ? ?  0x00000001149a2290:   add   x22, x10, #0x1              
;*ifge {reexecute=0 rethrow=0 return_oop=0}
             ? ? ?                                                            ; 
- 
org.openjdk.bench.java.lang.foreign.SegmentBulkRandomFill::heapSegmentFillLoop@33
 (line 100)
             ? ? ? ;; B20: #    out( B2 B21 ) &lt;- in( B23 B19 B16 )  Freq: 
174760
   0.78%     ? ? ?  0x00000001149a2294:   cmp   x22, x7
             ?   ?  0x00000001149a2298:   b.ge  0x00000001149a219c  // b.tcont


**pre-loop** (region has 21.80%):

             ;; B27: #  out( B29 B28 ) &lt;- in( B26 B28 ) Loop( B27-B28 inner 
pre of N1402) Freq: 348842
   0.10%   ?  0x00000001149a2364:   sxtw        x22, w10
   6.01%   ?  0x00000001149a2368:   strb        wzr, [x16, x22, lsl #0]     
;*invokevirtual putByte {reexecute=0 rethrow=0 return_oop=0}
           ?                                                            ; - 
jdk.internal.misc.ScopedMemoryAccess::putByteInternal@15 (line 534)
           ?                                                            ; - 
jdk.internal.misc.ScopedMemoryAccess::putByte@6 (line 522)
           ?                                                            ; - 
java.lang.invoke.VarHandleSegmentAsBytes::set@38 (line 114)
           ?                                                            ; - 
java.lang.invoke.LambdaForm$DMH/0x000000013a4d5800::invokeStatic@20
           ?                                                            ; - 
java.lang.invoke.LambdaForm$MH/0x000000013a4d8070::invoke@37
           ?                                                            ; - 
java.lang.invoke.VarHandleGuards::guard_LJI_V@134 (line 1017)
           ?                                                            ; - 
jdk.internal.foreign.AbstractMemorySegmentImpl::set@10 (line 670)
           ?                                                            ; - 
org.openjdk.bench.java.lang.foreign.SegmentBulkRandomFill::heapSegmentFillLoop@44
 (line 101)
   0.08%   ?  0x00000001149a236c:   add w4, w10, #0x1
   0.56%   ?  0x00000001149a2370:   cmp w4, w20
   0.04%  ??  0x00000001149a2374:   b.ge        0x00000001149a2380  // 
b.tcont;*ifge {reexecute=0 rethrow=0 return_oop=0}
          ??                                                            ; - 
org.openjdk.bench.java.lang.foreign.SegmentBulkRandomFill::heapSegmentFillLoop@33
 (line 100)
          ?? ;; B28: #  out( B27 ) &lt;- in( B27 )  Freq: 174421
   5.61%  ??  0x00000001149a2378:   mov w10, w4
          ??  0x00000001149a237c:   b   0x00000001149a2364

with a strange extra add that has some strange looking percentage (profile 
inaccuracy?):

   7.88%  ?   0x00000001149a2380:   add w10, w10, #0x20


**Summary**:

pre-loop:             22%, byte-store
main-loop:            40%  2x 16-byte-vector-store (profiling is a bit 
contradictory here - is it 16% or 44%?)
vectorized post-loop: 4%   1x 16-byte-vector-store (not super sure about 
profiling, but could be accurate)
post-loop:            12%  byte-store

The numbers don't quite add up - but they are still somewhat telling - and I 
think probably accurate enough to see what happens.

Basically: we waste a lot of time in the pre and post-loop: getting alignment 
and then finishing off at the end.

-------------------

And to compare:

  58.00%                c2, level 4  
org.openjdk.bench.java.lang.foreign.SegmentBulkRandomFill::heapSegmentFillJava, 
version 5, compile id 848
  29.83%                c2, level 4  
org.openjdk.bench.java.lang.foreign.SegmentBulkRandomFill::heapSegmentFillJava, 
version 5, compile id 848

We have 2 hot regions.

**main** (58%):

             ;; B40: #  out( B40 B41 ) &lt;- in( B39 B40 ) Loop( B40-B40 inner 
main of N140 strip mined) Freq: 2.13696e+08
   0.26%  ?   0x000000011800f900:   add x4, x1, w3, sxtw
          ?  ;; merged str pair
          ?   0x000000011800f904:   stp xzr, xzr, [x4]
          ?   0x000000011800f908:   str xzr, [x4, #16]              
;*invokevirtual putLongUnaligned {reexecute=0 rethrow=0 return_oop=0}
          ?                                                             ; - 
jdk.internal.misc.Unsafe::putLongUnaligned@10 (line 3677)
          ?                                                             ; - 
jdk.internal.misc.ScopedMemoryAccess::putLongUnalignedInternal@17 (line 2605)
          ?                                                             ; - 
jdk.internal.misc.ScopedMemoryAccess::putLongUnaligned@8 (line 2593)
          ?                                                             ; - 
jdk.internal.foreign.SegmentBulkOperations::fill@133 (line 78)
          ?                                                             ; - 
jdk.internal.foreign.AbstractMemorySegmentImpl::fill@2 (line 184)
          ?                                                             ; - 
org.openjdk.bench.java.lang.foreign.SegmentBulkRandomFill::heapSegmentFillJava@14
 (line 83)
          ?   0x000000011800f90c:   add w3, w3, #0x20               ;*iinc 
{reexecute=0 rethrow=0 return_oop=0}
          ?                                                             ; - 
jdk.internal.foreign.SegmentBulkOperations::fill@136 (line 77)
          ?                                                             ; - 
jdk.internal.foreign.AbstractMemorySegmentImpl::fill@2 (line 184)
          ?                                                             ; - 
org.openjdk.bench.java.lang.foreign.SegmentBulkRandomFill::heapSegmentFillJava@14
 (line 83)
  21.73%  ?   0x000000011800f910:   str xzr, [x4, #24]              
;*invokevirtual putLongUnaligned {reexecute=0 rethrow=0 return_oop=0}
          ?                                                             ; - 
jdk.internal.misc.Unsafe::putLongUnaligned@10 (line 3677)
          ?                                                             ; - 
jdk.internal.misc.ScopedMemoryAccess::putLongUnalignedInternal@17 (line 2605)
          ?                                                             ; - 
jdk.internal.misc.ScopedMemoryAccess::putLongUnaligned@8 (line 2593)
          ?                                                             ; - 
jdk.internal.foreign.SegmentBulkOperations::fill@133 (line 78)
          ?                                                             ; - 
jdk.internal.foreign.AbstractMemorySegmentImpl::fill@2 (line 184)
          ?                                                             ; - 
org.openjdk.bench.java.lang.foreign.SegmentBulkRandomFill::heapSegmentFillJava@14
 (line 83)
   0.17%  ?   0x000000011800f914:   cmp w3, w2
   2.58%  ?   0x000000011800f918:   b.lt        0x000000011800f900  // 
b.tstop;*if_icmpge {reexecute=0 rethrow=0 return_oop=0}
                                                                        ; - 
jdk.internal.foreign.SegmentBulkOperations::fill@98 (line 77)
                                                                        ; - 
jdk.internal.foreign.AbstractMemorySegmentImpl::fill@2 (line 184)
                                                                        ; - 
org.openjdk.bench.java.lang.foreign.SegmentBulkRandomFill::heapSegmentFillJava@14
 (line 83)
             ;; B41: #  out( B39 B42 ) &lt;- in( B40 )  Freq: 3.29583e+06
  26.13%      0x000000011800f91c:   ldr x2, [x28, #48]              ; 
ImmutableOopMap {r12=Oop r14=Oop c_rarg1=Derived_oop_r14 r15=Oop r16=Oop }

**Rest**:
vectorized post-loop

                ;; B2: #        out( B2 B3 ) &lt;- in( B42 B2 ) Loop( B2-B2 
inner post of N1701) Freq: 50831.6
   3.01%  ?      0x000000011800f728:   str      xzr, [x1, w3, sxtw]         
;*invokevirtual putLongUnaligned {reexecute=0 rethrow=0 return_oop=0}
          ?                                                                ; - 
jdk.internal.misc.Unsafe::putLongUnaligned@10 (line 3677)
          ?                                                                ; - 
jdk.internal.misc.ScopedMemoryAccess::putLongUnalignedInternal@17 (line 2605)
          ?                                                                ; - 
jdk.internal.misc.ScopedMemoryAccess::putLongUnaligned@8 (line 2593)
          ?                                                                ; - 
jdk.internal.foreign.SegmentBulkOperations::fill@133 (line 78)
          ?                                                                ; - 
jdk.internal.foreign.AbstractMemorySegmentImpl::fill@2 (line 184)
          ?                                                                ; - 
org.openjdk.bench.java.lang.foreign.SegmentBulkRandomFill::heapSegmentFillJava@14
 (line 83)
          ?      0x000000011800f72c:   add      w3, w3, #0x8                
;*iinc {reexecute=0 rethrow=0 return_oop=0}
          ?                                                                ; - 
jdk.internal.foreign.SegmentBulkOperations::fill@136 (line 77)
          ?                                                                ; - 
jdk.internal.foreign.AbstractMemorySegmentImpl::fill@2 (line 184)
          ?                                                                ; - 
org.openjdk.bench.java.lang.foreign.SegmentBulkRandomFill::heapSegmentFillJava@14
 (line 83)
          ?      0x000000011800f730:   cmp      w3, w10
          ?      0x000000011800f734:   b.lt     0x000000011800f728  // 
b.tstop;*if_icmpge {reexecute=0 rethrow=0 return_oop=0}
                                                                           ; - 
jdk.internal.foreign.SegmentBulkOperations::fill@98 (line 77)
                                                                           ; - 
jdk.internal.foreign.AbstractMemorySegmentImpl::fill@2 (line 184)
                                                                           ; - 
org.openjdk.bench.java.lang.foreign.SegmentBulkRandomFill::heapSegmentFillJava@14
 (line 83)
                ;; B3: #        out( B5 B4 ) &lt;- in( B2 B43 B44 ) top-of-loop 
Freq: 51627.8

... and then the rest of the code I speculate is your **long-int-short-byte 
wind-down code**.

-----------------------

**Conclusion:**

Java: spends about 58% in well vectorized main-loop code (2x super-unrolled, 
i.e. 2x 16-byte-vectors)
Loop: only spends about 40% in main loop (also 2x 16-byte vectors) - the rest 
is spent in pre/post-loops


Hmm. This really makes me want to ditch the alignment-code - it may hurt more 
than we gain from it :thinking: 
And we should also consider such "wind-down" code: going from 16-element 
vectors to 8, 4, 2, 1 elements. Of course that is extra code and extra compile 
time...

-------------

PR Comment: https://git.openjdk.org/jdk/pull/22010#issuecomment-2470102192

Re: RFR: 8343933: Add a MemorySegment::fill benchmark with varying sizes

Reply via email to