On Fri, 27 Aug 2021 13:12:33 GMT, Maurizio Cimadamore <mcimadam...@openjdk.org> 
wrote:

>> Mandy Chung has updated the pull request incrementally with one additional 
>> commit since the last revision:
>> 
>>   Remove separate accessor for static vs instance method
>>   
>>   There is no effective difference when using MethodHandle::dropArgument for 
>> static method.   Removing Static*Accessor and Instance*Accessor simplifies 
>> the implementation.
>
> src/java.base/share/classes/jdk/internal/reflect/DirectConstructorAccessorImpl.java
>  line 106:
> 
>> 104:     Object invokeImpl(Object[] args) throws Throwable {
>> 105:         var mhInvoker = mhInvoker();
>> 106:         return switch (paramCount) {
> 
> As @plevart observed, I'm also a bit surprised to see this, but I note your 
> comments regarding performance - especially in cold case. Every adaptation 
> counts, I guess, if you're not in the hot path. But let's make sure that the 
> pay off to add the extra hand-specialized cases is worth it - I'd be 
> surprised if spreading arguments is that expensive.

While I recall it was motivated primarily for startup (note startup numbers for 
`-Djdk.reflect.useSpreader=true` in @mlchung  reply earlier in this PR), the 
specialization to avoid the spreader for low-arity arguments also improve 
performance on throughput microbenchmarks. Surprisingly also reduce the per 
invocation allocation rate. 

Allocation profiling suggests that with the specialization a varargs array is 
entirely eliminated. This is of course somewhat fragile and dependent on a 
number of things - and might ultimately be an artifact of a rather synthetic 
microbenchmark that will have little benefit in practice, but it's a rather 
consistent gain for various number of arguments - even when actually going into 
a spreader (I have some hypotheses as to why this might happen, but most likely 
we help the JIT to narrow things down for each kind of invocation scheme with 
the selector):

18 baseline
Benchmark                                                                  Mode 
 Cnt    Score     Error   Units
ReflectionMethods.static_method                                            avgt 
  10   57.329 ±   4.217   ns/op
ReflectionMethods.static_method:·gc.alloc.rate.norm                        avgt 
  10   72.006 ±   0.001    B/op
ReflectionMethods.static_method_3arg                                       avgt 
  10   70.940 ±   7.457   ns/op
ReflectionMethods.static_method_3arg:·gc.alloc.rate.norm                   avgt 
  10   96.008 ±   0.002    B/op
ReflectionMethods.static_method_4arg                                       avgt 
  10   90.453 ±   4.252   ns/op
ReflectionMethods.static_method_4arg:·gc.alloc.rate.norm                   avgt 
  10  112.010 ±   0.001    B/op

pull/5027
Benchmark                                                                  Mode 
 Cnt    Score     Error   Units
ReflectionMethods.static_method                                            avgt 
  10   41.518 ±   2.444   ns/op
ReflectionMethods.static_method:·gc.alloc.rate.norm                        avgt 
  10   48.004 ±   0.001    B/op
ReflectionMethods.static_method_3arg                                       avgt 
  10   57.603 ±   3.299   ns/op
ReflectionMethods.static_method_3arg:·gc.alloc.rate.norm                   avgt 
  10   64.006 ±   0.001    B/op
ReflectionMethods.static_method_4arg                                       avgt 
  10   56.772 ±   3.971   ns/op
ReflectionMethods.static_method_4arg:·gc.alloc.rate.norm                   avgt 
  10   80.007 ±   0.001    B/op

On a patch to remove the specialization on top of pull/5027, performance 
reverts back to numbers very similar to the baseline:

remove-spreader-patch:
Benchmark                                                                  Mode 
 Cnt    Score     Error   Units
ReflectionMethods.static_method                                            avgt 
  10   56.644 ±   4.322   ns/op
ReflectionMethods.static_method:·gc.alloc.rate.norm                        avgt 
  10   72.006 ±   0.001    B/op
ReflectionMethods.static_method_3arg                                       avgt 
  10   72.353 ±   6.815   ns/op
ReflectionMethods.static_method_3arg:·gc.alloc.rate.norm                   avgt 
  10   96.008 ±   0.001    B/op
ReflectionMethods.static_method_4arg                                       avgt 
  10   82.820 ±   8.303   ns/op
ReflectionMethods.static_method_4arg:·gc.alloc.rate.norm                   avgt 
  10  112.009 ±   0.002    B/op


I think the cold start reduction alone is worth the extra 100 lines or so of 
code, but if the gain on these microbenchmarks translates to real throughput 
gains then I think the complexity is more than paid for. Simplifying it while 
retaining similar characteristics would be great of course, but I think such an 
exploration should be done as a follow-up.

-------------

PR: https://git.openjdk.java.net/jdk/pull/5027

Reply via email to