On Thu, 28 Oct 2021 08:58:48 GMT, Aleksey Shipilev <sh...@openjdk.org> wrote:

>> `Unsafe.storeStoreFence` currently delegates to stronger 
>> `Unsafe.storeFence`. We can teach compilers to map this directly to already 
>> existing rules that handle `MemBarStoreStore`. Like explicit 
>> `LoadFence`/`StoreFence`, we introduce the special node to differentiate 
>> explicit fence and implicit store-store barriers. `storeStoreFence` is 
>> usually used to simulate safe `final`-field like constructions in special 
>> JDK classes, like `ConstantCallSite` and friends.
>> 
>> Motivational performance difference on benchmarks from JDK-8276054 on ARM32 
>> (Raspberry Pi 4):
>> 
>> 
>> Benchmark                      Mode  Cnt   Score    Error  Units
>> Multiple.plain                 avgt    3   2.669 ±  0.004  ns/op
>> Multiple.release               avgt    3  16.688 ±  0.057  ns/op
>> Multiple.storeStore            avgt    3  14.021 ±  0.144  ns/op // Better
>> 
>> MultipleWithLoads.plain        avgt    3   4.672 ±  0.053  ns/op
>> MultipleWithLoads.release      avgt    3  16.689 ±  0.044  ns/op
>> MultipleWithLoads.storeStore   avgt    3  14.012 ±  0.010  ns/op // Better
>> 
>> MultipleWithStores.plain       avgt    3  14.687 ±  0.009  ns/op
>> MultipleWithStores.release     avgt    3  45.393 ±  0.192  ns/op
>> MultipleWithStores.storeStore  avgt    3  38.048 ±  0.033  ns/op // Better
>> 
>> Publishing.plain               avgt    3  27.079 ±  0.201  ns/op
>> Publishing.release             avgt    3  27.088 ±  0.241  ns/op
>> Publishing.storeStore          avgt    3  27.009 ±  0.259  ns/op // Within 
>> error, hidden by allocation
>> 
>> Single.plain                   avgt    3   2.670 ± 0.002  ns/op
>> Single.releaseFence            avgt    3   6.675 ± 0.001  ns/op
>> Single.storeStoreFence         avgt    3   8.012 ± 0.027  ns/op  // Worse, 
>> seems to be ARM32 implementation artifact
>> 
>> 
>> The same thing on AArch64 (Raspberry Pi 3):
>> 
>> 
>> Benchmark                      Mode  Cnt   Score   Error  Units
>> 
>> Multiple.plain                 avgt    3   5.914 ± 0.115  ns/op
>> Multiple.release               avgt    3  10.149 ± 0.059  ns/op
>> Multiple.storeStore            avgt    3   6.757 ± 0.138  ns/op // Better
>> 
>> MultipleWithLoads.plain        avgt    3  11.849 ± 0.331  ns/op
>> MultipleWithLoads.release      avgt    3  35.565 ± 1.144  ns/op
>> MultipleWithLoads.storeStore   avgt    3  19.441 ± 0.471  ns/op // Better
>> 
>> MultipleWithStores.plain       avgt    3   5.920 ± 0.213  ns/op
>> MultipleWithStores.release     avgt    3  20.286 ± 0.347  ns/op
>> MultipleWithStores.storeStore  avgt    3  12.686 ± 0.230  ns/op // Better
>> 
>> Publishing.plain               avgt    3  22.261 ± 1.630  ns/op
>> Publishing.release             avgt    3  22.269 ± 0.576  ns/op
>> Publishing.storeStore          avgt    3  17.464 ± 0.397  ns/op // Better
>> 
>> Single.plain                   avgt    3   5.916 ± 0.063  ns/op
>> Single.release                 avgt    3  10.148 ± 0.401  ns/op
>> Single.storeStore              avgt    3   6.767 ± 0.164  ns/op // Better
>> 
>> 
>> As expected, this does not affect x86_64 at all, because both `release` and 
>> `storeStore` are effectively no-ops, only affecting compiler optimizations:
>> 
>> 
>> Benchmark                      Mode  Cnt  Score   Error  Units
>> 
>> Multiple.plain                 avgt    3  0.406 ± 0.002  ns/op
>> Multiple.release               avgt    3  0.409 ± 0.018  ns/op
>> Multiple.storeStore            avgt    3  0.406 ± 0.001  ns/op
>> 
>> MultipleWithLoads.plain        avgt    3  4.328 ± 0.006  ns/op
>> MultipleWithLoads.release      avgt    3  4.600 ± 0.014  ns/op
>> MultipleWithLoads.storeStore   avgt    3  4.602 ± 0.006  ns/op
>> 
>> MultipleWithStores.plain       avgt    3  0.812 ± 0.001  ns/op
>> MultipleWithStores.release     avgt    3  0.812 ± 0.002  ns/op
>> MultipleWithStores.storeStore  avgt    3  0.812 ± 0.002  ns/op
>> 
>> Publishing.plain               avgt    3  6.370 ± 0.059  ns/op
>> Publishing.release             avgt    3  6.358 ± 0.436  ns/op
>> Publishing.storeStore          avgt    3  6.367 ± 0.054  ns/op
>> 
>> Single.plain                   avgt    3  0.407 ± 0.039  ns/op
>> Single.releaseFence            avgt    3  0.406 ± 0.001  ns/op
>> Single.storeStoreFence         avgt    3  0.406 ± 0.001  ns/op
>> 
>> 
>> Additional testing:
>>  - [x] Linux x86_64 fastdebug `tier1`
>
> Aleksey Shipilev has updated the pull request incrementally with one 
> additional commit since the last revision:
> 
>   Fix the comment to match JDK-8276096

Finally revived my quiet AArch64 dev board, added AArch64 results, which are 
even better than ARM32. Updated PR with perf results.

-------------

PR: https://git.openjdk.java.net/jdk/pull/6136

Reply via email to