This patch optimizes vectorial rotate (immediate) on aarch64 with shift and 
insert instructions, i.e. SLI and SRI.

Patch passed jtreg tier1-3 tests with linux-aarch64-server-fastdebug build.
Tests under `test/hotspot/jtreg/compiler/c2/cr6340864/` runned specially for 
the correctness and passed.

The JMH micro `test/micro/org/openjdk/bench/java/lang/RotateBenchmark.java` is 
used for performance test.
Witnessed ~15.4% performance improvements on Kunpeng920 (CPU tsv110), but 
~15.8% regression on Kunpeng916 (CPU A72).
So a switch `UseSIMDShiftInsertForRotation` is introduced on aarch64 with 
default value `false`, and set `true` for Kunpeng920.

The `RotateBenchmark.java` JMH micro-benchmark results on Kunpeng920:
Benchmark                            (SHIFT)  (TESTSIZE)   Mode  Cnt     Score  
  Error   Units

# kunpeng 920, -XX:-UseSIMDShiftInsertForRotation
RotateBenchmark.testRotateLeftI           20        1024  thrpt   10  3524.840 
±  2.365  ops/ms
RotateBenchmark.testRotateLeftIImm        20        1024  thrpt   10  3961.288 
±  0.897  ops/ms
RotateBenchmark.testRotateLeftL           20        1024  thrpt   10  1704.321 
± 11.309  ops/ms
RotateBenchmark.testRotateLeftLImm        20        1024  thrpt   10  2137.924 
±  2.215  ops/ms
RotateBenchmark.testRotateRightI          20        1024  thrpt   10  3536.960 
±  7.945  ops/ms
RotateBenchmark.testRotateRightIImm       20        1024  thrpt   10  3961.552 
±  0.673  ops/ms
RotateBenchmark.testRotateRightL          20        1024  thrpt   10  1729.868 
±  0.468  ops/ms
RotateBenchmark.testRotateRightLImm       20        1024  thrpt   10  2132.458 
±  3.385  ops/ms

# kunpeng 920, default, -XX:+UseSIMDShiftInsertForRotation
RotateBenchmark.testRotateLeftI           20        1024  thrpt   10  3504.602 
± 21.609  ops/ms
RotateBenchmark.testRotateLeftIImm        20        1024  thrpt   10  4569.820 
±  7.455  ops/ms
RotateBenchmark.testRotateLeftL           20        1024  thrpt   10  1730.735 
±  0.701  ops/ms
RotateBenchmark.testRotateLeftLImm        20        1024  thrpt   10  2469.796 
±  0.981  ops/ms
RotateBenchmark.testRotateRightI          20        1024  thrpt   10  3540.899 
±  7.679  ops/ms
RotateBenchmark.testRotateRightIImm       20        1024  thrpt   10  4571.876 
±  0.879  ops/ms
RotateBenchmark.testRotateRightL          20        1024  thrpt   10  1731.499 
±  0.877  ops/ms
RotateBenchmark.testRotateRightLImm       20        1024  thrpt   10  2469.454 
±  0.705  ops/ms

This also moves all logical and shifting NEON instructions from `aarch64.ad` to 
`aarch64_neon.ad`,
and has two minor improvements of supporting vector length 4 for `vsraa8B_imm` 
and `vsrla8B_imm`, vector length 2 for `vsraa4S_imm` and `vsrla4S_imm`.

-------------

Commit messages:
 - 8256820: AArch64: Optimize vector rotate (immediate) with shift and insert 
instructions

Changes: https://git.openjdk.java.net/jdk/pull/1761/files
 Webrev: https://webrevs.openjdk.java.net/?repo=jdk&pr=1761&range=00
  Issue: https://bugs.openjdk.java.net/browse/JDK-8256820
  Stats: 2899 lines in 9 files changed: 1561 ins; 1014 del; 324 mod
  Patch: https://git.openjdk.java.net/jdk/pull/1761.diff
  Fetch: git fetch https://git.openjdk.java.net/jdk pull/1761/head:pull/1761

PR: https://git.openjdk.java.net/jdk/pull/1761

Reply via email to