**MOTIVATION**

This is a big refactoring patch of merging rules in aarch64_sve.ad and
aarch64_neon.ad. The motivation can also be found at [1].

Currently AArch64 has aarch64_sve.ad and aarch64_neon.ad to support SVE
and NEON codegen respectively. 1) For SVE rules we use vReg operand to
match VecA for an arbitrary length of vector type, when SVE is enabled;
2) For NEON rules we use vecX/vecD operands to match VecX/VecD for
128-bit/64-bit vectors, when SVE is not enabled.

This separation looked clean at the time of introducing SVE support.
However, there are two main drawbacks now.

**Drawback-1**: NEON (Advanced SIMD) is the mandatory feature on AArch64 and
SVE vector registers share the lower 128 bits with NEON registers. For
some cases, even when SVE is enabled, we still prefer to match NEON
rules and emit NEON instructions.

**Drawback-2**: With more and more vector rules added to support VectorAPI,
there are lots of rules in both two ad files with different predication
conditions, e.g., different values of UseSVE or vector type/size.

Examples can be found in [1]. These two drawbacks make the code less
maintainable and increase the libjvm.so code size.

**KEY UPDATES**

In this patch, we mainly do two things, using generic vReg to match all
NEON/SVE vector registers and merging NEON/SVE matching rules.

- Update-1: Use generic vReg to match all NEON/SVE vector registers

Two different approaches were considered, and we prefer to use generic
vector solution but keep VecA operand for all >128-bit vectors. See the
last slide in [1]. All the changes lie in the AArch64 backend.

1) Some helpers are updated in aarch64.ad to enable generic vector on
AArch64. See vector_ideal_reg(), pd_specialize_generic_vector_operand(),
is_reg2reg_move() and is_generic_vector().

2) Operand vecA is created to match VecA register, and vReg is updated
to match VecA/D/X registers dynamically.

With the introduction of generic vReg, difference in register types
between NEON rules and SVE rules can be eliminated, which makes it easy
to merge these rules.

- Update-2: Try to merge existing rules

As updated in GensrcAdlc.gmk, new ad file, aarch64_vector.ad, is
introduced to hold the grouped and merged matching rules.

1) Similar rules with difference in vector type/size can be merged into
new rules, where different types and vector sizes are handled in the
codegen part, e.g., vadd(). This resolves **Drawback-2**.

2) In most cases, we tend to emit NEON instructions for 128-bit vector
operations on SVE platforms, e.g., vadd(). This resolves **Drawback-1**.

It's important to note that there are some exceptions.

Exception-1: For some rules, there are no direct NEON instructions, but
exists simple SVE implementation due to newly added SVE ISA. Such rules
include vloadconB, vmulL_neon, vminL_neon, vmaxL_neon,
reduce_addF_le128b (4F case), reduce_and/or/xor_neon, reduce_minL_neon,
reduce_maxL_neon, vcvtLtoF_neon, vcvtDtoI_neon, rearrange_HS_neon.

Exception-2: Vector mask generation and operation rules are different
because vector mask is stored in different types of registers between
NEON and SVE, e.g., vmaskcmp_neon and vmask_truecount_neon rules.

Exception-3: Shift right related rules are different because vector
shift right instructions differ a bit between NEON and SVE.

For these exceptions, we emit NEON or SVE code simply based on UseSVE
options.

**MINOR UPDATES and CODE REFACTORING**

Since we've touched all lines of code during merging rules, we further
do more minor updates and refactoring.

- Reduce regmask bits

Stack slot alignment is handled specially for scalable vector, which
will firstly align to SlotsPerVecA, and then align to the real vector
length. We should guarantee SlotsPerVecA is no bigger than the real
vector length. Otherwise, unused stack space would be allocated.

In AArch64 SVE, the vector length can be 128 to 2048 bits. However,
SlotsPerVecA is currently set as 8, i.e. 8 * 32 = 256 bits. As a result,
on a 128-bit SVE platform, the stack slot is aligned to 256 bits,
leaving the half 128 bits unused. In this patch, we reduce SlotsPerVecA
from 8 to 4.

See the updates in register_aarch64.hpp, regmask.hpp and aarch64.ad
(chunk1 and vectora_reg).

- Refactor NEON/SVE vector op support check.

Merge NEON and SVE vector supported check into one single function. To
be consistent, SVE default size supported check now is relaxed from no
less than 64 bits to the same condition as NEON's min_vector_size(),
i.e. 4 bytes and 4/2 booleans are also supported. This should be fine,
as we assume at least we will emit NEON code for those small vectors,
with unified rules.

- Some notes for new rules

1) Since new rules are unique and it makes no sense to set different
"ins_cost", we turn to use the default cost.

2) By default we use "pipe_slow" for matching rules in aarch64_vector.ad
now. Hence, many SIMD pipeline classes at aarch64.ad become unused and
can be removed.

3) Suffixes '_le128b/_gt128b' and '_neon/_sve' are appended in the
matching rule names if needed.
a) 'le128b' means the vector length is less than or equal to 128 bits.
This rule can be matched on both NEON and 128-bit SVE.
b) 'gt128b' means the vector length is greater than 128 bits. This rule
can only be matched on SVE.
c) 'neon' means this rule can only be matched on NEON, i.e. the
generated instruction is not better than those in 128-bit SVE.
d) 'sve' means this rule is only matched on SVE for all possible vector
length, i.e. not limited to gt128b.

Note-1: m4 file is not introduced because many duplications are highly
reduced now.
Note-2: We guess the code review for this big patch would probably take
some time and we may need to merge latest code from master branch from
time to time. We prefer to keep aarch64_neon/sve.ad and the
corresponding m4 files for easy comparison and review. Of course, they
will be finally removed after some solid reviews before integration.
Note-3: Several other minor refactorings are done in this patch, but we
cannot list all of them in the commit message. We have reviewed and
tested the rules carefully to guarantee the quality.

**TESTING**

1) Cross compilations on arm32/s390/pps/riscv passed.
2) tier1~3 jtreg passed on both x64 and aarch64 machines.
3) vector tests: all the test cases under the following directories can
pass on both NEON and SVE systems with max vector length 16/32/64 bytes.

  "test/hotspot/jtreg/compiler/vectorapi/"
  "test/jdk/jdk/incubator/vector/"
  "test/hotspot/jtreg/compiler/vectorization/"

4) Performance evaluation: we choose vector micro-benchmarks from
panama-vector:vectorIntrinsics [2] to evaluate the performance of this
patch. We've tested *MaxVectorTests.java cases on one 128-bit SVE
platform and one NEON platform, and didn't see any visiable regression
with NEON and SVE. We will continue to verify more cases on other
platforms with NEON and different SVE vector sizes.

**BENEFITS**

The number of matching rules is reduced to ~ **42%**.
  before: 373 (aarch64_neon.ad) + 380 (aarch64_sve.ad) = 753
  after   : 313 (aarch64_vector.ad)

Code size for libjvm.so (release build) on aarch64 is reduced to ~ **96%**.
  before: 25246528 B (commit 7905788e969)
  after   : 24208776 B (**nearly 1 MB reduction**)

[1] http://cr.openjdk.java.net/~njian/8285790/JDK-8285790.pdf
[2] 
https://github.com/openjdk/panama-vector/tree/vectorIntrinsics/test/micro/org/openjdk/bench/jdk/incubator/vector/operation

Co-Developed-by: Ningsheng Jian <ningsheng.j...@arm.com>
Co-Developed-by: Eric Liu <eric.c....@arm.com>

-------------

Commit messages:
 - 8285790: AArch64: Merge C2 NEON and SVE matching rules

Changes: https://git.openjdk.org/jdk/pull/9346/files
 Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=9346&range=00
  Issue: https://bugs.openjdk.org/browse/JDK-8285790
  Stats: 7151 lines in 12 files changed: 6454 ins; 576 del; 121 mod
  Patch: https://git.openjdk.org/jdk/pull/9346.diff
  Fetch: git fetch https://git.openjdk.org/jdk pull/9346/head:pull/9346

PR: https://git.openjdk.org/jdk/pull/9346

Reply via email to