**MOTIVATION** This is a big refactoring patch of merging rules in aarch64_sve.ad and aarch64_neon.ad. The motivation can also be found at [1].
Currently AArch64 has aarch64_sve.ad and aarch64_neon.ad to support SVE and NEON codegen respectively. 1) For SVE rules we use vReg operand to match VecA for an arbitrary length of vector type, when SVE is enabled; 2) For NEON rules we use vecX/vecD operands to match VecX/VecD for 128-bit/64-bit vectors, when SVE is not enabled. This separation looked clean at the time of introducing SVE support. However, there are two main drawbacks now. **Drawback-1**: NEON (Advanced SIMD) is the mandatory feature on AArch64 and SVE vector registers share the lower 128 bits with NEON registers. For some cases, even when SVE is enabled, we still prefer to match NEON rules and emit NEON instructions. **Drawback-2**: With more and more vector rules added to support VectorAPI, there are lots of rules in both two ad files with different predication conditions, e.g., different values of UseSVE or vector type/size. Examples can be found in [1]. These two drawbacks make the code less maintainable and increase the libjvm.so code size. **KEY UPDATES** In this patch, we mainly do two things, using generic vReg to match all NEON/SVE vector registers and merging NEON/SVE matching rules. - Update-1: Use generic vReg to match all NEON/SVE vector registers Two different approaches were considered, and we prefer to use generic vector solution but keep VecA operand for all >128-bit vectors. See the last slide in [1]. All the changes lie in the AArch64 backend. 1) Some helpers are updated in aarch64.ad to enable generic vector on AArch64. See vector_ideal_reg(), pd_specialize_generic_vector_operand(), is_reg2reg_move() and is_generic_vector(). 2) Operand vecA is created to match VecA register, and vReg is updated to match VecA/D/X registers dynamically. With the introduction of generic vReg, difference in register types between NEON rules and SVE rules can be eliminated, which makes it easy to merge these rules. - Update-2: Try to merge existing rules As updated in GensrcAdlc.gmk, new ad file, aarch64_vector.ad, is introduced to hold the grouped and merged matching rules. 1) Similar rules with difference in vector type/size can be merged into new rules, where different types and vector sizes are handled in the codegen part, e.g., vadd(). This resolves **Drawback-2**. 2) In most cases, we tend to emit NEON instructions for 128-bit vector operations on SVE platforms, e.g., vadd(). This resolves **Drawback-1**. It's important to note that there are some exceptions. Exception-1: For some rules, there are no direct NEON instructions, but exists simple SVE implementation due to newly added SVE ISA. Such rules include vloadconB, vmulL_neon, vminL_neon, vmaxL_neon, reduce_addF_le128b (4F case), reduce_and/or/xor_neon, reduce_minL_neon, reduce_maxL_neon, vcvtLtoF_neon, vcvtDtoI_neon, rearrange_HS_neon. Exception-2: Vector mask generation and operation rules are different because vector mask is stored in different types of registers between NEON and SVE, e.g., vmaskcmp_neon and vmask_truecount_neon rules. Exception-3: Shift right related rules are different because vector shift right instructions differ a bit between NEON and SVE. For these exceptions, we emit NEON or SVE code simply based on UseSVE options. **MINOR UPDATES and CODE REFACTORING** Since we've touched all lines of code during merging rules, we further do more minor updates and refactoring. - Reduce regmask bits Stack slot alignment is handled specially for scalable vector, which will firstly align to SlotsPerVecA, and then align to the real vector length. We should guarantee SlotsPerVecA is no bigger than the real vector length. Otherwise, unused stack space would be allocated. In AArch64 SVE, the vector length can be 128 to 2048 bits. However, SlotsPerVecA is currently set as 8, i.e. 8 * 32 = 256 bits. As a result, on a 128-bit SVE platform, the stack slot is aligned to 256 bits, leaving the half 128 bits unused. In this patch, we reduce SlotsPerVecA from 8 to 4. See the updates in register_aarch64.hpp, regmask.hpp and aarch64.ad (chunk1 and vectora_reg). - Refactor NEON/SVE vector op support check. Merge NEON and SVE vector supported check into one single function. To be consistent, SVE default size supported check now is relaxed from no less than 64 bits to the same condition as NEON's min_vector_size(), i.e. 4 bytes and 4/2 booleans are also supported. This should be fine, as we assume at least we will emit NEON code for those small vectors, with unified rules. - Some notes for new rules 1) Since new rules are unique and it makes no sense to set different "ins_cost", we turn to use the default cost. 2) By default we use "pipe_slow" for matching rules in aarch64_vector.ad now. Hence, many SIMD pipeline classes at aarch64.ad become unused and can be removed. 3) Suffixes '_le128b/_gt128b' and '_neon/_sve' are appended in the matching rule names if needed. a) 'le128b' means the vector length is less than or equal to 128 bits. This rule can be matched on both NEON and 128-bit SVE. b) 'gt128b' means the vector length is greater than 128 bits. This rule can only be matched on SVE. c) 'neon' means this rule can only be matched on NEON, i.e. the generated instruction is not better than those in 128-bit SVE. d) 'sve' means this rule is only matched on SVE for all possible vector length, i.e. not limited to gt128b. Note-1: m4 file is not introduced because many duplications are highly reduced now. Note-2: We guess the code review for this big patch would probably take some time and we may need to merge latest code from master branch from time to time. We prefer to keep aarch64_neon/sve.ad and the corresponding m4 files for easy comparison and review. Of course, they will be finally removed after some solid reviews before integration. Note-3: Several other minor refactorings are done in this patch, but we cannot list all of them in the commit message. We have reviewed and tested the rules carefully to guarantee the quality. **TESTING** 1) Cross compilations on arm32/s390/pps/riscv passed. 2) tier1~3 jtreg passed on both x64 and aarch64 machines. 3) vector tests: all the test cases under the following directories can pass on both NEON and SVE systems with max vector length 16/32/64 bytes. "test/hotspot/jtreg/compiler/vectorapi/" "test/jdk/jdk/incubator/vector/" "test/hotspot/jtreg/compiler/vectorization/" 4) Performance evaluation: we choose vector micro-benchmarks from panama-vector:vectorIntrinsics [2] to evaluate the performance of this patch. We've tested *MaxVectorTests.java cases on one 128-bit SVE platform and one NEON platform, and didn't see any visiable regression with NEON and SVE. We will continue to verify more cases on other platforms with NEON and different SVE vector sizes. **BENEFITS** The number of matching rules is reduced to ~ **42%**. before: 373 (aarch64_neon.ad) + 380 (aarch64_sve.ad) = 753 after : 313 (aarch64_vector.ad) Code size for libjvm.so (release build) on aarch64 is reduced to ~ **96%**. before: 25246528 B (commit 7905788e969) after : 24208776 B (**nearly 1 MB reduction**) [1] http://cr.openjdk.java.net/~njian/8285790/JDK-8285790.pdf [2] https://github.com/openjdk/panama-vector/tree/vectorIntrinsics/test/micro/org/openjdk/bench/jdk/incubator/vector/operation Co-Developed-by: Ningsheng Jian <ningsheng.j...@arm.com> Co-Developed-by: Eric Liu <eric.c....@arm.com> ------------- Commit messages: - 8285790: AArch64: Merge C2 NEON and SVE matching rules Changes: https://git.openjdk.org/jdk/pull/9346/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=9346&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8285790 Stats: 7151 lines in 12 files changed: 6454 ins; 576 del; 121 mod Patch: https://git.openjdk.org/jdk/pull/9346.diff Fetch: git fetch https://git.openjdk.org/jdk pull/9346/head:pull/9346 PR: https://git.openjdk.org/jdk/pull/9346