On Fri, 1 Jul 2022 10:36:36 GMT, Hao Sun <hao...@openjdk.org> wrote: > **MOTIVATION** > > This is a big refactoring patch of merging rules in aarch64_sve.ad and > aarch64_neon.ad. The motivation can also be found at [1]. > > Currently AArch64 has aarch64_sve.ad and aarch64_neon.ad to support SVE > and NEON codegen respectively. 1) For SVE rules we use vReg operand to > match VecA for an arbitrary length of vector type, when SVE is enabled; > 2) For NEON rules we use vecX/vecD operands to match VecX/VecD for > 128-bit/64-bit vectors, when SVE is not enabled. > > This separation looked clean at the time of introducing SVE support. > However, there are two main drawbacks now. > > **Drawback-1**: NEON (Advanced SIMD) is the mandatory feature on AArch64 and > SVE vector registers share the lower 128 bits with NEON registers. For > some cases, even when SVE is enabled, we still prefer to match NEON > rules and emit NEON instructions. > > **Drawback-2**: With more and more vector rules added to support VectorAPI, > there are lots of rules in both two ad files with different predication > conditions, e.g., different values of UseSVE or vector type/size. > > Examples can be found in [1]. These two drawbacks make the code less > maintainable and increase the libjvm.so code size. > > **KEY UPDATES** > > In this patch, we mainly do two things, using generic vReg to match all > NEON/SVE vector registers and merging NEON/SVE matching rules. > > - Update-1: Use generic vReg to match all NEON/SVE vector registers > > Two different approaches were considered, and we prefer to use generic > vector solution but keep VecA operand for all >128-bit vectors. See the > last slide in [1]. All the changes lie in the AArch64 backend. > > 1) Some helpers are updated in aarch64.ad to enable generic vector on > AArch64. See vector_ideal_reg(), pd_specialize_generic_vector_operand(), > is_reg2reg_move() and is_generic_vector(). > > 2) Operand vecA is created to match VecA register, and vReg is updated > to match VecA/D/X registers dynamically. > > With the introduction of generic vReg, difference in register types > between NEON rules and SVE rules can be eliminated, which makes it easy > to merge these rules. > > - Update-2: Try to merge existing rules > > As updated in GensrcAdlc.gmk, new ad file, aarch64_vector.ad, is > introduced to hold the grouped and merged matching rules. > > 1) Similar rules with difference in vector type/size can be merged into > new rules, where different types and vector sizes are handled in the > codegen part, e.g., vadd(). This resolves **Drawback-2**. > > 2) In most cases, we tend to emit NEON instructions for 128-bit vector > operations on SVE platforms, e.g., vadd(). This resolves **Drawback-1**. > > It's important to note that there are some exceptions. > > Exception-1: For some rules, there are no direct NEON instructions, but > exists simple SVE implementation due to newly added SVE ISA. Such rules > include vloadconB, vmulL_neon, vminL_neon, vmaxL_neon, > reduce_addF_le128b (4F case), reduce_and/or/xor_neon, reduce_minL_neon, > reduce_maxL_neon, vcvtLtoF_neon, vcvtDtoI_neon, rearrange_HS_neon. > > Exception-2: Vector mask generation and operation rules are different > because vector mask is stored in different types of registers between > NEON and SVE, e.g., vmaskcmp_neon and vmask_truecount_neon rules. > > Exception-3: Shift right related rules are different because vector > shift right instructions differ a bit between NEON and SVE. > > For these exceptions, we emit NEON or SVE code simply based on UseSVE > options. > > **MINOR UPDATES and CODE REFACTORING** > > Since we've touched all lines of code during merging rules, we further > do more minor updates and refactoring. > > - Reduce regmask bits > > Stack slot alignment is handled specially for scalable vector, which > will firstly align to SlotsPerVecA, and then align to the real vector > length. We should guarantee SlotsPerVecA is no bigger than the real > vector length. Otherwise, unused stack space would be allocated. > > In AArch64 SVE, the vector length can be 128 to 2048 bits. However, > SlotsPerVecA is currently set as 8, i.e. 8 * 32 = 256 bits. As a result, > on a 128-bit SVE platform, the stack slot is aligned to 256 bits, > leaving the half 128 bits unused. In this patch, we reduce SlotsPerVecA > from 8 to 4. > > See the updates in register_aarch64.hpp, regmask.hpp and aarch64.ad > (chunk1 and vectora_reg). > > - Refactor NEON/SVE vector op support check. > > Merge NEON and SVE vector supported check into one single function. To > be consistent, SVE default size supported check now is relaxed from no > less than 64 bits to the same condition as NEON's min_vector_size(), > i.e. 4 bytes and 4/2 booleans are also supported. This should be fine, > as we assume at least we will emit NEON code for those small vectors, > with unified rules. > > - Some notes for new rules > > 1) Since new rules are unique and it makes no sense to set different > "ins_cost", we turn to use the default cost. > > 2) By default we use "pipe_slow" for matching rules in aarch64_vector.ad > now. Hence, many SIMD pipeline classes at aarch64.ad become unused and > can be removed. > > 3) Suffixes '_le128b/_gt128b' and '_neon/_sve' are appended in the > matching rule names if needed. > a) 'le128b' means the vector length is less than or equal to 128 bits. > This rule can be matched on both NEON and 128-bit SVE. > b) 'gt128b' means the vector length is greater than 128 bits. This rule > can only be matched on SVE. > c) 'neon' means this rule can only be matched on NEON, i.e. the > generated instruction is not better than those in 128-bit SVE. > d) 'sve' means this rule is only matched on SVE for all possible vector > length, i.e. not limited to gt128b. > > Note-1: m4 file is not introduced because many duplications are highly > reduced now. > Note-2: We guess the code review for this big patch would probably take > some time and we may need to merge latest code from master branch from > time to time. We prefer to keep aarch64_neon/sve.ad and the > corresponding m4 files for easy comparison and review. Of course, they > will be finally removed after some solid reviews before integration. > Note-3: Several other minor refactorings are done in this patch, but we > cannot list all of them in the commit message. We have reviewed and > tested the rules carefully to guarantee the quality. > > **TESTING** > > 1) Cross compilations on arm32/s390/pps/riscv passed. > 2) tier1~3 jtreg passed on both x64 and aarch64 machines. > 3) vector tests: all the test cases under the following directories can > pass on both NEON and SVE systems with max vector length 16/32/64 bytes. > > "test/hotspot/jtreg/compiler/vectorapi/" > "test/jdk/jdk/incubator/vector/" > "test/hotspot/jtreg/compiler/vectorization/" > > 4) Performance evaluation: we choose vector micro-benchmarks from > panama-vector:vectorIntrinsics [2] to evaluate the performance of this > patch. We've tested *MaxVectorTests.java cases on one 128-bit SVE > platform and one NEON platform, and didn't see any visiable regression > with NEON and SVE. We will continue to verify more cases on other > platforms with NEON and different SVE vector sizes. > > **BENEFITS** > > The number of matching rules is reduced to ~ **42%**. > before: 373 (aarch64_neon.ad) + 380 (aarch64_sve.ad) = 753 > after : 313 (aarch64_vector.ad) > > Code size for libjvm.so (release build) on aarch64 is reduced to ~ **96%**. > before: 25246528 B (commit 7905788e969) > after : 24208776 B (**nearly 1 MB reduction**) > > [1] http://cr.openjdk.java.net/~njian/8285790/JDK-8285790.pdf > [2] > https://github.com/openjdk/panama-vector/tree/vectorIntrinsics/test/micro/org/openjdk/bench/jdk/incubator/vector/operation > > Co-Developed-by: Ningsheng Jian <ningsheng.j...@arm.com> > Co-Developed-by: Eric Liu <eric.c....@arm.com>
Aha! I was looking forward to this. On 7/1/22 11:46, Hao Sun wrote: > Note-1: m4 file is not introduced because many duplications are highly > reduced now. Yes, but there's still a lot of duplications. I'll make a few examples of where you should make simple changes that will usefully increase the level of abstraction. That will be a start. ------------- PR: https://git.openjdk.org/jdk/pull/9346