On Fri, 1 Jul 2022 11:25:36 GMT, Andrew Haley <a...@openjdk.org> wrote:
>> **MOTIVATION** >> >> This is a big refactoring patch of merging rules in aarch64_sve.ad and >> aarch64_neon.ad. The motivation can also be found at [1]. >> >> Currently AArch64 has aarch64_sve.ad and aarch64_neon.ad to support SVE >> and NEON codegen respectively. 1) For SVE rules we use vReg operand to >> match VecA for an arbitrary length of vector type, when SVE is enabled; >> 2) For NEON rules we use vecX/vecD operands to match VecX/VecD for >> 128-bit/64-bit vectors, when SVE is not enabled. >> >> This separation looked clean at the time of introducing SVE support. >> However, there are two main drawbacks now. >> >> **Drawback-1**: NEON (Advanced SIMD) is the mandatory feature on AArch64 and >> SVE vector registers share the lower 128 bits with NEON registers. For >> some cases, even when SVE is enabled, we still prefer to match NEON >> rules and emit NEON instructions. >> >> **Drawback-2**: With more and more vector rules added to support VectorAPI, >> there are lots of rules in both two ad files with different predication >> conditions, e.g., different values of UseSVE or vector type/size. >> >> Examples can be found in [1]. These two drawbacks make the code less >> maintainable and increase the libjvm.so code size. >> >> **KEY UPDATES** >> >> In this patch, we mainly do two things, using generic vReg to match all >> NEON/SVE vector registers and merging NEON/SVE matching rules. >> >> - Update-1: Use generic vReg to match all NEON/SVE vector registers >> >> Two different approaches were considered, and we prefer to use generic >> vector solution but keep VecA operand for all >128-bit vectors. See the >> last slide in [1]. All the changes lie in the AArch64 backend. >> >> 1) Some helpers are updated in aarch64.ad to enable generic vector on >> AArch64. See vector_ideal_reg(), pd_specialize_generic_vector_operand(), >> is_reg2reg_move() and is_generic_vector(). >> >> 2) Operand vecA is created to match VecA register, and vReg is updated >> to match VecA/D/X registers dynamically. >> >> With the introduction of generic vReg, difference in register types >> between NEON rules and SVE rules can be eliminated, which makes it easy >> to merge these rules. >> >> - Update-2: Try to merge existing rules >> >> As updated in GensrcAdlc.gmk, new ad file, aarch64_vector.ad, is >> introduced to hold the grouped and merged matching rules. >> >> 1) Similar rules with difference in vector type/size can be merged into >> new rules, where different types and vector sizes are handled in the >> codegen part, e.g., vadd(). This resolves **Drawback-2**. >> >> 2) In most cases, we tend to emit NEON instructions for 128-bit vector >> operations on SVE platforms, e.g., vadd(). This resolves **Drawback-1**. >> >> It's important to note that there are some exceptions. >> >> Exception-1: For some rules, there are no direct NEON instructions, but >> exists simple SVE implementation due to newly added SVE ISA. Such rules >> include vloadconB, vmulL_neon, vminL_neon, vmaxL_neon, >> reduce_addF_le128b (4F case), reduce_and/or/xor_neon, reduce_minL_neon, >> reduce_maxL_neon, vcvtLtoF_neon, vcvtDtoI_neon, rearrange_HS_neon. >> >> Exception-2: Vector mask generation and operation rules are different >> because vector mask is stored in different types of registers between >> NEON and SVE, e.g., vmaskcmp_neon and vmask_truecount_neon rules. >> >> Exception-3: Shift right related rules are different because vector >> shift right instructions differ a bit between NEON and SVE. >> >> For these exceptions, we emit NEON or SVE code simply based on UseSVE >> options. >> >> **MINOR UPDATES and CODE REFACTORING** >> >> Since we've touched all lines of code during merging rules, we further >> do more minor updates and refactoring. >> >> - Reduce regmask bits >> >> Stack slot alignment is handled specially for scalable vector, which >> will firstly align to SlotsPerVecA, and then align to the real vector >> length. We should guarantee SlotsPerVecA is no bigger than the real >> vector length. Otherwise, unused stack space would be allocated. >> >> In AArch64 SVE, the vector length can be 128 to 2048 bits. However, >> SlotsPerVecA is currently set as 8, i.e. 8 * 32 = 256 bits. As a result, >> on a 128-bit SVE platform, the stack slot is aligned to 256 bits, >> leaving the half 128 bits unused. In this patch, we reduce SlotsPerVecA >> from 8 to 4. >> >> See the updates in register_aarch64.hpp, regmask.hpp and aarch64.ad >> (chunk1 and vectora_reg). >> >> - Refactor NEON/SVE vector op support check. >> >> Merge NEON and SVE vector supported check into one single function. To >> be consistent, SVE default size supported check now is relaxed from no >> less than 64 bits to the same condition as NEON's min_vector_size(), >> i.e. 4 bytes and 4/2 booleans are also supported. This should be fine, >> as we assume at least we will emit NEON code for those small vectors, >> with unified rules. >> >> - Some notes for new rules >> >> 1) Since new rules are unique and it makes no sense to set different >> "ins_cost", we turn to use the default cost. >> >> 2) By default we use "pipe_slow" for matching rules in aarch64_vector.ad >> now. Hence, many SIMD pipeline classes at aarch64.ad become unused and >> can be removed. >> >> 3) Suffixes '_le128b/_gt128b' and '_neon/_sve' are appended in the >> matching rule names if needed. >> a) 'le128b' means the vector length is less than or equal to 128 bits. >> This rule can be matched on both NEON and 128-bit SVE. >> b) 'gt128b' means the vector length is greater than 128 bits. This rule >> can only be matched on SVE. >> c) 'neon' means this rule can only be matched on NEON, i.e. the >> generated instruction is not better than those in 128-bit SVE. >> d) 'sve' means this rule is only matched on SVE for all possible vector >> length, i.e. not limited to gt128b. >> >> Note-1: m4 file is not introduced because many duplications are highly >> reduced now. >> Note-2: We guess the code review for this big patch would probably take >> some time and we may need to merge latest code from master branch from >> time to time. We prefer to keep aarch64_neon/sve.ad and the >> corresponding m4 files for easy comparison and review. Of course, they >> will be finally removed after some solid reviews before integration. >> Note-3: Several other minor refactorings are done in this patch, but we >> cannot list all of them in the commit message. We have reviewed and >> tested the rules carefully to guarantee the quality. >> >> **TESTING** >> >> 1) Cross compilations on arm32/s390/pps/riscv passed. >> 2) tier1~3 jtreg passed on both x64 and aarch64 machines. >> 3) vector tests: all the test cases under the following directories can >> pass on both NEON and SVE systems with max vector length 16/32/64 bytes. >> >> "test/hotspot/jtreg/compiler/vectorapi/" >> "test/jdk/jdk/incubator/vector/" >> "test/hotspot/jtreg/compiler/vectorization/" >> >> 4) Performance evaluation: we choose vector micro-benchmarks from >> panama-vector:vectorIntrinsics [2] to evaluate the performance of this >> patch. We've tested *MaxVectorTests.java cases on one 128-bit SVE >> platform and one NEON platform, and didn't see any visiable regression >> with NEON and SVE. We will continue to verify more cases on other >> platforms with NEON and different SVE vector sizes. >> >> **BENEFITS** >> >> The number of matching rules is reduced to ~ **42%**. >> before: 373 (aarch64_neon.ad) + 380 (aarch64_sve.ad) = 753 >> after : 313 (aarch64_vector.ad) >> >> Code size for libjvm.so (release build) on aarch64 is reduced to ~ **96%**. >> before: 25246528 B (commit 7905788e969) >> after : 24208776 B (**nearly 1 MB reduction**) >> >> [1] http://cr.openjdk.java.net/~njian/8285790/JDK-8285790.pdf >> [2] >> https://github.com/openjdk/panama-vector/tree/vectorIntrinsics/test/micro/org/openjdk/bench/jdk/incubator/vector/operation >> >> Co-Developed-by: Ningsheng Jian <ningsheng.j...@arm.com> >> Co-Developed-by: Eric Liu <eric.c....@arm.com> > > Aha! I was looking forward to this. > > On 7/1/22 11:46, Hao Sun wrote: > > Note-1: m4 file is not introduced because many duplications are highly > > reduced now. > > Yes, but there's still a lot of duplications. I'll make a few examples > of where you should make simple changes that will usefully increase the > level of abstraction. That will be a start. > @theRealAph Thanks for your comment. Yes. There are still duplicate code. I > can easily list several ones, such as the reduce-and/or/xor, vector shift ops > and several reg with imm rules. We're open to keep m4 file. > > But I would suggest that we may put our attention firstly on 1) our > implementation on generic vector registers and 2) the merged rules (in > particular those we share the codegen for NEON only platform and 128-bit > vector ops on SVE platform). After that we may discuss whether to use m4 file > and how to implement it if needed. We can do both: there's no sense in which one excludes the other, and we have time. However, just putting aside for a moment the lack of useful abstraction mechanisms, I note that there's a lot of code like this: if (length_in_bytes <= 16) { // ... Neon } else { assert(UseSVE > 0, "must be sve"); // ... SVE } which is to say, there's an implicit assumption that if an operation can be done with Neon it will be, and SVE will only be used if not. What is the justification for that assumption? ------------- PR: https://git.openjdk.org/jdk/pull/9346