On Mon, 15 May 2023 02:58:46 GMT, Chang Peng <d...@openjdk.org> wrote:
> In Vector API Java level, vector mask is represented as a boolean array with > 0x00/0x01 (8 bits of each element) as values, aka in-memory format. When it > is loaded into vector register, e.g. Neon, the in-memory format will be > converted to in-register format with 0/-1 value for each lane (lane width > aligned to its type) by VectorLoadMask [1] operation, and convert back to > in-memory format by VectorStoreMask[2]. In Neon, a typical VectorStoreMask > operation will first narrow given vector registers by xtn insn [3] into byte > element type, and then do a vector negate to convert to 0x00/0x01 value for > each element. > > For most of the vector mask operations, the input mask is in-register format. > And a vector mask also works in-register format all through the compilation. > But for some operations like VectorMask.trueCount()[4] which counts the > elements of true value, the expected input mask is in-memory format. So a > VectorStoreMask is generated to convert the mask from in-register format to > in-memory format before those operations. > > However, for trueCount() these xtn instructions in VectorStoreMask can be > saved, since the narrowing operations will not influence the number of active > lane (value of 0x01) of its input. > > This patch adds an optimized rule `VectorMaskTrueCount (VectorStoreMask > mask)` to save the unnecessary narrowing operations. > > For example, > > > var m = VectorMask.fromArray(IntVector.SPECIES_PREFERRED, ba, 0); > m.not().trueCount(); > > > will produce following assembly on a Neon machine before this patch: > > > ... > mvn v16.16b, v16.16b // VectorMask.not() > xtn v16.4h, v16.4s > xtn v16.8b, v16.8h > neg v16.8b, v16.8b // VectorStoreMask > addv b17, v16.8b > umov w0, v17.b[0] // VectorMask.trueCount() > ... > > After this patch: > > > ... > mvn v16.16b, v16.16b // VectorMask.not() > addv s17, v16.4s > smov x0, v17.b[0] > neg x0, x0 // Optimized VectorMask.trueCount() > ... > > > In this case, we can save two xtn insns. > > Performance: > > Benchmark Before After Unit > testInt 723.822 ± 1.029 1182.375 ± 12.363 ops/ms > testLong 632.154 ± 0.197 1382.74 ± 2.188 ops/ms > testShort 788.665 ± 1.852 1152.38 ± 3.77 ops/ms > > [1]: > https://github.com/openjdk/jdk/blob/e1e758a7b43c29840296d337bd2f0213ab0ca3c9/src/hotspot/cpu/aarch64/aarch64_vector.ad#L4740 > [2]: https://github.com/openjdk/jdk/blo... This pull request has now been integrated. Changeset: f600d036 Author: changpeng1997 <chang.p...@arm.com> Committer: Eric Liu <e...@openjdk.org> URL: https://git.openjdk.org/jdk/commit/f600d0369a1f9ac78e62a328be4bbb598ffef62b Stats: 235 lines in 5 files changed: 235 ins; 0 del; 0 mod 8307795: AArch64: Optimize VectorMask.truecount() on Neon Reviewed-by: aph, eliu ------------- PR: https://git.openjdk.org/jdk/pull/13974