On Mon, 15 May 2023 02:58:46 GMT, Chang Peng <d...@openjdk.org> wrote:
> In Vector API Java level, vector mask is represented as a boolean array with > 0x00/0x01 (8 bits of each element) as values, aka in-memory format. When it > is loaded into vector register, e.g. Neon, the in-memory format will be > converted to in-register format with 0/-1 value for each lane (lane width > aligned to its type) by VectorLoadMask [1] operation, and convert back to > in-memory format by VectorStoreMask[2]. In Neon, a typical VectorStoreMask > operation will first narrow given vector registers by xtn insn [3] into byte > element type, and then do a vector negate to convert to 0x00/0x01 value for > each element. > > For most of the vector mask operations, the input mask is in-register format. > And a vector mask also works in-register format all through the compilation. > But for some operations like VectorMask.trueCount()[4] which counts the > elements of true value, the expected input mask is in-memory format. So a > VectorStoreMask is generated to convert the mask from in-register format to > in-memory format before those operations. > > However, for trueCount() these xtn instructions in VectorStoreMask can be > saved, since the narrowing operations will not influence the number of active > lane (value of 0x01) of its input. > > This patch adds an optimized rule `VectorMaskTrueCount (VectorStoreMask > mask)` to save the unnecessary narrowing operations. > > For example, > > > var m = VectorMask.fromArray(IntVector.SPECIES_PREFERRED, ba, 0); > m.not().trueCount(); > > > will produce following assembly on a Neon machine before this patch: > > > ... > mvn v16.16b, v16.16b // VectorMask.not() > xtn v16.4h, v16.4s > xtn v16.8b, v16.8h > neg v16.8b, v16.8b // VectorStoreMask > addv b17, v16.8b > umov w0, v17.b[0] // VectorMask.trueCount() > ... > > > After this patch: > > > ... > mvn v16.16b, v16.16b // VectorMask.not() > addv s17, v16.4s > smov x0, v17.b[0] > neg x0, x0 // Optimized VectorMask.trueCount() > ... > > > In this case, we can save two xtn insns. > > Performance: > > Benchmark Before After Unit > testInt 723.822 ± 1.029 1182.375 ± 12.363 ops/ms > testLong 632.154 ± 0.197 1382.74 ± 2.188 ops/ms > testShort 788.665 ± 1.852 1152.38 ± 3.77 ops/ms > > [1]: > https://github.com/openjdk/jdk/blob/e1e758a7b43c29840296d337bd2f0213ab0ca3c9/src/hotspot/cpu/aarch64/aarch64_vector.ad#L4740 > [2]: https://github.com/openjdk/jdk/b... That makes sense. Is it likely that there are more of these combined operations on vector masks that could be matched? if so, it might make sense to do the job earlier, in the C2 optimizer. test/micro/org/openjdk/bench/jdk/incubator/vector/StoreMaskTrueCount.java line 80: > 78: m = m.not(); > 79: res += m.trueCount(); > 80: } This looks like it might be removed by loop opts. I think you might need a blackhole somewhere. ------------- Marked as reviewed by aph (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/13974#pullrequestreview-1426088017 PR Review Comment: https://git.openjdk.org/jdk/pull/13974#discussion_r1193540231