Re: RFR: 8307795: AArch64: Optimize VectorMask.truecount() on Neon [v3]
On Thu, 18 May 2023 09:50:13 GMT, Chang Peng wrote: >> In Vector API Java level, vector mask is represented as a boolean array with >> 0x00/0x01 (8 bits of each element) as values, aka in-memory format. When it >> is loaded into vector register, e.g. Neon, the in-memory format will be >> converted to in-register format with 0/-1 value for each lane (lane width >> aligned to its type) by VectorLoadMask [1] operation, and convert back to >> in-memory format by VectorStoreMask[2]. In Neon, a typical VectorStoreMask >> operation will first narrow given vector registers by xtn insn [3] into byte >> element type, and then do a vector negate to convert to 0x00/0x01 value for >> each element. >> >> For most of the vector mask operations, the input mask is in-register >> format. And a vector mask also works in-register format all through the >> compilation. But for some operations like VectorMask.trueCount()[4] which >> counts the elements of true value, the expected input mask is in-memory >> format. So a VectorStoreMask is generated to convert the mask from >> in-register format to in-memory format before those operations. >> >> However, for trueCount() these xtn instructions in VectorStoreMask can be >> saved, since the narrowing operations will not influence the number of >> active lane (value of 0x01) of its input. >> >> This patch adds an optimized rule `VectorMaskTrueCount (VectorStoreMask >> mask)` to save the unnecessary narrowing operations. >> >> For example, >> >> >> var m = VectorMask.fromArray(IntVector.SPECIES_PREFERRED, ba, 0); >> m.not().trueCount(); >> >> >> will produce following assembly on a Neon machine before this patch: >> >> >> ... >> mvn v16.16b, v16.16b // VectorMask.not() >> xtn v16.4h, v16.4s >> xtn v16.8b, v16.8h >> neg v16.8b, v16.8b // VectorStoreMask >> addvb17, v16.8b >> umovw0, v17.b[0] // VectorMask.trueCount() >> ... >> >> >> After this patch: >> >> >> ... >> mvn v16.16b, v16.16b // VectorMask.not() >> addvs17, v16.4s >> smovx0, v17.b[0] >> neg x0, x0 // Optimized VectorMask.trueCount() >> ... >> >> >> In this case, we can save two xtn insns. >> >> Performance: >> >> Benchmark Before After Unit >> testInt 723.822 ± 1.029 1182.375 ± 12.363ops/ms >> testLong 632.154 ± 0.197 1382.74 ± 2.188ops/ms >> testShort 788.665 ± 1.852 1152.38 ± 3.77 ops/ms >> >> [1]: >> https://github.com/openjdk/jdk/blob/e1e758a7b43c29840296d337bd2f0213ab0ca3c9/src/hotspot/cpu/aarch64/aarch64_vect... > > Chang Peng has updated the pull request incrementally with one additional > commit since the last revision: > > Update benchmark to avoid potential optimization Marked as reviewed by aph (Reviewer). src/hotspot/cpu/aarch64/aarch64_vector_ad.m4 line 3825: > 3823: %} > 3824: > 3825: // Combined rule for VectorStoreMask + VectorMaskTrueCount when the > vector element type is not T_BYTE. Suggestion: // Combined rule for VectorMaskTrueCount (VectorStoreMask) when the vector element type is not T_BYTE. Using `+` is unnecessarily confusing, as is swapping the operands. - PR Review: https://git.openjdk.org/jdk/pull/13974#pullrequestreview-1443414797 PR Review Comment: https://git.openjdk.org/jdk/pull/13974#discussion_r1205210387
Re: RFR: 8307795: AArch64: Optimize VectorMask.truecount() on Neon [v3]
On Thu, 18 May 2023 09:50:13 GMT, Chang Peng wrote: >> In Vector API Java level, vector mask is represented as a boolean array with >> 0x00/0x01 (8 bits of each element) as values, aka in-memory format. When it >> is loaded into vector register, e.g. Neon, the in-memory format will be >> converted to in-register format with 0/-1 value for each lane (lane width >> aligned to its type) by VectorLoadMask [1] operation, and convert back to >> in-memory format by VectorStoreMask[2]. In Neon, a typical VectorStoreMask >> operation will first narrow given vector registers by xtn insn [3] into byte >> element type, and then do a vector negate to convert to 0x00/0x01 value for >> each element. >> >> For most of the vector mask operations, the input mask is in-register >> format. And a vector mask also works in-register format all through the >> compilation. But for some operations like VectorMask.trueCount()[4] which >> counts the elements of true value, the expected input mask is in-memory >> format. So a VectorStoreMask is generated to convert the mask from >> in-register format to in-memory format before those operations. >> >> However, for trueCount() these xtn instructions in VectorStoreMask can be >> saved, since the narrowing operations will not influence the number of >> active lane (value of 0x01) of its input. >> >> This patch adds an optimized rule `VectorMaskTrueCount (VectorStoreMask >> mask)` to save the unnecessary narrowing operations. >> >> For example, >> >> >> var m = VectorMask.fromArray(IntVector.SPECIES_PREFERRED, ba, 0); >> m.not().trueCount(); >> >> >> will produce following assembly on a Neon machine before this patch: >> >> >> ... >> mvn v16.16b, v16.16b // VectorMask.not() >> xtn v16.4h, v16.4s >> xtn v16.8b, v16.8h >> neg v16.8b, v16.8b // VectorStoreMask >> addvb17, v16.8b >> umovw0, v17.b[0] // VectorMask.trueCount() >> ... >> >> >> After this patch: >> >> >> ... >> mvn v16.16b, v16.16b // VectorMask.not() >> addvs17, v16.4s >> smovx0, v17.b[0] >> neg x0, x0 // Optimized VectorMask.trueCount() >> ... >> >> >> In this case, we can save two xtn insns. >> >> Performance: >> >> Benchmark Before After Unit >> testInt 723.822 ± 1.029 1182.375 ± 12.363ops/ms >> testLong 632.154 ± 0.197 1382.74 ± 2.188ops/ms >> testShort 788.665 ± 1.852 1152.38 ± 3.77 ops/ms >> >> [1]: >> https://github.com/openjdk/jdk/blob/e1e758a7b43c29840296d337bd2f0213ab0ca3c9/src/hotspot/cpu/aarch64/aarch64_vect... > > Chang Peng has updated the pull request incrementally with one additional > commit since the last revision: > > Update benchmark to avoid potential optimization LGTM. - Marked as reviewed by eliu (Committer). PR Review: https://git.openjdk.org/jdk/pull/13974#pullrequestreview-1436444533
Re: RFR: 8307795: AArch64: Optimize VectorMask.truecount() on Neon [v3]
On Mon, 15 May 2023 10:59:11 GMT, Andrew Haley wrote: > > > This looks like it might be removed by loop opts. I think you might need > > > a blackhole somewhere. > > > > > > `m` will be updated in every iteration of this loop, so `m` is not a > > loop-invariants actually. I can see the assembly code of this loop by using > > JMH perfasm. > > Isn't it? Looks to me like all it does is flip `m` each time. Whether or not > this code is optimized today isn't relevant. > > So it's the same as > > ``` > for (int i = 0; i < LENGTH/2; i++) { > res += m.trueCount(); > } > m = m.not(); > for (int i = 0; i < LENGTH/2; i++) { > res += m.trueCount(); > } > ``` > > ... which is trivially optimizable, no? Sorry for the delay. Yes, actually they do the same thing, though current C2 compiler cannot do such optimization so far. Anyway, I have updated this benchmark to avoid potential optimization and ensure that we can measure performance effectively. - PR Review Comment: https://git.openjdk.org/jdk/pull/13974#discussion_r1197626911
Re: RFR: 8307795: AArch64: Optimize VectorMask.truecount() on Neon [v3]
> In Vector API Java level, vector mask is represented as a boolean array with > 0x00/0x01 (8 bits of each element) as values, aka in-memory format. When it > is loaded into vector register, e.g. Neon, the in-memory format will be > converted to in-register format with 0/-1 value for each lane (lane width > aligned to its type) by VectorLoadMask [1] operation, and convert back to > in-memory format by VectorStoreMask[2]. In Neon, a typical VectorStoreMask > operation will first narrow given vector registers by xtn insn [3] into byte > element type, and then do a vector negate to convert to 0x00/0x01 value for > each element. > > For most of the vector mask operations, the input mask is in-register format. > And a vector mask also works in-register format all through the compilation. > But for some operations like VectorMask.trueCount()[4] which counts the > elements of true value, the expected input mask is in-memory format. So a > VectorStoreMask is generated to convert the mask from in-register format to > in-memory format before those operations. > > However, for trueCount() these xtn instructions in VectorStoreMask can be > saved, since the narrowing operations will not influence the number of active > lane (value of 0x01) of its input. > > This patch adds an optimized rule `VectorMaskTrueCount (VectorStoreMask > mask)` to save the unnecessary narrowing operations. > > For example, > > > var m = VectorMask.fromArray(IntVector.SPECIES_PREFERRED, ba, 0); > m.not().trueCount(); > > > will produce following assembly on a Neon machine before this patch: > > > ... > mvn v16.16b, v16.16b // VectorMask.not() > xtn v16.4h, v16.4s > xtn v16.8b, v16.8h > neg v16.8b, v16.8b // VectorStoreMask > addvb17, v16.8b > umovw0, v17.b[0] // VectorMask.trueCount() > ... > > > After this patch: > > > ... > mvn v16.16b, v16.16b // VectorMask.not() > addvs17, v16.4s > smovx0, v17.b[0] > neg x0, x0 // Optimized VectorMask.trueCount() > ... > > > In this case, we can save two xtn insns. > > Performance: > > Benchmark Before After Unit > testInt 723.822 ± 1.029 1182.375 ± 12.363ops/ms > testLong 632.154 ± 0.197 1382.74 ± 2.188ops/ms > testShort 788.665 ± 1.852 1152.38 ± 3.77 ops/ms > > [1]: > https://github.com/openjdk/jdk/blob/e1e758a7b43c29840296d337bd2f0213ab0ca3c9/src/hotspot/cpu/aarch64/aarch64_vector.ad#L4740 > [2]: https://github.com/openjdk/jdk/b... Chang Peng has updated the pull request incrementally with one additional commit since the last revision: Update benchmark to avoid potential optimization - Changes: - all: https://git.openjdk.org/jdk/pull/13974/files - new: https://git.openjdk.org/jdk/pull/13974/files/49e35b63..567f69a2 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=13974&range=02 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=13974&range=01-02 Stats: 20 lines in 4 files changed: 0 ins; 5 del; 15 mod Patch: https://git.openjdk.org/jdk/pull/13974.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/13974/head:pull/13974 PR: https://git.openjdk.org/jdk/pull/13974