Re: RFR: 8307795: AArch64: Optimize VectorMask.truecount() on Neon [v3]

2023-05-25 Thread Andrew Haley
On Thu, 18 May 2023 09:50:13 GMT, Chang Peng  wrote:

>> In Vector API Java level, vector mask is represented as a boolean array with 
>> 0x00/0x01 (8 bits of each element) as values, aka in-memory format. When it 
>> is loaded into vector register, e.g. Neon, the in-memory format will be 
>> converted to in-register format with 0/-1 value for each lane (lane width 
>> aligned to its type) by VectorLoadMask [1] operation, and convert back to 
>> in-memory format by VectorStoreMask[2]. In Neon, a typical VectorStoreMask 
>> operation will first narrow given vector registers by xtn insn [3] into byte 
>> element type, and then do a vector negate to convert to 0x00/0x01 value for 
>> each element.
>> 
>> For most of the vector mask operations, the input mask is in-register 
>> format. And a vector mask also works in-register format all through the 
>> compilation. But for some operations like VectorMask.trueCount()[4] which 
>> counts the elements of true value, the expected input mask is in-memory 
>> format. So a VectorStoreMask is generated to convert the mask from 
>> in-register format to in-memory format before those operations.
>> 
>> However, for trueCount() these xtn instructions in VectorStoreMask can be 
>> saved, since the narrowing operations will not influence the number of 
>> active lane (value of 0x01) of its input.
>> 
>> This patch adds an optimized rule `VectorMaskTrueCount (VectorStoreMask 
>> mask)` to save the unnecessary narrowing operations.
>> 
>> For example,
>> 
>> 
>> var m = VectorMask.fromArray(IntVector.SPECIES_PREFERRED, ba, 0);
>> m.not().trueCount();
>> 
>> 
>> will produce following assembly on a Neon machine before this patch:
>> 
>> 
>> ...
>> mvn v16.16b, v16.16b   // VectorMask.not()
>> xtn v16.4h, v16.4s
>> xtn v16.8b, v16.8h
>> neg v16.8b, v16.8b // VectorStoreMask
>> addvb17, v16.8b
>> umovw0, v17.b[0]   // VectorMask.trueCount()
>> ...
>> 
>> 
>> After this patch:
>> 
>> 
>> ...
>> mvn v16.16b, v16.16b   // VectorMask.not()
>> addvs17, v16.4s
>> smovx0, v17.b[0]
>> neg x0, x0 // Optimized VectorMask.trueCount()
>> ...
>> 
>> 
>> In this case, we can save two xtn insns.
>> 
>> Performance:
>> 
>> Benchmark Before   After  Unit
>> testInt   723.822 ± 1.029  1182.375 ± 12.363ops/ms
>> testLong   632.154 ± 0.197  1382.74 ± 2.188ops/ms
>> testShort  788.665 ± 1.852   1152.38 ± 3.77 ops/ms
>> 
>> [1]: 
>> https://github.com/openjdk/jdk/blob/e1e758a7b43c29840296d337bd2f0213ab0ca3c9/src/hotspot/cpu/aarch64/aarch64_vect...
>
> Chang Peng has updated the pull request incrementally with one additional 
> commit since the last revision:
> 
>   Update benchmark to avoid potential optimization

Marked as reviewed by aph (Reviewer).

src/hotspot/cpu/aarch64/aarch64_vector_ad.m4 line 3825:

> 3823: %}
> 3824: 
> 3825: // Combined rule for VectorStoreMask + VectorMaskTrueCount when the 
> vector element type is not T_BYTE.

Suggestion:

// Combined rule for VectorMaskTrueCount (VectorStoreMask) when the vector 
element type is not T_BYTE.

Using `+` is unnecessarily confusing, as is swapping the operands.

-

PR Review: https://git.openjdk.org/jdk/pull/13974#pullrequestreview-1443414797
PR Review Comment: https://git.openjdk.org/jdk/pull/13974#discussion_r1205210387


Re: RFR: 8307795: AArch64: Optimize VectorMask.truecount() on Neon [v3]

2023-05-22 Thread Eric Liu
On Thu, 18 May 2023 09:50:13 GMT, Chang Peng  wrote:

>> In Vector API Java level, vector mask is represented as a boolean array with 
>> 0x00/0x01 (8 bits of each element) as values, aka in-memory format. When it 
>> is loaded into vector register, e.g. Neon, the in-memory format will be 
>> converted to in-register format with 0/-1 value for each lane (lane width 
>> aligned to its type) by VectorLoadMask [1] operation, and convert back to 
>> in-memory format by VectorStoreMask[2]. In Neon, a typical VectorStoreMask 
>> operation will first narrow given vector registers by xtn insn [3] into byte 
>> element type, and then do a vector negate to convert to 0x00/0x01 value for 
>> each element.
>> 
>> For most of the vector mask operations, the input mask is in-register 
>> format. And a vector mask also works in-register format all through the 
>> compilation. But for some operations like VectorMask.trueCount()[4] which 
>> counts the elements of true value, the expected input mask is in-memory 
>> format. So a VectorStoreMask is generated to convert the mask from 
>> in-register format to in-memory format before those operations.
>> 
>> However, for trueCount() these xtn instructions in VectorStoreMask can be 
>> saved, since the narrowing operations will not influence the number of 
>> active lane (value of 0x01) of its input.
>> 
>> This patch adds an optimized rule `VectorMaskTrueCount (VectorStoreMask 
>> mask)` to save the unnecessary narrowing operations.
>> 
>> For example,
>> 
>> 
>> var m = VectorMask.fromArray(IntVector.SPECIES_PREFERRED, ba, 0);
>> m.not().trueCount();
>> 
>> 
>> will produce following assembly on a Neon machine before this patch:
>> 
>> 
>> ...
>> mvn v16.16b, v16.16b   // VectorMask.not()
>> xtn v16.4h, v16.4s
>> xtn v16.8b, v16.8h
>> neg v16.8b, v16.8b // VectorStoreMask
>> addvb17, v16.8b
>> umovw0, v17.b[0]   // VectorMask.trueCount()
>> ...
>> 
>> 
>> After this patch:
>> 
>> 
>> ...
>> mvn v16.16b, v16.16b   // VectorMask.not()
>> addvs17, v16.4s
>> smovx0, v17.b[0]
>> neg x0, x0 // Optimized VectorMask.trueCount()
>> ...
>> 
>> 
>> In this case, we can save two xtn insns.
>> 
>> Performance:
>> 
>> Benchmark Before   After  Unit
>> testInt   723.822 ± 1.029  1182.375 ± 12.363ops/ms
>> testLong   632.154 ± 0.197  1382.74 ± 2.188ops/ms
>> testShort  788.665 ± 1.852   1152.38 ± 3.77 ops/ms
>> 
>> [1]: 
>> https://github.com/openjdk/jdk/blob/e1e758a7b43c29840296d337bd2f0213ab0ca3c9/src/hotspot/cpu/aarch64/aarch64_vect...
>
> Chang Peng has updated the pull request incrementally with one additional 
> commit since the last revision:
> 
>   Update benchmark to avoid potential optimization

LGTM.

-

Marked as reviewed by eliu (Committer).

PR Review: https://git.openjdk.org/jdk/pull/13974#pullrequestreview-1436444533


Re: RFR: 8307795: AArch64: Optimize VectorMask.truecount() on Neon [v3]

2023-05-18 Thread Chang Peng
On Mon, 15 May 2023 10:59:11 GMT, Andrew Haley  wrote:

> > > This looks like it might be removed by loop opts. I think you might need 
> > > a blackhole somewhere.
> > 
> > 
> > `m` will be updated in every iteration of this loop, so `m` is not a 
> > loop-invariants actually. I can see the assembly code of this loop by using 
> > JMH perfasm.
> 
> Isn't it? Looks to me like all it does is flip `m` each time. Whether or not 
> this code is optimized today isn't relevant.
> 
> So it's the same as
> 
> ```
> for (int i = 0; i < LENGTH/2; i++) {
> res += m.trueCount();
> }
> m = m.not();
> for (int i = 0; i < LENGTH/2; i++) {
> res += m.trueCount();
> } 
> ```
> 
> ... which is trivially optimizable, no?

Sorry for the delay.

Yes, actually they do the same thing, though current C2 compiler cannot do such 
optimization so far. Anyway, I have updated this benchmark to avoid potential 
optimization and ensure that we can measure performance effectively.

-

PR Review Comment: https://git.openjdk.org/jdk/pull/13974#discussion_r1197626911


Re: RFR: 8307795: AArch64: Optimize VectorMask.truecount() on Neon [v3]

2023-05-18 Thread Chang Peng
> In Vector API Java level, vector mask is represented as a boolean array with 
> 0x00/0x01 (8 bits of each element) as values, aka in-memory format. When it 
> is loaded into vector register, e.g. Neon, the in-memory format will be 
> converted to in-register format with 0/-1 value for each lane (lane width 
> aligned to its type) by VectorLoadMask [1] operation, and convert back to 
> in-memory format by VectorStoreMask[2]. In Neon, a typical VectorStoreMask 
> operation will first narrow given vector registers by xtn insn [3] into byte 
> element type, and then do a vector negate to convert to 0x00/0x01 value for 
> each element.
> 
> For most of the vector mask operations, the input mask is in-register format. 
> And a vector mask also works in-register format all through the compilation. 
> But for some operations like VectorMask.trueCount()[4] which counts the 
> elements of true value, the expected input mask is in-memory format. So a 
> VectorStoreMask is generated to convert the mask from in-register format to 
> in-memory format before those operations.
> 
> However, for trueCount() these xtn instructions in VectorStoreMask can be 
> saved, since the narrowing operations will not influence the number of active 
> lane (value of 0x01) of its input.
> 
> This patch adds an optimized rule `VectorMaskTrueCount (VectorStoreMask 
> mask)` to save the unnecessary narrowing operations.
> 
> For example,
> 
> 
> var m = VectorMask.fromArray(IntVector.SPECIES_PREFERRED, ba, 0);
> m.not().trueCount();
> 
> 
> will produce following assembly on a Neon machine before this patch:
> 
> 
> ...
> mvn v16.16b, v16.16b   // VectorMask.not()
> xtn v16.4h, v16.4s
> xtn v16.8b, v16.8h
> neg v16.8b, v16.8b // VectorStoreMask
> addvb17, v16.8b
> umovw0, v17.b[0]   // VectorMask.trueCount()
> ...
> 
> 
> After this patch:
> 
> 
> ...
> mvn v16.16b, v16.16b   // VectorMask.not()
> addvs17, v16.4s
> smovx0, v17.b[0]
> neg x0, x0 // Optimized VectorMask.trueCount()
> ...
> 
> 
> In this case, we can save two xtn insns.
> 
> Performance:
> 
> Benchmark Before   After  Unit
> testInt   723.822 ± 1.029  1182.375 ± 12.363ops/ms
> testLong   632.154 ± 0.197  1382.74 ± 2.188ops/ms
> testShort  788.665 ± 1.852   1152.38 ± 3.77 ops/ms
> 
> [1]: 
> https://github.com/openjdk/jdk/blob/e1e758a7b43c29840296d337bd2f0213ab0ca3c9/src/hotspot/cpu/aarch64/aarch64_vector.ad#L4740
> [2]: https://github.com/openjdk/jdk/b...

Chang Peng has updated the pull request incrementally with one additional 
commit since the last revision:

  Update benchmark to avoid potential optimization

-

Changes:
  - all: https://git.openjdk.org/jdk/pull/13974/files
  - new: https://git.openjdk.org/jdk/pull/13974/files/49e35b63..567f69a2

Webrevs:
 - full: https://webrevs.openjdk.org/?repo=jdk&pr=13974&range=02
 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=13974&range=01-02

  Stats: 20 lines in 4 files changed: 0 ins; 5 del; 15 mod
  Patch: https://git.openjdk.org/jdk/pull/13974.diff
  Fetch: git fetch https://git.openjdk.org/jdk.git pull/13974/head:pull/13974

PR: https://git.openjdk.org/jdk/pull/13974