Re: RFR: 8319872: AArch64: [vectorapi] Implementation of unsigned (zero extended) casts [v3]
On Wed, 22 Nov 2023 02:18:32 GMT, Eric Liu wrote: >> src/hotspot/cpu/aarch64/c2_MacroAssembler_aarch64.cpp line 1412: >> >>> 1410: _sve_xunpk(is_unsigned, /* is_high */ false, dst, S, dst); >>> 1411: _sve_xunpk(is_unsigned, /* is_high */ false, dst, D, dst); >>> 1412: break; >> >> Why is this change here? It doesn't do anything. Does it? > > `is_unsigned` is also used in this function. It is used in VectorUCastNode > for zero extending. Ah, I see. - PR Review Comment: https://git.openjdk.org/jdk/pull/16670#discussion_r1401721522
Re: RFR: 8319872: AArch64: [vectorapi] Implementation of unsigned (zero extended) casts [v3]
On Tue, 21 Nov 2023 15:07:48 GMT, Andrew Haley wrote: >> Eric Liu has updated the pull request incrementally with one additional >> commit since the last revision: >> >> add _sve_xunpk & remove dead code >> >> Change-Id: Ic19836feb8a73ea7e65443794f2a0eb1363f6e2f > > src/hotspot/cpu/aarch64/c2_MacroAssembler_aarch64.cpp line 1412: > >> 1410: _sve_xunpk(is_unsigned, /* is_high */ false, dst, S, dst); >> 1411: _sve_xunpk(is_unsigned, /* is_high */ false, dst, D, dst); >> 1412: break; > > Why is this change here? It doesn't do anything. Does it? `is_unsigned` is also used in this function. It is used in VectorUCastNode for zero extending. - PR Review Comment: https://git.openjdk.org/jdk/pull/16670#discussion_r1401423194
Re: RFR: 8319872: AArch64: [vectorapi] Implementation of unsigned (zero extended) casts [v3]
On Tue, 21 Nov 2023 13:24:34 GMT, Eric Liu wrote: >> Vector API defines zero-extend operations [1], which are going to be >> intrinsified and generated to `VectorUCastNode` by C2. This patch adds >> backend implementation for `VectorUCastNode` on AArch64. >> >> The micro benchmark shows significant performance improvement. In my test >> machine (SVE, 256-bit), the result is shown as below: >> >> >> >> Benchmark Before After Units Gain >> VectorZeroExtend.byte2Int 3168.251 243012.399 ops/ms 75.70 >> VectorZeroExtend.byte2Long3212.201 216291.588 ops/ms 66.33 >> VectorZeroExtend.byte2Short 3391.968 182655.365 ops/ms 52.85 >> VectorZeroExtend.int2Long 1012.19780448.553 ops/ms 78.48 >> VectorZeroExtend.short2Int1812.471 153416.828 ops/ms 83.65 >> VectorZeroExtend.short2Long 1788.382 129794.814 ops/ms 71.58 >> >> >> On other Neon systems, we can get similar performance boost as a result of >> intrinsification success. >> >> Since `VectorUCastNode` only used in Vector API's zero extension currently, >> this patch also adds assertion on nodes' definitions to clarify their usages. >> >> [TEST] >> compiler/vectorapi and jdk/incubator/vector passed on NEON and SVE machines. >> >> [1] >> https://github.com/openjdk/jdk/blob/master/src/jdk.incubator.vector/share/classes/jdk/incubator/vector/VectorOperators.java#L726 > > Eric Liu has updated the pull request incrementally with one additional > commit since the last revision: > > add _sve_xunpk & remove dead code > > Change-Id: Ic19836feb8a73ea7e65443794f2a0eb1363f6e2f src/hotspot/cpu/aarch64/c2_MacroAssembler_aarch64.cpp line 1412: > 1410: _sve_xunpk(is_unsigned, /* is_high */ false, dst, S, dst); > 1411: _sve_xunpk(is_unsigned, /* is_high */ false, dst, D, dst); > 1412: break; Why is this change here? It doesn't do anything. Does it? - PR Review Comment: https://git.openjdk.org/jdk/pull/16670#discussion_r1400743499
Re: RFR: 8319872: AArch64: [vectorapi] Implementation of unsigned (zero extended) casts [v3]
On Tue, 21 Nov 2023 13:24:34 GMT, Eric Liu wrote: >> Vector API defines zero-extend operations [1], which are going to be >> intrinsified and generated to `VectorUCastNode` by C2. This patch adds >> backend implementation for `VectorUCastNode` on AArch64. >> >> The micro benchmark shows significant performance improvement. In my test >> machine (SVE, 256-bit), the result is shown as below: >> >> >> >> Benchmark Before After Units Gain >> VectorZeroExtend.byte2Int 3168.251 243012.399 ops/ms 75.70 >> VectorZeroExtend.byte2Long3212.201 216291.588 ops/ms 66.33 >> VectorZeroExtend.byte2Short 3391.968 182655.365 ops/ms 52.85 >> VectorZeroExtend.int2Long 1012.19780448.553 ops/ms 78.48 >> VectorZeroExtend.short2Int1812.471 153416.828 ops/ms 83.65 >> VectorZeroExtend.short2Long 1788.382 129794.814 ops/ms 71.58 >> >> >> On other Neon systems, we can get similar performance boost as a result of >> intrinsification success. >> >> Since `VectorUCastNode` only used in Vector API's zero extension currently, >> this patch also adds assertion on nodes' definitions to clarify their usages. >> >> [TEST] >> compiler/vectorapi and jdk/incubator/vector passed on NEON and SVE machines. >> >> [1] >> https://github.com/openjdk/jdk/blob/master/src/jdk.incubator.vector/share/classes/jdk/incubator/vector/VectorOperators.java#L726 > > Eric Liu has updated the pull request incrementally with one additional > commit since the last revision: > > add _sve_xunpk & remove dead code > > Change-Id: Ic19836feb8a73ea7e65443794f2a0eb1363f6e2f src/hotspot/cpu/aarch64/assembler_aarch64.hpp line 3872: > 3870: void sve_sunpklo(FloatRegister Zd, SIMD_RegVariant T, FloatRegister > Zn) { > 3871: _sve_xunpk(/* is_unsigned */ false, /* is_high */ false, Zd, T, Zn); > 3872: } This code expansion does not look right. You should be able to make this change without so much code expansion. #define INSN(NAME, unsigned, high) void name(FloatRegister Zd, SIMD_RegVariant T, FloatRegister Zn) { \ _sve_xunpk(unsigned, high, T, Zn); \ } INSN(sve_uunpkhi, true, true) ... - PR Review Comment: https://git.openjdk.org/jdk/pull/16670#discussion_r1400716627
Re: RFR: 8319872: AArch64: [vectorapi] Implementation of unsigned (zero extended) casts [v3]
> Vector API defines zero-extend operations [1], which are going to be > intrinsified and generated to `VectorUCastNode` by C2. This patch adds > backend implementation for `VectorUCastNode` on AArch64. > > The micro benchmark shows significant performance improvement. In my test > machine (SVE, 256-bit), the result is shown as below: > > > > Benchmark Before After Units Gain > VectorZeroExtend.byte2Int 3168.251 243012.399 ops/ms 75.70 > VectorZeroExtend.byte2Long3212.201 216291.588 ops/ms 66.33 > VectorZeroExtend.byte2Short 3391.968 182655.365 ops/ms 52.85 > VectorZeroExtend.int2Long 1012.19780448.553 ops/ms 78.48 > VectorZeroExtend.short2Int1812.471 153416.828 ops/ms 83.65 > VectorZeroExtend.short2Long 1788.382 129794.814 ops/ms 71.58 > > > On other Neon systems, we can get similar performance boost as a result of > intrinsification success. > > Since `VectorUCastNode` only used in Vector API's zero extension currently, > this patch also adds assertion on nodes' definitions to clarify their usages. > > [TEST] > compiler/vectorapi and jdk/incubator/vector passed on NEON and SVE machines. > > [1] > https://github.com/openjdk/jdk/blob/master/src/jdk.incubator.vector/share/classes/jdk/incubator/vector/VectorOperators.java#L726 Eric Liu has updated the pull request incrementally with one additional commit since the last revision: add _sve_xunpk & remove dead code Change-Id: Ic19836feb8a73ea7e65443794f2a0eb1363f6e2f - Changes: - all: https://git.openjdk.org/jdk/pull/16670/files - new: https://git.openjdk.org/jdk/pull/16670/files/4fb3128f..68748e7f Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=16670&range=02 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=16670&range=01-02 Stats: 85 lines in 2 files changed: 17 ins; 32 del; 36 mod Patch: https://git.openjdk.org/jdk/pull/16670.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/16670/head:pull/16670 PR: https://git.openjdk.org/jdk/pull/16670