Re: RFR: 8319872: AArch64: [vectorapi] Implementation of unsigned (zero extended) casts [v3]

2023-11-22 Thread Andrew Haley
On Wed, 22 Nov 2023 02:18:32 GMT, Eric Liu  wrote:

>> src/hotspot/cpu/aarch64/c2_MacroAssembler_aarch64.cpp line 1412:
>> 
>>> 1410:   _sve_xunpk(is_unsigned, /* is_high */ false, dst, S, dst);
>>> 1411:   _sve_xunpk(is_unsigned, /* is_high */ false, dst, D, dst);
>>> 1412:   break;
>> 
>> Why is this change here? It doesn't do anything. Does it?
>
> `is_unsigned` is also used in this function. It is used in VectorUCastNode 
> for zero extending.

Ah, I see.

-

PR Review Comment: https://git.openjdk.org/jdk/pull/16670#discussion_r1401721522


Re: RFR: 8319872: AArch64: [vectorapi] Implementation of unsigned (zero extended) casts [v3]

2023-11-21 Thread Eric Liu
On Tue, 21 Nov 2023 15:07:48 GMT, Andrew Haley  wrote:

>> Eric Liu has updated the pull request incrementally with one additional 
>> commit since the last revision:
>> 
>>   add _sve_xunpk & remove dead code
>>   
>>   Change-Id: Ic19836feb8a73ea7e65443794f2a0eb1363f6e2f
>
> src/hotspot/cpu/aarch64/c2_MacroAssembler_aarch64.cpp line 1412:
> 
>> 1410:   _sve_xunpk(is_unsigned, /* is_high */ false, dst, S, dst);
>> 1411:   _sve_xunpk(is_unsigned, /* is_high */ false, dst, D, dst);
>> 1412:   break;
> 
> Why is this change here? It doesn't do anything. Does it?

`is_unsigned` is also used in this function. It is used in VectorUCastNode for 
zero extending.

-

PR Review Comment: https://git.openjdk.org/jdk/pull/16670#discussion_r1401423194


Re: RFR: 8319872: AArch64: [vectorapi] Implementation of unsigned (zero extended) casts [v3]

2023-11-21 Thread Andrew Haley
On Tue, 21 Nov 2023 13:24:34 GMT, Eric Liu  wrote:

>> Vector API defines zero-extend operations [1], which are going to be 
>> intrinsified and generated to `VectorUCastNode` by C2. This patch adds 
>> backend implementation for `VectorUCastNode` on AArch64.
>> 
>> The micro benchmark shows significant performance improvement. In my test 
>> machine (SVE, 256-bit), the result is shown as below:
>> 
>> 
>> 
>>   Benchmark Before After   Units   Gain
>>   VectorZeroExtend.byte2Int 3168.251   243012.399  ops/ms  75.70
>>   VectorZeroExtend.byte2Long3212.201   216291.588  ops/ms  66.33
>>   VectorZeroExtend.byte2Short   3391.968   182655.365  ops/ms  52.85
>>   VectorZeroExtend.int2Long 1012.19780448.553  ops/ms  78.48
>>   VectorZeroExtend.short2Int1812.471   153416.828  ops/ms  83.65
>>   VectorZeroExtend.short2Long   1788.382   129794.814  ops/ms  71.58
>> 
>> 
>> On other Neon systems, we can get similar performance boost as a result of 
>> intrinsification success.
>> 
>> Since `VectorUCastNode` only used in Vector API's zero extension currently, 
>> this patch also adds assertion on nodes' definitions to clarify their usages.
>> 
>> [TEST]
>> compiler/vectorapi and jdk/incubator/vector passed on NEON and SVE machines.
>> 
>> [1] 
>> https://github.com/openjdk/jdk/blob/master/src/jdk.incubator.vector/share/classes/jdk/incubator/vector/VectorOperators.java#L726
>
> Eric Liu has updated the pull request incrementally with one additional 
> commit since the last revision:
> 
>   add _sve_xunpk & remove dead code
>   
>   Change-Id: Ic19836feb8a73ea7e65443794f2a0eb1363f6e2f

src/hotspot/cpu/aarch64/c2_MacroAssembler_aarch64.cpp line 1412:

> 1410:   _sve_xunpk(is_unsigned, /* is_high */ false, dst, S, dst);
> 1411:   _sve_xunpk(is_unsigned, /* is_high */ false, dst, D, dst);
> 1412:   break;

Why is this change here? It doesn't do anything. Does it?

-

PR Review Comment: https://git.openjdk.org/jdk/pull/16670#discussion_r1400743499


Re: RFR: 8319872: AArch64: [vectorapi] Implementation of unsigned (zero extended) casts [v3]

2023-11-21 Thread Andrew Haley
On Tue, 21 Nov 2023 13:24:34 GMT, Eric Liu  wrote:

>> Vector API defines zero-extend operations [1], which are going to be 
>> intrinsified and generated to `VectorUCastNode` by C2. This patch adds 
>> backend implementation for `VectorUCastNode` on AArch64.
>> 
>> The micro benchmark shows significant performance improvement. In my test 
>> machine (SVE, 256-bit), the result is shown as below:
>> 
>> 
>> 
>>   Benchmark Before After   Units   Gain
>>   VectorZeroExtend.byte2Int 3168.251   243012.399  ops/ms  75.70
>>   VectorZeroExtend.byte2Long3212.201   216291.588  ops/ms  66.33
>>   VectorZeroExtend.byte2Short   3391.968   182655.365  ops/ms  52.85
>>   VectorZeroExtend.int2Long 1012.19780448.553  ops/ms  78.48
>>   VectorZeroExtend.short2Int1812.471   153416.828  ops/ms  83.65
>>   VectorZeroExtend.short2Long   1788.382   129794.814  ops/ms  71.58
>> 
>> 
>> On other Neon systems, we can get similar performance boost as a result of 
>> intrinsification success.
>> 
>> Since `VectorUCastNode` only used in Vector API's zero extension currently, 
>> this patch also adds assertion on nodes' definitions to clarify their usages.
>> 
>> [TEST]
>> compiler/vectorapi and jdk/incubator/vector passed on NEON and SVE machines.
>> 
>> [1] 
>> https://github.com/openjdk/jdk/blob/master/src/jdk.incubator.vector/share/classes/jdk/incubator/vector/VectorOperators.java#L726
>
> Eric Liu has updated the pull request incrementally with one additional 
> commit since the last revision:
> 
>   add _sve_xunpk & remove dead code
>   
>   Change-Id: Ic19836feb8a73ea7e65443794f2a0eb1363f6e2f

src/hotspot/cpu/aarch64/assembler_aarch64.hpp line 3872:

> 3870:   void sve_sunpklo(FloatRegister Zd, SIMD_RegVariant T, FloatRegister 
> Zn) {
> 3871: _sve_xunpk(/* is_unsigned */ false, /* is_high */ false, Zd, T, Zn);
> 3872:   }

This code expansion does not look right. You should be able to make this change 
without so much code expansion.


#define INSN(NAME, unsigned, high)
void name(FloatRegister Zd, SIMD_RegVariant T, FloatRegister Zn) { \
  _sve_xunpk(unsigned, high, T, Zn);
\
}
  INSN(sve_uunpkhi, true, true) ...

-

PR Review Comment: https://git.openjdk.org/jdk/pull/16670#discussion_r1400716627


Re: RFR: 8319872: AArch64: [vectorapi] Implementation of unsigned (zero extended) casts [v3]

2023-11-21 Thread Eric Liu
> Vector API defines zero-extend operations [1], which are going to be 
> intrinsified and generated to `VectorUCastNode` by C2. This patch adds 
> backend implementation for `VectorUCastNode` on AArch64.
> 
> The micro benchmark shows significant performance improvement. In my test 
> machine (SVE, 256-bit), the result is shown as below:
> 
> 
> 
>   Benchmark Before After   Units   Gain
>   VectorZeroExtend.byte2Int 3168.251   243012.399  ops/ms  75.70
>   VectorZeroExtend.byte2Long3212.201   216291.588  ops/ms  66.33
>   VectorZeroExtend.byte2Short   3391.968   182655.365  ops/ms  52.85
>   VectorZeroExtend.int2Long 1012.19780448.553  ops/ms  78.48
>   VectorZeroExtend.short2Int1812.471   153416.828  ops/ms  83.65
>   VectorZeroExtend.short2Long   1788.382   129794.814  ops/ms  71.58
> 
> 
> On other Neon systems, we can get similar performance boost as a result of 
> intrinsification success.
> 
> Since `VectorUCastNode` only used in Vector API's zero extension currently, 
> this patch also adds assertion on nodes' definitions to clarify their usages.
> 
> [TEST]
> compiler/vectorapi and jdk/incubator/vector passed on NEON and SVE machines.
> 
> [1] 
> https://github.com/openjdk/jdk/blob/master/src/jdk.incubator.vector/share/classes/jdk/incubator/vector/VectorOperators.java#L726

Eric Liu has updated the pull request incrementally with one additional commit 
since the last revision:

  add _sve_xunpk & remove dead code
  
  Change-Id: Ic19836feb8a73ea7e65443794f2a0eb1363f6e2f

-

Changes:
  - all: https://git.openjdk.org/jdk/pull/16670/files
  - new: https://git.openjdk.org/jdk/pull/16670/files/4fb3128f..68748e7f

Webrevs:
 - full: https://webrevs.openjdk.org/?repo=jdk&pr=16670&range=02
 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=16670&range=01-02

  Stats: 85 lines in 2 files changed: 17 ins; 32 del; 36 mod
  Patch: https://git.openjdk.org/jdk/pull/16670.diff
  Fetch: git fetch https://git.openjdk.org/jdk.git pull/16670/head:pull/16670

PR: https://git.openjdk.org/jdk/pull/16670