Re: RFR: 8264109: Add vectorized implementation for VectorMask.andNot()

2021-03-30 Thread Xiaohong Gong
On Fri, 26 Mar 2021 01:50:33 GMT, Xiaohong Gong  wrote:

> Currently "VectorMask.andNot()" is not vectorized:
> public VectorMask andNot(VectorMask m) {
> // FIXME: Generate good code here.
> return bOp(m, (i, a, b) -> a && !b);
> }
> This can be implemented with` "and(m.not())" `directly. Since 
> `"VectorMask.and()/not()" `have been vectorized in hotspot, `"andNot"`
> can be vectorized as well by calling them.
> 
> The performance gain is >100% for such a simple JMH:
>   @Benchmark
>   public Object andNot(Blackhole bh) {
> boolean[] mask = fm.apply(SPECIES.length());
> boolean[] r = fmt.apply(SPECIES.length());
> VectorMask rm = VectorMask.fromArray(SPECIES, r, 0);
> 
> for (int ic = 0; ic < INVOC_COUNT; ic++) {
>   for (int i = 0; i < mask.length; i += SPECIES.length()) {
> VectorMask vmask = VectorMask.fromArray(SPECIES, mask, i);
>   rm = rm.andNot(vmask);
> }
> }
> return rm;
>   }

Hi there, could anyone please take a look at this PR? Thanks so much!

-

PR: https://git.openjdk.java.net/jdk/pull/3211


Re: RFR: 8264109: Add vectorized implementation for VectorMask.andNot() [v2]

2021-03-31 Thread Xiaohong Gong
> Currently "VectorMask.andNot()" is not vectorized:
> public VectorMask andNot(VectorMask m) {
> // FIXME: Generate good code here.
> return bOp(m, (i, a, b) -> a && !b);
> }
> This can be implemented with` "and(m.not())" `directly. Since 
> `"VectorMask.and()/not()" `have been vectorized in hotspot, `"andNot"`
> can be vectorized as well by calling them.
> 
> The performance gain is >100% for such a simple JMH:
>   @Benchmark
>   public Object andNot(Blackhole bh) {
> boolean[] mask = fm.apply(SPECIES.length());
> boolean[] r = fmt.apply(SPECIES.length());
> VectorMask rm = VectorMask.fromArray(SPECIES, r, 0);
> 
> for (int ic = 0; ic < INVOC_COUNT; ic++) {
>   for (int i = 0; i < mask.length; i += SPECIES.length()) {
> VectorMask vmask = VectorMask.fromArray(SPECIES, mask, i);
>   rm = rm.andNot(vmask);
> }
> }
> return rm;
>   }

Xiaohong Gong has updated the pull request incrementally with one additional 
commit since the last revision:

  Move the changing to AbstractMask.andNot and revert changes in VectorMask
  
  Change-Id: Ie3ec8f53cb24526c8e1ccc68038355d024910818
  CustomizedGitHooks: yes

-

Changes:
  - all: https://git.openjdk.java.net/jdk/pull/3211/files
  - new: https://git.openjdk.java.net/jdk/pull/3211/files/ccb73d92..33ac041e

Webrevs:
 - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=3211&range=01
 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=3211&range=00-01

  Stats: 9 lines in 2 files changed: 5 ins; 2 del; 2 mod
  Patch: https://git.openjdk.java.net/jdk/pull/3211.diff
  Fetch: git fetch https://git.openjdk.java.net/jdk pull/3211/head:pull/3211

PR: https://git.openjdk.java.net/jdk/pull/3211


Re: RFR: 8264109: Add vectorized implementation for VectorMask.andNot() [v2]

2021-03-31 Thread Xiaohong Gong
On Wed, 31 Mar 2021 16:42:09 GMT, Paul Sandoz  wrote:

> Would you mind updating the existing `AbstractMask.andNot` implementation? 
> rather than changing `VectorMask.andNot`. That fits in with the current 
> implementation pattern.

Hi @PaulSandoz , thanks for looking at this PR. I'v updated the patch according 
to your comments. Would you mind having look at it again? Thanks so much!

-

PR: https://git.openjdk.java.net/jdk/pull/3211


Re: RFR: 8264109: Add vectorized implementation for VectorMask.andNot() [v2]

2021-04-01 Thread Xiaohong Gong
On Thu, 1 Apr 2021 15:13:31 GMT, Paul Sandoz  wrote:

>> Xiaohong Gong has updated the pull request incrementally with one additional 
>> commit since the last revision:
>> 
>>   Move the changing to AbstractMask.andNot and revert changes in VectorMask
>>   
>>   Change-Id: Ie3ec8f53cb24526c8e1ccc68038355d024910818
>>   CustomizedGitHooks: yes
>
> Looks good.

Thanks for your review @PaulSandoz !

-

PR: https://git.openjdk.java.net/jdk/pull/3211


Re: RFR: 8264109: Add vectorized implementation for VectorMask.andNot() [v2]

2021-04-02 Thread Xiaohong Gong
On Fri, 2 Apr 2021 09:59:31 GMT, Ningsheng Jian  wrote:

>> Xiaohong Gong has updated the pull request incrementally with one additional 
>> commit since the last revision:
>> 
>>   Move the changing to AbstractMask.andNot and revert changes in VectorMask
>>   
>>   Change-Id: Ie3ec8f53cb24526c8e1ccc68038355d024910818
>>   CustomizedGitHooks: yes
>
> LGTM

Thanks for the review @nsjian !

-

PR: https://git.openjdk.java.net/jdk/pull/3211


Integrated: 8264109: Add vectorized implementation for VectorMask.andNot()

2021-04-02 Thread Xiaohong Gong
On Fri, 26 Mar 2021 01:50:33 GMT, Xiaohong Gong  wrote:

> Currently "VectorMask.andNot()" is not vectorized:
> public VectorMask andNot(VectorMask m) {
> // FIXME: Generate good code here.
> return bOp(m, (i, a, b) -> a && !b);
> }
> This can be implemented with` "and(m.not())" `directly. Since 
> `"VectorMask.and()/not()" `have been vectorized in hotspot, `"andNot"`
> can be vectorized as well by calling them.
> 
> The performance gain is >100% for such a simple JMH:
>   @Benchmark
>   public Object andNot(Blackhole bh) {
> boolean[] mask = fm.apply(SPECIES.length());
> boolean[] r = fmt.apply(SPECIES.length());
> VectorMask rm = VectorMask.fromArray(SPECIES, r, 0);
> 
> for (int ic = 0; ic < INVOC_COUNT; ic++) {
>   for (int i = 0; i < mask.length; i += SPECIES.length()) {
> VectorMask vmask = VectorMask.fromArray(SPECIES, mask, i);
>   rm = rm.andNot(vmask);
>     }
> }
> return rm;
>   }

This pull request has now been integrated.

Changeset: 7d0a0bad
Author:Xiaohong Gong 
Committer: Ningsheng Jian 
URL:   https://git.openjdk.java.net/jdk/commit/7d0a0bad
Stats: 3 lines in 1 file changed: 0 ins; 1 del; 2 mod

8264109: Add vectorized implementation for VectorMask.andNot()

Reviewed-by: psandoz, njian

-

PR: https://git.openjdk.java.net/jdk/pull/3211


RFR: 8267969: Add vectorized implementation for VectorMask.eq()

2021-05-31 Thread Xiaohong Gong
Currently `"VectorMask.eq()" `is not vectorized:

public VectorMask eq(VectorMask m) {
// FIXME: Generate good code here.
return bOp(m, (i, a, b) -> a == b);
}

This can be implemented by calling `"xor(m.not())"` directly.

The performance improved about 1.4x ~ 1.9x for the following benchmark with 
different basic types:

@Benchmark
public Object eq() {
boolean[] ma = fm.apply(size);
boolean[] mb = fmb.apply(size);
boolean[] mt = fmt.apply(size);
VectorMask m = VectorMask.fromArray(SPECIES, mt, 0);

for (int ic = 0; ic < INVOC_COUNT; ic++) {
for (int i = 0; i < ma.length; i += SPECIES.length()) {
var av = SPECIES.loadMask(ma, i);
var bv = SPECIES.loadMask(mb, i);

// accumulate results, so JIT can't eliminate relevant computations
m = m.and(av.eq(bv));
}
}

return m;
}

-

Commit messages:
 - 8267969: Add vectorized implementation for VectorMask.eq()

Changes: https://git.openjdk.java.net/jdk/pull/4272/files
 Webrev: https://webrevs.openjdk.java.net/?repo=jdk&pr=4272&range=00
  Issue: https://bugs.openjdk.java.net/browse/JDK-8267969
  Stats: 254 lines in 32 files changed: 248 ins; 6 del; 0 mod
  Patch: https://git.openjdk.java.net/jdk/pull/4272.diff
  Fetch: git fetch https://git.openjdk.java.net/jdk pull/4272/head:pull/4272

PR: https://git.openjdk.java.net/jdk/pull/4272


Re: RFR: 8267969: Add vectorized implementation for VectorMask.eq()

2021-06-01 Thread Xiaohong Gong
On Tue, 1 Jun 2021 16:29:58 GMT, Paul Sandoz  wrote:

> Looks. Later we may want to consider pushing this down as an intrinsic, 
> perhaps reusing `VectorSupport.compare`.

Thanks for your review @PaulSandoz ! Yes, reusing `VectorSupport.compare` is an 
alternative way to do the vectorization.

-

PR: https://git.openjdk.java.net/jdk/pull/4272


Re: RFR: 8267969: Add vectorized implementation for VectorMask.eq() [v2]

2021-06-01 Thread Xiaohong Gong
> Currently `"VectorMask.eq()" `is not vectorized:
> 
> public VectorMask eq(VectorMask m) {
> // FIXME: Generate good code here.
> return bOp(m, (i, a, b) -> a == b);
> }
> 
> This can be implemented by calling `"xor(m.not())"` directly.
> 
> The performance improved about 1.4x ~ 1.9x for the following benchmark with 
> different basic types:
> 
> @Benchmark
> public Object eq() {
> boolean[] ma = fm.apply(size);
> boolean[] mb = fmb.apply(size);
> boolean[] mt = fmt.apply(size);
> VectorMask m = VectorMask.fromArray(SPECIES, mt, 0);
> 
> for (int ic = 0; ic < INVOC_COUNT; ic++) {
> for (int i = 0; i < ma.length; i += SPECIES.length()) {
> var av = SPECIES.loadMask(ma, i);
> var bv = SPECIES.loadMask(mb, i);
> 
> // accumulate results, so JIT can't eliminate relevant 
> computations
> m = m.and(av.eq(bv));
> }
> }
> 
> return m;
> }

Xiaohong Gong has updated the pull request with a new target base due to a 
merge or a rebase. The incremental webrev excludes the unrelated changes 
brought in by the merge/rebase. The pull request contains two additional 
commits since the last revision:

 - Merge branch 'jdk:master' into JDK-8267969
 - 8267969: Add vectorized implementation for VectorMask.eq()

-

Changes:
  - all: https://git.openjdk.java.net/jdk/pull/4272/files
  - new: https://git.openjdk.java.net/jdk/pull/4272/files/0e8e0e84..f3a48026

Webrevs:
 - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=4272&range=01
 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=4272&range=00-01

  Stats: 22530 lines in 577 files changed: 2836 ins; 18744 del; 950 mod
  Patch: https://git.openjdk.java.net/jdk/pull/4272.diff
  Fetch: git fetch https://git.openjdk.java.net/jdk pull/4272/head:pull/4272

PR: https://git.openjdk.java.net/jdk/pull/4272


Integrated: 8267969: Add vectorized implementation for VectorMask.eq()

2021-06-02 Thread Xiaohong Gong
On Mon, 31 May 2021 10:25:26 GMT, Xiaohong Gong  wrote:

> Currently `"VectorMask.eq()" `is not vectorized:
> 
> public VectorMask eq(VectorMask m) {
> // FIXME: Generate good code here.
> return bOp(m, (i, a, b) -> a == b);
> }
> 
> This can be implemented by calling `"xor(m.not())"` directly.
> 
> The performance improved about 1.4x ~ 1.9x for the following benchmark with 
> different basic types:
> 
> @Benchmark
> public Object eq() {
> boolean[] ma = fm.apply(size);
> boolean[] mb = fmb.apply(size);
> boolean[] mt = fmt.apply(size);
> VectorMask m = VectorMask.fromArray(SPECIES, mt, 0);
> 
> for (int ic = 0; ic < INVOC_COUNT; ic++) {
> for (int i = 0; i < ma.length; i += SPECIES.length()) {
> var av = SPECIES.loadMask(ma, i);
> var bv = SPECIES.loadMask(mb, i);
> 
> // accumulate results, so JIT can't eliminate relevant 
> computations
> m = m.and(av.eq(bv));
> }
> }
> 
> return m;
> }

This pull request has now been integrated.

Changeset: 496fb90b
Author:Xiaohong Gong 
Committer: Nils Eliasson 
URL:   
https://git.openjdk.java.net/jdk/commit/496fb90b3a038f5fe76acc8a98bcd0dac4071cf9
Stats: 254 lines in 32 files changed: 248 ins; 6 del; 0 mod

8267969: Add vectorized implementation for VectorMask.eq()

Reviewed-by: psandoz, neliasso

-

PR: https://git.openjdk.java.net/jdk/pull/4272


Re: RFR: 8267969: Add vectorized implementation for VectorMask.eq() [v2]

2021-06-02 Thread Xiaohong Gong
On Wed, 2 Jun 2021 07:48:47 GMT, Nils Eliasson  wrote:

> Please wait until you have two reviewers before integrating.

Sure! Thanks so much for looking at this PR!

-

PR: https://git.openjdk.java.net/jdk/pull/4272


Re: RFR: 8268151: Vector API toShuffle optimization

2021-06-02 Thread Xiaohong Gong
On Thu, 3 Jun 2021 00:29:00 GMT, Sandhya Viswanathan  
wrote:

> The Vector API toShuffle method can be optimized  using existing vector 
> conversion intrinsic.
> 
> The following changes are made:
> 1) vector.toShuffle java implementation is changed to call 
> VectorSupport.convert.
> 2) The conversion intrinsic (inline_vector_convert()) in vectorIntrinsics.cpp 
> is changed to allow shuffle as a destination type.
> 3) The shuffle.toVector intrinsic (inline_vector_shuffle_to_vector()) in 
> vectorIntrinsics.cpp now explicitly generates conversion node instead of 
> performing conversion during unbox. This is to remove unnecessary boxing 
> during back to back vector.toShuffle and shuffle.toVector calls. 
> 
> Best Regards,
> Sandhya

src/hotspot/share/opto/vectornode.cpp line 1246:

> 1244:   return new VectorLoadMaskNode(value, out_vt);
> 1245: } else if (is_vector_shuffle) {
> 1246:   if (!is_shuffle_to_vector()) {

Hi @sviswa7 , thanks for this change! I'm just curious whether 
`is_shuffle_to_vector()` is still needed for `VectorUnboxNode` with this 
change? It seems this flag can be removed, doesn't it?

-

PR: https://git.openjdk.java.net/jdk/pull/4326


Re: RFR: 8268151: Vector API toShuffle optimization

2021-06-02 Thread Xiaohong Gong
On Thu, 3 Jun 2021 00:29:00 GMT, Sandhya Viswanathan  
wrote:

> The Vector API toShuffle method can be optimized  using existing vector 
> conversion intrinsic.
> 
> The following changes are made:
> 1) vector.toShuffle java implementation is changed to call 
> VectorSupport.convert.
> 2) The conversion intrinsic (inline_vector_convert()) in vectorIntrinsics.cpp 
> is changed to allow shuffle as a destination type.
> 3) The shuffle.toVector intrinsic (inline_vector_shuffle_to_vector()) in 
> vectorIntrinsics.cpp now explicitly generates conversion node instead of 
> performing conversion during unbox. This is to remove unnecessary boxing 
> during back to back vector.toShuffle and shuffle.toVector calls. 
> 
> Best Regards,
> Sandhya

src/jdk.incubator.vector/share/classes/jdk/incubator/vector/Byte128Vector.java 
line 335:

> 333: @ForceInline
> 334: private final
> 335: VectorShuffle toShuffleTemplate(AbstractSpecies dsp) {

Is it better to move this template method to the super class like other APIs?

src/jdk.incubator.vector/share/classes/jdk/incubator/vector/Byte128Vector.java 
line 350:

> 348:  Byte128Shuffle.class, byte.class, 
> VLENGTH,
> 349:  this, VSPECIES,
> 350:  Byte128Vector::toShuffleTemplate);

ditto

-

PR: https://git.openjdk.java.net/jdk/pull/4326


Re: RFR: 8268151: Vector API toShuffle optimization [v2]

2021-06-03 Thread Xiaohong Gong
On Thu, 3 Jun 2021 21:43:19 GMT, Sandhya Viswanathan  
wrote:

>> The Vector API toShuffle method can be optimized  using existing vector 
>> conversion intrinsic.
>> 
>> The following changes are made:
>> 1) vector.toShuffle java implementation is changed to call 
>> VectorSupport.convert.
>> 2) The conversion intrinsic (inline_vector_convert()) in 
>> vectorIntrinsics.cpp is changed to allow shuffle as a destination type.
>> 3) The shuffle.toVector intrinsic (inline_vector_shuffle_to_vector()) in 
>> vectorIntrinsics.cpp now explicitly generates conversion node instead of 
>> performing conversion during unbox. This is to remove unnecessary boxing 
>> during back to back vector.toShuffle and shuffle.toVector calls. 
>> 
>> Best Regards,
>> Sandhya
>
> Sandhya Viswanathan has updated the pull request incrementally with one 
> additional commit since the last revision:
> 
>   Implement review comments

Looks good to me. Thanks!

-

PR: https://git.openjdk.java.net/jdk/pull/4326


Re: RFR: 8268151: Vector API toShuffle optimization [v2]

2021-06-03 Thread Xiaohong Gong
On Thu, 3 Jun 2021 18:40:09 GMT, Sandhya Viswanathan  
wrote:

>> src/hotspot/share/opto/vectornode.cpp line 1246:
>> 
>>> 1244:   return new VectorLoadMaskNode(value, out_vt);
>>> 1245: } else if (is_vector_shuffle) {
>>> 1246:   if (!is_shuffle_to_vector()) {
>> 
>> Hi @sviswa7 , thanks for this change! I'm just curious whether 
>> `is_shuffle_to_vector()` is still needed for `VectorUnboxNode` with this 
>> change? It seems this flag can be removed, doesn't it?
>
> @XiaohongGong is_shuffle_to_vector is still needed as we shouldn't generate 
> VectorLoadShuffleNode for shuffle.toVector.

OK, got it. Thanks!

-

PR: https://git.openjdk.java.net/jdk/pull/4326


RFR: 8282432: Optimize masked "test" Vector API with predicate feature

2022-03-01 Thread Xiaohong Gong
The vector `"test"` api is implemented with vector `"compare"`. And the masked 
`"test" `is implemented with `"test(op).and(m)"` which means 
`"compare().and(m)"` finally. Since the masked vector `"compare"` has been 
optimized with predicated instruction for archituctures that support the 
feature, the masked `"test"` can also be optimized by implementing it with the 
predicated compare. This could save the additional ` "and"` for architectures 
that support the predicate feature.

This patch optimized the masked `"test"` by implementing it with masked 
`"compare"`. With this patch, the following codes for the` "IS_NEGATIVE"` op 
for a DoubleVector with SVE:

  mov z19.d, #0
  cmpgt   p1.d, p7/z, z19.d, z17.d
  and p0.b, p7/z, p1.b, p0.b

are optimized to:

  mov z19.d, #0
  cmpgt   p0.d, p0/z, z19.d, z17.d

Also update the jtreg tests for masked` "test" ` to make sure they are hot 
enough to be compiled by c2.

-

Commit messages:
 - 8282432: Optimize masked "test" Vector API with predicate feature

Changes: https://git.openjdk.java.net/jdk/pull/7654/files
 Webrev: https://webrevs.openjdk.java.net/?repo=jdk&pr=7654&range=00
  Issue: https://bugs.openjdk.java.net/browse/JDK-8282432
  Stats: 1233 lines in 70 files changed: 590 ins; 7 del; 636 mod
  Patch: https://git.openjdk.java.net/jdk/pull/7654.diff
  Fetch: git fetch https://git.openjdk.java.net/jdk pull/7654/head:pull/7654

PR: https://git.openjdk.java.net/jdk/pull/7654


Re: RFR: 8282432: Optimize masked "test" Vector API with predicate feature [v2]

2022-03-03 Thread Xiaohong Gong
> The vector `"test"` api is implemented with vector `"compare"`. And the 
> masked `"test" `is implemented with `"test(op).and(m)"` which means 
> `"compare().and(m)"` finally. Since the masked vector `"compare"` has been 
> optimized with predicated instruction for archituctures that support the 
> feature, the masked `"test"` can also be optimized by implementing it with 
> the predicated compare. This could save the additional ` "and"` for 
> architectures that support the predicate feature.
> 
> This patch optimized the masked `"test"` by implementing it with masked 
> `"compare"`. With this patch, the following codes for the` "IS_NEGATIVE"` op 
> for a DoubleVector with SVE:
> 
>   mov z19.d, #0
>   cmpgt   p1.d, p7/z, z19.d, z17.d
>   and p0.b, p7/z, p1.b, p0.b
> 
> are optimized to:
> 
>   mov z19.d, #0
>   cmpgt   p0.d, p0/z, z19.d, z17.d
> 
> Also update the jtreg tests for masked` "test" ` to make sure they are hot 
> enough to be compiled by c2.

Xiaohong Gong has updated the pull request incrementally with one additional 
commit since the last revision:

  Simply the testTemplate

-

Changes:
  - all: https://git.openjdk.java.net/jdk/pull/7654/files
  - new: https://git.openjdk.java.net/jdk/pull/7654/files/41a9d56f..c90d2f49

Webrevs:
 - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=7654&range=01
 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=7654&range=00-01

  Stats: 197 lines in 49 files changed: 13 ins; 33 del; 151 mod
  Patch: https://git.openjdk.java.net/jdk/pull/7654.diff
  Fetch: git fetch https://git.openjdk.java.net/jdk pull/7654/head:pull/7654

PR: https://git.openjdk.java.net/jdk/pull/7654


Re: RFR: 8282432: Optimize masked "test" Vector API with predicate feature

2022-03-03 Thread Xiaohong Gong
On Wed, 2 Mar 2022 02:50:27 GMT, Xiaohong Gong  wrote:

> The vector `"test"` api is implemented with vector `"compare"`. And the 
> masked `"test" `is implemented with `"test(op).and(m)"` which means 
> `"compare().and(m)"` finally. Since the masked vector `"compare"` has been 
> optimized with predicated instruction for archituctures that support the 
> feature, the masked `"test"` can also be optimized by implementing it with 
> the predicated compare. This could save the additional ` "and"` for 
> architectures that support the predicate feature.
> 
> This patch optimized the masked `"test"` by implementing it with masked 
> `"compare"`. With this patch, the following codes for the` "IS_NEGATIVE"` op 
> for a DoubleVector with SVE:
> 
>   mov z19.d, #0
>   cmpgt   p1.d, p7/z, z19.d, z17.d
>   and p0.b, p7/z, p1.b, p0.b
> 
> are optimized to:
> 
>   mov z19.d, #0
>   cmpgt   p0.d, p0/z, z19.d, z17.d
> 
> Also update the jtreg tests for masked` "test" ` to make sure they are hot 
> enough to be compiled by c2.

Hi @PaulSandoz , thanks for your review! All the comments have been address. 
Could you please take a look at it again? Thanks!

-

PR: https://git.openjdk.java.net/jdk/pull/7654


Re: RFR: 8282432: Optimize masked "test" Vector API with predicate feature [v2]

2022-03-06 Thread Xiaohong Gong
On Thu, 3 Mar 2022 17:40:13 GMT, Paul Sandoz  wrote:

> I guess the following: `mask.cast(IntVector.species(shape())` is more 
> efficient than: `m.cast(vspecies().asIntegral()))` ?

Yeah, that's one point. Another main reason is 
`m.cast(vspecies().asIntegral()))`  cannot be handled well in the superclass 
due to the java generic issues.

Thanks for the reiview @PaulSandoz !

-

PR: https://git.openjdk.java.net/jdk/pull/7654


Integrated: 8282432: Optimize masked "test" Vector API with predicate feature

2022-03-08 Thread Xiaohong Gong
On Wed, 2 Mar 2022 02:50:27 GMT, Xiaohong Gong  wrote:

> The vector `"test"` api is implemented with vector `"compare"`. And the 
> masked `"test" `is implemented with `"test(op).and(m)"` which means 
> `"compare().and(m)"` finally. Since the masked vector `"compare"` has been 
> optimized with predicated instruction for archituctures that support the 
> feature, the masked `"test"` can also be optimized by implementing it with 
> the predicated compare. This could save the additional ` "and"` for 
> architectures that support the predicate feature.
> 
> This patch optimized the masked `"test"` by implementing it with masked 
> `"compare"`. With this patch, the following codes for the` "IS_NEGATIVE"` op 
> for a DoubleVector with SVE:
> 
>   mov z19.d, #0
>   cmpgt   p1.d, p7/z, z19.d, z17.d
>   and p0.b, p7/z, p1.b, p0.b
> 
> are optimized to:
> 
>   mov z19.d, #0
>   cmpgt   p0.d, p0/z, z19.d, z17.d
> 
> Also update the jtreg tests for masked` "test" ` to make sure they are hot 
> enough to be compiled by c2.

This pull request has now been integrated.

Changeset: 12693a6c
Author:Xiaohong Gong 
Committer: Ningsheng Jian 
URL:   
https://git.openjdk.java.net/jdk/commit/12693a6cf3e00c38a635d888419406f689657813
Stats: 1325 lines in 70 files changed: 574 ins; 11 del; 740 mod

8282432: Optimize masked "test" Vector API with predicate feature

Reviewed-by: psandoz

-

PR: https://git.openjdk.java.net/jdk/pull/7654


Re: RFR: 8282162: [vector] Optimize vector negation API

2022-03-14 Thread Xiaohong Gong
On Fri, 11 Mar 2022 06:29:22 GMT, Xiaohong Gong  wrote:

> The current vector `"NEG"` is implemented with substraction a vector by zero 
> in case the architecture does not support the negation instruction. And to 
> fit the predicate feature for architectures that support it, the masked 
> vector `"NEG" ` is implemented with pattern `"v.not(m).add(1, m)"`. They both 
> can be optimized to a single negation instruction for ARM SVE.
> And so does the non-masked "NEG" for NEON. Besides, implementing the masked 
> "NEG" with substraction for architectures that support neither negation 
> instruction nor predicate feature can also save several instructions than the 
> current pattern.
> 
> To optimize the VectorAPI negation, this patch moves the implementation from 
> Java side to hotspot. The compiler will generate different nodes according to 
> the architecture:
>   - Generate the (predicated) negation node if architecture supports it, 
> otherwise, generate "`zero.sub(v)`" pattern for non-masked operation.
>   - Generate `"zero.sub(v, m)"` for masked operation if the architecture does 
> not have predicate feature, otherwise generate the original pattern 
> `"v.xor(-1, m).add(1, m)"`.
> 
> So with this patch, the following transformations are applied:
> 
> For non-masked negation with NEON:
> 
>   moviv16.4s, #0x0
>   sub v17.4s, v16.4s, v17.4s   ==> neg v17.4s, v17.4s
> 
> and with SVE:
> 
>   mov z16.s, #0
>   sub z18.s, z16.s, z17.s  ==> neg z16.s, p7/m, z16.s
> 
> For masked negation with NEON:
> 
>   moviv17.4s, #0x1
>   mvn v19.16b, v18.16b
>   mov v20.16b, v16.16b ==>  neg v18.4s, v17.4s
>   bsl v20.16b, v19.16b, v18.16b bsl v19.16b, v18.16b, v17.16b
>   add v19.4s, v20.4s, v17.4s
>   mov v18.16b, v16.16b
>   bsl v18.16b, v19.16b, v20.16b
> 
> and with SVE:
> 
>   mov z16.s, #-1
>   mov z17.s, #1==> neg z16.s, p0/m, z16.s
>   eor z18.s, p0/m, z18.s, z16.s
>   add z18.s, p0/m, z18.s, z17.s
> 
> Here are the performance gains for benchmarks (see [1][2]) on ARM and x86 
> machines(note that the non-masked negation benchmarks do not have any 
> improvement on X86 since no instructions are changed):
> 
> NEON:
> BenchmarkGain
> Byte128Vector.NEG1.029
> Byte128Vector.NEGMasked  1.757
> Short128Vector.NEG   1.041
> Short128Vector.NEGMasked 1.659
> Int128Vector.NEG 1.005
> Int128Vector.NEGMasked   1.513
> Long128Vector.NEG1.003
> Long128Vector.NEGMasked  1.878
> 
> SVE with 512-bits:
> BenchmarkGain
> ByteMaxVector.NEG1.10
> ByteMaxVector.NEGMasked  1.165
> ShortMaxVector.NEG   1.056
> ShortMaxVector.NEGMasked 1.195
> IntMaxVector.NEG 1.002
> IntMaxVector.NEGMasked   1.239
> LongMaxVector.NEG1.031
> LongMaxVector.NEGMasked  1.191
> 
> X86 (non AVX-512):
> BenchmarkGain
> ByteMaxVector.NEGMasked  1.254
> ShortMaxVector.NEGMasked 1.359
> IntMaxVector.NEGMasked   1.431
> LongMaxVector.NEGMasked  1.989
> 
> [1] 
> https://github.com/openjdk/panama-vector/blob/vectorIntrinsics/test/micro/org/openjdk/bench/jdk/incubator/vector/operation/Byte128Vector.java#L1881
> [2] 
> https://github.com/openjdk/panama-vector/blob/vectorIntrinsics/test/micro/org/openjdk/bench/jdk/incubator/vector/operation/Byte128Vector.java#L1896

Hi, could anyone please take a look at this PR? Thanks so much!

-

PR: https://git.openjdk.java.net/jdk/pull/7782


Re: RFR: 8282162: [vector] Optimize vector negation API

2022-03-14 Thread Xiaohong Gong
On Tue, 15 Mar 2022 02:39:42 GMT, Joe Darcy  wrote:

> Note that in terms of Java semantics, negation of floating point values needs 
> to be implemented as subtraction from negative zero rather than positive zero:
> 
> double negate(double arg) {return -0.0 - arg; }
> 
> This is to handle signed zeros correctly.

Hi @jddarcy ,thanks for looking at this PR and thanks for the notes on the 
floating point negation! Yeah, this really makes sense to me. Kindly note that 
this patch didn't touch the negation of the floating point values. For Vector 
API, the vector floating point negation has been intrinsified to `NegVF/D` node 
by compiler that we directly generate the negation instructions for them. 
Thanks!

-

PR: https://git.openjdk.java.net/jdk/pull/7782


Re: RFR: 8282162: [vector] Optimize vector negation API

2022-03-20 Thread Xiaohong Gong
On Sat, 19 Mar 2022 02:34:55 GMT, Jie Fu  wrote:

>> The current vector `"NEG"` is implemented with substraction a vector by zero 
>> in case the architecture does not support the negation instruction. And to 
>> fit the predicate feature for architectures that support it, the masked 
>> vector `"NEG" ` is implemented with pattern `"v.not(m).add(1, m)"`. They 
>> both can be optimized to a single negation instruction for ARM SVE.
>> And so does the non-masked "NEG" for NEON. Besides, implementing the masked 
>> "NEG" with substraction for architectures that support neither negation 
>> instruction nor predicate feature can also save several instructions than 
>> the current pattern.
>> 
>> To optimize the VectorAPI negation, this patch moves the implementation from 
>> Java side to hotspot. The compiler will generate different nodes according 
>> to the architecture:
>>   - Generate the (predicated) negation node if architecture supports it, 
>> otherwise, generate "`zero.sub(v)`" pattern for non-masked operation.
>>   - Generate `"zero.sub(v, m)"` for masked operation if the architecture 
>> does not have predicate feature, otherwise generate the original pattern 
>> `"v.xor(-1, m).add(1, m)"`.
>> 
>> So with this patch, the following transformations are applied:
>> 
>> For non-masked negation with NEON:
>> 
>>   moviv16.4s, #0x0
>>   sub v17.4s, v16.4s, v17.4s   ==> neg v17.4s, v17.4s
>> 
>> and with SVE:
>> 
>>   mov z16.s, #0
>>   sub z18.s, z16.s, z17.s  ==> neg z16.s, p7/m, z16.s
>> 
>> For masked negation with NEON:
>> 
>>   moviv17.4s, #0x1
>>   mvn v19.16b, v18.16b
>>   mov v20.16b, v16.16b ==>  neg v18.4s, v17.4s
>>   bsl v20.16b, v19.16b, v18.16b bsl v19.16b, v18.16b, v17.16b
>>   add v19.4s, v20.4s, v17.4s
>>   mov v18.16b, v16.16b
>>   bsl v18.16b, v19.16b, v20.16b
>> 
>> and with SVE:
>> 
>>   mov z16.s, #-1
>>   mov z17.s, #1==> neg z16.s, p0/m, z16.s
>>   eor z18.s, p0/m, z18.s, z16.s
>>   add z18.s, p0/m, z18.s, z17.s
>> 
>> Here are the performance gains for benchmarks (see [1][2]) on ARM and x86 
>> machines(note that the non-masked negation benchmarks do not have any 
>> improvement on X86 since no instructions are changed):
>> 
>> NEON:
>> BenchmarkGain
>> Byte128Vector.NEG1.029
>> Byte128Vector.NEGMasked  1.757
>> Short128Vector.NEG   1.041
>> Short128Vector.NEGMasked 1.659
>> Int128Vector.NEG 1.005
>> Int128Vector.NEGMasked   1.513
>> Long128Vector.NEG1.003
>> Long128Vector.NEGMasked  1.878
>> 
>> SVE with 512-bits:
>> BenchmarkGain
>> ByteMaxVector.NEG1.10
>> ByteMaxVector.NEGMasked  1.165
>> ShortMaxVector.NEG   1.056
>> ShortMaxVector.NEGMasked 1.195
>> IntMaxVector.NEG 1.002
>> IntMaxVector.NEGMasked   1.239
>> LongMaxVector.NEG1.031
>> LongMaxVector.NEGMasked  1.191
>> 
>> X86 (non AVX-512):
>> BenchmarkGain
>> ByteMaxVector.NEGMasked  1.254
>> ShortMaxVector.NEGMasked 1.359
>> IntMaxVector.NEGMasked   1.431
>> LongMaxVector.NEGMasked  1.989
>> 
>> [1] 
>> https://github.com/openjdk/panama-vector/blob/vectorIntrinsics/test/micro/org/openjdk/bench/jdk/incubator/vector/operation/Byte128Vector.java#L1881
>> [2] 
>> https://github.com/openjdk/panama-vector/blob/vectorIntrinsics/test/micro/org/openjdk/bench/jdk/incubator/vector/operation/Byte128Vector.java#L1896
>
> src/hotspot/share/opto/vectornode.cpp line 141:
> 
>> 139:   case T_BYTE:
>> 140:   case T_SHORT:
>> 141:   case T_INT: return Op_NegVI;
> 
> Why not add `Op_NegVB` for `BYTE` and `Op_NegVS` for `SHORT`?
> Is there any performance drop for byte/short negation operation if both of 
> them are handled as a NegVI vector?

The compiler can get the real type info from `Op_NegVI` that can also handle 
the `BYTE ` and `SHORT ` basic type. I just don't want to add more new IRs 
which also need more match rules in the ad files.

> Is there any performance drop for byte/short negation operation if both of 
> them are handled as a NegVI vector?

>From the benchmark results I showed in the commit message, I didn't see not 
>any performance drop for byte/short.  Thanks!

> src/hotspot/share/opto/vectornode.cpp line 1635:
> 
>> 1633: }
>> 1634: 
>> 1635: Node* NegVINode::Ideal(PhaseGVN* phase, bool can_reshape) {
> 
> Much duplication in `NegVINode::Ideal` and `NegVLNode::Ideal`.
> Is it possible to refactor the implementation?

Yeah, maybe we need a superclass for `NegVINode` and `NegVLNode`. Thanks!

-

PR: https://git.openjdk.java.net/jdk/pull/7782


Re: RFR: 8282162: [vector] Optimize vector negation API

2022-03-22 Thread Xiaohong Gong
On Sat, 19 Mar 2022 03:11:12 GMT, Jie Fu  wrote:

>>> Note that in terms of Java semantics, negation of floating point values 
>>> needs to be implemented as subtraction from negative zero rather than 
>>> positive zero:
>>> 
>>> double negate(double arg) {return -0.0 - arg; }
>>> 
>>> This is to handle signed zeros correctly.
>> 
>> Hi @jddarcy ,thanks for looking at this PR and thanks for the notes on the 
>> floating point negation! Yeah, this really makes sense to me. Kindly note 
>> that this patch didn't touch the negation of the floating point values. For 
>> Vector API, the vector floating point negation has been intrinsified to 
>> `NegVF/D` node by compiler that we directly generate the negation 
>> instructions for them. Thanks!
>
>> Note that in terms of Java semantics, negation of floating point values 
>> needs to be implemented as subtraction from negative zero rather than 
>> positive zero:
>> 
>> double negate(double arg) {return -0.0 - arg; }
>> 
>> This is to handle signed zeros correctly.
> 
> This seems easy to be broken by an opt enhancement.
> Just wondering do we have a jtreg test for this point? @jddarcy 
> Thanks.

Hi @DamonFool , thanks for your review! All the comments have been addressed. 
Thanks!

-

PR: https://git.openjdk.java.net/jdk/pull/7782


Re: RFR: 8282162: [vector] Optimize vector negation API [v2]

2022-03-22 Thread Xiaohong Gong
> The current vector `"NEG"` is implemented with substraction a vector by zero 
> in case the architecture does not support the negation instruction. And to 
> fit the predicate feature for architectures that support it, the masked 
> vector `"NEG" ` is implemented with pattern `"v.not(m).add(1, m)"`. They both 
> can be optimized to a single negation instruction for ARM SVE.
> And so does the non-masked "NEG" for NEON. Besides, implementing the masked 
> "NEG" with substraction for architectures that support neither negation 
> instruction nor predicate feature can also save several instructions than the 
> current pattern.
> 
> To optimize the VectorAPI negation, this patch moves the implementation from 
> Java side to hotspot. The compiler will generate different nodes according to 
> the architecture:
>   - Generate the (predicated) negation node if architecture supports it, 
> otherwise, generate "`zero.sub(v)`" pattern for non-masked operation.
>   - Generate `"zero.sub(v, m)"` for masked operation if the architecture does 
> not have predicate feature, otherwise generate the original pattern 
> `"v.xor(-1, m).add(1, m)"`.
> 
> So with this patch, the following transformations are applied:
> 
> For non-masked negation with NEON:
> 
>   moviv16.4s, #0x0
>   sub v17.4s, v16.4s, v17.4s   ==> neg v17.4s, v17.4s
> 
> and with SVE:
> 
>   mov z16.s, #0
>   sub z18.s, z16.s, z17.s  ==> neg z16.s, p7/m, z16.s
> 
> For masked negation with NEON:
> 
>   moviv17.4s, #0x1
>   mvn v19.16b, v18.16b
>   mov v20.16b, v16.16b ==>  neg v18.4s, v17.4s
>   bsl v20.16b, v19.16b, v18.16b bsl v19.16b, v18.16b, v17.16b
>   add v19.4s, v20.4s, v17.4s
>   mov v18.16b, v16.16b
>   bsl v18.16b, v19.16b, v20.16b
> 
> and with SVE:
> 
>   mov z16.s, #-1
>   mov z17.s, #1==> neg z16.s, p0/m, z16.s
>   eor z18.s, p0/m, z18.s, z16.s
>   add z18.s, p0/m, z18.s, z17.s
> 
> Here are the performance gains for benchmarks (see [1][2]) on ARM and x86 
> machines(note that the non-masked negation benchmarks do not have any 
> improvement on X86 since no instructions are changed):
> 
> NEON:
> BenchmarkGain
> Byte128Vector.NEG1.029
> Byte128Vector.NEGMasked  1.757
> Short128Vector.NEG   1.041
> Short128Vector.NEGMasked 1.659
> Int128Vector.NEG 1.005
> Int128Vector.NEGMasked   1.513
> Long128Vector.NEG1.003
> Long128Vector.NEGMasked  1.878
> 
> SVE with 512-bits:
> BenchmarkGain
> ByteMaxVector.NEG1.10
> ByteMaxVector.NEGMasked  1.165
> ShortMaxVector.NEG   1.056
> ShortMaxVector.NEGMasked 1.195
> IntMaxVector.NEG 1.002
> IntMaxVector.NEGMasked   1.239
> LongMaxVector.NEG1.031
> LongMaxVector.NEGMasked  1.191
> 
> X86 (non AVX-512):
> BenchmarkGain
> ByteMaxVector.NEGMasked  1.254
> ShortMaxVector.NEGMasked 1.359
> IntMaxVector.NEGMasked   1.431
> LongMaxVector.NEGMasked  1.989
> 
> [1] 
> https://github.com/openjdk/panama-vector/blob/vectorIntrinsics/test/micro/org/openjdk/bench/jdk/incubator/vector/operation/Byte128Vector.java#L1881
> [2] 
> https://github.com/openjdk/panama-vector/blob/vectorIntrinsics/test/micro/org/openjdk/bench/jdk/incubator/vector/operation/Byte128Vector.java#L1896

Xiaohong Gong has updated the pull request incrementally with one additional 
commit since the last revision:

  Add a superclass for vector negation

-

Changes:
  - all: https://git.openjdk.java.net/jdk/pull/7782/files
  - new: https://git.openjdk.java.net/jdk/pull/7782/files/828866f8..97c8119a

Webrevs:
 - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=7782&range=01
 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=7782&range=00-01

  Stats: 64 lines in 4 files changed: 16 ins; 13 del; 35 mod
  Patch: https://git.openjdk.java.net/jdk/pull/7782.diff
  Fetch: git fetch https://git.openjdk.java.net/jdk pull/7782/head:pull/7782

PR: https://git.openjdk.java.net/jdk/pull/7782


Re: RFR: 8282162: [vector] Optimize vector negation API [v2]

2022-03-23 Thread Xiaohong Gong
On Tue, 22 Mar 2022 09:58:23 GMT, Xiaohong Gong  wrote:

>> The current vector `"NEG"` is implemented with substraction a vector by zero 
>> in case the architecture does not support the negation instruction. And to 
>> fit the predicate feature for architectures that support it, the masked 
>> vector `"NEG" ` is implemented with pattern `"v.not(m).add(1, m)"`. They 
>> both can be optimized to a single negation instruction for ARM SVE.
>> And so does the non-masked "NEG" for NEON. Besides, implementing the masked 
>> "NEG" with substraction for architectures that support neither negation 
>> instruction nor predicate feature can also save several instructions than 
>> the current pattern.
>> 
>> To optimize the VectorAPI negation, this patch moves the implementation from 
>> Java side to hotspot. The compiler will generate different nodes according 
>> to the architecture:
>>   - Generate the (predicated) negation node if architecture supports it, 
>> otherwise, generate "`zero.sub(v)`" pattern for non-masked operation.
>>   - Generate `"zero.sub(v, m)"` for masked operation if the architecture 
>> does not have predicate feature, otherwise generate the original pattern 
>> `"v.xor(-1, m).add(1, m)"`.
>> 
>> So with this patch, the following transformations are applied:
>> 
>> For non-masked negation with NEON:
>> 
>>   moviv16.4s, #0x0
>>   sub v17.4s, v16.4s, v17.4s   ==> neg v17.4s, v17.4s
>> 
>> and with SVE:
>> 
>>   mov z16.s, #0
>>   sub z18.s, z16.s, z17.s  ==> neg z16.s, p7/m, z16.s
>> 
>> For masked negation with NEON:
>> 
>>   moviv17.4s, #0x1
>>   mvn v19.16b, v18.16b
>>   mov v20.16b, v16.16b ==>  neg v18.4s, v17.4s
>>   bsl v20.16b, v19.16b, v18.16b bsl v19.16b, v18.16b, v17.16b
>>   add v19.4s, v20.4s, v17.4s
>>   mov v18.16b, v16.16b
>>   bsl v18.16b, v19.16b, v20.16b
>> 
>> and with SVE:
>> 
>>   mov z16.s, #-1
>>   mov z17.s, #1==> neg z16.s, p0/m, z16.s
>>   eor z18.s, p0/m, z18.s, z16.s
>>   add z18.s, p0/m, z18.s, z17.s
>> 
>> Here are the performance gains for benchmarks (see [1][2]) on ARM and x86 
>> machines(note that the non-masked negation benchmarks do not have any 
>> improvement on X86 since no instructions are changed):
>> 
>> NEON:
>> BenchmarkGain
>> Byte128Vector.NEG1.029
>> Byte128Vector.NEGMasked  1.757
>> Short128Vector.NEG   1.041
>> Short128Vector.NEGMasked 1.659
>> Int128Vector.NEG 1.005
>> Int128Vector.NEGMasked   1.513
>> Long128Vector.NEG1.003
>> Long128Vector.NEGMasked  1.878
>> 
>> SVE with 512-bits:
>> BenchmarkGain
>> ByteMaxVector.NEG1.10
>> ByteMaxVector.NEGMasked  1.165
>> ShortMaxVector.NEG   1.056
>> ShortMaxVector.NEGMasked 1.195
>> IntMaxVector.NEG 1.002
>> IntMaxVector.NEGMasked   1.239
>> LongMaxVector.NEG1.031
>> LongMaxVector.NEGMasked  1.191
>> 
>> X86 (non AVX-512):
>> BenchmarkGain
>> ByteMaxVector.NEGMasked  1.254
>> ShortMaxVector.NEGMasked 1.359
>> IntMaxVector.NEGMasked   1.431
>> LongMaxVector.NEGMasked  1.989
>> 
>> [1] 
>> https://github.com/openjdk/panama-vector/blob/vectorIntrinsics/test/micro/org/openjdk/bench/jdk/incubator/vector/operation/Byte128Vector.java#L1881
>> [2] 
>> https://github.com/openjdk/panama-vector/blob/vectorIntrinsics/test/micro/org/openjdk/bench/jdk/incubator/vector/operation/Byte128Vector.java#L1896
>
> Xiaohong Gong has updated the pull request incrementally with one additional 
> commit since the last revision:
> 
>   Add a superclass for vector negation

Hi @PaulSandoz @jatin-bhateja, could you please help to take a look at this PR? 
Thanks so much!

-

PR: https://git.openjdk.java.net/jdk/pull/7782


Re: RFR: 8282162: [vector] Optimize vector negation API [v2]

2022-03-28 Thread Xiaohong Gong
On Mon, 28 Mar 2022 07:40:48 GMT, Jie Fu  wrote:

>> The compiler can get the real type info from `Op_NegVI` that can also handle 
>> the `BYTE ` and `SHORT ` basic type. I just don't want to add more new IRs 
>> which also need more match rules in the ad files.
>> 
>>> Is there any performance drop for byte/short negation operation if both of 
>>> them are handled as a NegVI vector?
>> 
>> From the benchmark results I showed in the commit message, I didn't see not 
>> any performance drop for byte/short.  Thanks!
>
>> The compiler can get the real type info from `Op_NegVI` that can also handle 
>> the `BYTE ` and `SHORT ` basic type. I just don't want to add more new IRs 
>> which also need more match rules in the ad files.
>> 
>> > Is there any performance drop for byte/short negation operation if both of 
>> > them are handled as a NegVI vector?
>> 
>> From the benchmark results I showed in the commit message, I didn't see not 
>> any performance drop for byte/short. Thanks!
> 
> There seems no vectorized negation instructions for {byte, short, int, long} 
> on x86, so this should be fine on x86.
> I tested the patch on x86 and the performance number looks good.

Thanks for doing this! Yeah, I think the performance for masked negation 
operations might improve on non avx-512 systems.

-

PR: https://git.openjdk.java.net/jdk/pull/7782


Re: RFR: 8282162: [vector] Optimize vector negation API [v2]

2022-03-28 Thread Xiaohong Gong
On Mon, 28 Mar 2022 07:43:29 GMT, Jie Fu  wrote:

>> Xiaohong Gong has updated the pull request incrementally with one additional 
>> commit since the last revision:
>> 
>>   Add a superclass for vector negation
>
> src/hotspot/share/opto/vectornode.cpp line 1592:
> 
>> 1590: 
>> 1591: // Generate other vector nodes to implement the masked/non-masked 
>> vector negation.
>> 1592: Node* VectorNode::degenerate_vector_integral_negate(Node* n, int vlen, 
>> BasicType bt, PhaseGVN* phase, bool is_predicated) {
> 
> Shall we move this declaration in `class  NegVNode` since it is only used by  
> NegVNode::Ideal ?

I think it can be. Thanks for pointing out this. I will change this later.

-

PR: https://git.openjdk.java.net/jdk/pull/7782


Re: RFR: 8282162: [vector] Optimize integral vector negation API [v3]

2022-03-28 Thread Xiaohong Gong
> The current vector `"NEG"` is implemented with substraction a vector by zero 
> in case the architecture does not support the negation instruction. And to 
> fit the predicate feature for architectures that support it, the masked 
> vector `"NEG" ` is implemented with pattern `"v.not(m).add(1, m)"`. They both 
> can be optimized to a single negation instruction for ARM SVE.
> And so does the non-masked "NEG" for NEON. Besides, implementing the masked 
> "NEG" with substraction for architectures that support neither negation 
> instruction nor predicate feature can also save several instructions than the 
> current pattern.
> 
> To optimize the VectorAPI negation, this patch moves the implementation from 
> Java side to hotspot. The compiler will generate different nodes according to 
> the architecture:
>   - Generate the (predicated) negation node if architecture supports it, 
> otherwise, generate "`zero.sub(v)`" pattern for non-masked operation.
>   - Generate `"zero.sub(v, m)"` for masked operation if the architecture does 
> not have predicate feature, otherwise generate the original pattern 
> `"v.xor(-1, m).add(1, m)"`.
> 
> So with this patch, the following transformations are applied:
> 
> For non-masked negation with NEON:
> 
>   moviv16.4s, #0x0
>   sub v17.4s, v16.4s, v17.4s   ==> neg v17.4s, v17.4s
> 
> and with SVE:
> 
>   mov z16.s, #0
>   sub z18.s, z16.s, z17.s  ==> neg z16.s, p7/m, z16.s
> 
> For masked negation with NEON:
> 
>   moviv17.4s, #0x1
>   mvn v19.16b, v18.16b
>   mov v20.16b, v16.16b ==>  neg v18.4s, v17.4s
>   bsl v20.16b, v19.16b, v18.16b bsl v19.16b, v18.16b, v17.16b
>   add v19.4s, v20.4s, v17.4s
>   mov v18.16b, v16.16b
>   bsl v18.16b, v19.16b, v20.16b
> 
> and with SVE:
> 
>   mov z16.s, #-1
>   mov z17.s, #1==> neg z16.s, p0/m, z16.s
>   eor z18.s, p0/m, z18.s, z16.s
>   add z18.s, p0/m, z18.s, z17.s
> 
> Here are the performance gains for benchmarks (see [1][2]) on ARM and x86 
> machines(note that the non-masked negation benchmarks do not have any 
> improvement on X86 since no instructions are changed):
> 
> NEON:
> BenchmarkGain
> Byte128Vector.NEG1.029
> Byte128Vector.NEGMasked  1.757
> Short128Vector.NEG   1.041
> Short128Vector.NEGMasked 1.659
> Int128Vector.NEG 1.005
> Int128Vector.NEGMasked   1.513
> Long128Vector.NEG1.003
> Long128Vector.NEGMasked  1.878
> 
> SVE with 512-bits:
> BenchmarkGain
> ByteMaxVector.NEG1.10
> ByteMaxVector.NEGMasked  1.165
> ShortMaxVector.NEG   1.056
> ShortMaxVector.NEGMasked 1.195
> IntMaxVector.NEG 1.002
> IntMaxVector.NEGMasked   1.239
> LongMaxVector.NEG1.031
> LongMaxVector.NEGMasked  1.191
> 
> X86 (non AVX-512):
> BenchmarkGain
> ByteMaxVector.NEGMasked  1.254
> ShortMaxVector.NEGMasked 1.359
> IntMaxVector.NEGMasked   1.431
> LongMaxVector.NEGMasked  1.989
> 
> [1] 
> https://github.com/openjdk/panama-vector/blob/vectorIntrinsics/test/micro/org/openjdk/bench/jdk/incubator/vector/operation/Byte128Vector.java#L1881
> [2] 
> https://github.com/openjdk/panama-vector/blob/vectorIntrinsics/test/micro/org/openjdk/bench/jdk/incubator/vector/operation/Byte128Vector.java#L1896

Xiaohong Gong has updated the pull request incrementally with one additional 
commit since the last revision:

  Make "degenerate_vector_integral_negate" to be "NegVI" private

-

Changes:
  - all: https://git.openjdk.java.net/jdk/pull/7782/files
  - new: https://git.openjdk.java.net/jdk/pull/7782/files/97c8119a..48f4d6be

Webrevs:
 - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=7782&range=02
 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=7782&range=01-02

  Stats: 15 lines in 2 files changed: 6 ins; 1 del; 8 mod
  Patch: https://git.openjdk.java.net/jdk/pull/7782.diff
  Fetch: git fetch https://git.openjdk.java.net/jdk pull/7782/head:pull/7782

PR: https://git.openjdk.java.net/jdk/pull/7782


Re: RFR: 8282162: [vector] Optimize integral vector negation API [v3]

2022-03-28 Thread Xiaohong Gong
On Tue, 29 Mar 2022 05:05:43 GMT, Jie Fu  wrote:

>> Xiaohong Gong has updated the pull request incrementally with one additional 
>> commit since the last revision:
>> 
>>   Make "degenerate_vector_integral_negate" to be "NegVI" private
>
> Note: I didn't check the aarch64 code change.

Thanks for the review @DamonFool !

-

PR: https://git.openjdk.java.net/jdk/pull/7782


Re: RFR: 8282162: [vector] Optimize integral vector negation API [v3]

2022-03-29 Thread Xiaohong Gong
On Tue, 29 Mar 2022 18:05:56 GMT, Paul Sandoz  wrote:

>> Xiaohong Gong has updated the pull request incrementally with one additional 
>> commit since the last revision:
>> 
>>   Make "degenerate_vector_integral_negate" to be "NegVI" private
>
> Java changes are good.

Thanks for the review @PaulSandoz @nsjian !

-

PR: https://git.openjdk.java.net/jdk/pull/7782


Integrated: 8282162: [vector] Optimize integral vector negation API

2022-03-29 Thread Xiaohong Gong
On Fri, 11 Mar 2022 06:29:22 GMT, Xiaohong Gong  wrote:

> The current vector `"NEG"` is implemented with substraction a vector by zero 
> in case the architecture does not support the negation instruction. And to 
> fit the predicate feature for architectures that support it, the masked 
> vector `"NEG" ` is implemented with pattern `"v.not(m).add(1, m)"`. They both 
> can be optimized to a single negation instruction for ARM SVE.
> And so does the non-masked "NEG" for NEON. Besides, implementing the masked 
> "NEG" with substraction for architectures that support neither negation 
> instruction nor predicate feature can also save several instructions than the 
> current pattern.
> 
> To optimize the VectorAPI negation, this patch moves the implementation from 
> Java side to hotspot. The compiler will generate different nodes according to 
> the architecture:
>   - Generate the (predicated) negation node if architecture supports it, 
> otherwise, generate "`zero.sub(v)`" pattern for non-masked operation.
>   - Generate `"zero.sub(v, m)"` for masked operation if the architecture does 
> not have predicate feature, otherwise generate the original pattern 
> `"v.xor(-1, m).add(1, m)"`.
> 
> So with this patch, the following transformations are applied:
> 
> For non-masked negation with NEON:
> 
>   moviv16.4s, #0x0
>   sub v17.4s, v16.4s, v17.4s   ==> neg v17.4s, v17.4s
> 
> and with SVE:
> 
>   mov z16.s, #0
>   sub z18.s, z16.s, z17.s  ==> neg z16.s, p7/m, z16.s
> 
> For masked negation with NEON:
> 
>   moviv17.4s, #0x1
>   mvn v19.16b, v18.16b
>   mov v20.16b, v16.16b ==>  neg v18.4s, v17.4s
>   bsl v20.16b, v19.16b, v18.16b bsl v19.16b, v18.16b, v17.16b
>   add v19.4s, v20.4s, v17.4s
>   mov v18.16b, v16.16b
>   bsl v18.16b, v19.16b, v20.16b
> 
> and with SVE:
> 
>   mov z16.s, #-1
>   mov z17.s, #1==> neg z16.s, p0/m, z16.s
>   eor z18.s, p0/m, z18.s, z16.s
>   add z18.s, p0/m, z18.s, z17.s
> 
> Here are the performance gains for benchmarks (see [1][2]) on ARM and x86 
> machines(note that the non-masked negation benchmarks do not have any 
> improvement on X86 since no instructions are changed):
> 
> NEON:
> BenchmarkGain
> Byte128Vector.NEG1.029
> Byte128Vector.NEGMasked  1.757
> Short128Vector.NEG   1.041
> Short128Vector.NEGMasked 1.659
> Int128Vector.NEG 1.005
> Int128Vector.NEGMasked   1.513
> Long128Vector.NEG1.003
> Long128Vector.NEGMasked  1.878
> 
> SVE with 512-bits:
> BenchmarkGain
> ByteMaxVector.NEG1.10
> ByteMaxVector.NEGMasked  1.165
> ShortMaxVector.NEG   1.056
> ShortMaxVector.NEGMasked 1.195
> IntMaxVector.NEG 1.002
> IntMaxVector.NEGMasked   1.239
> LongMaxVector.NEG1.031
> LongMaxVector.NEGMasked  1.191
> 
> X86 (non AVX-512):
> BenchmarkGain
> ByteMaxVector.NEGMasked  1.254
> ShortMaxVector.NEGMasked 1.359
> IntMaxVector.NEGMasked   1.431
> LongMaxVector.NEGMasked  1.989
> 
> [1] 
> https://github.com/openjdk/panama-vector/blob/vectorIntrinsics/test/micro/org/openjdk/bench/jdk/incubator/vector/operation/Byte128Vector.java#L1881
> [2] 
> https://github.com/openjdk/panama-vector/blob/vectorIntrinsics/test/micro/org/openjdk/bench/jdk/incubator/vector/operation/Byte128Vector.java#L1896

This pull request has now been integrated.

Changeset: d0668568
Author:Xiaohong Gong 
Committer: Jie Fu 
URL:   
https://git.openjdk.java.net/jdk/commit/d06685680c17583d56dc3d788d9a2ecea8812bc8
Stats: 325 lines in 15 files changed: 275 ins; 25 del; 25 mod

8282162: [vector] Optimize integral vector negation API

Reviewed-by: jiefu, psandoz, njian

-

PR: https://git.openjdk.java.net/jdk/pull/7782


RFR: 8283667: [vectorapi] Vectorization for masked load with IOOBE with predicate feature

2022-03-30 Thread Xiaohong Gong
Currently the vector load with mask when the given index happens out of the 
array boundary is implemented with pure java scalar code to avoid the IOOBE 
(IndexOutOfBoundaryException). This is necessary for architectures that do not 
support the predicate feature. Because the masked load is implemented with a 
full vector load and a vector blend applied on it. And a full vector load will 
definitely cause the IOOBE which is not valid. However, for architectures that 
support the predicate feature like SVE/AVX-512/RVV, it can be vectorized with 
the predicated load instruction as long as the indexes of the masked lanes are 
within the bounds of the array. For these architectures, loading with unmasked 
lanes does not raise exception.

This patch adds the vectorization support for the masked load with IOOBE part. 
Please see the original java implementation (FIXME: optimize):


  @ForceInline
  public static
  ByteVector fromArray(VectorSpecies species,
   byte[] a, int offset,
   VectorMask m) {
  ByteSpecies vsp = (ByteSpecies) species;
  if (offset >= 0 && offset <= (a.length - species.length())) {
  return vsp.dummyVector().fromArray0(a, offset, m);
  }

  // FIXME: optimize
  checkMaskFromIndexSize(offset, vsp, m, 1, a.length);
  return vsp.vOp(m, i -> a[offset + i]);
  }

Since it can only be vectorized with the predicate load, the hotspot must check 
whether the current backend supports it and falls back to the java scalar 
version if not. This is different from the normal masked vector load that the 
compiler will generate a full vector load and a vector blend if the predicate 
load is not supported. So to let the compiler make the expected action, an 
additional flag (i.e. `usePred`) is added to the existing "loadMasked" 
intrinsic, with the value "true" for the IOOBE part while "false" for the 
normal load. And the compiler will fail to intrinsify if the flag is "true" and 
the predicate load is not supported by the backend, which means that normal 
java path will be executed.

Also adds the same vectorization support for masked:
 - fromByteArray/fromByteBuffer
 - fromBooleanArray
 - fromCharArray

The performance for the new added benchmarks improve about `1.88x ~ 30.26x` on 
the x86 AVX-512 system:

Benchmark  before   After  Units
LoadMaskedIOOBEBenchmark.byteLoadArrayMaskIOOBE   737.542 1387.069 ops/ms
LoadMaskedIOOBEBenchmark.doubleLoadArrayMaskIOOBE 118.366  330.776 ops/ms
LoadMaskedIOOBEBenchmark.floatLoadArrayMaskIOOBE  233.832 6125.026 ops/ms
LoadMaskedIOOBEBenchmark.intLoadArrayMaskIOOBE233.816 7075.923 ops/ms
LoadMaskedIOOBEBenchmark.longLoadArrayMaskIOOBE   119.771  330.587 ops/ms
LoadMaskedIOOBEBenchmark.shortLoadArrayMaskIOOBE  431.961  939.301 ops/ms

Similar performance gain can also be observed on 512-bit SVE system.

-

Commit messages:
 - 8283667: [vectorapi] Vectorization for masked load with IOOBE with predicate 
feature

Changes: https://git.openjdk.java.net/jdk/pull/8035/files
 Webrev: https://webrevs.openjdk.java.net/?repo=jdk&pr=8035&range=00
  Issue: https://bugs.openjdk.java.net/browse/JDK-8283667
  Stats: 821 lines in 43 files changed: 314 ins; 117 del; 390 mod
  Patch: https://git.openjdk.java.net/jdk/pull/8035.diff
  Fetch: git fetch https://git.openjdk.java.net/jdk pull/8035/head:pull/8035

PR: https://git.openjdk.java.net/jdk/pull/8035


Re: RFR: 8283667: [vectorapi] Vectorization for masked load with IOOBE with predicate feature

2022-04-05 Thread Xiaohong Gong
On Wed, 30 Mar 2022 10:31:59 GMT, Xiaohong Gong  wrote:

> Currently the vector load with mask when the given index happens out of the 
> array boundary is implemented with pure java scalar code to avoid the IOOBE 
> (IndexOutOfBoundaryException). This is necessary for architectures that do 
> not support the predicate feature. Because the masked load is implemented 
> with a full vector load and a vector blend applied on it. And a full vector 
> load will definitely cause the IOOBE which is not valid. However, for 
> architectures that support the predicate feature like SVE/AVX-512/RVV, it can 
> be vectorized with the predicated load instruction as long as the indexes of 
> the masked lanes are within the bounds of the array. For these architectures, 
> loading with unmasked lanes does not raise exception.
> 
> This patch adds the vectorization support for the masked load with IOOBE 
> part. Please see the original java implementation (FIXME: optimize):
> 
> 
>   @ForceInline
>   public static
>   ByteVector fromArray(VectorSpecies species,
>byte[] a, int offset,
>VectorMask m) {
>   ByteSpecies vsp = (ByteSpecies) species;
>   if (offset >= 0 && offset <= (a.length - species.length())) {
>   return vsp.dummyVector().fromArray0(a, offset, m);
>   }
> 
>   // FIXME: optimize
>   checkMaskFromIndexSize(offset, vsp, m, 1, a.length);
>   return vsp.vOp(m, i -> a[offset + i]);
>   }
> 
> Since it can only be vectorized with the predicate load, the hotspot must 
> check whether the current backend supports it and falls back to the java 
> scalar version if not. This is different from the normal masked vector load 
> that the compiler will generate a full vector load and a vector blend if the 
> predicate load is not supported. So to let the compiler make the expected 
> action, an additional flag (i.e. `usePred`) is added to the existing 
> "loadMasked" intrinsic, with the value "true" for the IOOBE part while 
> "false" for the normal load. And the compiler will fail to intrinsify if the 
> flag is "true" and the predicate load is not supported by the backend, which 
> means that normal java path will be executed.
> 
> Also adds the same vectorization support for masked:
>  - fromByteArray/fromByteBuffer
>  - fromBooleanArray
>  - fromCharArray
> 
> The performance for the new added benchmarks improve about `1.88x ~ 30.26x` 
> on the x86 AVX-512 system:
> 
> Benchmark  before   After  Units
> LoadMaskedIOOBEBenchmark.byteLoadArrayMaskIOOBE   737.542 1387.069 ops/ms
> LoadMaskedIOOBEBenchmark.doubleLoadArrayMaskIOOBE 118.366  330.776 ops/ms
> LoadMaskedIOOBEBenchmark.floatLoadArrayMaskIOOBE  233.832 6125.026 ops/ms
> LoadMaskedIOOBEBenchmark.intLoadArrayMaskIOOBE233.816 7075.923 ops/ms
> LoadMaskedIOOBEBenchmark.longLoadArrayMaskIOOBE   119.771  330.587 ops/ms
> LoadMaskedIOOBEBenchmark.shortLoadArrayMaskIOOBE  431.961  939.301 ops/ms
> 
> Similar performance gain can also be observed on 512-bit SVE system.

Hi @PaulSandoz @jatin-bhateja @sviswa7, could you please help to check this PR? 
Any feedback is welcome! Thanks a lot!

-

PR: https://git.openjdk.java.net/jdk/pull/8035


Re: RFR: 8283667: [vectorapi] Vectorization for masked load with IOOBE with predicate feature

2022-04-19 Thread Xiaohong Gong
On Mon, 11 Apr 2022 09:04:36 GMT, Jatin Bhateja  wrote:

>> The optimization for masked store is recorded to: 
>> https://bugs.openjdk.java.net/browse/JDK-8284050
>
>> The blend should be with the intended-to-store vector, so that masked lanes 
>> contain the need-to-store elements and unmasked lanes contain the loaded 
>> elements, which would be stored back, which results in unchanged values.
> 
> It may not work if memory is beyond legal accessible address space of the 
> process, a corner case could be a page boundary.  Thus re-composing the 
> intermediated vector which partially contains actual updates but effectively 
> perform full vector write to destination address may not work in all 
> scenarios.

Thanks for the comment! So how about adding the check for the valid array range 
like the masked vector load?
Codes like:

public final
void intoArray(byte[] a, int offset,
   VectorMask m) {
if (m.allTrue()) {
intoArray(a, offset);
} else {
ByteSpecies vsp = vspecies();
if (offset >= 0 && offset <= (a.length - vsp.length())) { // a 
full range check
intoArray0(a, offset, m, /* usePred */ false);  
 // can be vectorized by load+blend_store
} else {
checkMaskFromIndexSize(offset, vsp, m, 1, a.length);
intoArray0(a, offset, m, /* usePred */ true);   
 // only be vectorized by the predicated store
}
}
}

-

PR: https://git.openjdk.java.net/jdk/pull/8035


Re: RFR: 8283667: [vectorapi] Vectorization for masked load with IOOBE with predicate feature

2022-04-19 Thread Xiaohong Gong
On Sat, 9 Apr 2022 00:10:40 GMT, Sandhya Viswanathan  
wrote:

>> Currently the vector load with mask when the given index happens out of the 
>> array boundary is implemented with pure java scalar code to avoid the IOOBE 
>> (IndexOutOfBoundaryException). This is necessary for architectures that do 
>> not support the predicate feature. Because the masked load is implemented 
>> with a full vector load and a vector blend applied on it. And a full vector 
>> load will definitely cause the IOOBE which is not valid. However, for 
>> architectures that support the predicate feature like SVE/AVX-512/RVV, it 
>> can be vectorized with the predicated load instruction as long as the 
>> indexes of the masked lanes are within the bounds of the array. For these 
>> architectures, loading with unmasked lanes does not raise exception.
>> 
>> This patch adds the vectorization support for the masked load with IOOBE 
>> part. Please see the original java implementation (FIXME: optimize):
>> 
>> 
>>   @ForceInline
>>   public static
>>   ByteVector fromArray(VectorSpecies species,
>>byte[] a, int offset,
>>VectorMask m) {
>>   ByteSpecies vsp = (ByteSpecies) species;
>>   if (offset >= 0 && offset <= (a.length - species.length())) {
>>   return vsp.dummyVector().fromArray0(a, offset, m);
>>   }
>> 
>>   // FIXME: optimize
>>   checkMaskFromIndexSize(offset, vsp, m, 1, a.length);
>>   return vsp.vOp(m, i -> a[offset + i]);
>>   }
>> 
>> Since it can only be vectorized with the predicate load, the hotspot must 
>> check whether the current backend supports it and falls back to the java 
>> scalar version if not. This is different from the normal masked vector load 
>> that the compiler will generate a full vector load and a vector blend if the 
>> predicate load is not supported. So to let the compiler make the expected 
>> action, an additional flag (i.e. `usePred`) is added to the existing 
>> "loadMasked" intrinsic, with the value "true" for the IOOBE part while 
>> "false" for the normal load. And the compiler will fail to intrinsify if the 
>> flag is "true" and the predicate load is not supported by the backend, which 
>> means that normal java path will be executed.
>> 
>> Also adds the same vectorization support for masked:
>>  - fromByteArray/fromByteBuffer
>>  - fromBooleanArray
>>  - fromCharArray
>> 
>> The performance for the new added benchmarks improve about `1.88x ~ 30.26x` 
>> on the x86 AVX-512 system:
>> 
>> Benchmark  before   After  Units
>> LoadMaskedIOOBEBenchmark.byteLoadArrayMaskIOOBE   737.542 1387.069 ops/ms
>> LoadMaskedIOOBEBenchmark.doubleLoadArrayMaskIOOBE 118.366  330.776 ops/ms
>> LoadMaskedIOOBEBenchmark.floatLoadArrayMaskIOOBE  233.832 6125.026 ops/ms
>> LoadMaskedIOOBEBenchmark.intLoadArrayMaskIOOBE233.816 7075.923 ops/ms
>> LoadMaskedIOOBEBenchmark.longLoadArrayMaskIOOBE   119.771  330.587 ops/ms
>> LoadMaskedIOOBEBenchmark.shortLoadArrayMaskIOOBE  431.961  939.301 ops/ms
>> 
>> Similar performance gain can also be observed on 512-bit SVE system.
>
> src/jdk.incubator.vector/share/classes/jdk/incubator/vector/ByteVector.java 
> line 2861:
> 
>> 2859: ByteSpecies vsp = (ByteSpecies) species;
>> 2860: if (offset >= 0 && offset <= (a.length - 
>> species.vectorByteSize())) {
>> 2861: return vsp.dummyVector().fromByteArray0(a, offset, m, /* 
>> usePred */ false).maybeSwap(bo);
> 
> Instead of usePred a term like inRange or offetInRage or offsetInVectorRange 
> would be easier to follow.

Thanks for the review. I will change it later.

-

PR: https://git.openjdk.java.net/jdk/pull/8035


Re: RFR: 8283667: [vectorapi] Vectorization for masked load with IOOBE with predicate feature [v2]

2022-04-22 Thread Xiaohong Gong
> Currently the vector load with mask when the given index happens out of the 
> array boundary is implemented with pure java scalar code to avoid the IOOBE 
> (IndexOutOfBoundaryException). This is necessary for architectures that do 
> not support the predicate feature. Because the masked load is implemented 
> with a full vector load and a vector blend applied on it. And a full vector 
> load will definitely cause the IOOBE which is not valid. However, for 
> architectures that support the predicate feature like SVE/AVX-512/RVV, it can 
> be vectorized with the predicated load instruction as long as the indexes of 
> the masked lanes are within the bounds of the array. For these architectures, 
> loading with unmasked lanes does not raise exception.
> 
> This patch adds the vectorization support for the masked load with IOOBE 
> part. Please see the original java implementation (FIXME: optimize):
> 
> 
>   @ForceInline
>   public static
>   ByteVector fromArray(VectorSpecies species,
>byte[] a, int offset,
>VectorMask m) {
>   ByteSpecies vsp = (ByteSpecies) species;
>   if (offset >= 0 && offset <= (a.length - species.length())) {
>   return vsp.dummyVector().fromArray0(a, offset, m);
>   }
> 
>   // FIXME: optimize
>   checkMaskFromIndexSize(offset, vsp, m, 1, a.length);
>   return vsp.vOp(m, i -> a[offset + i]);
>   }
> 
> Since it can only be vectorized with the predicate load, the hotspot must 
> check whether the current backend supports it and falls back to the java 
> scalar version if not. This is different from the normal masked vector load 
> that the compiler will generate a full vector load and a vector blend if the 
> predicate load is not supported. So to let the compiler make the expected 
> action, an additional flag (i.e. `usePred`) is added to the existing 
> "loadMasked" intrinsic, with the value "true" for the IOOBE part while 
> "false" for the normal load. And the compiler will fail to intrinsify if the 
> flag is "true" and the predicate load is not supported by the backend, which 
> means that normal java path will be executed.
> 
> Also adds the same vectorization support for masked:
>  - fromByteArray/fromByteBuffer
>  - fromBooleanArray
>  - fromCharArray
> 
> The performance for the new added benchmarks improve about `1.88x ~ 30.26x` 
> on the x86 AVX-512 system:
> 
> Benchmark  before   After  Units
> LoadMaskedIOOBEBenchmark.byteLoadArrayMaskIOOBE   737.542 1387.069 ops/ms
> LoadMaskedIOOBEBenchmark.doubleLoadArrayMaskIOOBE 118.366  330.776 ops/ms
> LoadMaskedIOOBEBenchmark.floatLoadArrayMaskIOOBE  233.832 6125.026 ops/ms
> LoadMaskedIOOBEBenchmark.intLoadArrayMaskIOOBE233.816 7075.923 ops/ms
> LoadMaskedIOOBEBenchmark.longLoadArrayMaskIOOBE   119.771  330.587 ops/ms
> LoadMaskedIOOBEBenchmark.shortLoadArrayMaskIOOBE  431.961  939.301 ops/ms
> 
> Similar performance gain can also be observed on 512-bit SVE system.

Xiaohong Gong has updated the pull request incrementally with one additional 
commit since the last revision:

  Rename the "usePred" to "offsetInRange"

-

Changes:
  - all: https://git.openjdk.java.net/jdk/pull/8035/files
  - new: https://git.openjdk.java.net/jdk/pull/8035/files/8f9e8a3c..9b2d2f19

Webrevs:
 - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=8035&range=01
 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=8035&range=00-01

  Stats: 393 lines in 41 files changed: 0 ins; 0 del; 393 mod
  Patch: https://git.openjdk.java.net/jdk/pull/8035.diff
  Fetch: git fetch https://git.openjdk.java.net/jdk pull/8035/head:pull/8035

PR: https://git.openjdk.java.net/jdk/pull/8035


Re: RFR: 8283667: [vectorapi] Vectorization for masked load with IOOBE with predicate feature [v2]

2022-04-22 Thread Xiaohong Gong
On Wed, 20 Apr 2022 02:46:09 GMT, Xiaohong Gong  wrote:

>> src/jdk.incubator.vector/share/classes/jdk/incubator/vector/ByteVector.java 
>> line 2861:
>> 
>>> 2859: ByteSpecies vsp = (ByteSpecies) species;
>>> 2860: if (offset >= 0 && offset <= (a.length - 
>>> species.vectorByteSize())) {
>>> 2861: return vsp.dummyVector().fromByteArray0(a, offset, m, /* 
>>> usePred */ false).maybeSwap(bo);
>> 
>> Instead of usePred a term like inRange or offetInRage or offsetInVectorRange 
>> would be easier to follow.
>
> Thanks for the review. I will change it later.

The name is updated to `offsetInRange`. Thanks!

-

PR: https://git.openjdk.java.net/jdk/pull/8035


Re: RFR: 8283667: [vectorapi] Vectorization for masked load with IOOBE with predicate feature [v2]

2022-05-04 Thread Xiaohong Gong
On Fri, 29 Apr 2022 21:34:13 GMT, Paul Sandoz  wrote:

>> Xiaohong Gong has updated the pull request incrementally with one additional 
>> commit since the last revision:
>> 
>>   Rename the "usePred" to "offsetInRange"
>
> IIUC when the hardware does not support predicated loads then any false 
> `offsetIntRange` value causes the load intrinsic to fail resulting in the 
> fallback, so it would not be materially any different to the current 
> behavior, just more uniformly implemented.
> 
> Why can't the intrinsic support the passing a boolean directly? Is it 
> something to do with constants? If that is not possible I recommend creating 
> named constant values and pass those all the way through rather than 
> converting a boolean to an integer value. Then there is no need for a branch 
> checking `offsetInRange`.
> 
> Might be better to hold off until the JEP is integrated and then update, 
> since this will conflict (`byte[]` and `ByteBuffer` load methods are removed 
> and `MemorySegment` load methods are added). You could prepare for that now 
> by branching off `vectorIntrinsics`.

Thanks for your comments @PaulSandoz !

> IIUC when the hardware does not support predicated loads then any false 
> `offsetIntRange` value causes the load intrinsic to fail resulting in the 
> fallback, so it would not be materially any different to the current 
> behavior, just more uniformly implemented.

Yes, it's true that this patch doesn't have any different to the hardware that 
does not support the predicated loads. It only benefits the predicated feature 
supported systems like ARM SVE and X86 AVX-512.

> Why can't the intrinsic support the passing a boolean directly? Is it 
> something to do with constants? If that is not possible I recommend creating 
> named constant values and pass those all the way through rather than 
> converting a boolean to an integer value. Then there is no need for a branch 
> checking offsetInRange.

Yeah, I agree that it's not good by adding a branch checking for 
`offsetInRange`. But actually I met the constant issue that passing the values 
all the way cannot guarantee the argument a constant in compiler at the compile 
time. Do you have any better idea to fixing this?

> Might be better to hold off until the JEP is integrated and then update, 
> since this will conflict (byte[] and ByteBuffer load methods are removed and 
> MemorySegment load methods are added). You could prepare for that now by 
> branching off vectorIntrinsics.

Agree. Thanks!

-

PR: https://git.openjdk.java.net/jdk/pull/8035


Re: RFR: 8283667: [vectorapi] Vectorization for masked load with IOOBE with predicate feature [v2]

2022-05-04 Thread Xiaohong Gong
On Thu, 5 May 2022 01:21:40 GMT, Paul Sandoz  wrote:

> > Yeah, I agree that it's not good by adding a branch checking for 
> > `offsetInRange`. But actually I met the constant issue that passing the 
> > values all the way cannot guarantee the argument a constant in compiler at 
> > the compile time. Do you have any better idea to fixing this?
> 
> That's odd, `boolean` constants are passed that are then converted to `int` 
> constants. Did you try passing integer constants all the way through?

I will try again. I remember the main cause is the calling of `fromArray0` from 
`fromArray`, it is not annotated with `ForceInline`. The arguments might not be 
compiled to a constant for cases that the offset is not in the array range like 
tail loop.

-

PR: https://git.openjdk.java.net/jdk/pull/8035


Re: RFR: 8283667: [vectorapi] Vectorization for masked load with IOOBE with predicate feature [v2]

2022-05-04 Thread Xiaohong Gong
On Thu, 5 May 2022 01:42:48 GMT, Xiaohong Gong  wrote:

> > > Yeah, I agree that it's not good by adding a branch checking for 
> > > `offsetInRange`. But actually I met the constant issue that passing the 
> > > values all the way cannot guarantee the argument a constant in compiler 
> > > at the compile time. Do you have any better idea to fixing this?
> > 
> > 
> > That's odd, `boolean` constants are passed that are then converted to `int` 
> > constants. Did you try passing integer constants all the way through?
> 
> I will try again. I remember the main cause is the calling of `fromArray0` 
> from `fromArray`, it is not annotated with `ForceInline`. The arguments might 
> not be compiled to a constant for cases that the offset is not in the array 
> range like tail loop.

I tried to pass the integer constant all the way, and unfortunate that the 
`offsetInRange` is not compiled to a constant. The following assertion in the 
`vectorIntrinsics.cpp` will fail:

  --- a/src/hotspot/share/opto/vectorIntrinsics.cpp
+++ b/src/hotspot/share/opto/vectorIntrinsics.cpp
@@ -1236,6 +1236,7 @@ bool 
LibraryCallKit::inline_vector_mem_masked_operation(bool is_store) {
 } else {
   // Masked vector load with IOOBE always uses the predicated load.
   const TypeInt* offset_in_range = gvn().type(argument(8))->isa_int();
+  assert(offset_in_range->is_con(), "not a constant");
   if (!offset_in_range->is_con()) {
 if (C->print_intrinsics()) {
   tty->print_cr("  ** missing constant: offsetInRange=%s",

-

PR: https://git.openjdk.java.net/jdk/pull/8035


RFR: 8284050: [vectorapi] Optimize masked store for non-predicated architectures

2022-05-04 Thread Xiaohong Gong
Currently the vectorization of masked vector store is implemented by the masked 
store instruction only on architectures that support the predicate feature. The 
compiler will fall back to the java scalar code for non-predicate supported 
architectures like ARM NEON. However, for these systems, the masked store can 
be vectorized with the non-masked vector `"load + blend + store"`. For example, 
storing a vector` "v"` controlled by a mask` "m"` into a memory with address` 
"addr" (i.e. "store(addr, v, m)")` can be implemented with:


 1) mem_v = load(addr) ; non-masked load from the same memory
 2) v = blend(mem_v, v, m) ; blend with the src vector with the mask
 3) store(addr, v) ; non-masked store into the memory


Since the first full loading needs the array offset must be inside of the valid 
array bounds, we make the compiler do the vectorization only when the offset is 
in range of the array boundary. And the compiler will still fall back to the 
java scalar code if not all offsets are valid. Besides, the original offset 
check for masked lanes are only applied when the offset is not always inside of 
the array range. This also improves the performance for masked store when the 
offset is always valid. The whole process is similar to the masked load API.

Here is the performance data for the masked vector store benchmarks on a X86 
non avx-512 system, which improves about `20x ~ 50x`:

Benchmark  beforeafter   Units
StoreMaskedBenchmark.byteStoreArrayMask   221.733  11094.126 ops/ms
StoreMaskedBenchmark.doubleStoreArrayMask  41.086   1034.408 ops/ms
StoreMaskedBenchmark.floatStoreArrayMask   73.820   1985.015 ops/ms
StoreMaskedBenchmark.intStoreArrayMask 75.028   2027.557 ops/ms
StoreMaskedBenchmark.longStoreArrayMask40.929   1032.928 ops/ms
StoreMaskedBenchmark.shortStoreArrayMask  135.794   5307.567 ops/ms

Similar performance gain can also be observed on ARM NEON system.

And here is the performance data on X86 avx-512 system, which improves about 
`1.88x - 2.81x`:

Benchmark  before after   Units
StoreMaskedBenchmark.byteStoreArrayMask   11185.956 21012.824 ops/ms
StoreMaskedBenchmark.doubleStoreArrayMask  1480.644  3911.720 ops/ms
StoreMaskedBenchmark.floatStoreArrayMask   2738.352  7708.365 ops/ms
StoreMaskedBenchmark.intStoreArrayMask 4191.904  9300.428 ops/ms
StoreMaskedBenchmark.longStoreArrayMask2025.031  4604.504 ops/ms
StoreMaskedBenchmark.shortStoreArrayMask   8339.389 17817.128 ops/ms

Similar performance gain can also be observed on ARM SVE system.

-

Depends on: https://git.openjdk.java.net/jdk/pull/8035

Commit messages:
 - 8284050: [vectorapi] Optimize masked store for non-predicated architectures
 - 8283667: [vectorapi] Vectorization for masked load with IOOBE with predicate 
feature

Changes: https://git.openjdk.java.net/jdk/pull/8544/files
 Webrev: https://webrevs.openjdk.java.net/?repo=jdk&pr=8544&range=00
  Issue: https://bugs.openjdk.java.net/browse/JDK-8284050
  Stats: 1708 lines in 44 files changed: 710 ins; 188 del; 810 mod
  Patch: https://git.openjdk.java.net/jdk/pull/8544.diff
  Fetch: git fetch https://git.openjdk.java.net/jdk pull/8544/head:pull/8544

PR: https://git.openjdk.java.net/jdk/pull/8544


Re: RFR: 8284050: [vectorapi] Optimize masked store for non-predicated architectures [v2]

2022-05-04 Thread Xiaohong Gong
> Currently the vectorization of masked vector store is implemented by the 
> masked store instruction only on architectures that support the predicate 
> feature. The compiler will fall back to the java scalar code for 
> non-predicate supported architectures like ARM NEON. However, for these 
> systems, the masked store can be vectorized with the non-masked vector `"load 
> + blend + store"`. For example, storing a vector` "v"` controlled by a mask` 
> "m"` into a memory with address` "addr" (i.e. "store(addr, v, m)")` can be 
> implemented with:
> 
> 
>  1) mem_v = load(addr) ; non-masked load from the same memory
>  2) v = blend(mem_v, v, m) ; blend with the src vector with the mask
>  3) store(addr, v) ; non-masked store into the memory
> 
> 
> Since the first full loading needs the array offset must be inside of the 
> valid array bounds, we make the compiler do the vectorization only when the 
> offset is in range of the array boundary. And the compiler will still fall 
> back to the java scalar code if not all offsets are valid. Besides, the 
> original offset check for masked lanes are only applied when the offset is 
> not always inside of the array range. This also improves the performance for 
> masked store when the offset is always valid. The whole process is similar to 
> the masked load API.
> 
> Here is the performance data for the masked vector store benchmarks on a X86 
> non avx-512 system, which improves about `20x ~ 50x`:
> 
> Benchmark  beforeafter   Units
> StoreMaskedBenchmark.byteStoreArrayMask   221.733  11094.126 ops/ms
> StoreMaskedBenchmark.doubleStoreArrayMask  41.086   1034.408 ops/ms
> StoreMaskedBenchmark.floatStoreArrayMask   73.820   1985.015 ops/ms
> StoreMaskedBenchmark.intStoreArrayMask 75.028   2027.557 ops/ms
> StoreMaskedBenchmark.longStoreArrayMask40.929   1032.928 ops/ms
> StoreMaskedBenchmark.shortStoreArrayMask  135.794   5307.567 ops/ms
> 
> Similar performance gain can also be observed on ARM NEON system.
> 
> And here is the performance data on X86 avx-512 system, which improves about 
> `1.88x - 2.81x`:
> 
> Benchmark  before after   Units
> StoreMaskedBenchmark.byteStoreArrayMask   11185.956 21012.824 ops/ms
> StoreMaskedBenchmark.doubleStoreArrayMask  1480.644  3911.720 ops/ms
> StoreMaskedBenchmark.floatStoreArrayMask   2738.352  7708.365 ops/ms
> StoreMaskedBenchmark.intStoreArrayMask 4191.904  9300.428 ops/ms
> StoreMaskedBenchmark.longStoreArrayMask2025.031  4604.504 ops/ms
> StoreMaskedBenchmark.shortStoreArrayMask   8339.389 17817.128 ops/ms
> 
> Similar performance gain can also be observed on ARM SVE system.

Xiaohong Gong has updated the pull request with a new target base due to a 
merge or a rebase. The pull request now contains one commit:

  8284050: [vectorapi] Optimize masked store for non-predicated architectures

-

Changes: https://git.openjdk.java.net/jdk/pull/8544/files
 Webrev: https://webrevs.openjdk.java.net/?repo=jdk&pr=8544&range=01
  Stats: 915 lines in 43 files changed: 405 ins; 80 del; 430 mod
  Patch: https://git.openjdk.java.net/jdk/pull/8544.diff
  Fetch: git fetch https://git.openjdk.java.net/jdk pull/8544/head:pull/8544

PR: https://git.openjdk.java.net/jdk/pull/8544


Re: RFR: 8283667: [vectorapi] Vectorization for masked load with IOOBE with predicate feature [v2]

2022-05-04 Thread Xiaohong Gong
On Thu, 28 Apr 2022 00:13:49 GMT, Sandhya Viswanathan 
 wrote:

>> Xiaohong Gong has updated the pull request incrementally with one additional 
>> commit since the last revision:
>> 
>>   Rename the "usePred" to "offsetInRange"
>
> src/hotspot/share/opto/vectorIntrinsics.cpp line 1232:
> 
>> 1230:   // out when current case uses the predicate feature.
>> 1231:   if (!supports_predicate) {
>> 1232: bool use_predicate = false;
> 
> If we rename this to needs_predicate it will be easier to understand.

Thanks for the comment! This local variable will be removed after adding the 
similar intrinsify for store masked. Please help to see the PR 
https://github.com/openjdk/jdk/pull/8544. Thanks so much!

-

PR: https://git.openjdk.java.net/jdk/pull/8035


Re: RFR: 8283667: [vectorapi] Vectorization for masked load with IOOBE with predicate feature [v2]

2022-05-04 Thread Xiaohong Gong
On Thu, 31 Mar 2022 02:15:26 GMT, Quan Anh Mai  wrote:

>> I'm afraid not. "Load + Blend" makes the elements of unmasked lanes to be 
>> `0`. Then a full store may change the values in the unmasked memory to be 0, 
>> which is different with the mask store API definition.
>
> The blend should be with the intended-to-store vector, so that masked lanes 
> contain the need-to-store elements and unmasked lanes contain the loaded 
> elements, which would be stored back, which results in unchanged values.

Hi @merykitty @jatin-bhateja , could you please help to take a review at the 
similar store masked PR https://github.com/openjdk/jdk/pull/8544 ? Any feedback 
is welcome! Thanks so much!

-

PR: https://git.openjdk.java.net/jdk/pull/8035


Re: RFR: 8284050: [vectorapi] Optimize masked store for non-predicated architectures [v2]

2022-05-04 Thread Xiaohong Gong
On Thu, 5 May 2022 02:27:03 GMT, John R Rose  wrote:

>> Xiaohong Gong has updated the pull request with a new target base due to a 
>> merge or a rebase. The pull request now contains one commit:
>> 
>>   8284050: [vectorapi] Optimize masked store for non-predicated architectures
>
> src/hotspot/share/opto/vectorIntrinsics.cpp line 1363:
> 
>> 1361:   // Use the vector blend to implement the masked store. The 
>> biased elements are the original
>> 1362:   // values in the memory.
>> 1363:   Node* mem_val = gvn().transform(LoadVectorNode::make(0, 
>> control(), memory(addr), addr, addr_type, mem_num_elem, mem_elem_bt));
> 
> I'm sorry to say it, but I am pretty sure this is an invalid optimization.
> See top-level comment for more details.

Thanks for your comments! Yeah, this actually influences something due to the 
Java Memory Model rules which I missed to consider more. I will try the scatter 
ways instead. Thanks so much!

-

PR: https://git.openjdk.java.net/jdk/pull/8544


Re: RFR: 8284050: [vectorapi] Optimize masked store for non-predicated architectures [v2]

2022-05-04 Thread Xiaohong Gong
On Thu, 5 May 2022 02:09:39 GMT, Xiaohong Gong  wrote:

>> Currently the vectorization of masked vector store is implemented by the 
>> masked store instruction only on architectures that support the predicate 
>> feature. The compiler will fall back to the java scalar code for 
>> non-predicate supported architectures like ARM NEON. However, for these 
>> systems, the masked store can be vectorized with the non-masked vector 
>> `"load + blend + store"`. For example, storing a vector` "v"` controlled by 
>> a mask` "m"` into a memory with address` "addr" (i.e. "store(addr, v, m)")` 
>> can be implemented with:
>> 
>> 
>>  1) mem_v = load(addr) ; non-masked load from the same memory
>>  2) v = blend(mem_v, v, m) ; blend with the src vector with the mask
>>  3) store(addr, v) ; non-masked store into the memory
>> 
>> 
>> Since the first full loading needs the array offset must be inside of the 
>> valid array bounds, we make the compiler do the vectorization only when the 
>> offset is in range of the array boundary. And the compiler will still fall 
>> back to the java scalar code if not all offsets are valid. Besides, the 
>> original offset check for masked lanes are only applied when the offset is 
>> not always inside of the array range. This also improves the performance for 
>> masked store when the offset is always valid. The whole process is similar 
>> to the masked load API.
>> 
>> Here is the performance data for the masked vector store benchmarks on a X86 
>> non avx-512 system, which improves about `20x ~ 50x`:
>> 
>> Benchmark  beforeafter   Units
>> StoreMaskedBenchmark.byteStoreArrayMask   221.733  11094.126 ops/ms
>> StoreMaskedBenchmark.doubleStoreArrayMask  41.086   1034.408 ops/ms
>> StoreMaskedBenchmark.floatStoreArrayMask   73.820   1985.015 ops/ms
>> StoreMaskedBenchmark.intStoreArrayMask 75.028   2027.557 ops/ms
>> StoreMaskedBenchmark.longStoreArrayMask40.929   1032.928 ops/ms
>> StoreMaskedBenchmark.shortStoreArrayMask  135.794   5307.567 ops/ms
>> 
>> Similar performance gain can also be observed on ARM NEON system.
>> 
>> And here is the performance data on X86 avx-512 system, which improves about 
>> `1.88x - 2.81x`:
>> 
>> Benchmark  before after   Units
>> StoreMaskedBenchmark.byteStoreArrayMask   11185.956 21012.824 ops/ms
>> StoreMaskedBenchmark.doubleStoreArrayMask  1480.644  3911.720 ops/ms
>> StoreMaskedBenchmark.floatStoreArrayMask   2738.352  7708.365 ops/ms
>> StoreMaskedBenchmark.intStoreArrayMask 4191.904  9300.428 ops/ms
>> StoreMaskedBenchmark.longStoreArrayMask2025.031  4604.504 ops/ms
>> StoreMaskedBenchmark.shortStoreArrayMask   8339.389 17817.128 ops/ms
>> 
>> Similar performance gain can also be observed on ARM SVE system.
>
> Xiaohong Gong has updated the pull request with a new target base due to a 
> merge or a rebase. The pull request now contains one commit:
> 
>   8284050: [vectorapi] Optimize masked store for non-predicated architectures

> _Mailing list message from [Hans Boehm](mailto:hbo...@google.com) on 
> [hotspot-dev](mailto:hotspot-...@mail.openjdk.java.net):_
> 
> Naive question: What happens if one of the vector elements that should not 
> have been updated is concurrently being written by another thread? Aren't you 
> generating writes to vector elements that should not have been written?
> 
> Hans
> 
> On Wed, May 4, 2022 at 7:08 PM Xiaohong Gong  
> wrote:

Yeah, this is the similar concern with what @rose00 mentioned above. The 
current solution cannot work well for multi-thread progresses. I will consider 
other better solutions. Thanks for the comments!

src/jdk.incubator.vector/share/classes/jdk/incubator/vector/ByteVector.java 
line 3483:

> 3481: ByteSpecies vsp = vspecies();
> 3482: if (offset >= 0 && offset <= (a.length - vsp.length())) {
> 3483: intoBooleanArray0(a, offset, m, /* offsetInRange */ 
> true);

The offset check could save the `checkMaskFromIndexSize`  for cases that offset 
are in the valid array bounds, which also improves the performance. @rose00 , 
do you think this part of change is ok at least? Thanks!

-

PR: https://git.openjdk.java.net/jdk/pull/8544


Re: RFR: 8284050: [vectorapi] Optimize masked store for non-predicated architectures [v2]

2022-05-05 Thread Xiaohong Gong
On Thu, 5 May 2022 02:09:39 GMT, Xiaohong Gong  wrote:

>> Currently the vectorization of masked vector store is implemented by the 
>> masked store instruction only on architectures that support the predicate 
>> feature. The compiler will fall back to the java scalar code for 
>> non-predicate supported architectures like ARM NEON. However, for these 
>> systems, the masked store can be vectorized with the non-masked vector 
>> `"load + blend + store"`. For example, storing a vector` "v"` controlled by 
>> a mask` "m"` into a memory with address` "addr" (i.e. "store(addr, v, m)")` 
>> can be implemented with:
>> 
>> 
>>  1) mem_v = load(addr) ; non-masked load from the same memory
>>  2) v = blend(mem_v, v, m) ; blend with the src vector with the mask
>>  3) store(addr, v) ; non-masked store into the memory
>> 
>> 
>> Since the first full loading needs the array offset must be inside of the 
>> valid array bounds, we make the compiler do the vectorization only when the 
>> offset is in range of the array boundary. And the compiler will still fall 
>> back to the java scalar code if not all offsets are valid. Besides, the 
>> original offset check for masked lanes are only applied when the offset is 
>> not always inside of the array range. This also improves the performance for 
>> masked store when the offset is always valid. The whole process is similar 
>> to the masked load API.
>> 
>> Here is the performance data for the masked vector store benchmarks on a X86 
>> non avx-512 system, which improves about `20x ~ 50x`:
>> 
>> Benchmark  beforeafter   Units
>> StoreMaskedBenchmark.byteStoreArrayMask   221.733  11094.126 ops/ms
>> StoreMaskedBenchmark.doubleStoreArrayMask  41.086   1034.408 ops/ms
>> StoreMaskedBenchmark.floatStoreArrayMask   73.820   1985.015 ops/ms
>> StoreMaskedBenchmark.intStoreArrayMask 75.028   2027.557 ops/ms
>> StoreMaskedBenchmark.longStoreArrayMask40.929   1032.928 ops/ms
>> StoreMaskedBenchmark.shortStoreArrayMask  135.794   5307.567 ops/ms
>> 
>> Similar performance gain can also be observed on ARM NEON system.
>> 
>> And here is the performance data on X86 avx-512 system, which improves about 
>> `1.88x - 2.81x`:
>> 
>> Benchmark  before after   Units
>> StoreMaskedBenchmark.byteStoreArrayMask   11185.956 21012.824 ops/ms
>> StoreMaskedBenchmark.doubleStoreArrayMask  1480.644  3911.720 ops/ms
>> StoreMaskedBenchmark.floatStoreArrayMask   2738.352  7708.365 ops/ms
>> StoreMaskedBenchmark.intStoreArrayMask 4191.904  9300.428 ops/ms
>> StoreMaskedBenchmark.longStoreArrayMask2025.031  4604.504 ops/ms
>> StoreMaskedBenchmark.shortStoreArrayMask   8339.389 17817.128 ops/ms
>> 
>> Similar performance gain can also be observed on ARM SVE system.
>
> Xiaohong Gong has updated the pull request with a new target base due to a 
> merge or a rebase. The pull request now contains one commit:
> 
>   8284050: [vectorapi] Optimize masked store for non-predicated architectures

> _Mailing list message from [John Rose](mailto:john.r.r...@oracle.com) on 
> [hotspot-dev](mailto:hotspot-...@mail.openjdk.java.net):_
> 
> > On May 4, 2022, at 8:29 PM, Xiaohong Gong  wrote:
> > The offset check could save the `checkMaskFromIndexSize`  for cases that 
> > offset are in the valid array bounds, which also improves the performance. 
> > @rose00 , do you think this part of change is ok at least?
> 
> That part is ok, yes. I wish we could get the same effect with loop 
> optimizations but I don?t know an easy way. The explicit check in the source 
> code gives the JIT a crutch but I hope we can figure out a way in the future 
> to integrate mask logic into range check elimination logic, making the 
> crutches unnecessary. For now it?s fine.

Thanks! So I will separate this part out and fix it in another PR first. For 
the store masked vectorization with scatter or other ideas, I'm not quite sure 
whether they can always benefit cross architectures and need more 
investigation. I prefer to close this PR now. Thanks for all your comments!

-

PR: https://git.openjdk.java.net/jdk/pull/8544


Withdrawn: 8284050: [vectorapi] Optimize masked store for non-predicated architectures

2022-05-05 Thread Xiaohong Gong
On Thu, 5 May 2022 02:00:04 GMT, Xiaohong Gong  wrote:

> Currently the vectorization of masked vector store is implemented by the 
> masked store instruction only on architectures that support the predicate 
> feature. The compiler will fall back to the java scalar code for 
> non-predicate supported architectures like ARM NEON. However, for these 
> systems, the masked store can be vectorized with the non-masked vector `"load 
> + blend + store"`. For example, storing a vector` "v"` controlled by a mask` 
> "m"` into a memory with address` "addr" (i.e. "store(addr, v, m)")` can be 
> implemented with:
> 
> 
>  1) mem_v = load(addr) ; non-masked load from the same memory
>  2) v = blend(mem_v, v, m) ; blend with the src vector with the mask
>  3) store(addr, v) ; non-masked store into the memory
> 
> 
> Since the first full loading needs the array offset must be inside of the 
> valid array bounds, we make the compiler do the vectorization only when the 
> offset is in range of the array boundary. And the compiler will still fall 
> back to the java scalar code if not all offsets are valid. Besides, the 
> original offset check for masked lanes are only applied when the offset is 
> not always inside of the array range. This also improves the performance for 
> masked store when the offset is always valid. The whole process is similar to 
> the masked load API.
> 
> Here is the performance data for the masked vector store benchmarks on a X86 
> non avx-512 system, which improves about `20x ~ 50x`:
> 
> Benchmark  beforeafter   Units
> StoreMaskedBenchmark.byteStoreArrayMask   221.733  11094.126 ops/ms
> StoreMaskedBenchmark.doubleStoreArrayMask  41.086   1034.408 ops/ms
> StoreMaskedBenchmark.floatStoreArrayMask   73.820   1985.015 ops/ms
> StoreMaskedBenchmark.intStoreArrayMask 75.028   2027.557 ops/ms
> StoreMaskedBenchmark.longStoreArrayMask40.929   1032.928 ops/ms
> StoreMaskedBenchmark.shortStoreArrayMask  135.794   5307.567 ops/ms
> 
> Similar performance gain can also be observed on ARM NEON system.
> 
> And here is the performance data on X86 avx-512 system, which improves about 
> `1.88x - 2.81x`:
> 
> Benchmark  before after   Units
> StoreMaskedBenchmark.byteStoreArrayMask   11185.956 21012.824 ops/ms
> StoreMaskedBenchmark.doubleStoreArrayMask  1480.644  3911.720 ops/ms
> StoreMaskedBenchmark.floatStoreArrayMask   2738.352  7708.365 ops/ms
> StoreMaskedBenchmark.intStoreArrayMask 4191.904  9300.428 ops/ms
> StoreMaskedBenchmark.longStoreArrayMask2025.031  4604.504 ops/ms
> StoreMaskedBenchmark.shortStoreArrayMask   8339.389 17817.128 ops/ms
> 
> Similar performance gain can also be observed on ARM SVE system.

This pull request has been closed without being integrated.

-

PR: https://git.openjdk.java.net/jdk/pull/8544


Re: RFR: 8283667: [vectorapi] Vectorization for masked load with IOOBE with predicate feature [v2]

2022-05-05 Thread Xiaohong Gong
On Thu, 5 May 2022 02:14:08 GMT, Xiaohong Gong  wrote:

>> src/hotspot/share/opto/vectorIntrinsics.cpp line 1232:
>> 
>>> 1230:   // out when current case uses the predicate feature.
>>> 1231:   if (!supports_predicate) {
>>> 1232: bool use_predicate = false;
>> 
>> If we rename this to needs_predicate it will be easier to understand.
>
> Thanks for the comment! This local variable will be removed after adding the 
> similar intrinsify for store masked. Please help to see the PR 
> https://github.com/openjdk/jdk/pull/8544. Thanks so much!

Renamed to "needs_predicate". Thanks!

-

PR: https://git.openjdk.java.net/jdk/pull/8035


Re: RFR: 8283667: [vectorapi] Vectorization for masked load with IOOBE with predicate feature [v3]

2022-05-05 Thread Xiaohong Gong
> Currently the vector load with mask when the given index happens out of the 
> array boundary is implemented with pure java scalar code to avoid the IOOBE 
> (IndexOutOfBoundaryException). This is necessary for architectures that do 
> not support the predicate feature. Because the masked load is implemented 
> with a full vector load and a vector blend applied on it. And a full vector 
> load will definitely cause the IOOBE which is not valid. However, for 
> architectures that support the predicate feature like SVE/AVX-512/RVV, it can 
> be vectorized with the predicated load instruction as long as the indexes of 
> the masked lanes are within the bounds of the array. For these architectures, 
> loading with unmasked lanes does not raise exception.
> 
> This patch adds the vectorization support for the masked load with IOOBE 
> part. Please see the original java implementation (FIXME: optimize):
> 
> 
>   @ForceInline
>   public static
>   ByteVector fromArray(VectorSpecies species,
>byte[] a, int offset,
>VectorMask m) {
>   ByteSpecies vsp = (ByteSpecies) species;
>   if (offset >= 0 && offset <= (a.length - species.length())) {
>   return vsp.dummyVector().fromArray0(a, offset, m);
>   }
> 
>   // FIXME: optimize
>   checkMaskFromIndexSize(offset, vsp, m, 1, a.length);
>   return vsp.vOp(m, i -> a[offset + i]);
>   }
> 
> Since it can only be vectorized with the predicate load, the hotspot must 
> check whether the current backend supports it and falls back to the java 
> scalar version if not. This is different from the normal masked vector load 
> that the compiler will generate a full vector load and a vector blend if the 
> predicate load is not supported. So to let the compiler make the expected 
> action, an additional flag (i.e. `usePred`) is added to the existing 
> "loadMasked" intrinsic, with the value "true" for the IOOBE part while 
> "false" for the normal load. And the compiler will fail to intrinsify if the 
> flag is "true" and the predicate load is not supported by the backend, which 
> means that normal java path will be executed.
> 
> Also adds the same vectorization support for masked:
>  - fromByteArray/fromByteBuffer
>  - fromBooleanArray
>  - fromCharArray
> 
> The performance for the new added benchmarks improve about `1.88x ~ 30.26x` 
> on the x86 AVX-512 system:
> 
> Benchmark  before   After  Units
> LoadMaskedIOOBEBenchmark.byteLoadArrayMaskIOOBE   737.542 1387.069 ops/ms
> LoadMaskedIOOBEBenchmark.doubleLoadArrayMaskIOOBE 118.366  330.776 ops/ms
> LoadMaskedIOOBEBenchmark.floatLoadArrayMaskIOOBE  233.832 6125.026 ops/ms
> LoadMaskedIOOBEBenchmark.intLoadArrayMaskIOOBE233.816 7075.923 ops/ms
> LoadMaskedIOOBEBenchmark.longLoadArrayMaskIOOBE   119.771  330.587 ops/ms
> LoadMaskedIOOBEBenchmark.shortLoadArrayMaskIOOBE  431.961  939.301 ops/ms
> 
> Similar performance gain can also be observed on 512-bit SVE system.

Xiaohong Gong has updated the pull request incrementally with one additional 
commit since the last revision:

  Rename "use_predicate" to "needs_predicate"

-

Changes:
  - all: https://git.openjdk.java.net/jdk/pull/8035/files
  - new: https://git.openjdk.java.net/jdk/pull/8035/files/9b2d2f19..9c69206e

Webrevs:
 - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=8035&range=02
 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=8035&range=01-02

  Stats: 4 lines in 1 file changed: 0 ins; 0 del; 4 mod
  Patch: https://git.openjdk.java.net/jdk/pull/8035.diff
  Fetch: git fetch https://git.openjdk.java.net/jdk pull/8035/head:pull/8035

PR: https://git.openjdk.java.net/jdk/pull/8035


Re: RFR: 8283667: [vectorapi] Vectorization for masked load with IOOBE with predicate feature [v3]

2022-05-05 Thread Xiaohong Gong
On Thu, 5 May 2022 19:27:47 GMT, Paul Sandoz  wrote:

>> Xiaohong Gong has updated the pull request incrementally with one additional 
>> commit since the last revision:
>> 
>>   Rename "use_predicate" to "needs_predicate"
>
> src/hotspot/share/opto/vectorIntrinsics.cpp line 1238:
> 
>> 1236: } else {
>> 1237:   // Masked vector load with IOOBE always uses the predicated load.
>> 1238:   const TypeInt* offset_in_range = 
>> gvn().type(argument(8))->isa_int();
> 
> Should it be `argument(7)`? (and adjustments later to access the container).

I'm afraid it's `argument(8)` for the load operation since the `argument(7)` is 
the mask input. It seems the argument number is not right begin from the mask 
input which is expected to be `6`. But the it's not. Actually I don't quite 
understand why.

-

PR: https://git.openjdk.java.net/jdk/pull/8035


Re: RFR: 8283667: [vectorapi] Vectorization for masked load with IOOBE with predicate feature [v3]

2022-05-05 Thread Xiaohong Gong
On Fri, 6 May 2022 04:22:30 GMT, Sandhya Viswanathan  
wrote:

>> I'm afraid it's `argument(8)` for the load operation since the `argument(7)` 
>> is the mask input. It seems the argument number is not right begin from the 
>> mask input which is expected to be `6`. But the it's not. Actually I don't 
>> quite understand why.
>
> offset is long so uses two argument slots (5 and 6). 
> mask is argument (7).
> offsetInRange is argument(8).

Make sense! Thanks for the explanation!

-

PR: https://git.openjdk.java.net/jdk/pull/8035


Re: RFR: 8283667: [vectorapi] Vectorization for masked load with IOOBE with predicate feature [v3]

2022-05-06 Thread Xiaohong Gong
On Fri, 6 May 2022 14:59:26 GMT, Paul Sandoz  wrote:

>> Make sense! Thanks for the explanation!
>
> Doh! of course. This is not the first and will not be the last time i get 
> caught out by the 2-slot requirement.
> It may be useful to do this:
> 
> Node* mask_arg = is_store ? argument(8) : argument(7);

Yes, the mask argument is got like:

 Node* mask = unbox_vector(is_store ? argument(8) : argument(7), mbox_type, 
elem_bt, num_elem);

-

PR: https://git.openjdk.java.net/jdk/pull/8035


Re: RFR: 8283667: [vectorapi] Vectorization for masked load with IOOBE with predicate feature [v3]

2022-05-09 Thread Xiaohong Gong
On Mon, 9 May 2022 21:55:27 GMT, Paul Sandoz  wrote:

> I modified the code of this PR to avoid the conversion of `boolean` to `int`, 
> so a constant integer value is passed all the way through, and the masked 
> load is made intrinsic from the method at which the constants are passed as 
> arguments i.e. the public `fromArray` mask accepting method.

Great and thanks! Could you please show me the changes or an example? I can 
push the changes to this PR. Thanks so much!

-

PR: https://git.openjdk.java.net/jdk/pull/8035


RFR: 8286279: [vectorapi] Only check index of masked lanes if offset is out of array boundary for masked store

2022-05-09 Thread Xiaohong Gong
Checking whether the indexes of masked lanes are inside of the valid memory 
boundary is necessary for masked vector memory access. However, this could be 
saved if the given offset is inside of the vector range that could make sure no 
IOOBE (IndexOutOfBoundaryException) happens. The masked load APIs have saved 
this kind of check for common cases. And this patch did the similar 
optimization for the masked vector store.

The performance for the new added store masked benchmarks improves about `1.83x 
~ 2.62x` on a x86 system:

Benchmark   BeforeAfter Gain Units
StoreMaskedBenchmark.byteStoreArrayMask   12757.936 23291.118  1.826 ops/ms
StoreMaskedBenchmark.doubleStoreArrayMask  1520.932  3921.616  2.578 ops/ms
StoreMaskedBenchmark.floatStoreArrayMask   2713.031  7122.535  2.625 ops/ms
StoreMaskedBenchmark.intStoreArrayMask 4113.772  8220.206  1.998 ops/ms
StoreMaskedBenchmark.longStoreArrayMask1993.986  4874.148  2.444 ops/ms
StoreMaskedBenchmark.shortStoreArrayMask   8543.593 17821.086  2.086 ops/ms

Similar performane gain can also be observed on ARM hardware.

-

Commit messages:
 - 8286279: [vectorapi] Only check index of masked lanes if offset is out of 
array boundary for masked store

Changes: https://git.openjdk.java.net/jdk/pull/8620/files
 Webrev: https://webrevs.openjdk.java.net/?repo=jdk&pr=8620&range=00
  Issue: https://bugs.openjdk.java.net/browse/JDK-8286279
  Stats: 213 lines in 8 files changed: 188 ins; 0 del; 25 mod
  Patch: https://git.openjdk.java.net/jdk/pull/8620.diff
  Fetch: git fetch https://git.openjdk.java.net/jdk pull/8620/head:pull/8620

PR: https://git.openjdk.java.net/jdk/pull/8620


Re: RFR: 8283667: [vectorapi] Vectorization for masked load with IOOBE with predicate feature [v3]

2022-05-10 Thread Xiaohong Gong
On Mon, 9 May 2022 21:55:27 GMT, Paul Sandoz  wrote:

>> Xiaohong Gong has updated the pull request incrementally with one additional 
>> commit since the last revision:
>> 
>>   Rename "use_predicate" to "needs_predicate"
>
> I modified the code of this PR to avoid the conversion of `boolean` to `int`, 
> so a constant integer value is passed all the way through, and the masked 
> load is made intrinsic from the method at which the constants are passed as 
> arguments i.e. the public `fromArray` mask accepting method.

Hi @PaulSandoz , thanks for the patch for the constant int parameter. I think 
the main change is:

-ByteVector fromArray0Template(Class maskClass, C base, long offset, int 
index, M m, boolean offsetInRange,
+ByteVector fromArray0Template(Class maskClass, C base, long offset, int 
index, M m, int offsetInRange,
 VectorSupport.LoadVectorMaskedOperation defaultImpl) {
 m.check(species());
 ByteSpecies vsp = vspecies();
-if (offsetInRange) {
-return VectorSupport.loadMasked(
-vsp.vectorType(), maskClass, vsp.elementType(), 
vsp.laneCount(),
-base, offset, m, /* offsetInRange */ 1,
-base, index, vsp, defaultImpl);
-} else {
-return VectorSupport.loadMasked(
-vsp.vectorType(), maskClass, vsp.elementType(), 
vsp.laneCount(),
-base, offset, m, /* offsetInRange */ 0,
-base, index, vsp, defaultImpl);
-}
+return VectorSupport.loadMasked(
+vsp.vectorType(), maskClass, vsp.elementType(), vsp.laneCount(),
+base, offset, m, offsetInRange == 1 ? 1 : 0,
+base, index, vsp, defaultImpl);
 }

which uses `offsetInRange == 1 ? 1 : 0`. Unfortunately this could not always 
make sure the `offsetInRange` a constant a the compiler time. Again, this 
change could also make the assertion fail randomly:

--- a/src/hotspot/share/opto/vectorIntrinsics.cpp
+++ b/src/hotspot/share/opto/vectorIntrinsics.cpp
@@ -1236,6 +1236,7 @@ bool 
LibraryCallKit::inline_vector_mem_masked_operation(bool is_store) {
 } else {
   // Masked vector load with IOOBE always uses the predicated load.
   const TypeInt* offset_in_range = gvn().type(argument(8))->isa_int();
+  assert(offset_in_range->is_con(), "must be a constant");
   if (!offset_in_range->is_con()) {
 if (C->print_intrinsics()) {
   tty->print_cr("  ** missing constant: offsetInRange=%s",

Sometimes, the compiler can parse it a constant. I think this depends on the 
compiler OSR and speculative optimization. Did you try an example with IOOBE on 
a non predicated hardware?

Here is the main code of my unittest to reproduce the issue:

static final VectorSpecies I_SPECIES = IntVector.SPECIES_128;
static final int LENGTH = 1026;
public static int[] ia;
public static int[] ib;

private static void init() {
for (int i = 0; i < LENGTH; i++) {
ia[i] = i;
ib[i] = 0;
}

for (int i = 0; i < 2; i++) {
m[i] = i % 2 == 0;
}
}

 private static void func() {
VectorMask mask = VectorMask.fromArray(I_SPECIES, m, 0);
for (int i = 0; i < LENGTH; i += vl) {
IntVector av = IntVector.fromArray(I_SPECIES, ia, i, mask);
av.lanewise(VectorOperators.ABS).intoArray(ic, i, mask);
}
 }

 public static void main(String[] args) {
init();
for (int i = 0; i < 1; i++) {
func();
}
  }

-

PR: https://git.openjdk.java.net/jdk/pull/8035


Re: RFR: 8286279: [vectorapi] Only check index of masked lanes if offset is out of array boundary for masked store

2022-05-11 Thread Xiaohong Gong
On Tue, 10 May 2022 01:23:55 GMT, Xiaohong Gong  wrote:

> Checking whether the indexes of masked lanes are inside of the valid memory 
> boundary is necessary for masked vector memory access. However, this could be 
> saved if the given offset is inside of the vector range that could make sure 
> no IOOBE (IndexOutOfBoundaryException) happens. The masked load APIs have 
> saved this kind of check for common cases. And this patch did the similar 
> optimization for the masked vector store.
> 
> The performance for the new added store masked benchmarks improves about 
> `1.83x ~ 2.62x` on a x86 system:
> 
> Benchmark   BeforeAfter Gain Units
> StoreMaskedBenchmark.byteStoreArrayMask   12757.936 23291.118  1.826 ops/ms
> StoreMaskedBenchmark.doubleStoreArrayMask  1520.932  3921.616  2.578 ops/ms
> StoreMaskedBenchmark.floatStoreArrayMask   2713.031  7122.535  2.625 ops/ms
> StoreMaskedBenchmark.intStoreArrayMask 4113.772  8220.206  1.998 ops/ms
> StoreMaskedBenchmark.longStoreArrayMask1993.986  4874.148  2.444 ops/ms
> StoreMaskedBenchmark.shortStoreArrayMask   8543.593 17821.086  2.086 ops/ms
> 
> Similar performane gain can also be observed on ARM hardware.

Hi @PaulSandoz @rose00, could you please help to take a look at this simple PR? 
Thanks so much!

-

PR: https://git.openjdk.java.net/jdk/pull/8620


Re: RFR: 8286279: [vectorapi] Only check index of masked lanes if offset is out of array boundary for masked store

2022-05-11 Thread Xiaohong Gong
On Wed, 11 May 2022 15:10:55 GMT, Paul Sandoz  wrote:

>> Checking whether the indexes of masked lanes are inside of the valid memory 
>> boundary is necessary for masked vector memory access. However, this could 
>> be saved if the given offset is inside of the vector range that could make 
>> sure no IOOBE (IndexOutOfBoundaryException) happens. The masked load APIs 
>> have saved this kind of check for common cases. And this patch did the 
>> similar optimization for the masked vector store.
>> 
>> The performance for the new added store masked benchmarks improves about 
>> `1.83x ~ 2.62x` on a x86 system:
>> 
>> Benchmark   BeforeAfter Gain Units
>> StoreMaskedBenchmark.byteStoreArrayMask   12757.936 23291.118  1.826 ops/ms
>> StoreMaskedBenchmark.doubleStoreArrayMask  1520.932  3921.616  2.578 ops/ms
>> StoreMaskedBenchmark.floatStoreArrayMask   2713.031  7122.535  2.625 ops/ms
>> StoreMaskedBenchmark.intStoreArrayMask 4113.772  8220.206  1.998 ops/ms
>> StoreMaskedBenchmark.longStoreArrayMask1993.986  4874.148  2.444 ops/ms
>> StoreMaskedBenchmark.shortStoreArrayMask   8543.593 17821.086  2.086 ops/ms
>> 
>> Similar performane gain can also be observed on ARM hardware.
>
> src/jdk.incubator.vector/share/classes/jdk/incubator/vector/X-Vector.java.template
>  line 4086:
> 
>> 4084: } else {
>> 4085: $Type$Species vsp = vspecies();
>> 4086: if (offset < 0 || offset > (a.length - vsp.length())) {
> 
> Can we use `VectorIntrinsics.checkFromIndexSize`? e.g. 
> 
> if (!VectorIntrinsics.checkFromIndexSize(offset, vsp.length(), a.length)) { 
> ...

Thanks for the review @PaulSandoz ! For the 
`VectorIntrinsics.checkFromIndexSize`, I'm afraid it's not suitable to be used 
here because the `outOfBounds` exception will be thrown if the offset is not 
inside of the valid array boundary. And  for the masked operations, this is not 
needed since we only need to check the masked lanes. Please correct me if I 
didn't understand correctly. Thanks!

-

PR: https://git.openjdk.java.net/jdk/pull/8620


Re: RFR: 8283667: [vectorapi] Vectorization for masked load with IOOBE with predicate feature [v3]

2022-05-11 Thread Xiaohong Gong
On Wed, 11 May 2022 19:45:55 GMT, Paul Sandoz  wrote:

> I tried your test code with the patch and logged compilation 
> (`-XX:-TieredCompilation -XX:+PrintCompilation -XX:+PrintInlining 
> -XX:+PrintIntrinsics -Xbatch`)
> 
> For `func` the first call to `VectorSupport::loadMasked` is intrinsic and 
> inlined:
> 
> ```
> @ 45   jdk.internal.vm.vector.VectorSupport::loadMasked (40 bytes)   
> (intrinsic)
> ```
> 
> But the second call (for the last loop iteration) fails to inline:
> 
> ```
> @ 45   jdk.internal.vm.vector.VectorSupport::loadMasked (40 bytes)   
> failed to inline (intrinsic)
> ```
> 
> Since i am running on an mac book this looks right and aligns with the 
> `-XX:+PrintIntrinsics` output:
> 
> ```
>   ** Rejected vector op (LoadVectorMasked,int,8) because architecture does 
> not support it
>   ** Rejected vector op (LoadVectorMasked,int,8) because architecture does 
> not support it
>   ** not supported: op=loadMasked vlen=8 etype=int using_byte_array=0
> ```
> 
> ?
> 
> I have not looked at the code gen nor measured performance comparing the case 
> when never out of bounds and only out of bounds for the last loop iteration.

Yeah, it looks right from the log. Did you try to find whether there is the log 
`** missing constant: offsetInRange=Parm` with `XX:+PrintIntrinsics` ? Or 
insert an assertion in `vectorIntrinsics.cpp` like:

--- a/src/hotspot/share/opto/vectorIntrinsics.cpp
+++ b/src/hotspot/share/opto/vectorIntrinsics.cpp
@@ -1236,6 +1236,7 @@ bool 
LibraryCallKit::inline_vector_mem_masked_operation(bool is_store) {
 } else {
   // Masked vector load with IOOBE always uses the predicated load.
   const TypeInt* offset_in_range = gvn().type(argument(8))->isa_int();
+  assert(offset_in_range->is_con(), "must be a constant");
   if (!offset_in_range->is_con()) {
 if (C->print_intrinsics()) {
   tty->print_cr("  ** missing constant: offsetInRange=%s",

And run the tests with debug mode.  Thanks!

-

PR: https://git.openjdk.java.net/jdk/pull/8035


Re: RFR: 8286279: [vectorapi] Only check index of masked lanes if offset is out of array boundary for masked store

2022-05-11 Thread Xiaohong Gong
On Thu, 12 May 2022 03:36:31 GMT, Paul Sandoz  wrote:

>> Thanks for the review @PaulSandoz ! For the 
>> `VectorIntrinsics.checkFromIndexSize`, I'm afraid it's not suitable to be 
>> used here because the `outOfBounds` exception will be thrown if the offset 
>> is not inside of the valid array boundary. And  for the masked operations, 
>> this is not needed since we only need to check the masked lanes. Please 
>> correct me if I didn't understand correctly. Thanks!
>
> Silly me! i commented too quickly, `checkFromIndexSize` cannot be used. My 
> intent was that we could use a method rather than repeating the expression in 
> numerous places (including masked loads IIRC).

Good idea! I will try. Thanks!

-

PR: https://git.openjdk.java.net/jdk/pull/8620


Re: RFR: 8286279: [vectorapi] Only check index of masked lanes if offset is out of array boundary for masked store

2022-05-12 Thread Xiaohong Gong
On Thu, 12 May 2022 09:49:17 GMT, Quan Anh Mai  wrote:

> Maybe we could use `a.length - vsp.length() > 0 && offset u< a.length - 
> vsp.length()` which would hoist the first check outside of the loop. Thanks.

Thanks for the review @merykitty ! We need the check `offset >= 0` which I 
think is different from `a.length - vsp.length()`.

-

PR: https://git.openjdk.java.net/jdk/pull/8620


Re: RFR: 8286279: [vectorapi] Only check index of masked lanes if offset is out of array boundary for masked store [v2]

2022-05-12 Thread Xiaohong Gong
> Checking whether the indexes of masked lanes are inside of the valid memory 
> boundary is necessary for masked vector memory access. However, this could be 
> saved if the given offset is inside of the vector range that could make sure 
> no IOOBE (IndexOutOfBoundaryException) happens. The masked load APIs have 
> saved this kind of check for common cases. And this patch did the similar 
> optimization for the masked vector store.
> 
> The performance for the new added store masked benchmarks improves about 
> `1.83x ~ 2.62x` on a x86 system:
> 
> Benchmark   BeforeAfter Gain Units
> StoreMaskedBenchmark.byteStoreArrayMask   12757.936 23291.118  1.826 ops/ms
> StoreMaskedBenchmark.doubleStoreArrayMask  1520.932  3921.616  2.578 ops/ms
> StoreMaskedBenchmark.floatStoreArrayMask   2713.031  7122.535  2.625 ops/ms
> StoreMaskedBenchmark.intStoreArrayMask 4113.772  8220.206  1.998 ops/ms
> StoreMaskedBenchmark.longStoreArrayMask1993.986  4874.148  2.444 ops/ms
> StoreMaskedBenchmark.shortStoreArrayMask   8543.593 17821.086  2.086 ops/ms
> 
> Similar performane gain can also be observed on ARM hardware.

Xiaohong Gong has updated the pull request incrementally with one additional 
commit since the last revision:

  Wrap the offset check into a static method

-

Changes:
  - all: https://git.openjdk.java.net/jdk/pull/8620/files
  - new: https://git.openjdk.java.net/jdk/pull/8620/files/eb67f681..2229dd24

Webrevs:
 - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=8620&range=01
 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=8620&range=00-01

  Stats: 56 lines in 8 files changed: 5 ins; 0 del; 51 mod
  Patch: https://git.openjdk.java.net/jdk/pull/8620.diff
  Fetch: git fetch https://git.openjdk.java.net/jdk pull/8620/head:pull/8620

PR: https://git.openjdk.java.net/jdk/pull/8620


Re: RFR: 8286279: [vectorapi] Only check index of masked lanes if offset is out of array boundary for masked store [v2]

2022-05-12 Thread Xiaohong Gong
On Thu, 12 May 2022 03:36:31 GMT, Paul Sandoz  wrote:

>> Thanks for the review @PaulSandoz ! For the 
>> `VectorIntrinsics.checkFromIndexSize`, I'm afraid it's not suitable to be 
>> used here because the `outOfBounds` exception will be thrown if the offset 
>> is not inside of the valid array boundary. And  for the masked operations, 
>> this is not needed since we only need to check the masked lanes. Please 
>> correct me if I didn't understand correctly. Thanks!
>
> Silly me! i commented too quickly, `checkFromIndexSize` cannot be used. My 
> intent was that we could use a method rather than repeating the expression in 
> numerous places (including masked loads IIRC).

Hi @PaulSandoz , I'v updated the offset check for masked load/store. Could you 
please help to check whether it is ok? Thanks so much!

-

PR: https://git.openjdk.java.net/jdk/pull/8620


Re: RFR: 8283667: [vectorapi] Vectorization for masked load with IOOBE with predicate feature [v3]

2022-05-12 Thread Xiaohong Gong
On Thu, 12 May 2022 16:07:54 GMT, Paul Sandoz  wrote:

> Yes, the tests were run in debug mode. The reporting of the missing constant 
> occurs for the compiled method that is called from the method where the 
> constants are declared e.g.:
> 
> ```
> 719  240bjdk.incubator.vector.Int256Vector::fromArray0 (15 
> bytes)
>   ** Rejected vector op (LoadVectorMasked,int,8) because architecture does 
> not support it
>   ** missing constant: offsetInRange=Parm
> @ 11   
> jdk.incubator.vector.IntVector::fromArray0Template (22 bytes)   force inline 
> by annotation
> ```
> 
> So it appears to be working as expected. A similar pattern occurs at a 
> lower-level for the passing of the mask class. `Int256Vector::fromArray0` 
> passes a constant class to `IntVector::fromArray0Template` (the compilation 
> of which bails out before checking that the `offsetInRange` is constant).

Make sense to me! I think I understand what you mean. I will have more tests 
with the integer constant change. Thanks!

-

PR: https://git.openjdk.java.net/jdk/pull/8035


Re: RFR: 8286279: [vectorapi] Only check index of masked lanes if offset is out of array boundary for masked store [v2]

2022-05-13 Thread Xiaohong Gong
On Fri, 13 May 2022 01:35:40 GMT, Xiaohong Gong  wrote:

>> Checking whether the indexes of masked lanes are inside of the valid memory 
>> boundary is necessary for masked vector memory access. However, this could 
>> be saved if the given offset is inside of the vector range that could make 
>> sure no IOOBE (IndexOutOfBoundaryException) happens. The masked load APIs 
>> have saved this kind of check for common cases. And this patch did the 
>> similar optimization for the masked vector store.
>> 
>> The performance for the new added store masked benchmarks improves about 
>> `1.83x ~ 2.62x` on a x86 system:
>> 
>> Benchmark   BeforeAfter Gain Units
>> StoreMaskedBenchmark.byteStoreArrayMask   12757.936 23291.118  1.826 ops/ms
>> StoreMaskedBenchmark.doubleStoreArrayMask  1520.932  3921.616  2.578 ops/ms
>> StoreMaskedBenchmark.floatStoreArrayMask   2713.031  7122.535  2.625 ops/ms
>> StoreMaskedBenchmark.intStoreArrayMask 4113.772  8220.206  1.998 ops/ms
>> StoreMaskedBenchmark.longStoreArrayMask1993.986  4874.148  2.444 ops/ms
>> StoreMaskedBenchmark.shortStoreArrayMask   8543.593 17821.086  2.086 ops/ms
>> 
>> Similar performane gain can also be observed on ARM hardware.
>
> Xiaohong Gong has updated the pull request incrementally with one additional 
> commit since the last revision:
> 
>   Wrap the offset check into a static method

Thanks for the explanation! Yeah, the main problem is Java doesn't have the 
direct unsigned comparison. We need the function call. From the two ways you 
provided, I think the second `Integer.lessThanUnsigned` looks better. But I'm 
not sure whether this could improve the performance a lot, although the first 
check `a.length - vsp.length() > 0` can be hosited out side of the loop. And 
this might make the codes more complex for me. Maybe we can do a pre research 
to find a better implementation to the unsigned comparison first. Do you think 
so?

-

PR: https://git.openjdk.java.net/jdk/pull/8620


Re: RFR: 8283667: [vectorapi] Vectorization for masked load with IOOBE with predicate feature [v3]

2022-05-13 Thread Xiaohong Gong
On Thu, 12 May 2022 16:07:54 GMT, Paul Sandoz  wrote:

>> Xiaohong Gong has updated the pull request incrementally with one additional 
>> commit since the last revision:
>> 
>>   Rename "use_predicate" to "needs_predicate"
>
> Yes, the tests were run in debug mode. The reporting of the missing constant 
> occurs for the compiled method that is called from the method where the 
> constants are declared e.g.:
> 
> 719  240bjdk.incubator.vector.Int256Vector::fromArray0 (15 
> bytes)
>   ** Rejected vector op (LoadVectorMasked,int,8) because architecture does 
> not support it
>   ** missing constant: offsetInRange=Parm
> @ 11   
> jdk.incubator.vector.IntVector::fromArray0Template (22 bytes)   force inline 
> by annotation
> 
> 
> So it appears to be working as expected. A similar pattern occurs at a 
> lower-level for the passing of the mask class. `Int256Vector::fromArray0` 
> passes a constant class to `IntVector::fromArray0Template` (the compilation 
> of which bails out before checking that the `offsetInRange` is constant).

You are right @PaulSandoz ! I ran the tests and benchmarks with your patch, and 
no failure and performance regression are found. I will update the patch soon. 
Thanks for the help!

-

PR: https://git.openjdk.java.net/jdk/pull/8035


Re: RFR: 8283667: [vectorapi] Vectorization for masked load with IOOBE with predicate feature [v4]

2022-05-13 Thread Xiaohong Gong
> Currently the vector load with mask when the given index happens out of the 
> array boundary is implemented with pure java scalar code to avoid the IOOBE 
> (IndexOutOfBoundaryException). This is necessary for architectures that do 
> not support the predicate feature. Because the masked load is implemented 
> with a full vector load and a vector blend applied on it. And a full vector 
> load will definitely cause the IOOBE which is not valid. However, for 
> architectures that support the predicate feature like SVE/AVX-512/RVV, it can 
> be vectorized with the predicated load instruction as long as the indexes of 
> the masked lanes are within the bounds of the array. For these architectures, 
> loading with unmasked lanes does not raise exception.
> 
> This patch adds the vectorization support for the masked load with IOOBE 
> part. Please see the original java implementation (FIXME: optimize):
> 
> 
>   @ForceInline
>   public static
>   ByteVector fromArray(VectorSpecies species,
>byte[] a, int offset,
>VectorMask m) {
>   ByteSpecies vsp = (ByteSpecies) species;
>   if (offset >= 0 && offset <= (a.length - species.length())) {
>   return vsp.dummyVector().fromArray0(a, offset, m);
>   }
> 
>   // FIXME: optimize
>   checkMaskFromIndexSize(offset, vsp, m, 1, a.length);
>   return vsp.vOp(m, i -> a[offset + i]);
>   }
> 
> Since it can only be vectorized with the predicate load, the hotspot must 
> check whether the current backend supports it and falls back to the java 
> scalar version if not. This is different from the normal masked vector load 
> that the compiler will generate a full vector load and a vector blend if the 
> predicate load is not supported. So to let the compiler make the expected 
> action, an additional flag (i.e. `usePred`) is added to the existing 
> "loadMasked" intrinsic, with the value "true" for the IOOBE part while 
> "false" for the normal load. And the compiler will fail to intrinsify if the 
> flag is "true" and the predicate load is not supported by the backend, which 
> means that normal java path will be executed.
> 
> Also adds the same vectorization support for masked:
>  - fromByteArray/fromByteBuffer
>  - fromBooleanArray
>  - fromCharArray
> 
> The performance for the new added benchmarks improve about `1.88x ~ 30.26x` 
> on the x86 AVX-512 system:
> 
> Benchmark  before   After  Units
> LoadMaskedIOOBEBenchmark.byteLoadArrayMaskIOOBE   737.542 1387.069 ops/ms
> LoadMaskedIOOBEBenchmark.doubleLoadArrayMaskIOOBE 118.366  330.776 ops/ms
> LoadMaskedIOOBEBenchmark.floatLoadArrayMaskIOOBE  233.832 6125.026 ops/ms
> LoadMaskedIOOBEBenchmark.intLoadArrayMaskIOOBE233.816 7075.923 ops/ms
> LoadMaskedIOOBEBenchmark.longLoadArrayMaskIOOBE   119.771  330.587 ops/ms
> LoadMaskedIOOBEBenchmark.shortLoadArrayMaskIOOBE  431.961  939.301 ops/ms
> 
> Similar performance gain can also be observed on 512-bit SVE system.

Xiaohong Gong has updated the pull request incrementally with one additional 
commit since the last revision:

  Use integer constant for offsetInRange all the way through

-

Changes:
  - all: https://git.openjdk.java.net/jdk/pull/8035/files
  - new: https://git.openjdk.java.net/jdk/pull/8035/files/9c69206e..07edfbd5

Webrevs:
 - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=8035&range=03
 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=8035&range=02-03

  Stats: 438 lines in 39 files changed: 33 ins; 118 del; 287 mod
  Patch: https://git.openjdk.java.net/jdk/pull/8035.diff
  Fetch: git fetch https://git.openjdk.java.net/jdk pull/8035/head:pull/8035

PR: https://git.openjdk.java.net/jdk/pull/8035


Re: RFR: 8284960: Integration of JEP 426: Vector API (Fourth Incubator) [v6]

2022-05-19 Thread Xiaohong Gong
On Thu, 19 May 2022 08:53:31 GMT, Ningsheng Jian  wrote:

>>> LUT should be generated only if UsePopCountInsturction is false 
>> 
>> Should there be `!UsePopCountInsturction` check then?
>> 
>>> restrict the scope of flag to only scalar popcount operation
>> 
>> Interesting. But AArch64 code does cover vector cases which just adds 
>> confusion.
>
>> Interesting. But AArch64 code does cover vector cases which just adds 
>> confusion.
> 
> `UsePopCountInsturction` is always true in AArch64. @XiaohongGong removed the 
> `predicate` in aarch64 rules, and I think we can even remove the option check 
> in match_rule_supported().

Ok , I will remove the check for it. Thanks!

-

PR: https://git.openjdk.java.net/jdk/pull/8425


Re: RFR: 8286279: [vectorapi] Only check index of masked lanes if offset is out of array boundary for masked store [v2]

2022-05-29 Thread Xiaohong Gong
On Fri, 13 May 2022 01:35:40 GMT, Xiaohong Gong  wrote:

>> Checking whether the indexes of masked lanes are inside of the valid memory 
>> boundary is necessary for masked vector memory access. However, this could 
>> be saved if the given offset is inside of the vector range that could make 
>> sure no IOOBE (IndexOutOfBoundaryException) happens. The masked load APIs 
>> have saved this kind of check for common cases. And this patch did the 
>> similar optimization for the masked vector store.
>> 
>> The performance for the new added store masked benchmarks improves about 
>> `1.83x ~ 2.62x` on a x86 system:
>> 
>> Benchmark   BeforeAfter Gain Units
>> StoreMaskedBenchmark.byteStoreArrayMask   12757.936 23291.118  1.826 ops/ms
>> StoreMaskedBenchmark.doubleStoreArrayMask  1520.932  3921.616  2.578 ops/ms
>> StoreMaskedBenchmark.floatStoreArrayMask   2713.031  7122.535  2.625 ops/ms
>> StoreMaskedBenchmark.intStoreArrayMask 4113.772  8220.206  1.998 ops/ms
>> StoreMaskedBenchmark.longStoreArrayMask1993.986  4874.148  2.444 ops/ms
>> StoreMaskedBenchmark.shortStoreArrayMask   8543.593 17821.086  2.086 ops/ms
>> 
>> Similar performane gain can also be observed on ARM hardware.
>
> Xiaohong Gong has updated the pull request incrementally with one additional 
> commit since the last revision:
> 
>   Wrap the offset check into a static method

@PaulSandoz, could you please help to check whether the current version is ok 
for you? Thanks so much!

-

PR: https://git.openjdk.java.net/jdk/pull/8620


Re: RFR: 8286279: [vectorapi] Only check index of masked lanes if offset is out of array boundary for masked store [v2]

2022-05-31 Thread Xiaohong Gong
On Mon, 30 May 2022 01:17:00 GMT, Xiaohong Gong  wrote:

>> Xiaohong Gong has updated the pull request incrementally with one additional 
>> commit since the last revision:
>> 
>>   Wrap the offset check into a static method
>
> @PaulSandoz, could you please help to check whether the current version is ok 
> for you? Thanks so much!

> @XiaohongGong looks good, now the Vector API JEP has been integrated you will 
> get a merge conflict, but it should be easier to resolve.

Great! I will resolve the conflict and update the patch, thanks so much!

-

PR: https://git.openjdk.java.net/jdk/pull/8620


Re: RFR: 8286279: [vectorapi] Only check index of masked lanes if offset is out of array boundary for masked store [v3]

2022-06-01 Thread Xiaohong Gong
> Checking whether the indexes of masked lanes are inside of the valid memory 
> boundary is necessary for masked vector memory access. However, this could be 
> saved if the given offset is inside of the vector range that could make sure 
> no IOOBE (IndexOutOfBoundaryException) happens. The masked load APIs have 
> saved this kind of check for common cases. And this patch did the similar 
> optimization for the masked vector store.
> 
> The performance for the new added store masked benchmarks improves about 
> `1.83x ~ 2.62x` on a x86 system:
> 
> Benchmark   BeforeAfter Gain Units
> StoreMaskedBenchmark.byteStoreArrayMask   12757.936 23291.118  1.826 ops/ms
> StoreMaskedBenchmark.doubleStoreArrayMask  1520.932  3921.616  2.578 ops/ms
> StoreMaskedBenchmark.floatStoreArrayMask   2713.031  7122.535  2.625 ops/ms
> StoreMaskedBenchmark.intStoreArrayMask 4113.772  8220.206  1.998 ops/ms
> StoreMaskedBenchmark.longStoreArrayMask1993.986  4874.148  2.444 ops/ms
> StoreMaskedBenchmark.shortStoreArrayMask   8543.593 17821.086  2.086 ops/ms
> 
> Similar performane gain can also be observed on ARM hardware.

Xiaohong Gong has updated the pull request with a new target base due to a 
merge or a rebase. The pull request now contains three commits:

 - Merge branch 'jdk:master' to JDK-8286279
 - Wrap the offset check into a static method
 - 8286279: [vectorapi] Only check index of masked lanes if offset is out of 
array boundary for masked store

-

Changes: https://git.openjdk.java.net/jdk/pull/8620/files
 Webrev: https://webrevs.openjdk.java.net/?repo=jdk&pr=8620&range=02
  Stats: 216 lines in 9 files changed: 179 ins; 0 del; 37 mod
  Patch: https://git.openjdk.java.net/jdk/pull/8620.diff
  Fetch: git fetch https://git.openjdk.java.net/jdk pull/8620/head:pull/8620

PR: https://git.openjdk.java.net/jdk/pull/8620


Re: RFR: 8286279: [vectorapi] Only check index of masked lanes if offset is out of array boundary for masked store [v2]

2022-06-01 Thread Xiaohong Gong
On Tue, 31 May 2022 16:48:27 GMT, Paul Sandoz  wrote:

>> @PaulSandoz, could you please help to check whether the current version is 
>> ok for you? Thanks so much!
>
> @XiaohongGong looks good, now the Vector API JEP has been integrated you will 
> get a merge conflict, but it should be easier to resolve.

Hi @PaulSandoz , I updated the patch and resolved the conflicts. Could you 
please take a look at it again? Thanks so much!

-

PR: https://git.openjdk.java.net/jdk/pull/8620


Re: RFR: 8283667: [vectorapi] Vectorization for masked load with IOOBE with predicate feature [v3]

2022-06-01 Thread Xiaohong Gong
On Fri, 13 May 2022 08:58:12 GMT, Xiaohong Gong  wrote:

>> Yes, the tests were run in debug mode. The reporting of the missing constant 
>> occurs for the compiled method that is called from the method where the 
>> constants are declared e.g.:
>> 
>> 719  240bjdk.incubator.vector.Int256Vector::fromArray0 (15 
>> bytes)
>>   ** Rejected vector op (LoadVectorMasked,int,8) because architecture does 
>> not support it
>>   ** missing constant: offsetInRange=Parm
>> @ 11   
>> jdk.incubator.vector.IntVector::fromArray0Template (22 bytes)   force inline 
>> by annotation
>> 
>> 
>> So it appears to be working as expected. A similar pattern occurs at a 
>> lower-level for the passing of the mask class. `Int256Vector::fromArray0` 
>> passes a constant class to `IntVector::fromArray0Template` (the compilation 
>> of which bails out before checking that the `offsetInRange` is constant).
>
> You are right @PaulSandoz ! I ran the tests and benchmarks with your patch, 
> and no failure and performance regression are found. I will update the patch 
> soon. Thanks for the help!

> @XiaohongGong Could you please rebase the branch and resolve conflicts?

Sure, I'm working on this now. The patch will be updated soon. Thanks.

-

PR: https://git.openjdk.java.net/jdk/pull/8035


Re: RFR: 8283667: [vectorapi] Vectorization for masked load with IOOBE with predicate feature [v5]

2022-06-01 Thread Xiaohong Gong
> Currently the vector load with mask when the given index happens out of the 
> array boundary is implemented with pure java scalar code to avoid the IOOBE 
> (IndexOutOfBoundaryException). This is necessary for architectures that do 
> not support the predicate feature. Because the masked load is implemented 
> with a full vector load and a vector blend applied on it. And a full vector 
> load will definitely cause the IOOBE which is not valid. However, for 
> architectures that support the predicate feature like SVE/AVX-512/RVV, it can 
> be vectorized with the predicated load instruction as long as the indexes of 
> the masked lanes are within the bounds of the array. For these architectures, 
> loading with unmasked lanes does not raise exception.
> 
> This patch adds the vectorization support for the masked load with IOOBE 
> part. Please see the original java implementation (FIXME: optimize):
> 
> 
>   @ForceInline
>   public static
>   ByteVector fromArray(VectorSpecies species,
>byte[] a, int offset,
>VectorMask m) {
>   ByteSpecies vsp = (ByteSpecies) species;
>   if (offset >= 0 && offset <= (a.length - species.length())) {
>   return vsp.dummyVector().fromArray0(a, offset, m);
>   }
> 
>   // FIXME: optimize
>   checkMaskFromIndexSize(offset, vsp, m, 1, a.length);
>   return vsp.vOp(m, i -> a[offset + i]);
>   }
> 
> Since it can only be vectorized with the predicate load, the hotspot must 
> check whether the current backend supports it and falls back to the java 
> scalar version if not. This is different from the normal masked vector load 
> that the compiler will generate a full vector load and a vector blend if the 
> predicate load is not supported. So to let the compiler make the expected 
> action, an additional flag (i.e. `usePred`) is added to the existing 
> "loadMasked" intrinsic, with the value "true" for the IOOBE part while 
> "false" for the normal load. And the compiler will fail to intrinsify if the 
> flag is "true" and the predicate load is not supported by the backend, which 
> means that normal java path will be executed.
> 
> Also adds the same vectorization support for masked:
>  - fromByteArray/fromByteBuffer
>  - fromBooleanArray
>  - fromCharArray
> 
> The performance for the new added benchmarks improve about `1.88x ~ 30.26x` 
> on the x86 AVX-512 system:
> 
> Benchmark  before   After  Units
> LoadMaskedIOOBEBenchmark.byteLoadArrayMaskIOOBE   737.542 1387.069 ops/ms
> LoadMaskedIOOBEBenchmark.doubleLoadArrayMaskIOOBE 118.366  330.776 ops/ms
> LoadMaskedIOOBEBenchmark.floatLoadArrayMaskIOOBE  233.832 6125.026 ops/ms
> LoadMaskedIOOBEBenchmark.intLoadArrayMaskIOOBE233.816 7075.923 ops/ms
> LoadMaskedIOOBEBenchmark.longLoadArrayMaskIOOBE   119.771  330.587 ops/ms
> LoadMaskedIOOBEBenchmark.shortLoadArrayMaskIOOBE  431.961  939.301 ops/ms
> 
> Similar performance gain can also be observed on 512-bit SVE system.

Xiaohong Gong has updated the pull request with a new target base due to a 
merge or a rebase. The pull request now contains five commits:

 - Merge branch 'jdk:master' into JDK-8283667
 - Use integer constant for offsetInRange all the way through
 - Rename "use_predicate" to "needs_predicate"
 - Rename the "usePred" to "offsetInRange"
 - 8283667: [vectorapi] Vectorization for masked load with IOOBE with predicate 
feature

-

Changes: https://git.openjdk.java.net/jdk/pull/8035/files
 Webrev: https://webrevs.openjdk.java.net/?repo=jdk&pr=8035&range=04
  Stats: 447 lines in 43 files changed: 168 ins; 21 del; 258 mod
  Patch: https://git.openjdk.java.net/jdk/pull/8035.diff
  Fetch: git fetch https://git.openjdk.java.net/jdk pull/8035/head:pull/8035

PR: https://git.openjdk.java.net/jdk/pull/8035


Re: RFR: 8283667: [vectorapi] Vectorization for masked load with IOOBE with predicate feature [v3]

2022-06-01 Thread Xiaohong Gong
On Thu, 2 Jun 2022 01:49:10 GMT, Xiaohong Gong  wrote:

> > @XiaohongGong Could you please rebase the branch and resolve conflicts?
> 
> Sure, I'm working on this now. The patch will be updated soon. Thanks.

Resolved the conflicts. Thanks!

-

PR: https://git.openjdk.java.net/jdk/pull/8035


Re: RFR: 8283667: [vectorapi] Vectorization for masked load with IOOBE with predicate feature [v3]

2022-06-05 Thread Xiaohong Gong
On Thu, 2 Jun 2022 03:24:07 GMT, Xiaohong Gong  wrote:

>>> @XiaohongGong Could you please rebase the branch and resolve conflicts?
>> 
>> Sure, I'm working on this now. The patch will be updated soon. Thanks.
>
>> > @XiaohongGong Could you please rebase the branch and resolve conflicts?
>> 
>> Sure, I'm working on this now. The patch will be updated soon. Thanks.
> 
> Resolved the conflicts. Thanks!

> @XiaohongGong You need one more review approval.

Sure! Thanks a lot for the review!

-

PR: https://git.openjdk.java.net/jdk/pull/8035


Re: RFR: 8283667: [vectorapi] Vectorization for masked load with IOOBE with predicate feature [v3]

2022-06-05 Thread Xiaohong Gong
On Thu, 12 May 2022 16:07:54 GMT, Paul Sandoz  wrote:

>> Xiaohong Gong has updated the pull request incrementally with one additional 
>> commit since the last revision:
>> 
>>   Rename "use_predicate" to "needs_predicate"
>
> Yes, the tests were run in debug mode. The reporting of the missing constant 
> occurs for the compiled method that is called from the method where the 
> constants are declared e.g.:
> 
> 719  240bjdk.incubator.vector.Int256Vector::fromArray0 (15 
> bytes)
>   ** Rejected vector op (LoadVectorMasked,int,8) because architecture does 
> not support it
>   ** missing constant: offsetInRange=Parm
> @ 11   
> jdk.incubator.vector.IntVector::fromArray0Template (22 bytes)   force inline 
> by annotation
> 
> 
> So it appears to be working as expected. A similar pattern occurs at a 
> lower-level for the passing of the mask class. `Int256Vector::fromArray0` 
> passes a constant class to `IntVector::fromArray0Template` (the compilation 
> of which bails out before checking that the `offsetInRange` is constant).

Hi @PaulSandoz , could you please take a look at this PR again? Thanks so much!

-

PR: https://git.openjdk.java.net/jdk/pull/8035


Re: RFR: 8286279: [vectorapi] Only check index of masked lanes if offset is out of array boundary for masked store [v2]

2022-06-06 Thread Xiaohong Gong
On Tue, 31 May 2022 16:48:27 GMT, Paul Sandoz  wrote:

>> @PaulSandoz, could you please help to check whether the current version is 
>> ok for you? Thanks so much!
>
> @XiaohongGong looks good, now the Vector API JEP has been integrated you will 
> get a merge conflict, but it should be easier to resolve.

Thanks for the review @PaulSandoz @merykitty !

-

PR: https://git.openjdk.java.net/jdk/pull/8620


Integrated: 8286279: [vectorapi] Only check index of masked lanes if offset is out of array boundary for masked store

2022-06-06 Thread Xiaohong Gong
On Tue, 10 May 2022 01:23:55 GMT, Xiaohong Gong  wrote:

> Checking whether the indexes of masked lanes are inside of the valid memory 
> boundary is necessary for masked vector memory access. However, this could be 
> saved if the given offset is inside of the vector range that could make sure 
> no IOOBE (IndexOutOfBoundaryException) happens. The masked load APIs have 
> saved this kind of check for common cases. And this patch did the similar 
> optimization for the masked vector store.
> 
> The performance for the new added store masked benchmarks improves about 
> `1.83x ~ 2.62x` on a x86 system:
> 
> Benchmark   BeforeAfter Gain Units
> StoreMaskedBenchmark.byteStoreArrayMask   12757.936 23291.118  1.826 ops/ms
> StoreMaskedBenchmark.doubleStoreArrayMask  1520.932  3921.616  2.578 ops/ms
> StoreMaskedBenchmark.floatStoreArrayMask   2713.031  7122.535  2.625 ops/ms
> StoreMaskedBenchmark.intStoreArrayMask 4113.772  8220.206  1.998 ops/ms
> StoreMaskedBenchmark.longStoreArrayMask1993.986  4874.148  2.444 ops/ms
> StoreMaskedBenchmark.shortStoreArrayMask   8543.593 17821.086  2.086 ops/ms
> 
> Similar performane gain can also be observed on ARM hardware.

This pull request has now been integrated.

Changeset: ef7cc210
Author:Xiaohong Gong 
URL:   
https://git.openjdk.java.net/jdk/commit/ef7cc2105c66de443d3a9af706220272018a0d8d
Stats: 216 lines in 9 files changed: 179 ins; 0 del; 37 mod

8286279: [vectorapi] Only check index of masked lanes if offset is out of array 
boundary for masked store

Reviewed-by: psandoz

-

PR: https://git.openjdk.java.net/jdk/pull/8620


Re: RFR: 8283667: [vectorapi] Vectorization for masked load with IOOBE with predicate feature [v5]

2022-06-06 Thread Xiaohong Gong
On Mon, 6 Jun 2022 10:40:45 GMT, Jatin Bhateja  wrote:

>> Xiaohong Gong has updated the pull request with a new target base due to a 
>> merge or a rebase. The pull request now contains five commits:
>> 
>>  - Merge branch 'jdk:master' into JDK-8283667
>>  - Use integer constant for offsetInRange all the way through
>>  - Rename "use_predicate" to "needs_predicate"
>>  - Rename the "usePred" to "offsetInRange"
>>  - 8283667: [vectorapi] Vectorization for masked load with IOOBE with 
>> predicate feature
>
> test/micro/org/openjdk/bench/jdk/incubator/vector/LoadMaskedIOOBEBenchmark.java
>  line 97:
> 
>> 95: public void byteLoadArrayMaskIOOBE() {
>> 96: for (int i = 0; i < inSize; i += bspecies.length()) {
>> 97: VectorMask mask = VectorMask.fromArray(bspecies, m, i);
> 
> For other case "if (offset >= 0 && offset <= (a.length - species.length())) 
> )" we are anyways intrinsifying, should we limit this micro to work only for 
> newly optimized case.

Yeah, thanks and it's really a good suggestion to limit this benchmark only for 
the IOOBE cases. I locally modified the tests to make sure only the IOOBE case 
happens and the results show good as well. But do you think it's better to keep 
as it is since we can also see the performance of the common cases to make sure 
no regressions happen? As the current benchmarks can also show the performance 
gain by this PR.

-

PR: https://git.openjdk.java.net/jdk/pull/8035


Re: RFR: 8283667: [vectorapi] Vectorization for masked load with IOOBE with predicate feature [v6]

2022-06-06 Thread Xiaohong Gong
> Currently the vector load with mask when the given index happens out of the 
> array boundary is implemented with pure java scalar code to avoid the IOOBE 
> (IndexOutOfBoundaryException). This is necessary for architectures that do 
> not support the predicate feature. Because the masked load is implemented 
> with a full vector load and a vector blend applied on it. And a full vector 
> load will definitely cause the IOOBE which is not valid. However, for 
> architectures that support the predicate feature like SVE/AVX-512/RVV, it can 
> be vectorized with the predicated load instruction as long as the indexes of 
> the masked lanes are within the bounds of the array. For these architectures, 
> loading with unmasked lanes does not raise exception.
> 
> This patch adds the vectorization support for the masked load with IOOBE 
> part. Please see the original java implementation (FIXME: optimize):
> 
> 
>   @ForceInline
>   public static
>   ByteVector fromArray(VectorSpecies species,
>byte[] a, int offset,
>VectorMask m) {
>   ByteSpecies vsp = (ByteSpecies) species;
>   if (offset >= 0 && offset <= (a.length - species.length())) {
>   return vsp.dummyVector().fromArray0(a, offset, m);
>   }
> 
>   // FIXME: optimize
>   checkMaskFromIndexSize(offset, vsp, m, 1, a.length);
>   return vsp.vOp(m, i -> a[offset + i]);
>   }
> 
> Since it can only be vectorized with the predicate load, the hotspot must 
> check whether the current backend supports it and falls back to the java 
> scalar version if not. This is different from the normal masked vector load 
> that the compiler will generate a full vector load and a vector blend if the 
> predicate load is not supported. So to let the compiler make the expected 
> action, an additional flag (i.e. `usePred`) is added to the existing 
> "loadMasked" intrinsic, with the value "true" for the IOOBE part while 
> "false" for the normal load. And the compiler will fail to intrinsify if the 
> flag is "true" and the predicate load is not supported by the backend, which 
> means that normal java path will be executed.
> 
> Also adds the same vectorization support for masked:
>  - fromByteArray/fromByteBuffer
>  - fromBooleanArray
>  - fromCharArray
> 
> The performance for the new added benchmarks improve about `1.88x ~ 30.26x` 
> on the x86 AVX-512 system:
> 
> Benchmark  before   After  Units
> LoadMaskedIOOBEBenchmark.byteLoadArrayMaskIOOBE   737.542 1387.069 ops/ms
> LoadMaskedIOOBEBenchmark.doubleLoadArrayMaskIOOBE 118.366  330.776 ops/ms
> LoadMaskedIOOBEBenchmark.floatLoadArrayMaskIOOBE  233.832 6125.026 ops/ms
> LoadMaskedIOOBEBenchmark.intLoadArrayMaskIOOBE233.816 7075.923 ops/ms
> LoadMaskedIOOBEBenchmark.longLoadArrayMaskIOOBE   119.771  330.587 ops/ms
> LoadMaskedIOOBEBenchmark.shortLoadArrayMaskIOOBE  431.961  939.301 ops/ms
> 
> Similar performance gain can also be observed on 512-bit SVE system.

Xiaohong Gong has updated the pull request with a new target base due to a 
merge or a rebase. The pull request now contains seven commits:

 - Add constant OFFSET_IN_RANGE and OFFSET_OUT_OF_RANGE
 - Merge branch 'jdk:master' into JDK-8283667
 - Merge branch 'jdk:master' into JDK-8283667
 - Use integer constant for offsetInRange all the way through
 - Rename "use_predicate" to "needs_predicate"
 - Rename the "usePred" to "offsetInRange"
 - 8283667: [vectorapi] Vectorization for masked load with IOOBE with predicate 
feature

-

Changes: https://git.openjdk.java.net/jdk/pull/8035/files
 Webrev: https://webrevs.openjdk.java.net/?repo=jdk&pr=8035&range=05
  Stats: 453 lines in 44 files changed: 174 ins; 21 del; 258 mod
  Patch: https://git.openjdk.java.net/jdk/pull/8035.diff
  Fetch: git fetch https://git.openjdk.java.net/jdk pull/8035/head:pull/8035

PR: https://git.openjdk.java.net/jdk/pull/8035


Re: RFR: 8283667: [vectorapi] Vectorization for masked load with IOOBE with predicate feature [v5]

2022-06-06 Thread Xiaohong Gong
On Mon, 6 Jun 2022 15:41:06 GMT, Paul Sandoz  wrote:

> Looks good. As a follow on PR I think it would be useful to add constants 
> `OFFSET_IN_RANGE` and `OFFSET_OUT_OF_RANGE`, then it becomes much clearer in 
> source and you can drop the `/* offsetInRange */` comment on the argument.

Hi @PaulSandoz , thanks for the advice! I'v rebased the codes and added these 
two constants in the latest patch. Thanks again for the review!

-

PR: https://git.openjdk.java.net/jdk/pull/8035


Re: RFR: 8283667: [vectorapi] Vectorization for masked load with IOOBE with predicate feature [v5]

2022-06-06 Thread Xiaohong Gong
On Tue, 7 Jun 2022 06:41:36 GMT, Jatin Bhateja  wrote:

>> Yeah, thanks and it's really a good suggestion to limit this benchmark only 
>> for the IOOBE cases. I locally modified the tests to make sure only the 
>> IOOBE case happens and the results show good as well. But do you think it's 
>> better to keep as it is since we can also see the performance of the common 
>> cases to make sure no regressions happen? As the current benchmarks can also 
>> show the performance gain by this PR.
>
> It was just to remove the noise from a targeted micro benchmark. But we can 
> keep it as it is.

OK, thanks!

-

PR: https://git.openjdk.java.net/jdk/pull/8035


Re: RFR: 8283667: [vectorapi] Vectorization for masked load with IOOBE with predicate feature [v3]

2022-06-07 Thread Xiaohong Gong
On Thu, 12 May 2022 16:07:54 GMT, Paul Sandoz  wrote:

>> Xiaohong Gong has updated the pull request incrementally with one additional 
>> commit since the last revision:
>> 
>>   Rename "use_predicate" to "needs_predicate"
>
> Yes, the tests were run in debug mode. The reporting of the missing constant 
> occurs for the compiled method that is called from the method where the 
> constants are declared e.g.:
> 
> 719  240bjdk.incubator.vector.Int256Vector::fromArray0 (15 
> bytes)
>   ** Rejected vector op (LoadVectorMasked,int,8) because architecture does 
> not support it
>   ** missing constant: offsetInRange=Parm
> @ 11   
> jdk.incubator.vector.IntVector::fromArray0Template (22 bytes)   force inline 
> by annotation
> 
> 
> So it appears to be working as expected. A similar pattern occurs at a 
> lower-level for the passing of the mask class. `Int256Vector::fromArray0` 
> passes a constant class to `IntVector::fromArray0Template` (the compilation 
> of which bails out before checking that the `offsetInRange` is constant).

Thanks for the review @PaulSandoz  @sviswa7 @jatin-bhateja @merykitty !

-

PR: https://git.openjdk.java.net/jdk/pull/8035


Integrated: 8283667: [vectorapi] Vectorization for masked load with IOOBE with predicate feature

2022-06-07 Thread Xiaohong Gong
On Wed, 30 Mar 2022 10:31:59 GMT, Xiaohong Gong  wrote:

> Currently the vector load with mask when the given index happens out of the 
> array boundary is implemented with pure java scalar code to avoid the IOOBE 
> (IndexOutOfBoundaryException). This is necessary for architectures that do 
> not support the predicate feature. Because the masked load is implemented 
> with a full vector load and a vector blend applied on it. And a full vector 
> load will definitely cause the IOOBE which is not valid. However, for 
> architectures that support the predicate feature like SVE/AVX-512/RVV, it can 
> be vectorized with the predicated load instruction as long as the indexes of 
> the masked lanes are within the bounds of the array. For these architectures, 
> loading with unmasked lanes does not raise exception.
> 
> This patch adds the vectorization support for the masked load with IOOBE 
> part. Please see the original java implementation (FIXME: optimize):
> 
> 
>   @ForceInline
>   public static
>   ByteVector fromArray(VectorSpecies species,
>byte[] a, int offset,
>VectorMask m) {
>   ByteSpecies vsp = (ByteSpecies) species;
>   if (offset >= 0 && offset <= (a.length - species.length())) {
>   return vsp.dummyVector().fromArray0(a, offset, m);
>   }
> 
>   // FIXME: optimize
>   checkMaskFromIndexSize(offset, vsp, m, 1, a.length);
>   return vsp.vOp(m, i -> a[offset + i]);
>   }
> 
> Since it can only be vectorized with the predicate load, the hotspot must 
> check whether the current backend supports it and falls back to the java 
> scalar version if not. This is different from the normal masked vector load 
> that the compiler will generate a full vector load and a vector blend if the 
> predicate load is not supported. So to let the compiler make the expected 
> action, an additional flag (i.e. `usePred`) is added to the existing 
> "loadMasked" intrinsic, with the value "true" for the IOOBE part while 
> "false" for the normal load. And the compiler will fail to intrinsify if the 
> flag is "true" and the predicate load is not supported by the backend, which 
> means that normal java path will be executed.
> 
> Also adds the same vectorization support for masked:
>  - fromByteArray/fromByteBuffer
>  - fromBooleanArray
>  - fromCharArray
> 
> The performance for the new added benchmarks improve about `1.88x ~ 30.26x` 
> on the x86 AVX-512 system:
> 
> Benchmark  before   After  Units
> LoadMaskedIOOBEBenchmark.byteLoadArrayMaskIOOBE   737.542 1387.069 ops/ms
> LoadMaskedIOOBEBenchmark.doubleLoadArrayMaskIOOBE 118.366  330.776 ops/ms
> LoadMaskedIOOBEBenchmark.floatLoadArrayMaskIOOBE  233.832 6125.026 ops/ms
> LoadMaskedIOOBEBenchmark.intLoadArrayMaskIOOBE233.816 7075.923 ops/ms
> LoadMaskedIOOBEBenchmark.longLoadArrayMaskIOOBE   119.771  330.587 ops/ms
> LoadMaskedIOOBEBenchmark.shortLoadArrayMaskIOOBE  431.961  939.301 ops/ms
> 
> Similar performance gain can also be observed on 512-bit SVE system.

This pull request has now been integrated.

Changeset: 39fa52b5
Author:Xiaohong Gong 
URL:   
https://git.openjdk.java.net/jdk/commit/39fa52b5f7504eca7399b863b0fb934bdce37f7e
Stats: 453 lines in 44 files changed: 174 ins; 21 del; 258 mod

8283667: [vectorapi] Vectorization for masked load with IOOBE with predicate 
feature

Reviewed-by: sviswanathan, psandoz

-

PR: https://git.openjdk.java.net/jdk/pull/8035