Re: RFR: 8264109: Add vectorized implementation for VectorMask.andNot()
On Fri, 26 Mar 2021 01:50:33 GMT, Xiaohong Gong wrote: > Currently "VectorMask.andNot()" is not vectorized: > public VectorMask andNot(VectorMask m) { > // FIXME: Generate good code here. > return bOp(m, (i, a, b) -> a && !b); > } > This can be implemented with` "and(m.not())" `directly. Since > `"VectorMask.and()/not()" `have been vectorized in hotspot, `"andNot"` > can be vectorized as well by calling them. > > The performance gain is >100% for such a simple JMH: > @Benchmark > public Object andNot(Blackhole bh) { > boolean[] mask = fm.apply(SPECIES.length()); > boolean[] r = fmt.apply(SPECIES.length()); > VectorMask rm = VectorMask.fromArray(SPECIES, r, 0); > > for (int ic = 0; ic < INVOC_COUNT; ic++) { > for (int i = 0; i < mask.length; i += SPECIES.length()) { > VectorMask vmask = VectorMask.fromArray(SPECIES, mask, i); > rm = rm.andNot(vmask); > } > } > return rm; > } Hi there, could anyone please take a look at this PR? Thanks so much! - PR: https://git.openjdk.java.net/jdk/pull/3211
Re: RFR: 8264109: Add vectorized implementation for VectorMask.andNot() [v2]
> Currently "VectorMask.andNot()" is not vectorized: > public VectorMask andNot(VectorMask m) { > // FIXME: Generate good code here. > return bOp(m, (i, a, b) -> a && !b); > } > This can be implemented with` "and(m.not())" `directly. Since > `"VectorMask.and()/not()" `have been vectorized in hotspot, `"andNot"` > can be vectorized as well by calling them. > > The performance gain is >100% for such a simple JMH: > @Benchmark > public Object andNot(Blackhole bh) { > boolean[] mask = fm.apply(SPECIES.length()); > boolean[] r = fmt.apply(SPECIES.length()); > VectorMask rm = VectorMask.fromArray(SPECIES, r, 0); > > for (int ic = 0; ic < INVOC_COUNT; ic++) { > for (int i = 0; i < mask.length; i += SPECIES.length()) { > VectorMask vmask = VectorMask.fromArray(SPECIES, mask, i); > rm = rm.andNot(vmask); > } > } > return rm; > } Xiaohong Gong has updated the pull request incrementally with one additional commit since the last revision: Move the changing to AbstractMask.andNot and revert changes in VectorMask Change-Id: Ie3ec8f53cb24526c8e1ccc68038355d024910818 CustomizedGitHooks: yes - Changes: - all: https://git.openjdk.java.net/jdk/pull/3211/files - new: https://git.openjdk.java.net/jdk/pull/3211/files/ccb73d92..33ac041e Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=3211&range=01 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=3211&range=00-01 Stats: 9 lines in 2 files changed: 5 ins; 2 del; 2 mod Patch: https://git.openjdk.java.net/jdk/pull/3211.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/3211/head:pull/3211 PR: https://git.openjdk.java.net/jdk/pull/3211
Re: RFR: 8264109: Add vectorized implementation for VectorMask.andNot() [v2]
On Wed, 31 Mar 2021 16:42:09 GMT, Paul Sandoz wrote: > Would you mind updating the existing `AbstractMask.andNot` implementation? > rather than changing `VectorMask.andNot`. That fits in with the current > implementation pattern. Hi @PaulSandoz , thanks for looking at this PR. I'v updated the patch according to your comments. Would you mind having look at it again? Thanks so much! - PR: https://git.openjdk.java.net/jdk/pull/3211
Re: RFR: 8264109: Add vectorized implementation for VectorMask.andNot() [v2]
On Thu, 1 Apr 2021 15:13:31 GMT, Paul Sandoz wrote: >> Xiaohong Gong has updated the pull request incrementally with one additional >> commit since the last revision: >> >> Move the changing to AbstractMask.andNot and revert changes in VectorMask >> >> Change-Id: Ie3ec8f53cb24526c8e1ccc68038355d024910818 >> CustomizedGitHooks: yes > > Looks good. Thanks for your review @PaulSandoz ! - PR: https://git.openjdk.java.net/jdk/pull/3211
Re: RFR: 8264109: Add vectorized implementation for VectorMask.andNot() [v2]
On Fri, 2 Apr 2021 09:59:31 GMT, Ningsheng Jian wrote: >> Xiaohong Gong has updated the pull request incrementally with one additional >> commit since the last revision: >> >> Move the changing to AbstractMask.andNot and revert changes in VectorMask >> >> Change-Id: Ie3ec8f53cb24526c8e1ccc68038355d024910818 >> CustomizedGitHooks: yes > > LGTM Thanks for the review @nsjian ! - PR: https://git.openjdk.java.net/jdk/pull/3211
Integrated: 8264109: Add vectorized implementation for VectorMask.andNot()
On Fri, 26 Mar 2021 01:50:33 GMT, Xiaohong Gong wrote: > Currently "VectorMask.andNot()" is not vectorized: > public VectorMask andNot(VectorMask m) { > // FIXME: Generate good code here. > return bOp(m, (i, a, b) -> a && !b); > } > This can be implemented with` "and(m.not())" `directly. Since > `"VectorMask.and()/not()" `have been vectorized in hotspot, `"andNot"` > can be vectorized as well by calling them. > > The performance gain is >100% for such a simple JMH: > @Benchmark > public Object andNot(Blackhole bh) { > boolean[] mask = fm.apply(SPECIES.length()); > boolean[] r = fmt.apply(SPECIES.length()); > VectorMask rm = VectorMask.fromArray(SPECIES, r, 0); > > for (int ic = 0; ic < INVOC_COUNT; ic++) { > for (int i = 0; i < mask.length; i += SPECIES.length()) { > VectorMask vmask = VectorMask.fromArray(SPECIES, mask, i); > rm = rm.andNot(vmask); > } > } > return rm; > } This pull request has now been integrated. Changeset: 7d0a0bad Author:Xiaohong Gong Committer: Ningsheng Jian URL: https://git.openjdk.java.net/jdk/commit/7d0a0bad Stats: 3 lines in 1 file changed: 0 ins; 1 del; 2 mod 8264109: Add vectorized implementation for VectorMask.andNot() Reviewed-by: psandoz, njian - PR: https://git.openjdk.java.net/jdk/pull/3211
RFR: 8267969: Add vectorized implementation for VectorMask.eq()
Currently `"VectorMask.eq()" `is not vectorized: public VectorMask eq(VectorMask m) { // FIXME: Generate good code here. return bOp(m, (i, a, b) -> a == b); } This can be implemented by calling `"xor(m.not())"` directly. The performance improved about 1.4x ~ 1.9x for the following benchmark with different basic types: @Benchmark public Object eq() { boolean[] ma = fm.apply(size); boolean[] mb = fmb.apply(size); boolean[] mt = fmt.apply(size); VectorMask m = VectorMask.fromArray(SPECIES, mt, 0); for (int ic = 0; ic < INVOC_COUNT; ic++) { for (int i = 0; i < ma.length; i += SPECIES.length()) { var av = SPECIES.loadMask(ma, i); var bv = SPECIES.loadMask(mb, i); // accumulate results, so JIT can't eliminate relevant computations m = m.and(av.eq(bv)); } } return m; } - Commit messages: - 8267969: Add vectorized implementation for VectorMask.eq() Changes: https://git.openjdk.java.net/jdk/pull/4272/files Webrev: https://webrevs.openjdk.java.net/?repo=jdk&pr=4272&range=00 Issue: https://bugs.openjdk.java.net/browse/JDK-8267969 Stats: 254 lines in 32 files changed: 248 ins; 6 del; 0 mod Patch: https://git.openjdk.java.net/jdk/pull/4272.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/4272/head:pull/4272 PR: https://git.openjdk.java.net/jdk/pull/4272
Re: RFR: 8267969: Add vectorized implementation for VectorMask.eq()
On Tue, 1 Jun 2021 16:29:58 GMT, Paul Sandoz wrote: > Looks. Later we may want to consider pushing this down as an intrinsic, > perhaps reusing `VectorSupport.compare`. Thanks for your review @PaulSandoz ! Yes, reusing `VectorSupport.compare` is an alternative way to do the vectorization. - PR: https://git.openjdk.java.net/jdk/pull/4272
Re: RFR: 8267969: Add vectorized implementation for VectorMask.eq() [v2]
> Currently `"VectorMask.eq()" `is not vectorized: > > public VectorMask eq(VectorMask m) { > // FIXME: Generate good code here. > return bOp(m, (i, a, b) -> a == b); > } > > This can be implemented by calling `"xor(m.not())"` directly. > > The performance improved about 1.4x ~ 1.9x for the following benchmark with > different basic types: > > @Benchmark > public Object eq() { > boolean[] ma = fm.apply(size); > boolean[] mb = fmb.apply(size); > boolean[] mt = fmt.apply(size); > VectorMask m = VectorMask.fromArray(SPECIES, mt, 0); > > for (int ic = 0; ic < INVOC_COUNT; ic++) { > for (int i = 0; i < ma.length; i += SPECIES.length()) { > var av = SPECIES.loadMask(ma, i); > var bv = SPECIES.loadMask(mb, i); > > // accumulate results, so JIT can't eliminate relevant > computations > m = m.and(av.eq(bv)); > } > } > > return m; > } Xiaohong Gong has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains two additional commits since the last revision: - Merge branch 'jdk:master' into JDK-8267969 - 8267969: Add vectorized implementation for VectorMask.eq() - Changes: - all: https://git.openjdk.java.net/jdk/pull/4272/files - new: https://git.openjdk.java.net/jdk/pull/4272/files/0e8e0e84..f3a48026 Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=4272&range=01 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=4272&range=00-01 Stats: 22530 lines in 577 files changed: 2836 ins; 18744 del; 950 mod Patch: https://git.openjdk.java.net/jdk/pull/4272.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/4272/head:pull/4272 PR: https://git.openjdk.java.net/jdk/pull/4272
Integrated: 8267969: Add vectorized implementation for VectorMask.eq()
On Mon, 31 May 2021 10:25:26 GMT, Xiaohong Gong wrote: > Currently `"VectorMask.eq()" `is not vectorized: > > public VectorMask eq(VectorMask m) { > // FIXME: Generate good code here. > return bOp(m, (i, a, b) -> a == b); > } > > This can be implemented by calling `"xor(m.not())"` directly. > > The performance improved about 1.4x ~ 1.9x for the following benchmark with > different basic types: > > @Benchmark > public Object eq() { > boolean[] ma = fm.apply(size); > boolean[] mb = fmb.apply(size); > boolean[] mt = fmt.apply(size); > VectorMask m = VectorMask.fromArray(SPECIES, mt, 0); > > for (int ic = 0; ic < INVOC_COUNT; ic++) { > for (int i = 0; i < ma.length; i += SPECIES.length()) { > var av = SPECIES.loadMask(ma, i); > var bv = SPECIES.loadMask(mb, i); > > // accumulate results, so JIT can't eliminate relevant > computations > m = m.and(av.eq(bv)); > } > } > > return m; > } This pull request has now been integrated. Changeset: 496fb90b Author:Xiaohong Gong Committer: Nils Eliasson URL: https://git.openjdk.java.net/jdk/commit/496fb90b3a038f5fe76acc8a98bcd0dac4071cf9 Stats: 254 lines in 32 files changed: 248 ins; 6 del; 0 mod 8267969: Add vectorized implementation for VectorMask.eq() Reviewed-by: psandoz, neliasso - PR: https://git.openjdk.java.net/jdk/pull/4272
Re: RFR: 8267969: Add vectorized implementation for VectorMask.eq() [v2]
On Wed, 2 Jun 2021 07:48:47 GMT, Nils Eliasson wrote: > Please wait until you have two reviewers before integrating. Sure! Thanks so much for looking at this PR! - PR: https://git.openjdk.java.net/jdk/pull/4272
Re: RFR: 8268151: Vector API toShuffle optimization
On Thu, 3 Jun 2021 00:29:00 GMT, Sandhya Viswanathan wrote: > The Vector API toShuffle method can be optimized using existing vector > conversion intrinsic. > > The following changes are made: > 1) vector.toShuffle java implementation is changed to call > VectorSupport.convert. > 2) The conversion intrinsic (inline_vector_convert()) in vectorIntrinsics.cpp > is changed to allow shuffle as a destination type. > 3) The shuffle.toVector intrinsic (inline_vector_shuffle_to_vector()) in > vectorIntrinsics.cpp now explicitly generates conversion node instead of > performing conversion during unbox. This is to remove unnecessary boxing > during back to back vector.toShuffle and shuffle.toVector calls. > > Best Regards, > Sandhya src/hotspot/share/opto/vectornode.cpp line 1246: > 1244: return new VectorLoadMaskNode(value, out_vt); > 1245: } else if (is_vector_shuffle) { > 1246: if (!is_shuffle_to_vector()) { Hi @sviswa7 , thanks for this change! I'm just curious whether `is_shuffle_to_vector()` is still needed for `VectorUnboxNode` with this change? It seems this flag can be removed, doesn't it? - PR: https://git.openjdk.java.net/jdk/pull/4326
Re: RFR: 8268151: Vector API toShuffle optimization
On Thu, 3 Jun 2021 00:29:00 GMT, Sandhya Viswanathan wrote: > The Vector API toShuffle method can be optimized using existing vector > conversion intrinsic. > > The following changes are made: > 1) vector.toShuffle java implementation is changed to call > VectorSupport.convert. > 2) The conversion intrinsic (inline_vector_convert()) in vectorIntrinsics.cpp > is changed to allow shuffle as a destination type. > 3) The shuffle.toVector intrinsic (inline_vector_shuffle_to_vector()) in > vectorIntrinsics.cpp now explicitly generates conversion node instead of > performing conversion during unbox. This is to remove unnecessary boxing > during back to back vector.toShuffle and shuffle.toVector calls. > > Best Regards, > Sandhya src/jdk.incubator.vector/share/classes/jdk/incubator/vector/Byte128Vector.java line 335: > 333: @ForceInline > 334: private final > 335: VectorShuffle toShuffleTemplate(AbstractSpecies dsp) { Is it better to move this template method to the super class like other APIs? src/jdk.incubator.vector/share/classes/jdk/incubator/vector/Byte128Vector.java line 350: > 348: Byte128Shuffle.class, byte.class, > VLENGTH, > 349: this, VSPECIES, > 350: Byte128Vector::toShuffleTemplate); ditto - PR: https://git.openjdk.java.net/jdk/pull/4326
Re: RFR: 8268151: Vector API toShuffle optimization [v2]
On Thu, 3 Jun 2021 21:43:19 GMT, Sandhya Viswanathan wrote: >> The Vector API toShuffle method can be optimized using existing vector >> conversion intrinsic. >> >> The following changes are made: >> 1) vector.toShuffle java implementation is changed to call >> VectorSupport.convert. >> 2) The conversion intrinsic (inline_vector_convert()) in >> vectorIntrinsics.cpp is changed to allow shuffle as a destination type. >> 3) The shuffle.toVector intrinsic (inline_vector_shuffle_to_vector()) in >> vectorIntrinsics.cpp now explicitly generates conversion node instead of >> performing conversion during unbox. This is to remove unnecessary boxing >> during back to back vector.toShuffle and shuffle.toVector calls. >> >> Best Regards, >> Sandhya > > Sandhya Viswanathan has updated the pull request incrementally with one > additional commit since the last revision: > > Implement review comments Looks good to me. Thanks! - PR: https://git.openjdk.java.net/jdk/pull/4326
Re: RFR: 8268151: Vector API toShuffle optimization [v2]
On Thu, 3 Jun 2021 18:40:09 GMT, Sandhya Viswanathan wrote: >> src/hotspot/share/opto/vectornode.cpp line 1246: >> >>> 1244: return new VectorLoadMaskNode(value, out_vt); >>> 1245: } else if (is_vector_shuffle) { >>> 1246: if (!is_shuffle_to_vector()) { >> >> Hi @sviswa7 , thanks for this change! I'm just curious whether >> `is_shuffle_to_vector()` is still needed for `VectorUnboxNode` with this >> change? It seems this flag can be removed, doesn't it? > > @XiaohongGong is_shuffle_to_vector is still needed as we shouldn't generate > VectorLoadShuffleNode for shuffle.toVector. OK, got it. Thanks! - PR: https://git.openjdk.java.net/jdk/pull/4326
RFR: 8282432: Optimize masked "test" Vector API with predicate feature
The vector `"test"` api is implemented with vector `"compare"`. And the masked `"test" `is implemented with `"test(op).and(m)"` which means `"compare().and(m)"` finally. Since the masked vector `"compare"` has been optimized with predicated instruction for archituctures that support the feature, the masked `"test"` can also be optimized by implementing it with the predicated compare. This could save the additional ` "and"` for architectures that support the predicate feature. This patch optimized the masked `"test"` by implementing it with masked `"compare"`. With this patch, the following codes for the` "IS_NEGATIVE"` op for a DoubleVector with SVE: mov z19.d, #0 cmpgt p1.d, p7/z, z19.d, z17.d and p0.b, p7/z, p1.b, p0.b are optimized to: mov z19.d, #0 cmpgt p0.d, p0/z, z19.d, z17.d Also update the jtreg tests for masked` "test" ` to make sure they are hot enough to be compiled by c2. - Commit messages: - 8282432: Optimize masked "test" Vector API with predicate feature Changes: https://git.openjdk.java.net/jdk/pull/7654/files Webrev: https://webrevs.openjdk.java.net/?repo=jdk&pr=7654&range=00 Issue: https://bugs.openjdk.java.net/browse/JDK-8282432 Stats: 1233 lines in 70 files changed: 590 ins; 7 del; 636 mod Patch: https://git.openjdk.java.net/jdk/pull/7654.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/7654/head:pull/7654 PR: https://git.openjdk.java.net/jdk/pull/7654
Re: RFR: 8282432: Optimize masked "test" Vector API with predicate feature [v2]
> The vector `"test"` api is implemented with vector `"compare"`. And the > masked `"test" `is implemented with `"test(op).and(m)"` which means > `"compare().and(m)"` finally. Since the masked vector `"compare"` has been > optimized with predicated instruction for archituctures that support the > feature, the masked `"test"` can also be optimized by implementing it with > the predicated compare. This could save the additional ` "and"` for > architectures that support the predicate feature. > > This patch optimized the masked `"test"` by implementing it with masked > `"compare"`. With this patch, the following codes for the` "IS_NEGATIVE"` op > for a DoubleVector with SVE: > > mov z19.d, #0 > cmpgt p1.d, p7/z, z19.d, z17.d > and p0.b, p7/z, p1.b, p0.b > > are optimized to: > > mov z19.d, #0 > cmpgt p0.d, p0/z, z19.d, z17.d > > Also update the jtreg tests for masked` "test" ` to make sure they are hot > enough to be compiled by c2. Xiaohong Gong has updated the pull request incrementally with one additional commit since the last revision: Simply the testTemplate - Changes: - all: https://git.openjdk.java.net/jdk/pull/7654/files - new: https://git.openjdk.java.net/jdk/pull/7654/files/41a9d56f..c90d2f49 Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=7654&range=01 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=7654&range=00-01 Stats: 197 lines in 49 files changed: 13 ins; 33 del; 151 mod Patch: https://git.openjdk.java.net/jdk/pull/7654.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/7654/head:pull/7654 PR: https://git.openjdk.java.net/jdk/pull/7654
Re: RFR: 8282432: Optimize masked "test" Vector API with predicate feature
On Wed, 2 Mar 2022 02:50:27 GMT, Xiaohong Gong wrote: > The vector `"test"` api is implemented with vector `"compare"`. And the > masked `"test" `is implemented with `"test(op).and(m)"` which means > `"compare().and(m)"` finally. Since the masked vector `"compare"` has been > optimized with predicated instruction for archituctures that support the > feature, the masked `"test"` can also be optimized by implementing it with > the predicated compare. This could save the additional ` "and"` for > architectures that support the predicate feature. > > This patch optimized the masked `"test"` by implementing it with masked > `"compare"`. With this patch, the following codes for the` "IS_NEGATIVE"` op > for a DoubleVector with SVE: > > mov z19.d, #0 > cmpgt p1.d, p7/z, z19.d, z17.d > and p0.b, p7/z, p1.b, p0.b > > are optimized to: > > mov z19.d, #0 > cmpgt p0.d, p0/z, z19.d, z17.d > > Also update the jtreg tests for masked` "test" ` to make sure they are hot > enough to be compiled by c2. Hi @PaulSandoz , thanks for your review! All the comments have been address. Could you please take a look at it again? Thanks! - PR: https://git.openjdk.java.net/jdk/pull/7654
Re: RFR: 8282432: Optimize masked "test" Vector API with predicate feature [v2]
On Thu, 3 Mar 2022 17:40:13 GMT, Paul Sandoz wrote: > I guess the following: `mask.cast(IntVector.species(shape())` is more > efficient than: `m.cast(vspecies().asIntegral()))` ? Yeah, that's one point. Another main reason is `m.cast(vspecies().asIntegral()))` cannot be handled well in the superclass due to the java generic issues. Thanks for the reiview @PaulSandoz ! - PR: https://git.openjdk.java.net/jdk/pull/7654
Integrated: 8282432: Optimize masked "test" Vector API with predicate feature
On Wed, 2 Mar 2022 02:50:27 GMT, Xiaohong Gong wrote: > The vector `"test"` api is implemented with vector `"compare"`. And the > masked `"test" `is implemented with `"test(op).and(m)"` which means > `"compare().and(m)"` finally. Since the masked vector `"compare"` has been > optimized with predicated instruction for archituctures that support the > feature, the masked `"test"` can also be optimized by implementing it with > the predicated compare. This could save the additional ` "and"` for > architectures that support the predicate feature. > > This patch optimized the masked `"test"` by implementing it with masked > `"compare"`. With this patch, the following codes for the` "IS_NEGATIVE"` op > for a DoubleVector with SVE: > > mov z19.d, #0 > cmpgt p1.d, p7/z, z19.d, z17.d > and p0.b, p7/z, p1.b, p0.b > > are optimized to: > > mov z19.d, #0 > cmpgt p0.d, p0/z, z19.d, z17.d > > Also update the jtreg tests for masked` "test" ` to make sure they are hot > enough to be compiled by c2. This pull request has now been integrated. Changeset: 12693a6c Author:Xiaohong Gong Committer: Ningsheng Jian URL: https://git.openjdk.java.net/jdk/commit/12693a6cf3e00c38a635d888419406f689657813 Stats: 1325 lines in 70 files changed: 574 ins; 11 del; 740 mod 8282432: Optimize masked "test" Vector API with predicate feature Reviewed-by: psandoz - PR: https://git.openjdk.java.net/jdk/pull/7654
Re: RFR: 8282162: [vector] Optimize vector negation API
On Fri, 11 Mar 2022 06:29:22 GMT, Xiaohong Gong wrote: > The current vector `"NEG"` is implemented with substraction a vector by zero > in case the architecture does not support the negation instruction. And to > fit the predicate feature for architectures that support it, the masked > vector `"NEG" ` is implemented with pattern `"v.not(m).add(1, m)"`. They both > can be optimized to a single negation instruction for ARM SVE. > And so does the non-masked "NEG" for NEON. Besides, implementing the masked > "NEG" with substraction for architectures that support neither negation > instruction nor predicate feature can also save several instructions than the > current pattern. > > To optimize the VectorAPI negation, this patch moves the implementation from > Java side to hotspot. The compiler will generate different nodes according to > the architecture: > - Generate the (predicated) negation node if architecture supports it, > otherwise, generate "`zero.sub(v)`" pattern for non-masked operation. > - Generate `"zero.sub(v, m)"` for masked operation if the architecture does > not have predicate feature, otherwise generate the original pattern > `"v.xor(-1, m).add(1, m)"`. > > So with this patch, the following transformations are applied: > > For non-masked negation with NEON: > > moviv16.4s, #0x0 > sub v17.4s, v16.4s, v17.4s ==> neg v17.4s, v17.4s > > and with SVE: > > mov z16.s, #0 > sub z18.s, z16.s, z17.s ==> neg z16.s, p7/m, z16.s > > For masked negation with NEON: > > moviv17.4s, #0x1 > mvn v19.16b, v18.16b > mov v20.16b, v16.16b ==> neg v18.4s, v17.4s > bsl v20.16b, v19.16b, v18.16b bsl v19.16b, v18.16b, v17.16b > add v19.4s, v20.4s, v17.4s > mov v18.16b, v16.16b > bsl v18.16b, v19.16b, v20.16b > > and with SVE: > > mov z16.s, #-1 > mov z17.s, #1==> neg z16.s, p0/m, z16.s > eor z18.s, p0/m, z18.s, z16.s > add z18.s, p0/m, z18.s, z17.s > > Here are the performance gains for benchmarks (see [1][2]) on ARM and x86 > machines(note that the non-masked negation benchmarks do not have any > improvement on X86 since no instructions are changed): > > NEON: > BenchmarkGain > Byte128Vector.NEG1.029 > Byte128Vector.NEGMasked 1.757 > Short128Vector.NEG 1.041 > Short128Vector.NEGMasked 1.659 > Int128Vector.NEG 1.005 > Int128Vector.NEGMasked 1.513 > Long128Vector.NEG1.003 > Long128Vector.NEGMasked 1.878 > > SVE with 512-bits: > BenchmarkGain > ByteMaxVector.NEG1.10 > ByteMaxVector.NEGMasked 1.165 > ShortMaxVector.NEG 1.056 > ShortMaxVector.NEGMasked 1.195 > IntMaxVector.NEG 1.002 > IntMaxVector.NEGMasked 1.239 > LongMaxVector.NEG1.031 > LongMaxVector.NEGMasked 1.191 > > X86 (non AVX-512): > BenchmarkGain > ByteMaxVector.NEGMasked 1.254 > ShortMaxVector.NEGMasked 1.359 > IntMaxVector.NEGMasked 1.431 > LongMaxVector.NEGMasked 1.989 > > [1] > https://github.com/openjdk/panama-vector/blob/vectorIntrinsics/test/micro/org/openjdk/bench/jdk/incubator/vector/operation/Byte128Vector.java#L1881 > [2] > https://github.com/openjdk/panama-vector/blob/vectorIntrinsics/test/micro/org/openjdk/bench/jdk/incubator/vector/operation/Byte128Vector.java#L1896 Hi, could anyone please take a look at this PR? Thanks so much! - PR: https://git.openjdk.java.net/jdk/pull/7782
Re: RFR: 8282162: [vector] Optimize vector negation API
On Tue, 15 Mar 2022 02:39:42 GMT, Joe Darcy wrote: > Note that in terms of Java semantics, negation of floating point values needs > to be implemented as subtraction from negative zero rather than positive zero: > > double negate(double arg) {return -0.0 - arg; } > > This is to handle signed zeros correctly. Hi @jddarcy ,thanks for looking at this PR and thanks for the notes on the floating point negation! Yeah, this really makes sense to me. Kindly note that this patch didn't touch the negation of the floating point values. For Vector API, the vector floating point negation has been intrinsified to `NegVF/D` node by compiler that we directly generate the negation instructions for them. Thanks! - PR: https://git.openjdk.java.net/jdk/pull/7782
Re: RFR: 8282162: [vector] Optimize vector negation API
On Sat, 19 Mar 2022 02:34:55 GMT, Jie Fu wrote: >> The current vector `"NEG"` is implemented with substraction a vector by zero >> in case the architecture does not support the negation instruction. And to >> fit the predicate feature for architectures that support it, the masked >> vector `"NEG" ` is implemented with pattern `"v.not(m).add(1, m)"`. They >> both can be optimized to a single negation instruction for ARM SVE. >> And so does the non-masked "NEG" for NEON. Besides, implementing the masked >> "NEG" with substraction for architectures that support neither negation >> instruction nor predicate feature can also save several instructions than >> the current pattern. >> >> To optimize the VectorAPI negation, this patch moves the implementation from >> Java side to hotspot. The compiler will generate different nodes according >> to the architecture: >> - Generate the (predicated) negation node if architecture supports it, >> otherwise, generate "`zero.sub(v)`" pattern for non-masked operation. >> - Generate `"zero.sub(v, m)"` for masked operation if the architecture >> does not have predicate feature, otherwise generate the original pattern >> `"v.xor(-1, m).add(1, m)"`. >> >> So with this patch, the following transformations are applied: >> >> For non-masked negation with NEON: >> >> moviv16.4s, #0x0 >> sub v17.4s, v16.4s, v17.4s ==> neg v17.4s, v17.4s >> >> and with SVE: >> >> mov z16.s, #0 >> sub z18.s, z16.s, z17.s ==> neg z16.s, p7/m, z16.s >> >> For masked negation with NEON: >> >> moviv17.4s, #0x1 >> mvn v19.16b, v18.16b >> mov v20.16b, v16.16b ==> neg v18.4s, v17.4s >> bsl v20.16b, v19.16b, v18.16b bsl v19.16b, v18.16b, v17.16b >> add v19.4s, v20.4s, v17.4s >> mov v18.16b, v16.16b >> bsl v18.16b, v19.16b, v20.16b >> >> and with SVE: >> >> mov z16.s, #-1 >> mov z17.s, #1==> neg z16.s, p0/m, z16.s >> eor z18.s, p0/m, z18.s, z16.s >> add z18.s, p0/m, z18.s, z17.s >> >> Here are the performance gains for benchmarks (see [1][2]) on ARM and x86 >> machines(note that the non-masked negation benchmarks do not have any >> improvement on X86 since no instructions are changed): >> >> NEON: >> BenchmarkGain >> Byte128Vector.NEG1.029 >> Byte128Vector.NEGMasked 1.757 >> Short128Vector.NEG 1.041 >> Short128Vector.NEGMasked 1.659 >> Int128Vector.NEG 1.005 >> Int128Vector.NEGMasked 1.513 >> Long128Vector.NEG1.003 >> Long128Vector.NEGMasked 1.878 >> >> SVE with 512-bits: >> BenchmarkGain >> ByteMaxVector.NEG1.10 >> ByteMaxVector.NEGMasked 1.165 >> ShortMaxVector.NEG 1.056 >> ShortMaxVector.NEGMasked 1.195 >> IntMaxVector.NEG 1.002 >> IntMaxVector.NEGMasked 1.239 >> LongMaxVector.NEG1.031 >> LongMaxVector.NEGMasked 1.191 >> >> X86 (non AVX-512): >> BenchmarkGain >> ByteMaxVector.NEGMasked 1.254 >> ShortMaxVector.NEGMasked 1.359 >> IntMaxVector.NEGMasked 1.431 >> LongMaxVector.NEGMasked 1.989 >> >> [1] >> https://github.com/openjdk/panama-vector/blob/vectorIntrinsics/test/micro/org/openjdk/bench/jdk/incubator/vector/operation/Byte128Vector.java#L1881 >> [2] >> https://github.com/openjdk/panama-vector/blob/vectorIntrinsics/test/micro/org/openjdk/bench/jdk/incubator/vector/operation/Byte128Vector.java#L1896 > > src/hotspot/share/opto/vectornode.cpp line 141: > >> 139: case T_BYTE: >> 140: case T_SHORT: >> 141: case T_INT: return Op_NegVI; > > Why not add `Op_NegVB` for `BYTE` and `Op_NegVS` for `SHORT`? > Is there any performance drop for byte/short negation operation if both of > them are handled as a NegVI vector? The compiler can get the real type info from `Op_NegVI` that can also handle the `BYTE ` and `SHORT ` basic type. I just don't want to add more new IRs which also need more match rules in the ad files. > Is there any performance drop for byte/short negation operation if both of > them are handled as a NegVI vector? >From the benchmark results I showed in the commit message, I didn't see not >any performance drop for byte/short. Thanks! > src/hotspot/share/opto/vectornode.cpp line 1635: > >> 1633: } >> 1634: >> 1635: Node* NegVINode::Ideal(PhaseGVN* phase, bool can_reshape) { > > Much duplication in `NegVINode::Ideal` and `NegVLNode::Ideal`. > Is it possible to refactor the implementation? Yeah, maybe we need a superclass for `NegVINode` and `NegVLNode`. Thanks! - PR: https://git.openjdk.java.net/jdk/pull/7782
Re: RFR: 8282162: [vector] Optimize vector negation API
On Sat, 19 Mar 2022 03:11:12 GMT, Jie Fu wrote: >>> Note that in terms of Java semantics, negation of floating point values >>> needs to be implemented as subtraction from negative zero rather than >>> positive zero: >>> >>> double negate(double arg) {return -0.0 - arg; } >>> >>> This is to handle signed zeros correctly. >> >> Hi @jddarcy ,thanks for looking at this PR and thanks for the notes on the >> floating point negation! Yeah, this really makes sense to me. Kindly note >> that this patch didn't touch the negation of the floating point values. For >> Vector API, the vector floating point negation has been intrinsified to >> `NegVF/D` node by compiler that we directly generate the negation >> instructions for them. Thanks! > >> Note that in terms of Java semantics, negation of floating point values >> needs to be implemented as subtraction from negative zero rather than >> positive zero: >> >> double negate(double arg) {return -0.0 - arg; } >> >> This is to handle signed zeros correctly. > > This seems easy to be broken by an opt enhancement. > Just wondering do we have a jtreg test for this point? @jddarcy > Thanks. Hi @DamonFool , thanks for your review! All the comments have been addressed. Thanks! - PR: https://git.openjdk.java.net/jdk/pull/7782
Re: RFR: 8282162: [vector] Optimize vector negation API [v2]
> The current vector `"NEG"` is implemented with substraction a vector by zero > in case the architecture does not support the negation instruction. And to > fit the predicate feature for architectures that support it, the masked > vector `"NEG" ` is implemented with pattern `"v.not(m).add(1, m)"`. They both > can be optimized to a single negation instruction for ARM SVE. > And so does the non-masked "NEG" for NEON. Besides, implementing the masked > "NEG" with substraction for architectures that support neither negation > instruction nor predicate feature can also save several instructions than the > current pattern. > > To optimize the VectorAPI negation, this patch moves the implementation from > Java side to hotspot. The compiler will generate different nodes according to > the architecture: > - Generate the (predicated) negation node if architecture supports it, > otherwise, generate "`zero.sub(v)`" pattern for non-masked operation. > - Generate `"zero.sub(v, m)"` for masked operation if the architecture does > not have predicate feature, otherwise generate the original pattern > `"v.xor(-1, m).add(1, m)"`. > > So with this patch, the following transformations are applied: > > For non-masked negation with NEON: > > moviv16.4s, #0x0 > sub v17.4s, v16.4s, v17.4s ==> neg v17.4s, v17.4s > > and with SVE: > > mov z16.s, #0 > sub z18.s, z16.s, z17.s ==> neg z16.s, p7/m, z16.s > > For masked negation with NEON: > > moviv17.4s, #0x1 > mvn v19.16b, v18.16b > mov v20.16b, v16.16b ==> neg v18.4s, v17.4s > bsl v20.16b, v19.16b, v18.16b bsl v19.16b, v18.16b, v17.16b > add v19.4s, v20.4s, v17.4s > mov v18.16b, v16.16b > bsl v18.16b, v19.16b, v20.16b > > and with SVE: > > mov z16.s, #-1 > mov z17.s, #1==> neg z16.s, p0/m, z16.s > eor z18.s, p0/m, z18.s, z16.s > add z18.s, p0/m, z18.s, z17.s > > Here are the performance gains for benchmarks (see [1][2]) on ARM and x86 > machines(note that the non-masked negation benchmarks do not have any > improvement on X86 since no instructions are changed): > > NEON: > BenchmarkGain > Byte128Vector.NEG1.029 > Byte128Vector.NEGMasked 1.757 > Short128Vector.NEG 1.041 > Short128Vector.NEGMasked 1.659 > Int128Vector.NEG 1.005 > Int128Vector.NEGMasked 1.513 > Long128Vector.NEG1.003 > Long128Vector.NEGMasked 1.878 > > SVE with 512-bits: > BenchmarkGain > ByteMaxVector.NEG1.10 > ByteMaxVector.NEGMasked 1.165 > ShortMaxVector.NEG 1.056 > ShortMaxVector.NEGMasked 1.195 > IntMaxVector.NEG 1.002 > IntMaxVector.NEGMasked 1.239 > LongMaxVector.NEG1.031 > LongMaxVector.NEGMasked 1.191 > > X86 (non AVX-512): > BenchmarkGain > ByteMaxVector.NEGMasked 1.254 > ShortMaxVector.NEGMasked 1.359 > IntMaxVector.NEGMasked 1.431 > LongMaxVector.NEGMasked 1.989 > > [1] > https://github.com/openjdk/panama-vector/blob/vectorIntrinsics/test/micro/org/openjdk/bench/jdk/incubator/vector/operation/Byte128Vector.java#L1881 > [2] > https://github.com/openjdk/panama-vector/blob/vectorIntrinsics/test/micro/org/openjdk/bench/jdk/incubator/vector/operation/Byte128Vector.java#L1896 Xiaohong Gong has updated the pull request incrementally with one additional commit since the last revision: Add a superclass for vector negation - Changes: - all: https://git.openjdk.java.net/jdk/pull/7782/files - new: https://git.openjdk.java.net/jdk/pull/7782/files/828866f8..97c8119a Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=7782&range=01 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=7782&range=00-01 Stats: 64 lines in 4 files changed: 16 ins; 13 del; 35 mod Patch: https://git.openjdk.java.net/jdk/pull/7782.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/7782/head:pull/7782 PR: https://git.openjdk.java.net/jdk/pull/7782
Re: RFR: 8282162: [vector] Optimize vector negation API [v2]
On Tue, 22 Mar 2022 09:58:23 GMT, Xiaohong Gong wrote: >> The current vector `"NEG"` is implemented with substraction a vector by zero >> in case the architecture does not support the negation instruction. And to >> fit the predicate feature for architectures that support it, the masked >> vector `"NEG" ` is implemented with pattern `"v.not(m).add(1, m)"`. They >> both can be optimized to a single negation instruction for ARM SVE. >> And so does the non-masked "NEG" for NEON. Besides, implementing the masked >> "NEG" with substraction for architectures that support neither negation >> instruction nor predicate feature can also save several instructions than >> the current pattern. >> >> To optimize the VectorAPI negation, this patch moves the implementation from >> Java side to hotspot. The compiler will generate different nodes according >> to the architecture: >> - Generate the (predicated) negation node if architecture supports it, >> otherwise, generate "`zero.sub(v)`" pattern for non-masked operation. >> - Generate `"zero.sub(v, m)"` for masked operation if the architecture >> does not have predicate feature, otherwise generate the original pattern >> `"v.xor(-1, m).add(1, m)"`. >> >> So with this patch, the following transformations are applied: >> >> For non-masked negation with NEON: >> >> moviv16.4s, #0x0 >> sub v17.4s, v16.4s, v17.4s ==> neg v17.4s, v17.4s >> >> and with SVE: >> >> mov z16.s, #0 >> sub z18.s, z16.s, z17.s ==> neg z16.s, p7/m, z16.s >> >> For masked negation with NEON: >> >> moviv17.4s, #0x1 >> mvn v19.16b, v18.16b >> mov v20.16b, v16.16b ==> neg v18.4s, v17.4s >> bsl v20.16b, v19.16b, v18.16b bsl v19.16b, v18.16b, v17.16b >> add v19.4s, v20.4s, v17.4s >> mov v18.16b, v16.16b >> bsl v18.16b, v19.16b, v20.16b >> >> and with SVE: >> >> mov z16.s, #-1 >> mov z17.s, #1==> neg z16.s, p0/m, z16.s >> eor z18.s, p0/m, z18.s, z16.s >> add z18.s, p0/m, z18.s, z17.s >> >> Here are the performance gains for benchmarks (see [1][2]) on ARM and x86 >> machines(note that the non-masked negation benchmarks do not have any >> improvement on X86 since no instructions are changed): >> >> NEON: >> BenchmarkGain >> Byte128Vector.NEG1.029 >> Byte128Vector.NEGMasked 1.757 >> Short128Vector.NEG 1.041 >> Short128Vector.NEGMasked 1.659 >> Int128Vector.NEG 1.005 >> Int128Vector.NEGMasked 1.513 >> Long128Vector.NEG1.003 >> Long128Vector.NEGMasked 1.878 >> >> SVE with 512-bits: >> BenchmarkGain >> ByteMaxVector.NEG1.10 >> ByteMaxVector.NEGMasked 1.165 >> ShortMaxVector.NEG 1.056 >> ShortMaxVector.NEGMasked 1.195 >> IntMaxVector.NEG 1.002 >> IntMaxVector.NEGMasked 1.239 >> LongMaxVector.NEG1.031 >> LongMaxVector.NEGMasked 1.191 >> >> X86 (non AVX-512): >> BenchmarkGain >> ByteMaxVector.NEGMasked 1.254 >> ShortMaxVector.NEGMasked 1.359 >> IntMaxVector.NEGMasked 1.431 >> LongMaxVector.NEGMasked 1.989 >> >> [1] >> https://github.com/openjdk/panama-vector/blob/vectorIntrinsics/test/micro/org/openjdk/bench/jdk/incubator/vector/operation/Byte128Vector.java#L1881 >> [2] >> https://github.com/openjdk/panama-vector/blob/vectorIntrinsics/test/micro/org/openjdk/bench/jdk/incubator/vector/operation/Byte128Vector.java#L1896 > > Xiaohong Gong has updated the pull request incrementally with one additional > commit since the last revision: > > Add a superclass for vector negation Hi @PaulSandoz @jatin-bhateja, could you please help to take a look at this PR? Thanks so much! - PR: https://git.openjdk.java.net/jdk/pull/7782
Re: RFR: 8282162: [vector] Optimize vector negation API [v2]
On Mon, 28 Mar 2022 07:40:48 GMT, Jie Fu wrote: >> The compiler can get the real type info from `Op_NegVI` that can also handle >> the `BYTE ` and `SHORT ` basic type. I just don't want to add more new IRs >> which also need more match rules in the ad files. >> >>> Is there any performance drop for byte/short negation operation if both of >>> them are handled as a NegVI vector? >> >> From the benchmark results I showed in the commit message, I didn't see not >> any performance drop for byte/short. Thanks! > >> The compiler can get the real type info from `Op_NegVI` that can also handle >> the `BYTE ` and `SHORT ` basic type. I just don't want to add more new IRs >> which also need more match rules in the ad files. >> >> > Is there any performance drop for byte/short negation operation if both of >> > them are handled as a NegVI vector? >> >> From the benchmark results I showed in the commit message, I didn't see not >> any performance drop for byte/short. Thanks! > > There seems no vectorized negation instructions for {byte, short, int, long} > on x86, so this should be fine on x86. > I tested the patch on x86 and the performance number looks good. Thanks for doing this! Yeah, I think the performance for masked negation operations might improve on non avx-512 systems. - PR: https://git.openjdk.java.net/jdk/pull/7782
Re: RFR: 8282162: [vector] Optimize vector negation API [v2]
On Mon, 28 Mar 2022 07:43:29 GMT, Jie Fu wrote: >> Xiaohong Gong has updated the pull request incrementally with one additional >> commit since the last revision: >> >> Add a superclass for vector negation > > src/hotspot/share/opto/vectornode.cpp line 1592: > >> 1590: >> 1591: // Generate other vector nodes to implement the masked/non-masked >> vector negation. >> 1592: Node* VectorNode::degenerate_vector_integral_negate(Node* n, int vlen, >> BasicType bt, PhaseGVN* phase, bool is_predicated) { > > Shall we move this declaration in `class NegVNode` since it is only used by > NegVNode::Ideal ? I think it can be. Thanks for pointing out this. I will change this later. - PR: https://git.openjdk.java.net/jdk/pull/7782
Re: RFR: 8282162: [vector] Optimize integral vector negation API [v3]
> The current vector `"NEG"` is implemented with substraction a vector by zero > in case the architecture does not support the negation instruction. And to > fit the predicate feature for architectures that support it, the masked > vector `"NEG" ` is implemented with pattern `"v.not(m).add(1, m)"`. They both > can be optimized to a single negation instruction for ARM SVE. > And so does the non-masked "NEG" for NEON. Besides, implementing the masked > "NEG" with substraction for architectures that support neither negation > instruction nor predicate feature can also save several instructions than the > current pattern. > > To optimize the VectorAPI negation, this patch moves the implementation from > Java side to hotspot. The compiler will generate different nodes according to > the architecture: > - Generate the (predicated) negation node if architecture supports it, > otherwise, generate "`zero.sub(v)`" pattern for non-masked operation. > - Generate `"zero.sub(v, m)"` for masked operation if the architecture does > not have predicate feature, otherwise generate the original pattern > `"v.xor(-1, m).add(1, m)"`. > > So with this patch, the following transformations are applied: > > For non-masked negation with NEON: > > moviv16.4s, #0x0 > sub v17.4s, v16.4s, v17.4s ==> neg v17.4s, v17.4s > > and with SVE: > > mov z16.s, #0 > sub z18.s, z16.s, z17.s ==> neg z16.s, p7/m, z16.s > > For masked negation with NEON: > > moviv17.4s, #0x1 > mvn v19.16b, v18.16b > mov v20.16b, v16.16b ==> neg v18.4s, v17.4s > bsl v20.16b, v19.16b, v18.16b bsl v19.16b, v18.16b, v17.16b > add v19.4s, v20.4s, v17.4s > mov v18.16b, v16.16b > bsl v18.16b, v19.16b, v20.16b > > and with SVE: > > mov z16.s, #-1 > mov z17.s, #1==> neg z16.s, p0/m, z16.s > eor z18.s, p0/m, z18.s, z16.s > add z18.s, p0/m, z18.s, z17.s > > Here are the performance gains for benchmarks (see [1][2]) on ARM and x86 > machines(note that the non-masked negation benchmarks do not have any > improvement on X86 since no instructions are changed): > > NEON: > BenchmarkGain > Byte128Vector.NEG1.029 > Byte128Vector.NEGMasked 1.757 > Short128Vector.NEG 1.041 > Short128Vector.NEGMasked 1.659 > Int128Vector.NEG 1.005 > Int128Vector.NEGMasked 1.513 > Long128Vector.NEG1.003 > Long128Vector.NEGMasked 1.878 > > SVE with 512-bits: > BenchmarkGain > ByteMaxVector.NEG1.10 > ByteMaxVector.NEGMasked 1.165 > ShortMaxVector.NEG 1.056 > ShortMaxVector.NEGMasked 1.195 > IntMaxVector.NEG 1.002 > IntMaxVector.NEGMasked 1.239 > LongMaxVector.NEG1.031 > LongMaxVector.NEGMasked 1.191 > > X86 (non AVX-512): > BenchmarkGain > ByteMaxVector.NEGMasked 1.254 > ShortMaxVector.NEGMasked 1.359 > IntMaxVector.NEGMasked 1.431 > LongMaxVector.NEGMasked 1.989 > > [1] > https://github.com/openjdk/panama-vector/blob/vectorIntrinsics/test/micro/org/openjdk/bench/jdk/incubator/vector/operation/Byte128Vector.java#L1881 > [2] > https://github.com/openjdk/panama-vector/blob/vectorIntrinsics/test/micro/org/openjdk/bench/jdk/incubator/vector/operation/Byte128Vector.java#L1896 Xiaohong Gong has updated the pull request incrementally with one additional commit since the last revision: Make "degenerate_vector_integral_negate" to be "NegVI" private - Changes: - all: https://git.openjdk.java.net/jdk/pull/7782/files - new: https://git.openjdk.java.net/jdk/pull/7782/files/97c8119a..48f4d6be Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=7782&range=02 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=7782&range=01-02 Stats: 15 lines in 2 files changed: 6 ins; 1 del; 8 mod Patch: https://git.openjdk.java.net/jdk/pull/7782.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/7782/head:pull/7782 PR: https://git.openjdk.java.net/jdk/pull/7782
Re: RFR: 8282162: [vector] Optimize integral vector negation API [v3]
On Tue, 29 Mar 2022 05:05:43 GMT, Jie Fu wrote: >> Xiaohong Gong has updated the pull request incrementally with one additional >> commit since the last revision: >> >> Make "degenerate_vector_integral_negate" to be "NegVI" private > > Note: I didn't check the aarch64 code change. Thanks for the review @DamonFool ! - PR: https://git.openjdk.java.net/jdk/pull/7782
Re: RFR: 8282162: [vector] Optimize integral vector negation API [v3]
On Tue, 29 Mar 2022 18:05:56 GMT, Paul Sandoz wrote: >> Xiaohong Gong has updated the pull request incrementally with one additional >> commit since the last revision: >> >> Make "degenerate_vector_integral_negate" to be "NegVI" private > > Java changes are good. Thanks for the review @PaulSandoz @nsjian ! - PR: https://git.openjdk.java.net/jdk/pull/7782
Integrated: 8282162: [vector] Optimize integral vector negation API
On Fri, 11 Mar 2022 06:29:22 GMT, Xiaohong Gong wrote: > The current vector `"NEG"` is implemented with substraction a vector by zero > in case the architecture does not support the negation instruction. And to > fit the predicate feature for architectures that support it, the masked > vector `"NEG" ` is implemented with pattern `"v.not(m).add(1, m)"`. They both > can be optimized to a single negation instruction for ARM SVE. > And so does the non-masked "NEG" for NEON. Besides, implementing the masked > "NEG" with substraction for architectures that support neither negation > instruction nor predicate feature can also save several instructions than the > current pattern. > > To optimize the VectorAPI negation, this patch moves the implementation from > Java side to hotspot. The compiler will generate different nodes according to > the architecture: > - Generate the (predicated) negation node if architecture supports it, > otherwise, generate "`zero.sub(v)`" pattern for non-masked operation. > - Generate `"zero.sub(v, m)"` for masked operation if the architecture does > not have predicate feature, otherwise generate the original pattern > `"v.xor(-1, m).add(1, m)"`. > > So with this patch, the following transformations are applied: > > For non-masked negation with NEON: > > moviv16.4s, #0x0 > sub v17.4s, v16.4s, v17.4s ==> neg v17.4s, v17.4s > > and with SVE: > > mov z16.s, #0 > sub z18.s, z16.s, z17.s ==> neg z16.s, p7/m, z16.s > > For masked negation with NEON: > > moviv17.4s, #0x1 > mvn v19.16b, v18.16b > mov v20.16b, v16.16b ==> neg v18.4s, v17.4s > bsl v20.16b, v19.16b, v18.16b bsl v19.16b, v18.16b, v17.16b > add v19.4s, v20.4s, v17.4s > mov v18.16b, v16.16b > bsl v18.16b, v19.16b, v20.16b > > and with SVE: > > mov z16.s, #-1 > mov z17.s, #1==> neg z16.s, p0/m, z16.s > eor z18.s, p0/m, z18.s, z16.s > add z18.s, p0/m, z18.s, z17.s > > Here are the performance gains for benchmarks (see [1][2]) on ARM and x86 > machines(note that the non-masked negation benchmarks do not have any > improvement on X86 since no instructions are changed): > > NEON: > BenchmarkGain > Byte128Vector.NEG1.029 > Byte128Vector.NEGMasked 1.757 > Short128Vector.NEG 1.041 > Short128Vector.NEGMasked 1.659 > Int128Vector.NEG 1.005 > Int128Vector.NEGMasked 1.513 > Long128Vector.NEG1.003 > Long128Vector.NEGMasked 1.878 > > SVE with 512-bits: > BenchmarkGain > ByteMaxVector.NEG1.10 > ByteMaxVector.NEGMasked 1.165 > ShortMaxVector.NEG 1.056 > ShortMaxVector.NEGMasked 1.195 > IntMaxVector.NEG 1.002 > IntMaxVector.NEGMasked 1.239 > LongMaxVector.NEG1.031 > LongMaxVector.NEGMasked 1.191 > > X86 (non AVX-512): > BenchmarkGain > ByteMaxVector.NEGMasked 1.254 > ShortMaxVector.NEGMasked 1.359 > IntMaxVector.NEGMasked 1.431 > LongMaxVector.NEGMasked 1.989 > > [1] > https://github.com/openjdk/panama-vector/blob/vectorIntrinsics/test/micro/org/openjdk/bench/jdk/incubator/vector/operation/Byte128Vector.java#L1881 > [2] > https://github.com/openjdk/panama-vector/blob/vectorIntrinsics/test/micro/org/openjdk/bench/jdk/incubator/vector/operation/Byte128Vector.java#L1896 This pull request has now been integrated. Changeset: d0668568 Author:Xiaohong Gong Committer: Jie Fu URL: https://git.openjdk.java.net/jdk/commit/d06685680c17583d56dc3d788d9a2ecea8812bc8 Stats: 325 lines in 15 files changed: 275 ins; 25 del; 25 mod 8282162: [vector] Optimize integral vector negation API Reviewed-by: jiefu, psandoz, njian - PR: https://git.openjdk.java.net/jdk/pull/7782
RFR: 8283667: [vectorapi] Vectorization for masked load with IOOBE with predicate feature
Currently the vector load with mask when the given index happens out of the array boundary is implemented with pure java scalar code to avoid the IOOBE (IndexOutOfBoundaryException). This is necessary for architectures that do not support the predicate feature. Because the masked load is implemented with a full vector load and a vector blend applied on it. And a full vector load will definitely cause the IOOBE which is not valid. However, for architectures that support the predicate feature like SVE/AVX-512/RVV, it can be vectorized with the predicated load instruction as long as the indexes of the masked lanes are within the bounds of the array. For these architectures, loading with unmasked lanes does not raise exception. This patch adds the vectorization support for the masked load with IOOBE part. Please see the original java implementation (FIXME: optimize): @ForceInline public static ByteVector fromArray(VectorSpecies species, byte[] a, int offset, VectorMask m) { ByteSpecies vsp = (ByteSpecies) species; if (offset >= 0 && offset <= (a.length - species.length())) { return vsp.dummyVector().fromArray0(a, offset, m); } // FIXME: optimize checkMaskFromIndexSize(offset, vsp, m, 1, a.length); return vsp.vOp(m, i -> a[offset + i]); } Since it can only be vectorized with the predicate load, the hotspot must check whether the current backend supports it and falls back to the java scalar version if not. This is different from the normal masked vector load that the compiler will generate a full vector load and a vector blend if the predicate load is not supported. So to let the compiler make the expected action, an additional flag (i.e. `usePred`) is added to the existing "loadMasked" intrinsic, with the value "true" for the IOOBE part while "false" for the normal load. And the compiler will fail to intrinsify if the flag is "true" and the predicate load is not supported by the backend, which means that normal java path will be executed. Also adds the same vectorization support for masked: - fromByteArray/fromByteBuffer - fromBooleanArray - fromCharArray The performance for the new added benchmarks improve about `1.88x ~ 30.26x` on the x86 AVX-512 system: Benchmark before After Units LoadMaskedIOOBEBenchmark.byteLoadArrayMaskIOOBE 737.542 1387.069 ops/ms LoadMaskedIOOBEBenchmark.doubleLoadArrayMaskIOOBE 118.366 330.776 ops/ms LoadMaskedIOOBEBenchmark.floatLoadArrayMaskIOOBE 233.832 6125.026 ops/ms LoadMaskedIOOBEBenchmark.intLoadArrayMaskIOOBE233.816 7075.923 ops/ms LoadMaskedIOOBEBenchmark.longLoadArrayMaskIOOBE 119.771 330.587 ops/ms LoadMaskedIOOBEBenchmark.shortLoadArrayMaskIOOBE 431.961 939.301 ops/ms Similar performance gain can also be observed on 512-bit SVE system. - Commit messages: - 8283667: [vectorapi] Vectorization for masked load with IOOBE with predicate feature Changes: https://git.openjdk.java.net/jdk/pull/8035/files Webrev: https://webrevs.openjdk.java.net/?repo=jdk&pr=8035&range=00 Issue: https://bugs.openjdk.java.net/browse/JDK-8283667 Stats: 821 lines in 43 files changed: 314 ins; 117 del; 390 mod Patch: https://git.openjdk.java.net/jdk/pull/8035.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/8035/head:pull/8035 PR: https://git.openjdk.java.net/jdk/pull/8035
Re: RFR: 8283667: [vectorapi] Vectorization for masked load with IOOBE with predicate feature
On Wed, 30 Mar 2022 10:31:59 GMT, Xiaohong Gong wrote: > Currently the vector load with mask when the given index happens out of the > array boundary is implemented with pure java scalar code to avoid the IOOBE > (IndexOutOfBoundaryException). This is necessary for architectures that do > not support the predicate feature. Because the masked load is implemented > with a full vector load and a vector blend applied on it. And a full vector > load will definitely cause the IOOBE which is not valid. However, for > architectures that support the predicate feature like SVE/AVX-512/RVV, it can > be vectorized with the predicated load instruction as long as the indexes of > the masked lanes are within the bounds of the array. For these architectures, > loading with unmasked lanes does not raise exception. > > This patch adds the vectorization support for the masked load with IOOBE > part. Please see the original java implementation (FIXME: optimize): > > > @ForceInline > public static > ByteVector fromArray(VectorSpecies species, >byte[] a, int offset, >VectorMask m) { > ByteSpecies vsp = (ByteSpecies) species; > if (offset >= 0 && offset <= (a.length - species.length())) { > return vsp.dummyVector().fromArray0(a, offset, m); > } > > // FIXME: optimize > checkMaskFromIndexSize(offset, vsp, m, 1, a.length); > return vsp.vOp(m, i -> a[offset + i]); > } > > Since it can only be vectorized with the predicate load, the hotspot must > check whether the current backend supports it and falls back to the java > scalar version if not. This is different from the normal masked vector load > that the compiler will generate a full vector load and a vector blend if the > predicate load is not supported. So to let the compiler make the expected > action, an additional flag (i.e. `usePred`) is added to the existing > "loadMasked" intrinsic, with the value "true" for the IOOBE part while > "false" for the normal load. And the compiler will fail to intrinsify if the > flag is "true" and the predicate load is not supported by the backend, which > means that normal java path will be executed. > > Also adds the same vectorization support for masked: > - fromByteArray/fromByteBuffer > - fromBooleanArray > - fromCharArray > > The performance for the new added benchmarks improve about `1.88x ~ 30.26x` > on the x86 AVX-512 system: > > Benchmark before After Units > LoadMaskedIOOBEBenchmark.byteLoadArrayMaskIOOBE 737.542 1387.069 ops/ms > LoadMaskedIOOBEBenchmark.doubleLoadArrayMaskIOOBE 118.366 330.776 ops/ms > LoadMaskedIOOBEBenchmark.floatLoadArrayMaskIOOBE 233.832 6125.026 ops/ms > LoadMaskedIOOBEBenchmark.intLoadArrayMaskIOOBE233.816 7075.923 ops/ms > LoadMaskedIOOBEBenchmark.longLoadArrayMaskIOOBE 119.771 330.587 ops/ms > LoadMaskedIOOBEBenchmark.shortLoadArrayMaskIOOBE 431.961 939.301 ops/ms > > Similar performance gain can also be observed on 512-bit SVE system. Hi @PaulSandoz @jatin-bhateja @sviswa7, could you please help to check this PR? Any feedback is welcome! Thanks a lot! - PR: https://git.openjdk.java.net/jdk/pull/8035
Re: RFR: 8283667: [vectorapi] Vectorization for masked load with IOOBE with predicate feature
On Mon, 11 Apr 2022 09:04:36 GMT, Jatin Bhateja wrote: >> The optimization for masked store is recorded to: >> https://bugs.openjdk.java.net/browse/JDK-8284050 > >> The blend should be with the intended-to-store vector, so that masked lanes >> contain the need-to-store elements and unmasked lanes contain the loaded >> elements, which would be stored back, which results in unchanged values. > > It may not work if memory is beyond legal accessible address space of the > process, a corner case could be a page boundary. Thus re-composing the > intermediated vector which partially contains actual updates but effectively > perform full vector write to destination address may not work in all > scenarios. Thanks for the comment! So how about adding the check for the valid array range like the masked vector load? Codes like: public final void intoArray(byte[] a, int offset, VectorMask m) { if (m.allTrue()) { intoArray(a, offset); } else { ByteSpecies vsp = vspecies(); if (offset >= 0 && offset <= (a.length - vsp.length())) { // a full range check intoArray0(a, offset, m, /* usePred */ false); // can be vectorized by load+blend_store } else { checkMaskFromIndexSize(offset, vsp, m, 1, a.length); intoArray0(a, offset, m, /* usePred */ true); // only be vectorized by the predicated store } } } - PR: https://git.openjdk.java.net/jdk/pull/8035
Re: RFR: 8283667: [vectorapi] Vectorization for masked load with IOOBE with predicate feature
On Sat, 9 Apr 2022 00:10:40 GMT, Sandhya Viswanathan wrote: >> Currently the vector load with mask when the given index happens out of the >> array boundary is implemented with pure java scalar code to avoid the IOOBE >> (IndexOutOfBoundaryException). This is necessary for architectures that do >> not support the predicate feature. Because the masked load is implemented >> with a full vector load and a vector blend applied on it. And a full vector >> load will definitely cause the IOOBE which is not valid. However, for >> architectures that support the predicate feature like SVE/AVX-512/RVV, it >> can be vectorized with the predicated load instruction as long as the >> indexes of the masked lanes are within the bounds of the array. For these >> architectures, loading with unmasked lanes does not raise exception. >> >> This patch adds the vectorization support for the masked load with IOOBE >> part. Please see the original java implementation (FIXME: optimize): >> >> >> @ForceInline >> public static >> ByteVector fromArray(VectorSpecies species, >>byte[] a, int offset, >>VectorMask m) { >> ByteSpecies vsp = (ByteSpecies) species; >> if (offset >= 0 && offset <= (a.length - species.length())) { >> return vsp.dummyVector().fromArray0(a, offset, m); >> } >> >> // FIXME: optimize >> checkMaskFromIndexSize(offset, vsp, m, 1, a.length); >> return vsp.vOp(m, i -> a[offset + i]); >> } >> >> Since it can only be vectorized with the predicate load, the hotspot must >> check whether the current backend supports it and falls back to the java >> scalar version if not. This is different from the normal masked vector load >> that the compiler will generate a full vector load and a vector blend if the >> predicate load is not supported. So to let the compiler make the expected >> action, an additional flag (i.e. `usePred`) is added to the existing >> "loadMasked" intrinsic, with the value "true" for the IOOBE part while >> "false" for the normal load. And the compiler will fail to intrinsify if the >> flag is "true" and the predicate load is not supported by the backend, which >> means that normal java path will be executed. >> >> Also adds the same vectorization support for masked: >> - fromByteArray/fromByteBuffer >> - fromBooleanArray >> - fromCharArray >> >> The performance for the new added benchmarks improve about `1.88x ~ 30.26x` >> on the x86 AVX-512 system: >> >> Benchmark before After Units >> LoadMaskedIOOBEBenchmark.byteLoadArrayMaskIOOBE 737.542 1387.069 ops/ms >> LoadMaskedIOOBEBenchmark.doubleLoadArrayMaskIOOBE 118.366 330.776 ops/ms >> LoadMaskedIOOBEBenchmark.floatLoadArrayMaskIOOBE 233.832 6125.026 ops/ms >> LoadMaskedIOOBEBenchmark.intLoadArrayMaskIOOBE233.816 7075.923 ops/ms >> LoadMaskedIOOBEBenchmark.longLoadArrayMaskIOOBE 119.771 330.587 ops/ms >> LoadMaskedIOOBEBenchmark.shortLoadArrayMaskIOOBE 431.961 939.301 ops/ms >> >> Similar performance gain can also be observed on 512-bit SVE system. > > src/jdk.incubator.vector/share/classes/jdk/incubator/vector/ByteVector.java > line 2861: > >> 2859: ByteSpecies vsp = (ByteSpecies) species; >> 2860: if (offset >= 0 && offset <= (a.length - >> species.vectorByteSize())) { >> 2861: return vsp.dummyVector().fromByteArray0(a, offset, m, /* >> usePred */ false).maybeSwap(bo); > > Instead of usePred a term like inRange or offetInRage or offsetInVectorRange > would be easier to follow. Thanks for the review. I will change it later. - PR: https://git.openjdk.java.net/jdk/pull/8035
Re: RFR: 8283667: [vectorapi] Vectorization for masked load with IOOBE with predicate feature [v2]
> Currently the vector load with mask when the given index happens out of the > array boundary is implemented with pure java scalar code to avoid the IOOBE > (IndexOutOfBoundaryException). This is necessary for architectures that do > not support the predicate feature. Because the masked load is implemented > with a full vector load and a vector blend applied on it. And a full vector > load will definitely cause the IOOBE which is not valid. However, for > architectures that support the predicate feature like SVE/AVX-512/RVV, it can > be vectorized with the predicated load instruction as long as the indexes of > the masked lanes are within the bounds of the array. For these architectures, > loading with unmasked lanes does not raise exception. > > This patch adds the vectorization support for the masked load with IOOBE > part. Please see the original java implementation (FIXME: optimize): > > > @ForceInline > public static > ByteVector fromArray(VectorSpecies species, >byte[] a, int offset, >VectorMask m) { > ByteSpecies vsp = (ByteSpecies) species; > if (offset >= 0 && offset <= (a.length - species.length())) { > return vsp.dummyVector().fromArray0(a, offset, m); > } > > // FIXME: optimize > checkMaskFromIndexSize(offset, vsp, m, 1, a.length); > return vsp.vOp(m, i -> a[offset + i]); > } > > Since it can only be vectorized with the predicate load, the hotspot must > check whether the current backend supports it and falls back to the java > scalar version if not. This is different from the normal masked vector load > that the compiler will generate a full vector load and a vector blend if the > predicate load is not supported. So to let the compiler make the expected > action, an additional flag (i.e. `usePred`) is added to the existing > "loadMasked" intrinsic, with the value "true" for the IOOBE part while > "false" for the normal load. And the compiler will fail to intrinsify if the > flag is "true" and the predicate load is not supported by the backend, which > means that normal java path will be executed. > > Also adds the same vectorization support for masked: > - fromByteArray/fromByteBuffer > - fromBooleanArray > - fromCharArray > > The performance for the new added benchmarks improve about `1.88x ~ 30.26x` > on the x86 AVX-512 system: > > Benchmark before After Units > LoadMaskedIOOBEBenchmark.byteLoadArrayMaskIOOBE 737.542 1387.069 ops/ms > LoadMaskedIOOBEBenchmark.doubleLoadArrayMaskIOOBE 118.366 330.776 ops/ms > LoadMaskedIOOBEBenchmark.floatLoadArrayMaskIOOBE 233.832 6125.026 ops/ms > LoadMaskedIOOBEBenchmark.intLoadArrayMaskIOOBE233.816 7075.923 ops/ms > LoadMaskedIOOBEBenchmark.longLoadArrayMaskIOOBE 119.771 330.587 ops/ms > LoadMaskedIOOBEBenchmark.shortLoadArrayMaskIOOBE 431.961 939.301 ops/ms > > Similar performance gain can also be observed on 512-bit SVE system. Xiaohong Gong has updated the pull request incrementally with one additional commit since the last revision: Rename the "usePred" to "offsetInRange" - Changes: - all: https://git.openjdk.java.net/jdk/pull/8035/files - new: https://git.openjdk.java.net/jdk/pull/8035/files/8f9e8a3c..9b2d2f19 Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=8035&range=01 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=8035&range=00-01 Stats: 393 lines in 41 files changed: 0 ins; 0 del; 393 mod Patch: https://git.openjdk.java.net/jdk/pull/8035.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/8035/head:pull/8035 PR: https://git.openjdk.java.net/jdk/pull/8035
Re: RFR: 8283667: [vectorapi] Vectorization for masked load with IOOBE with predicate feature [v2]
On Wed, 20 Apr 2022 02:46:09 GMT, Xiaohong Gong wrote: >> src/jdk.incubator.vector/share/classes/jdk/incubator/vector/ByteVector.java >> line 2861: >> >>> 2859: ByteSpecies vsp = (ByteSpecies) species; >>> 2860: if (offset >= 0 && offset <= (a.length - >>> species.vectorByteSize())) { >>> 2861: return vsp.dummyVector().fromByteArray0(a, offset, m, /* >>> usePred */ false).maybeSwap(bo); >> >> Instead of usePred a term like inRange or offetInRage or offsetInVectorRange >> would be easier to follow. > > Thanks for the review. I will change it later. The name is updated to `offsetInRange`. Thanks! - PR: https://git.openjdk.java.net/jdk/pull/8035
Re: RFR: 8283667: [vectorapi] Vectorization for masked load with IOOBE with predicate feature [v2]
On Fri, 29 Apr 2022 21:34:13 GMT, Paul Sandoz wrote: >> Xiaohong Gong has updated the pull request incrementally with one additional >> commit since the last revision: >> >> Rename the "usePred" to "offsetInRange" > > IIUC when the hardware does not support predicated loads then any false > `offsetIntRange` value causes the load intrinsic to fail resulting in the > fallback, so it would not be materially any different to the current > behavior, just more uniformly implemented. > > Why can't the intrinsic support the passing a boolean directly? Is it > something to do with constants? If that is not possible I recommend creating > named constant values and pass those all the way through rather than > converting a boolean to an integer value. Then there is no need for a branch > checking `offsetInRange`. > > Might be better to hold off until the JEP is integrated and then update, > since this will conflict (`byte[]` and `ByteBuffer` load methods are removed > and `MemorySegment` load methods are added). You could prepare for that now > by branching off `vectorIntrinsics`. Thanks for your comments @PaulSandoz ! > IIUC when the hardware does not support predicated loads then any false > `offsetIntRange` value causes the load intrinsic to fail resulting in the > fallback, so it would not be materially any different to the current > behavior, just more uniformly implemented. Yes, it's true that this patch doesn't have any different to the hardware that does not support the predicated loads. It only benefits the predicated feature supported systems like ARM SVE and X86 AVX-512. > Why can't the intrinsic support the passing a boolean directly? Is it > something to do with constants? If that is not possible I recommend creating > named constant values and pass those all the way through rather than > converting a boolean to an integer value. Then there is no need for a branch > checking offsetInRange. Yeah, I agree that it's not good by adding a branch checking for `offsetInRange`. But actually I met the constant issue that passing the values all the way cannot guarantee the argument a constant in compiler at the compile time. Do you have any better idea to fixing this? > Might be better to hold off until the JEP is integrated and then update, > since this will conflict (byte[] and ByteBuffer load methods are removed and > MemorySegment load methods are added). You could prepare for that now by > branching off vectorIntrinsics. Agree. Thanks! - PR: https://git.openjdk.java.net/jdk/pull/8035
Re: RFR: 8283667: [vectorapi] Vectorization for masked load with IOOBE with predicate feature [v2]
On Thu, 5 May 2022 01:21:40 GMT, Paul Sandoz wrote: > > Yeah, I agree that it's not good by adding a branch checking for > > `offsetInRange`. But actually I met the constant issue that passing the > > values all the way cannot guarantee the argument a constant in compiler at > > the compile time. Do you have any better idea to fixing this? > > That's odd, `boolean` constants are passed that are then converted to `int` > constants. Did you try passing integer constants all the way through? I will try again. I remember the main cause is the calling of `fromArray0` from `fromArray`, it is not annotated with `ForceInline`. The arguments might not be compiled to a constant for cases that the offset is not in the array range like tail loop. - PR: https://git.openjdk.java.net/jdk/pull/8035
Re: RFR: 8283667: [vectorapi] Vectorization for masked load with IOOBE with predicate feature [v2]
On Thu, 5 May 2022 01:42:48 GMT, Xiaohong Gong wrote: > > > Yeah, I agree that it's not good by adding a branch checking for > > > `offsetInRange`. But actually I met the constant issue that passing the > > > values all the way cannot guarantee the argument a constant in compiler > > > at the compile time. Do you have any better idea to fixing this? > > > > > > That's odd, `boolean` constants are passed that are then converted to `int` > > constants. Did you try passing integer constants all the way through? > > I will try again. I remember the main cause is the calling of `fromArray0` > from `fromArray`, it is not annotated with `ForceInline`. The arguments might > not be compiled to a constant for cases that the offset is not in the array > range like tail loop. I tried to pass the integer constant all the way, and unfortunate that the `offsetInRange` is not compiled to a constant. The following assertion in the `vectorIntrinsics.cpp` will fail: --- a/src/hotspot/share/opto/vectorIntrinsics.cpp +++ b/src/hotspot/share/opto/vectorIntrinsics.cpp @@ -1236,6 +1236,7 @@ bool LibraryCallKit::inline_vector_mem_masked_operation(bool is_store) { } else { // Masked vector load with IOOBE always uses the predicated load. const TypeInt* offset_in_range = gvn().type(argument(8))->isa_int(); + assert(offset_in_range->is_con(), "not a constant"); if (!offset_in_range->is_con()) { if (C->print_intrinsics()) { tty->print_cr(" ** missing constant: offsetInRange=%s", - PR: https://git.openjdk.java.net/jdk/pull/8035
RFR: 8284050: [vectorapi] Optimize masked store for non-predicated architectures
Currently the vectorization of masked vector store is implemented by the masked store instruction only on architectures that support the predicate feature. The compiler will fall back to the java scalar code for non-predicate supported architectures like ARM NEON. However, for these systems, the masked store can be vectorized with the non-masked vector `"load + blend + store"`. For example, storing a vector` "v"` controlled by a mask` "m"` into a memory with address` "addr" (i.e. "store(addr, v, m)")` can be implemented with: 1) mem_v = load(addr) ; non-masked load from the same memory 2) v = blend(mem_v, v, m) ; blend with the src vector with the mask 3) store(addr, v) ; non-masked store into the memory Since the first full loading needs the array offset must be inside of the valid array bounds, we make the compiler do the vectorization only when the offset is in range of the array boundary. And the compiler will still fall back to the java scalar code if not all offsets are valid. Besides, the original offset check for masked lanes are only applied when the offset is not always inside of the array range. This also improves the performance for masked store when the offset is always valid. The whole process is similar to the masked load API. Here is the performance data for the masked vector store benchmarks on a X86 non avx-512 system, which improves about `20x ~ 50x`: Benchmark beforeafter Units StoreMaskedBenchmark.byteStoreArrayMask 221.733 11094.126 ops/ms StoreMaskedBenchmark.doubleStoreArrayMask 41.086 1034.408 ops/ms StoreMaskedBenchmark.floatStoreArrayMask 73.820 1985.015 ops/ms StoreMaskedBenchmark.intStoreArrayMask 75.028 2027.557 ops/ms StoreMaskedBenchmark.longStoreArrayMask40.929 1032.928 ops/ms StoreMaskedBenchmark.shortStoreArrayMask 135.794 5307.567 ops/ms Similar performance gain can also be observed on ARM NEON system. And here is the performance data on X86 avx-512 system, which improves about `1.88x - 2.81x`: Benchmark before after Units StoreMaskedBenchmark.byteStoreArrayMask 11185.956 21012.824 ops/ms StoreMaskedBenchmark.doubleStoreArrayMask 1480.644 3911.720 ops/ms StoreMaskedBenchmark.floatStoreArrayMask 2738.352 7708.365 ops/ms StoreMaskedBenchmark.intStoreArrayMask 4191.904 9300.428 ops/ms StoreMaskedBenchmark.longStoreArrayMask2025.031 4604.504 ops/ms StoreMaskedBenchmark.shortStoreArrayMask 8339.389 17817.128 ops/ms Similar performance gain can also be observed on ARM SVE system. - Depends on: https://git.openjdk.java.net/jdk/pull/8035 Commit messages: - 8284050: [vectorapi] Optimize masked store for non-predicated architectures - 8283667: [vectorapi] Vectorization for masked load with IOOBE with predicate feature Changes: https://git.openjdk.java.net/jdk/pull/8544/files Webrev: https://webrevs.openjdk.java.net/?repo=jdk&pr=8544&range=00 Issue: https://bugs.openjdk.java.net/browse/JDK-8284050 Stats: 1708 lines in 44 files changed: 710 ins; 188 del; 810 mod Patch: https://git.openjdk.java.net/jdk/pull/8544.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/8544/head:pull/8544 PR: https://git.openjdk.java.net/jdk/pull/8544
Re: RFR: 8284050: [vectorapi] Optimize masked store for non-predicated architectures [v2]
> Currently the vectorization of masked vector store is implemented by the > masked store instruction only on architectures that support the predicate > feature. The compiler will fall back to the java scalar code for > non-predicate supported architectures like ARM NEON. However, for these > systems, the masked store can be vectorized with the non-masked vector `"load > + blend + store"`. For example, storing a vector` "v"` controlled by a mask` > "m"` into a memory with address` "addr" (i.e. "store(addr, v, m)")` can be > implemented with: > > > 1) mem_v = load(addr) ; non-masked load from the same memory > 2) v = blend(mem_v, v, m) ; blend with the src vector with the mask > 3) store(addr, v) ; non-masked store into the memory > > > Since the first full loading needs the array offset must be inside of the > valid array bounds, we make the compiler do the vectorization only when the > offset is in range of the array boundary. And the compiler will still fall > back to the java scalar code if not all offsets are valid. Besides, the > original offset check for masked lanes are only applied when the offset is > not always inside of the array range. This also improves the performance for > masked store when the offset is always valid. The whole process is similar to > the masked load API. > > Here is the performance data for the masked vector store benchmarks on a X86 > non avx-512 system, which improves about `20x ~ 50x`: > > Benchmark beforeafter Units > StoreMaskedBenchmark.byteStoreArrayMask 221.733 11094.126 ops/ms > StoreMaskedBenchmark.doubleStoreArrayMask 41.086 1034.408 ops/ms > StoreMaskedBenchmark.floatStoreArrayMask 73.820 1985.015 ops/ms > StoreMaskedBenchmark.intStoreArrayMask 75.028 2027.557 ops/ms > StoreMaskedBenchmark.longStoreArrayMask40.929 1032.928 ops/ms > StoreMaskedBenchmark.shortStoreArrayMask 135.794 5307.567 ops/ms > > Similar performance gain can also be observed on ARM NEON system. > > And here is the performance data on X86 avx-512 system, which improves about > `1.88x - 2.81x`: > > Benchmark before after Units > StoreMaskedBenchmark.byteStoreArrayMask 11185.956 21012.824 ops/ms > StoreMaskedBenchmark.doubleStoreArrayMask 1480.644 3911.720 ops/ms > StoreMaskedBenchmark.floatStoreArrayMask 2738.352 7708.365 ops/ms > StoreMaskedBenchmark.intStoreArrayMask 4191.904 9300.428 ops/ms > StoreMaskedBenchmark.longStoreArrayMask2025.031 4604.504 ops/ms > StoreMaskedBenchmark.shortStoreArrayMask 8339.389 17817.128 ops/ms > > Similar performance gain can also be observed on ARM SVE system. Xiaohong Gong has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains one commit: 8284050: [vectorapi] Optimize masked store for non-predicated architectures - Changes: https://git.openjdk.java.net/jdk/pull/8544/files Webrev: https://webrevs.openjdk.java.net/?repo=jdk&pr=8544&range=01 Stats: 915 lines in 43 files changed: 405 ins; 80 del; 430 mod Patch: https://git.openjdk.java.net/jdk/pull/8544.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/8544/head:pull/8544 PR: https://git.openjdk.java.net/jdk/pull/8544
Re: RFR: 8283667: [vectorapi] Vectorization for masked load with IOOBE with predicate feature [v2]
On Thu, 28 Apr 2022 00:13:49 GMT, Sandhya Viswanathan wrote: >> Xiaohong Gong has updated the pull request incrementally with one additional >> commit since the last revision: >> >> Rename the "usePred" to "offsetInRange" > > src/hotspot/share/opto/vectorIntrinsics.cpp line 1232: > >> 1230: // out when current case uses the predicate feature. >> 1231: if (!supports_predicate) { >> 1232: bool use_predicate = false; > > If we rename this to needs_predicate it will be easier to understand. Thanks for the comment! This local variable will be removed after adding the similar intrinsify for store masked. Please help to see the PR https://github.com/openjdk/jdk/pull/8544. Thanks so much! - PR: https://git.openjdk.java.net/jdk/pull/8035
Re: RFR: 8283667: [vectorapi] Vectorization for masked load with IOOBE with predicate feature [v2]
On Thu, 31 Mar 2022 02:15:26 GMT, Quan Anh Mai wrote: >> I'm afraid not. "Load + Blend" makes the elements of unmasked lanes to be >> `0`. Then a full store may change the values in the unmasked memory to be 0, >> which is different with the mask store API definition. > > The blend should be with the intended-to-store vector, so that masked lanes > contain the need-to-store elements and unmasked lanes contain the loaded > elements, which would be stored back, which results in unchanged values. Hi @merykitty @jatin-bhateja , could you please help to take a review at the similar store masked PR https://github.com/openjdk/jdk/pull/8544 ? Any feedback is welcome! Thanks so much! - PR: https://git.openjdk.java.net/jdk/pull/8035
Re: RFR: 8284050: [vectorapi] Optimize masked store for non-predicated architectures [v2]
On Thu, 5 May 2022 02:27:03 GMT, John R Rose wrote: >> Xiaohong Gong has updated the pull request with a new target base due to a >> merge or a rebase. The pull request now contains one commit: >> >> 8284050: [vectorapi] Optimize masked store for non-predicated architectures > > src/hotspot/share/opto/vectorIntrinsics.cpp line 1363: > >> 1361: // Use the vector blend to implement the masked store. The >> biased elements are the original >> 1362: // values in the memory. >> 1363: Node* mem_val = gvn().transform(LoadVectorNode::make(0, >> control(), memory(addr), addr, addr_type, mem_num_elem, mem_elem_bt)); > > I'm sorry to say it, but I am pretty sure this is an invalid optimization. > See top-level comment for more details. Thanks for your comments! Yeah, this actually influences something due to the Java Memory Model rules which I missed to consider more. I will try the scatter ways instead. Thanks so much! - PR: https://git.openjdk.java.net/jdk/pull/8544
Re: RFR: 8284050: [vectorapi] Optimize masked store for non-predicated architectures [v2]
On Thu, 5 May 2022 02:09:39 GMT, Xiaohong Gong wrote: >> Currently the vectorization of masked vector store is implemented by the >> masked store instruction only on architectures that support the predicate >> feature. The compiler will fall back to the java scalar code for >> non-predicate supported architectures like ARM NEON. However, for these >> systems, the masked store can be vectorized with the non-masked vector >> `"load + blend + store"`. For example, storing a vector` "v"` controlled by >> a mask` "m"` into a memory with address` "addr" (i.e. "store(addr, v, m)")` >> can be implemented with: >> >> >> 1) mem_v = load(addr) ; non-masked load from the same memory >> 2) v = blend(mem_v, v, m) ; blend with the src vector with the mask >> 3) store(addr, v) ; non-masked store into the memory >> >> >> Since the first full loading needs the array offset must be inside of the >> valid array bounds, we make the compiler do the vectorization only when the >> offset is in range of the array boundary. And the compiler will still fall >> back to the java scalar code if not all offsets are valid. Besides, the >> original offset check for masked lanes are only applied when the offset is >> not always inside of the array range. This also improves the performance for >> masked store when the offset is always valid. The whole process is similar >> to the masked load API. >> >> Here is the performance data for the masked vector store benchmarks on a X86 >> non avx-512 system, which improves about `20x ~ 50x`: >> >> Benchmark beforeafter Units >> StoreMaskedBenchmark.byteStoreArrayMask 221.733 11094.126 ops/ms >> StoreMaskedBenchmark.doubleStoreArrayMask 41.086 1034.408 ops/ms >> StoreMaskedBenchmark.floatStoreArrayMask 73.820 1985.015 ops/ms >> StoreMaskedBenchmark.intStoreArrayMask 75.028 2027.557 ops/ms >> StoreMaskedBenchmark.longStoreArrayMask40.929 1032.928 ops/ms >> StoreMaskedBenchmark.shortStoreArrayMask 135.794 5307.567 ops/ms >> >> Similar performance gain can also be observed on ARM NEON system. >> >> And here is the performance data on X86 avx-512 system, which improves about >> `1.88x - 2.81x`: >> >> Benchmark before after Units >> StoreMaskedBenchmark.byteStoreArrayMask 11185.956 21012.824 ops/ms >> StoreMaskedBenchmark.doubleStoreArrayMask 1480.644 3911.720 ops/ms >> StoreMaskedBenchmark.floatStoreArrayMask 2738.352 7708.365 ops/ms >> StoreMaskedBenchmark.intStoreArrayMask 4191.904 9300.428 ops/ms >> StoreMaskedBenchmark.longStoreArrayMask2025.031 4604.504 ops/ms >> StoreMaskedBenchmark.shortStoreArrayMask 8339.389 17817.128 ops/ms >> >> Similar performance gain can also be observed on ARM SVE system. > > Xiaohong Gong has updated the pull request with a new target base due to a > merge or a rebase. The pull request now contains one commit: > > 8284050: [vectorapi] Optimize masked store for non-predicated architectures > _Mailing list message from [Hans Boehm](mailto:hbo...@google.com) on > [hotspot-dev](mailto:hotspot-...@mail.openjdk.java.net):_ > > Naive question: What happens if one of the vector elements that should not > have been updated is concurrently being written by another thread? Aren't you > generating writes to vector elements that should not have been written? > > Hans > > On Wed, May 4, 2022 at 7:08 PM Xiaohong Gong > wrote: Yeah, this is the similar concern with what @rose00 mentioned above. The current solution cannot work well for multi-thread progresses. I will consider other better solutions. Thanks for the comments! src/jdk.incubator.vector/share/classes/jdk/incubator/vector/ByteVector.java line 3483: > 3481: ByteSpecies vsp = vspecies(); > 3482: if (offset >= 0 && offset <= (a.length - vsp.length())) { > 3483: intoBooleanArray0(a, offset, m, /* offsetInRange */ > true); The offset check could save the `checkMaskFromIndexSize` for cases that offset are in the valid array bounds, which also improves the performance. @rose00 , do you think this part of change is ok at least? Thanks! - PR: https://git.openjdk.java.net/jdk/pull/8544
Re: RFR: 8284050: [vectorapi] Optimize masked store for non-predicated architectures [v2]
On Thu, 5 May 2022 02:09:39 GMT, Xiaohong Gong wrote: >> Currently the vectorization of masked vector store is implemented by the >> masked store instruction only on architectures that support the predicate >> feature. The compiler will fall back to the java scalar code for >> non-predicate supported architectures like ARM NEON. However, for these >> systems, the masked store can be vectorized with the non-masked vector >> `"load + blend + store"`. For example, storing a vector` "v"` controlled by >> a mask` "m"` into a memory with address` "addr" (i.e. "store(addr, v, m)")` >> can be implemented with: >> >> >> 1) mem_v = load(addr) ; non-masked load from the same memory >> 2) v = blend(mem_v, v, m) ; blend with the src vector with the mask >> 3) store(addr, v) ; non-masked store into the memory >> >> >> Since the first full loading needs the array offset must be inside of the >> valid array bounds, we make the compiler do the vectorization only when the >> offset is in range of the array boundary. And the compiler will still fall >> back to the java scalar code if not all offsets are valid. Besides, the >> original offset check for masked lanes are only applied when the offset is >> not always inside of the array range. This also improves the performance for >> masked store when the offset is always valid. The whole process is similar >> to the masked load API. >> >> Here is the performance data for the masked vector store benchmarks on a X86 >> non avx-512 system, which improves about `20x ~ 50x`: >> >> Benchmark beforeafter Units >> StoreMaskedBenchmark.byteStoreArrayMask 221.733 11094.126 ops/ms >> StoreMaskedBenchmark.doubleStoreArrayMask 41.086 1034.408 ops/ms >> StoreMaskedBenchmark.floatStoreArrayMask 73.820 1985.015 ops/ms >> StoreMaskedBenchmark.intStoreArrayMask 75.028 2027.557 ops/ms >> StoreMaskedBenchmark.longStoreArrayMask40.929 1032.928 ops/ms >> StoreMaskedBenchmark.shortStoreArrayMask 135.794 5307.567 ops/ms >> >> Similar performance gain can also be observed on ARM NEON system. >> >> And here is the performance data on X86 avx-512 system, which improves about >> `1.88x - 2.81x`: >> >> Benchmark before after Units >> StoreMaskedBenchmark.byteStoreArrayMask 11185.956 21012.824 ops/ms >> StoreMaskedBenchmark.doubleStoreArrayMask 1480.644 3911.720 ops/ms >> StoreMaskedBenchmark.floatStoreArrayMask 2738.352 7708.365 ops/ms >> StoreMaskedBenchmark.intStoreArrayMask 4191.904 9300.428 ops/ms >> StoreMaskedBenchmark.longStoreArrayMask2025.031 4604.504 ops/ms >> StoreMaskedBenchmark.shortStoreArrayMask 8339.389 17817.128 ops/ms >> >> Similar performance gain can also be observed on ARM SVE system. > > Xiaohong Gong has updated the pull request with a new target base due to a > merge or a rebase. The pull request now contains one commit: > > 8284050: [vectorapi] Optimize masked store for non-predicated architectures > _Mailing list message from [John Rose](mailto:john.r.r...@oracle.com) on > [hotspot-dev](mailto:hotspot-...@mail.openjdk.java.net):_ > > > On May 4, 2022, at 8:29 PM, Xiaohong Gong wrote: > > The offset check could save the `checkMaskFromIndexSize` for cases that > > offset are in the valid array bounds, which also improves the performance. > > @rose00 , do you think this part of change is ok at least? > > That part is ok, yes. I wish we could get the same effect with loop > optimizations but I don?t know an easy way. The explicit check in the source > code gives the JIT a crutch but I hope we can figure out a way in the future > to integrate mask logic into range check elimination logic, making the > crutches unnecessary. For now it?s fine. Thanks! So I will separate this part out and fix it in another PR first. For the store masked vectorization with scatter or other ideas, I'm not quite sure whether they can always benefit cross architectures and need more investigation. I prefer to close this PR now. Thanks for all your comments! - PR: https://git.openjdk.java.net/jdk/pull/8544
Withdrawn: 8284050: [vectorapi] Optimize masked store for non-predicated architectures
On Thu, 5 May 2022 02:00:04 GMT, Xiaohong Gong wrote: > Currently the vectorization of masked vector store is implemented by the > masked store instruction only on architectures that support the predicate > feature. The compiler will fall back to the java scalar code for > non-predicate supported architectures like ARM NEON. However, for these > systems, the masked store can be vectorized with the non-masked vector `"load > + blend + store"`. For example, storing a vector` "v"` controlled by a mask` > "m"` into a memory with address` "addr" (i.e. "store(addr, v, m)")` can be > implemented with: > > > 1) mem_v = load(addr) ; non-masked load from the same memory > 2) v = blend(mem_v, v, m) ; blend with the src vector with the mask > 3) store(addr, v) ; non-masked store into the memory > > > Since the first full loading needs the array offset must be inside of the > valid array bounds, we make the compiler do the vectorization only when the > offset is in range of the array boundary. And the compiler will still fall > back to the java scalar code if not all offsets are valid. Besides, the > original offset check for masked lanes are only applied when the offset is > not always inside of the array range. This also improves the performance for > masked store when the offset is always valid. The whole process is similar to > the masked load API. > > Here is the performance data for the masked vector store benchmarks on a X86 > non avx-512 system, which improves about `20x ~ 50x`: > > Benchmark beforeafter Units > StoreMaskedBenchmark.byteStoreArrayMask 221.733 11094.126 ops/ms > StoreMaskedBenchmark.doubleStoreArrayMask 41.086 1034.408 ops/ms > StoreMaskedBenchmark.floatStoreArrayMask 73.820 1985.015 ops/ms > StoreMaskedBenchmark.intStoreArrayMask 75.028 2027.557 ops/ms > StoreMaskedBenchmark.longStoreArrayMask40.929 1032.928 ops/ms > StoreMaskedBenchmark.shortStoreArrayMask 135.794 5307.567 ops/ms > > Similar performance gain can also be observed on ARM NEON system. > > And here is the performance data on X86 avx-512 system, which improves about > `1.88x - 2.81x`: > > Benchmark before after Units > StoreMaskedBenchmark.byteStoreArrayMask 11185.956 21012.824 ops/ms > StoreMaskedBenchmark.doubleStoreArrayMask 1480.644 3911.720 ops/ms > StoreMaskedBenchmark.floatStoreArrayMask 2738.352 7708.365 ops/ms > StoreMaskedBenchmark.intStoreArrayMask 4191.904 9300.428 ops/ms > StoreMaskedBenchmark.longStoreArrayMask2025.031 4604.504 ops/ms > StoreMaskedBenchmark.shortStoreArrayMask 8339.389 17817.128 ops/ms > > Similar performance gain can also be observed on ARM SVE system. This pull request has been closed without being integrated. - PR: https://git.openjdk.java.net/jdk/pull/8544
Re: RFR: 8283667: [vectorapi] Vectorization for masked load with IOOBE with predicate feature [v2]
On Thu, 5 May 2022 02:14:08 GMT, Xiaohong Gong wrote: >> src/hotspot/share/opto/vectorIntrinsics.cpp line 1232: >> >>> 1230: // out when current case uses the predicate feature. >>> 1231: if (!supports_predicate) { >>> 1232: bool use_predicate = false; >> >> If we rename this to needs_predicate it will be easier to understand. > > Thanks for the comment! This local variable will be removed after adding the > similar intrinsify for store masked. Please help to see the PR > https://github.com/openjdk/jdk/pull/8544. Thanks so much! Renamed to "needs_predicate". Thanks! - PR: https://git.openjdk.java.net/jdk/pull/8035
Re: RFR: 8283667: [vectorapi] Vectorization for masked load with IOOBE with predicate feature [v3]
> Currently the vector load with mask when the given index happens out of the > array boundary is implemented with pure java scalar code to avoid the IOOBE > (IndexOutOfBoundaryException). This is necessary for architectures that do > not support the predicate feature. Because the masked load is implemented > with a full vector load and a vector blend applied on it. And a full vector > load will definitely cause the IOOBE which is not valid. However, for > architectures that support the predicate feature like SVE/AVX-512/RVV, it can > be vectorized with the predicated load instruction as long as the indexes of > the masked lanes are within the bounds of the array. For these architectures, > loading with unmasked lanes does not raise exception. > > This patch adds the vectorization support for the masked load with IOOBE > part. Please see the original java implementation (FIXME: optimize): > > > @ForceInline > public static > ByteVector fromArray(VectorSpecies species, >byte[] a, int offset, >VectorMask m) { > ByteSpecies vsp = (ByteSpecies) species; > if (offset >= 0 && offset <= (a.length - species.length())) { > return vsp.dummyVector().fromArray0(a, offset, m); > } > > // FIXME: optimize > checkMaskFromIndexSize(offset, vsp, m, 1, a.length); > return vsp.vOp(m, i -> a[offset + i]); > } > > Since it can only be vectorized with the predicate load, the hotspot must > check whether the current backend supports it and falls back to the java > scalar version if not. This is different from the normal masked vector load > that the compiler will generate a full vector load and a vector blend if the > predicate load is not supported. So to let the compiler make the expected > action, an additional flag (i.e. `usePred`) is added to the existing > "loadMasked" intrinsic, with the value "true" for the IOOBE part while > "false" for the normal load. And the compiler will fail to intrinsify if the > flag is "true" and the predicate load is not supported by the backend, which > means that normal java path will be executed. > > Also adds the same vectorization support for masked: > - fromByteArray/fromByteBuffer > - fromBooleanArray > - fromCharArray > > The performance for the new added benchmarks improve about `1.88x ~ 30.26x` > on the x86 AVX-512 system: > > Benchmark before After Units > LoadMaskedIOOBEBenchmark.byteLoadArrayMaskIOOBE 737.542 1387.069 ops/ms > LoadMaskedIOOBEBenchmark.doubleLoadArrayMaskIOOBE 118.366 330.776 ops/ms > LoadMaskedIOOBEBenchmark.floatLoadArrayMaskIOOBE 233.832 6125.026 ops/ms > LoadMaskedIOOBEBenchmark.intLoadArrayMaskIOOBE233.816 7075.923 ops/ms > LoadMaskedIOOBEBenchmark.longLoadArrayMaskIOOBE 119.771 330.587 ops/ms > LoadMaskedIOOBEBenchmark.shortLoadArrayMaskIOOBE 431.961 939.301 ops/ms > > Similar performance gain can also be observed on 512-bit SVE system. Xiaohong Gong has updated the pull request incrementally with one additional commit since the last revision: Rename "use_predicate" to "needs_predicate" - Changes: - all: https://git.openjdk.java.net/jdk/pull/8035/files - new: https://git.openjdk.java.net/jdk/pull/8035/files/9b2d2f19..9c69206e Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=8035&range=02 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=8035&range=01-02 Stats: 4 lines in 1 file changed: 0 ins; 0 del; 4 mod Patch: https://git.openjdk.java.net/jdk/pull/8035.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/8035/head:pull/8035 PR: https://git.openjdk.java.net/jdk/pull/8035
Re: RFR: 8283667: [vectorapi] Vectorization for masked load with IOOBE with predicate feature [v3]
On Thu, 5 May 2022 19:27:47 GMT, Paul Sandoz wrote: >> Xiaohong Gong has updated the pull request incrementally with one additional >> commit since the last revision: >> >> Rename "use_predicate" to "needs_predicate" > > src/hotspot/share/opto/vectorIntrinsics.cpp line 1238: > >> 1236: } else { >> 1237: // Masked vector load with IOOBE always uses the predicated load. >> 1238: const TypeInt* offset_in_range = >> gvn().type(argument(8))->isa_int(); > > Should it be `argument(7)`? (and adjustments later to access the container). I'm afraid it's `argument(8)` for the load operation since the `argument(7)` is the mask input. It seems the argument number is not right begin from the mask input which is expected to be `6`. But the it's not. Actually I don't quite understand why. - PR: https://git.openjdk.java.net/jdk/pull/8035
Re: RFR: 8283667: [vectorapi] Vectorization for masked load with IOOBE with predicate feature [v3]
On Fri, 6 May 2022 04:22:30 GMT, Sandhya Viswanathan wrote: >> I'm afraid it's `argument(8)` for the load operation since the `argument(7)` >> is the mask input. It seems the argument number is not right begin from the >> mask input which is expected to be `6`. But the it's not. Actually I don't >> quite understand why. > > offset is long so uses two argument slots (5 and 6). > mask is argument (7). > offsetInRange is argument(8). Make sense! Thanks for the explanation! - PR: https://git.openjdk.java.net/jdk/pull/8035
Re: RFR: 8283667: [vectorapi] Vectorization for masked load with IOOBE with predicate feature [v3]
On Fri, 6 May 2022 14:59:26 GMT, Paul Sandoz wrote: >> Make sense! Thanks for the explanation! > > Doh! of course. This is not the first and will not be the last time i get > caught out by the 2-slot requirement. > It may be useful to do this: > > Node* mask_arg = is_store ? argument(8) : argument(7); Yes, the mask argument is got like: Node* mask = unbox_vector(is_store ? argument(8) : argument(7), mbox_type, elem_bt, num_elem); - PR: https://git.openjdk.java.net/jdk/pull/8035
Re: RFR: 8283667: [vectorapi] Vectorization for masked load with IOOBE with predicate feature [v3]
On Mon, 9 May 2022 21:55:27 GMT, Paul Sandoz wrote: > I modified the code of this PR to avoid the conversion of `boolean` to `int`, > so a constant integer value is passed all the way through, and the masked > load is made intrinsic from the method at which the constants are passed as > arguments i.e. the public `fromArray` mask accepting method. Great and thanks! Could you please show me the changes or an example? I can push the changes to this PR. Thanks so much! - PR: https://git.openjdk.java.net/jdk/pull/8035
RFR: 8286279: [vectorapi] Only check index of masked lanes if offset is out of array boundary for masked store
Checking whether the indexes of masked lanes are inside of the valid memory boundary is necessary for masked vector memory access. However, this could be saved if the given offset is inside of the vector range that could make sure no IOOBE (IndexOutOfBoundaryException) happens. The masked load APIs have saved this kind of check for common cases. And this patch did the similar optimization for the masked vector store. The performance for the new added store masked benchmarks improves about `1.83x ~ 2.62x` on a x86 system: Benchmark BeforeAfter Gain Units StoreMaskedBenchmark.byteStoreArrayMask 12757.936 23291.118 1.826 ops/ms StoreMaskedBenchmark.doubleStoreArrayMask 1520.932 3921.616 2.578 ops/ms StoreMaskedBenchmark.floatStoreArrayMask 2713.031 7122.535 2.625 ops/ms StoreMaskedBenchmark.intStoreArrayMask 4113.772 8220.206 1.998 ops/ms StoreMaskedBenchmark.longStoreArrayMask1993.986 4874.148 2.444 ops/ms StoreMaskedBenchmark.shortStoreArrayMask 8543.593 17821.086 2.086 ops/ms Similar performane gain can also be observed on ARM hardware. - Commit messages: - 8286279: [vectorapi] Only check index of masked lanes if offset is out of array boundary for masked store Changes: https://git.openjdk.java.net/jdk/pull/8620/files Webrev: https://webrevs.openjdk.java.net/?repo=jdk&pr=8620&range=00 Issue: https://bugs.openjdk.java.net/browse/JDK-8286279 Stats: 213 lines in 8 files changed: 188 ins; 0 del; 25 mod Patch: https://git.openjdk.java.net/jdk/pull/8620.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/8620/head:pull/8620 PR: https://git.openjdk.java.net/jdk/pull/8620
Re: RFR: 8283667: [vectorapi] Vectorization for masked load with IOOBE with predicate feature [v3]
On Mon, 9 May 2022 21:55:27 GMT, Paul Sandoz wrote: >> Xiaohong Gong has updated the pull request incrementally with one additional >> commit since the last revision: >> >> Rename "use_predicate" to "needs_predicate" > > I modified the code of this PR to avoid the conversion of `boolean` to `int`, > so a constant integer value is passed all the way through, and the masked > load is made intrinsic from the method at which the constants are passed as > arguments i.e. the public `fromArray` mask accepting method. Hi @PaulSandoz , thanks for the patch for the constant int parameter. I think the main change is: -ByteVector fromArray0Template(Class maskClass, C base, long offset, int index, M m, boolean offsetInRange, +ByteVector fromArray0Template(Class maskClass, C base, long offset, int index, M m, int offsetInRange, VectorSupport.LoadVectorMaskedOperation defaultImpl) { m.check(species()); ByteSpecies vsp = vspecies(); -if (offsetInRange) { -return VectorSupport.loadMasked( -vsp.vectorType(), maskClass, vsp.elementType(), vsp.laneCount(), -base, offset, m, /* offsetInRange */ 1, -base, index, vsp, defaultImpl); -} else { -return VectorSupport.loadMasked( -vsp.vectorType(), maskClass, vsp.elementType(), vsp.laneCount(), -base, offset, m, /* offsetInRange */ 0, -base, index, vsp, defaultImpl); -} +return VectorSupport.loadMasked( +vsp.vectorType(), maskClass, vsp.elementType(), vsp.laneCount(), +base, offset, m, offsetInRange == 1 ? 1 : 0, +base, index, vsp, defaultImpl); } which uses `offsetInRange == 1 ? 1 : 0`. Unfortunately this could not always make sure the `offsetInRange` a constant a the compiler time. Again, this change could also make the assertion fail randomly: --- a/src/hotspot/share/opto/vectorIntrinsics.cpp +++ b/src/hotspot/share/opto/vectorIntrinsics.cpp @@ -1236,6 +1236,7 @@ bool LibraryCallKit::inline_vector_mem_masked_operation(bool is_store) { } else { // Masked vector load with IOOBE always uses the predicated load. const TypeInt* offset_in_range = gvn().type(argument(8))->isa_int(); + assert(offset_in_range->is_con(), "must be a constant"); if (!offset_in_range->is_con()) { if (C->print_intrinsics()) { tty->print_cr(" ** missing constant: offsetInRange=%s", Sometimes, the compiler can parse it a constant. I think this depends on the compiler OSR and speculative optimization. Did you try an example with IOOBE on a non predicated hardware? Here is the main code of my unittest to reproduce the issue: static final VectorSpecies I_SPECIES = IntVector.SPECIES_128; static final int LENGTH = 1026; public static int[] ia; public static int[] ib; private static void init() { for (int i = 0; i < LENGTH; i++) { ia[i] = i; ib[i] = 0; } for (int i = 0; i < 2; i++) { m[i] = i % 2 == 0; } } private static void func() { VectorMask mask = VectorMask.fromArray(I_SPECIES, m, 0); for (int i = 0; i < LENGTH; i += vl) { IntVector av = IntVector.fromArray(I_SPECIES, ia, i, mask); av.lanewise(VectorOperators.ABS).intoArray(ic, i, mask); } } public static void main(String[] args) { init(); for (int i = 0; i < 1; i++) { func(); } } - PR: https://git.openjdk.java.net/jdk/pull/8035
Re: RFR: 8286279: [vectorapi] Only check index of masked lanes if offset is out of array boundary for masked store
On Tue, 10 May 2022 01:23:55 GMT, Xiaohong Gong wrote: > Checking whether the indexes of masked lanes are inside of the valid memory > boundary is necessary for masked vector memory access. However, this could be > saved if the given offset is inside of the vector range that could make sure > no IOOBE (IndexOutOfBoundaryException) happens. The masked load APIs have > saved this kind of check for common cases. And this patch did the similar > optimization for the masked vector store. > > The performance for the new added store masked benchmarks improves about > `1.83x ~ 2.62x` on a x86 system: > > Benchmark BeforeAfter Gain Units > StoreMaskedBenchmark.byteStoreArrayMask 12757.936 23291.118 1.826 ops/ms > StoreMaskedBenchmark.doubleStoreArrayMask 1520.932 3921.616 2.578 ops/ms > StoreMaskedBenchmark.floatStoreArrayMask 2713.031 7122.535 2.625 ops/ms > StoreMaskedBenchmark.intStoreArrayMask 4113.772 8220.206 1.998 ops/ms > StoreMaskedBenchmark.longStoreArrayMask1993.986 4874.148 2.444 ops/ms > StoreMaskedBenchmark.shortStoreArrayMask 8543.593 17821.086 2.086 ops/ms > > Similar performane gain can also be observed on ARM hardware. Hi @PaulSandoz @rose00, could you please help to take a look at this simple PR? Thanks so much! - PR: https://git.openjdk.java.net/jdk/pull/8620
Re: RFR: 8286279: [vectorapi] Only check index of masked lanes if offset is out of array boundary for masked store
On Wed, 11 May 2022 15:10:55 GMT, Paul Sandoz wrote: >> Checking whether the indexes of masked lanes are inside of the valid memory >> boundary is necessary for masked vector memory access. However, this could >> be saved if the given offset is inside of the vector range that could make >> sure no IOOBE (IndexOutOfBoundaryException) happens. The masked load APIs >> have saved this kind of check for common cases. And this patch did the >> similar optimization for the masked vector store. >> >> The performance for the new added store masked benchmarks improves about >> `1.83x ~ 2.62x` on a x86 system: >> >> Benchmark BeforeAfter Gain Units >> StoreMaskedBenchmark.byteStoreArrayMask 12757.936 23291.118 1.826 ops/ms >> StoreMaskedBenchmark.doubleStoreArrayMask 1520.932 3921.616 2.578 ops/ms >> StoreMaskedBenchmark.floatStoreArrayMask 2713.031 7122.535 2.625 ops/ms >> StoreMaskedBenchmark.intStoreArrayMask 4113.772 8220.206 1.998 ops/ms >> StoreMaskedBenchmark.longStoreArrayMask1993.986 4874.148 2.444 ops/ms >> StoreMaskedBenchmark.shortStoreArrayMask 8543.593 17821.086 2.086 ops/ms >> >> Similar performane gain can also be observed on ARM hardware. > > src/jdk.incubator.vector/share/classes/jdk/incubator/vector/X-Vector.java.template > line 4086: > >> 4084: } else { >> 4085: $Type$Species vsp = vspecies(); >> 4086: if (offset < 0 || offset > (a.length - vsp.length())) { > > Can we use `VectorIntrinsics.checkFromIndexSize`? e.g. > > if (!VectorIntrinsics.checkFromIndexSize(offset, vsp.length(), a.length)) { > ... Thanks for the review @PaulSandoz ! For the `VectorIntrinsics.checkFromIndexSize`, I'm afraid it's not suitable to be used here because the `outOfBounds` exception will be thrown if the offset is not inside of the valid array boundary. And for the masked operations, this is not needed since we only need to check the masked lanes. Please correct me if I didn't understand correctly. Thanks! - PR: https://git.openjdk.java.net/jdk/pull/8620
Re: RFR: 8283667: [vectorapi] Vectorization for masked load with IOOBE with predicate feature [v3]
On Wed, 11 May 2022 19:45:55 GMT, Paul Sandoz wrote: > I tried your test code with the patch and logged compilation > (`-XX:-TieredCompilation -XX:+PrintCompilation -XX:+PrintInlining > -XX:+PrintIntrinsics -Xbatch`) > > For `func` the first call to `VectorSupport::loadMasked` is intrinsic and > inlined: > > ``` > @ 45 jdk.internal.vm.vector.VectorSupport::loadMasked (40 bytes) > (intrinsic) > ``` > > But the second call (for the last loop iteration) fails to inline: > > ``` > @ 45 jdk.internal.vm.vector.VectorSupport::loadMasked (40 bytes) > failed to inline (intrinsic) > ``` > > Since i am running on an mac book this looks right and aligns with the > `-XX:+PrintIntrinsics` output: > > ``` > ** Rejected vector op (LoadVectorMasked,int,8) because architecture does > not support it > ** Rejected vector op (LoadVectorMasked,int,8) because architecture does > not support it > ** not supported: op=loadMasked vlen=8 etype=int using_byte_array=0 > ``` > > ? > > I have not looked at the code gen nor measured performance comparing the case > when never out of bounds and only out of bounds for the last loop iteration. Yeah, it looks right from the log. Did you try to find whether there is the log `** missing constant: offsetInRange=Parm` with `XX:+PrintIntrinsics` ? Or insert an assertion in `vectorIntrinsics.cpp` like: --- a/src/hotspot/share/opto/vectorIntrinsics.cpp +++ b/src/hotspot/share/opto/vectorIntrinsics.cpp @@ -1236,6 +1236,7 @@ bool LibraryCallKit::inline_vector_mem_masked_operation(bool is_store) { } else { // Masked vector load with IOOBE always uses the predicated load. const TypeInt* offset_in_range = gvn().type(argument(8))->isa_int(); + assert(offset_in_range->is_con(), "must be a constant"); if (!offset_in_range->is_con()) { if (C->print_intrinsics()) { tty->print_cr(" ** missing constant: offsetInRange=%s", And run the tests with debug mode. Thanks! - PR: https://git.openjdk.java.net/jdk/pull/8035
Re: RFR: 8286279: [vectorapi] Only check index of masked lanes if offset is out of array boundary for masked store
On Thu, 12 May 2022 03:36:31 GMT, Paul Sandoz wrote: >> Thanks for the review @PaulSandoz ! For the >> `VectorIntrinsics.checkFromIndexSize`, I'm afraid it's not suitable to be >> used here because the `outOfBounds` exception will be thrown if the offset >> is not inside of the valid array boundary. And for the masked operations, >> this is not needed since we only need to check the masked lanes. Please >> correct me if I didn't understand correctly. Thanks! > > Silly me! i commented too quickly, `checkFromIndexSize` cannot be used. My > intent was that we could use a method rather than repeating the expression in > numerous places (including masked loads IIRC). Good idea! I will try. Thanks! - PR: https://git.openjdk.java.net/jdk/pull/8620
Re: RFR: 8286279: [vectorapi] Only check index of masked lanes if offset is out of array boundary for masked store
On Thu, 12 May 2022 09:49:17 GMT, Quan Anh Mai wrote: > Maybe we could use `a.length - vsp.length() > 0 && offset u< a.length - > vsp.length()` which would hoist the first check outside of the loop. Thanks. Thanks for the review @merykitty ! We need the check `offset >= 0` which I think is different from `a.length - vsp.length()`. - PR: https://git.openjdk.java.net/jdk/pull/8620
Re: RFR: 8286279: [vectorapi] Only check index of masked lanes if offset is out of array boundary for masked store [v2]
> Checking whether the indexes of masked lanes are inside of the valid memory > boundary is necessary for masked vector memory access. However, this could be > saved if the given offset is inside of the vector range that could make sure > no IOOBE (IndexOutOfBoundaryException) happens. The masked load APIs have > saved this kind of check for common cases. And this patch did the similar > optimization for the masked vector store. > > The performance for the new added store masked benchmarks improves about > `1.83x ~ 2.62x` on a x86 system: > > Benchmark BeforeAfter Gain Units > StoreMaskedBenchmark.byteStoreArrayMask 12757.936 23291.118 1.826 ops/ms > StoreMaskedBenchmark.doubleStoreArrayMask 1520.932 3921.616 2.578 ops/ms > StoreMaskedBenchmark.floatStoreArrayMask 2713.031 7122.535 2.625 ops/ms > StoreMaskedBenchmark.intStoreArrayMask 4113.772 8220.206 1.998 ops/ms > StoreMaskedBenchmark.longStoreArrayMask1993.986 4874.148 2.444 ops/ms > StoreMaskedBenchmark.shortStoreArrayMask 8543.593 17821.086 2.086 ops/ms > > Similar performane gain can also be observed on ARM hardware. Xiaohong Gong has updated the pull request incrementally with one additional commit since the last revision: Wrap the offset check into a static method - Changes: - all: https://git.openjdk.java.net/jdk/pull/8620/files - new: https://git.openjdk.java.net/jdk/pull/8620/files/eb67f681..2229dd24 Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=8620&range=01 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=8620&range=00-01 Stats: 56 lines in 8 files changed: 5 ins; 0 del; 51 mod Patch: https://git.openjdk.java.net/jdk/pull/8620.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/8620/head:pull/8620 PR: https://git.openjdk.java.net/jdk/pull/8620
Re: RFR: 8286279: [vectorapi] Only check index of masked lanes if offset is out of array boundary for masked store [v2]
On Thu, 12 May 2022 03:36:31 GMT, Paul Sandoz wrote: >> Thanks for the review @PaulSandoz ! For the >> `VectorIntrinsics.checkFromIndexSize`, I'm afraid it's not suitable to be >> used here because the `outOfBounds` exception will be thrown if the offset >> is not inside of the valid array boundary. And for the masked operations, >> this is not needed since we only need to check the masked lanes. Please >> correct me if I didn't understand correctly. Thanks! > > Silly me! i commented too quickly, `checkFromIndexSize` cannot be used. My > intent was that we could use a method rather than repeating the expression in > numerous places (including masked loads IIRC). Hi @PaulSandoz , I'v updated the offset check for masked load/store. Could you please help to check whether it is ok? Thanks so much! - PR: https://git.openjdk.java.net/jdk/pull/8620
Re: RFR: 8283667: [vectorapi] Vectorization for masked load with IOOBE with predicate feature [v3]
On Thu, 12 May 2022 16:07:54 GMT, Paul Sandoz wrote: > Yes, the tests were run in debug mode. The reporting of the missing constant > occurs for the compiled method that is called from the method where the > constants are declared e.g.: > > ``` > 719 240bjdk.incubator.vector.Int256Vector::fromArray0 (15 > bytes) > ** Rejected vector op (LoadVectorMasked,int,8) because architecture does > not support it > ** missing constant: offsetInRange=Parm > @ 11 > jdk.incubator.vector.IntVector::fromArray0Template (22 bytes) force inline > by annotation > ``` > > So it appears to be working as expected. A similar pattern occurs at a > lower-level for the passing of the mask class. `Int256Vector::fromArray0` > passes a constant class to `IntVector::fromArray0Template` (the compilation > of which bails out before checking that the `offsetInRange` is constant). Make sense to me! I think I understand what you mean. I will have more tests with the integer constant change. Thanks! - PR: https://git.openjdk.java.net/jdk/pull/8035
Re: RFR: 8286279: [vectorapi] Only check index of masked lanes if offset is out of array boundary for masked store [v2]
On Fri, 13 May 2022 01:35:40 GMT, Xiaohong Gong wrote: >> Checking whether the indexes of masked lanes are inside of the valid memory >> boundary is necessary for masked vector memory access. However, this could >> be saved if the given offset is inside of the vector range that could make >> sure no IOOBE (IndexOutOfBoundaryException) happens. The masked load APIs >> have saved this kind of check for common cases. And this patch did the >> similar optimization for the masked vector store. >> >> The performance for the new added store masked benchmarks improves about >> `1.83x ~ 2.62x` on a x86 system: >> >> Benchmark BeforeAfter Gain Units >> StoreMaskedBenchmark.byteStoreArrayMask 12757.936 23291.118 1.826 ops/ms >> StoreMaskedBenchmark.doubleStoreArrayMask 1520.932 3921.616 2.578 ops/ms >> StoreMaskedBenchmark.floatStoreArrayMask 2713.031 7122.535 2.625 ops/ms >> StoreMaskedBenchmark.intStoreArrayMask 4113.772 8220.206 1.998 ops/ms >> StoreMaskedBenchmark.longStoreArrayMask1993.986 4874.148 2.444 ops/ms >> StoreMaskedBenchmark.shortStoreArrayMask 8543.593 17821.086 2.086 ops/ms >> >> Similar performane gain can also be observed on ARM hardware. > > Xiaohong Gong has updated the pull request incrementally with one additional > commit since the last revision: > > Wrap the offset check into a static method Thanks for the explanation! Yeah, the main problem is Java doesn't have the direct unsigned comparison. We need the function call. From the two ways you provided, I think the second `Integer.lessThanUnsigned` looks better. But I'm not sure whether this could improve the performance a lot, although the first check `a.length - vsp.length() > 0` can be hosited out side of the loop. And this might make the codes more complex for me. Maybe we can do a pre research to find a better implementation to the unsigned comparison first. Do you think so? - PR: https://git.openjdk.java.net/jdk/pull/8620
Re: RFR: 8283667: [vectorapi] Vectorization for masked load with IOOBE with predicate feature [v3]
On Thu, 12 May 2022 16:07:54 GMT, Paul Sandoz wrote: >> Xiaohong Gong has updated the pull request incrementally with one additional >> commit since the last revision: >> >> Rename "use_predicate" to "needs_predicate" > > Yes, the tests were run in debug mode. The reporting of the missing constant > occurs for the compiled method that is called from the method where the > constants are declared e.g.: > > 719 240bjdk.incubator.vector.Int256Vector::fromArray0 (15 > bytes) > ** Rejected vector op (LoadVectorMasked,int,8) because architecture does > not support it > ** missing constant: offsetInRange=Parm > @ 11 > jdk.incubator.vector.IntVector::fromArray0Template (22 bytes) force inline > by annotation > > > So it appears to be working as expected. A similar pattern occurs at a > lower-level for the passing of the mask class. `Int256Vector::fromArray0` > passes a constant class to `IntVector::fromArray0Template` (the compilation > of which bails out before checking that the `offsetInRange` is constant). You are right @PaulSandoz ! I ran the tests and benchmarks with your patch, and no failure and performance regression are found. I will update the patch soon. Thanks for the help! - PR: https://git.openjdk.java.net/jdk/pull/8035
Re: RFR: 8283667: [vectorapi] Vectorization for masked load with IOOBE with predicate feature [v4]
> Currently the vector load with mask when the given index happens out of the > array boundary is implemented with pure java scalar code to avoid the IOOBE > (IndexOutOfBoundaryException). This is necessary for architectures that do > not support the predicate feature. Because the masked load is implemented > with a full vector load and a vector blend applied on it. And a full vector > load will definitely cause the IOOBE which is not valid. However, for > architectures that support the predicate feature like SVE/AVX-512/RVV, it can > be vectorized with the predicated load instruction as long as the indexes of > the masked lanes are within the bounds of the array. For these architectures, > loading with unmasked lanes does not raise exception. > > This patch adds the vectorization support for the masked load with IOOBE > part. Please see the original java implementation (FIXME: optimize): > > > @ForceInline > public static > ByteVector fromArray(VectorSpecies species, >byte[] a, int offset, >VectorMask m) { > ByteSpecies vsp = (ByteSpecies) species; > if (offset >= 0 && offset <= (a.length - species.length())) { > return vsp.dummyVector().fromArray0(a, offset, m); > } > > // FIXME: optimize > checkMaskFromIndexSize(offset, vsp, m, 1, a.length); > return vsp.vOp(m, i -> a[offset + i]); > } > > Since it can only be vectorized with the predicate load, the hotspot must > check whether the current backend supports it and falls back to the java > scalar version if not. This is different from the normal masked vector load > that the compiler will generate a full vector load and a vector blend if the > predicate load is not supported. So to let the compiler make the expected > action, an additional flag (i.e. `usePred`) is added to the existing > "loadMasked" intrinsic, with the value "true" for the IOOBE part while > "false" for the normal load. And the compiler will fail to intrinsify if the > flag is "true" and the predicate load is not supported by the backend, which > means that normal java path will be executed. > > Also adds the same vectorization support for masked: > - fromByteArray/fromByteBuffer > - fromBooleanArray > - fromCharArray > > The performance for the new added benchmarks improve about `1.88x ~ 30.26x` > on the x86 AVX-512 system: > > Benchmark before After Units > LoadMaskedIOOBEBenchmark.byteLoadArrayMaskIOOBE 737.542 1387.069 ops/ms > LoadMaskedIOOBEBenchmark.doubleLoadArrayMaskIOOBE 118.366 330.776 ops/ms > LoadMaskedIOOBEBenchmark.floatLoadArrayMaskIOOBE 233.832 6125.026 ops/ms > LoadMaskedIOOBEBenchmark.intLoadArrayMaskIOOBE233.816 7075.923 ops/ms > LoadMaskedIOOBEBenchmark.longLoadArrayMaskIOOBE 119.771 330.587 ops/ms > LoadMaskedIOOBEBenchmark.shortLoadArrayMaskIOOBE 431.961 939.301 ops/ms > > Similar performance gain can also be observed on 512-bit SVE system. Xiaohong Gong has updated the pull request incrementally with one additional commit since the last revision: Use integer constant for offsetInRange all the way through - Changes: - all: https://git.openjdk.java.net/jdk/pull/8035/files - new: https://git.openjdk.java.net/jdk/pull/8035/files/9c69206e..07edfbd5 Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=8035&range=03 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=8035&range=02-03 Stats: 438 lines in 39 files changed: 33 ins; 118 del; 287 mod Patch: https://git.openjdk.java.net/jdk/pull/8035.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/8035/head:pull/8035 PR: https://git.openjdk.java.net/jdk/pull/8035
Re: RFR: 8284960: Integration of JEP 426: Vector API (Fourth Incubator) [v6]
On Thu, 19 May 2022 08:53:31 GMT, Ningsheng Jian wrote: >>> LUT should be generated only if UsePopCountInsturction is false >> >> Should there be `!UsePopCountInsturction` check then? >> >>> restrict the scope of flag to only scalar popcount operation >> >> Interesting. But AArch64 code does cover vector cases which just adds >> confusion. > >> Interesting. But AArch64 code does cover vector cases which just adds >> confusion. > > `UsePopCountInsturction` is always true in AArch64. @XiaohongGong removed the > `predicate` in aarch64 rules, and I think we can even remove the option check > in match_rule_supported(). Ok , I will remove the check for it. Thanks! - PR: https://git.openjdk.java.net/jdk/pull/8425
Re: RFR: 8286279: [vectorapi] Only check index of masked lanes if offset is out of array boundary for masked store [v2]
On Fri, 13 May 2022 01:35:40 GMT, Xiaohong Gong wrote: >> Checking whether the indexes of masked lanes are inside of the valid memory >> boundary is necessary for masked vector memory access. However, this could >> be saved if the given offset is inside of the vector range that could make >> sure no IOOBE (IndexOutOfBoundaryException) happens. The masked load APIs >> have saved this kind of check for common cases. And this patch did the >> similar optimization for the masked vector store. >> >> The performance for the new added store masked benchmarks improves about >> `1.83x ~ 2.62x` on a x86 system: >> >> Benchmark BeforeAfter Gain Units >> StoreMaskedBenchmark.byteStoreArrayMask 12757.936 23291.118 1.826 ops/ms >> StoreMaskedBenchmark.doubleStoreArrayMask 1520.932 3921.616 2.578 ops/ms >> StoreMaskedBenchmark.floatStoreArrayMask 2713.031 7122.535 2.625 ops/ms >> StoreMaskedBenchmark.intStoreArrayMask 4113.772 8220.206 1.998 ops/ms >> StoreMaskedBenchmark.longStoreArrayMask1993.986 4874.148 2.444 ops/ms >> StoreMaskedBenchmark.shortStoreArrayMask 8543.593 17821.086 2.086 ops/ms >> >> Similar performane gain can also be observed on ARM hardware. > > Xiaohong Gong has updated the pull request incrementally with one additional > commit since the last revision: > > Wrap the offset check into a static method @PaulSandoz, could you please help to check whether the current version is ok for you? Thanks so much! - PR: https://git.openjdk.java.net/jdk/pull/8620
Re: RFR: 8286279: [vectorapi] Only check index of masked lanes if offset is out of array boundary for masked store [v2]
On Mon, 30 May 2022 01:17:00 GMT, Xiaohong Gong wrote: >> Xiaohong Gong has updated the pull request incrementally with one additional >> commit since the last revision: >> >> Wrap the offset check into a static method > > @PaulSandoz, could you please help to check whether the current version is ok > for you? Thanks so much! > @XiaohongGong looks good, now the Vector API JEP has been integrated you will > get a merge conflict, but it should be easier to resolve. Great! I will resolve the conflict and update the patch, thanks so much! - PR: https://git.openjdk.java.net/jdk/pull/8620
Re: RFR: 8286279: [vectorapi] Only check index of masked lanes if offset is out of array boundary for masked store [v3]
> Checking whether the indexes of masked lanes are inside of the valid memory > boundary is necessary for masked vector memory access. However, this could be > saved if the given offset is inside of the vector range that could make sure > no IOOBE (IndexOutOfBoundaryException) happens. The masked load APIs have > saved this kind of check for common cases. And this patch did the similar > optimization for the masked vector store. > > The performance for the new added store masked benchmarks improves about > `1.83x ~ 2.62x` on a x86 system: > > Benchmark BeforeAfter Gain Units > StoreMaskedBenchmark.byteStoreArrayMask 12757.936 23291.118 1.826 ops/ms > StoreMaskedBenchmark.doubleStoreArrayMask 1520.932 3921.616 2.578 ops/ms > StoreMaskedBenchmark.floatStoreArrayMask 2713.031 7122.535 2.625 ops/ms > StoreMaskedBenchmark.intStoreArrayMask 4113.772 8220.206 1.998 ops/ms > StoreMaskedBenchmark.longStoreArrayMask1993.986 4874.148 2.444 ops/ms > StoreMaskedBenchmark.shortStoreArrayMask 8543.593 17821.086 2.086 ops/ms > > Similar performane gain can also be observed on ARM hardware. Xiaohong Gong has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains three commits: - Merge branch 'jdk:master' to JDK-8286279 - Wrap the offset check into a static method - 8286279: [vectorapi] Only check index of masked lanes if offset is out of array boundary for masked store - Changes: https://git.openjdk.java.net/jdk/pull/8620/files Webrev: https://webrevs.openjdk.java.net/?repo=jdk&pr=8620&range=02 Stats: 216 lines in 9 files changed: 179 ins; 0 del; 37 mod Patch: https://git.openjdk.java.net/jdk/pull/8620.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/8620/head:pull/8620 PR: https://git.openjdk.java.net/jdk/pull/8620
Re: RFR: 8286279: [vectorapi] Only check index of masked lanes if offset is out of array boundary for masked store [v2]
On Tue, 31 May 2022 16:48:27 GMT, Paul Sandoz wrote: >> @PaulSandoz, could you please help to check whether the current version is >> ok for you? Thanks so much! > > @XiaohongGong looks good, now the Vector API JEP has been integrated you will > get a merge conflict, but it should be easier to resolve. Hi @PaulSandoz , I updated the patch and resolved the conflicts. Could you please take a look at it again? Thanks so much! - PR: https://git.openjdk.java.net/jdk/pull/8620
Re: RFR: 8283667: [vectorapi] Vectorization for masked load with IOOBE with predicate feature [v3]
On Fri, 13 May 2022 08:58:12 GMT, Xiaohong Gong wrote: >> Yes, the tests were run in debug mode. The reporting of the missing constant >> occurs for the compiled method that is called from the method where the >> constants are declared e.g.: >> >> 719 240bjdk.incubator.vector.Int256Vector::fromArray0 (15 >> bytes) >> ** Rejected vector op (LoadVectorMasked,int,8) because architecture does >> not support it >> ** missing constant: offsetInRange=Parm >> @ 11 >> jdk.incubator.vector.IntVector::fromArray0Template (22 bytes) force inline >> by annotation >> >> >> So it appears to be working as expected. A similar pattern occurs at a >> lower-level for the passing of the mask class. `Int256Vector::fromArray0` >> passes a constant class to `IntVector::fromArray0Template` (the compilation >> of which bails out before checking that the `offsetInRange` is constant). > > You are right @PaulSandoz ! I ran the tests and benchmarks with your patch, > and no failure and performance regression are found. I will update the patch > soon. Thanks for the help! > @XiaohongGong Could you please rebase the branch and resolve conflicts? Sure, I'm working on this now. The patch will be updated soon. Thanks. - PR: https://git.openjdk.java.net/jdk/pull/8035
Re: RFR: 8283667: [vectorapi] Vectorization for masked load with IOOBE with predicate feature [v5]
> Currently the vector load with mask when the given index happens out of the > array boundary is implemented with pure java scalar code to avoid the IOOBE > (IndexOutOfBoundaryException). This is necessary for architectures that do > not support the predicate feature. Because the masked load is implemented > with a full vector load and a vector blend applied on it. And a full vector > load will definitely cause the IOOBE which is not valid. However, for > architectures that support the predicate feature like SVE/AVX-512/RVV, it can > be vectorized with the predicated load instruction as long as the indexes of > the masked lanes are within the bounds of the array. For these architectures, > loading with unmasked lanes does not raise exception. > > This patch adds the vectorization support for the masked load with IOOBE > part. Please see the original java implementation (FIXME: optimize): > > > @ForceInline > public static > ByteVector fromArray(VectorSpecies species, >byte[] a, int offset, >VectorMask m) { > ByteSpecies vsp = (ByteSpecies) species; > if (offset >= 0 && offset <= (a.length - species.length())) { > return vsp.dummyVector().fromArray0(a, offset, m); > } > > // FIXME: optimize > checkMaskFromIndexSize(offset, vsp, m, 1, a.length); > return vsp.vOp(m, i -> a[offset + i]); > } > > Since it can only be vectorized with the predicate load, the hotspot must > check whether the current backend supports it and falls back to the java > scalar version if not. This is different from the normal masked vector load > that the compiler will generate a full vector load and a vector blend if the > predicate load is not supported. So to let the compiler make the expected > action, an additional flag (i.e. `usePred`) is added to the existing > "loadMasked" intrinsic, with the value "true" for the IOOBE part while > "false" for the normal load. And the compiler will fail to intrinsify if the > flag is "true" and the predicate load is not supported by the backend, which > means that normal java path will be executed. > > Also adds the same vectorization support for masked: > - fromByteArray/fromByteBuffer > - fromBooleanArray > - fromCharArray > > The performance for the new added benchmarks improve about `1.88x ~ 30.26x` > on the x86 AVX-512 system: > > Benchmark before After Units > LoadMaskedIOOBEBenchmark.byteLoadArrayMaskIOOBE 737.542 1387.069 ops/ms > LoadMaskedIOOBEBenchmark.doubleLoadArrayMaskIOOBE 118.366 330.776 ops/ms > LoadMaskedIOOBEBenchmark.floatLoadArrayMaskIOOBE 233.832 6125.026 ops/ms > LoadMaskedIOOBEBenchmark.intLoadArrayMaskIOOBE233.816 7075.923 ops/ms > LoadMaskedIOOBEBenchmark.longLoadArrayMaskIOOBE 119.771 330.587 ops/ms > LoadMaskedIOOBEBenchmark.shortLoadArrayMaskIOOBE 431.961 939.301 ops/ms > > Similar performance gain can also be observed on 512-bit SVE system. Xiaohong Gong has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains five commits: - Merge branch 'jdk:master' into JDK-8283667 - Use integer constant for offsetInRange all the way through - Rename "use_predicate" to "needs_predicate" - Rename the "usePred" to "offsetInRange" - 8283667: [vectorapi] Vectorization for masked load with IOOBE with predicate feature - Changes: https://git.openjdk.java.net/jdk/pull/8035/files Webrev: https://webrevs.openjdk.java.net/?repo=jdk&pr=8035&range=04 Stats: 447 lines in 43 files changed: 168 ins; 21 del; 258 mod Patch: https://git.openjdk.java.net/jdk/pull/8035.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/8035/head:pull/8035 PR: https://git.openjdk.java.net/jdk/pull/8035
Re: RFR: 8283667: [vectorapi] Vectorization for masked load with IOOBE with predicate feature [v3]
On Thu, 2 Jun 2022 01:49:10 GMT, Xiaohong Gong wrote: > > @XiaohongGong Could you please rebase the branch and resolve conflicts? > > Sure, I'm working on this now. The patch will be updated soon. Thanks. Resolved the conflicts. Thanks! - PR: https://git.openjdk.java.net/jdk/pull/8035
Re: RFR: 8283667: [vectorapi] Vectorization for masked load with IOOBE with predicate feature [v3]
On Thu, 2 Jun 2022 03:24:07 GMT, Xiaohong Gong wrote: >>> @XiaohongGong Could you please rebase the branch and resolve conflicts? >> >> Sure, I'm working on this now. The patch will be updated soon. Thanks. > >> > @XiaohongGong Could you please rebase the branch and resolve conflicts? >> >> Sure, I'm working on this now. The patch will be updated soon. Thanks. > > Resolved the conflicts. Thanks! > @XiaohongGong You need one more review approval. Sure! Thanks a lot for the review! - PR: https://git.openjdk.java.net/jdk/pull/8035
Re: RFR: 8283667: [vectorapi] Vectorization for masked load with IOOBE with predicate feature [v3]
On Thu, 12 May 2022 16:07:54 GMT, Paul Sandoz wrote: >> Xiaohong Gong has updated the pull request incrementally with one additional >> commit since the last revision: >> >> Rename "use_predicate" to "needs_predicate" > > Yes, the tests were run in debug mode. The reporting of the missing constant > occurs for the compiled method that is called from the method where the > constants are declared e.g.: > > 719 240bjdk.incubator.vector.Int256Vector::fromArray0 (15 > bytes) > ** Rejected vector op (LoadVectorMasked,int,8) because architecture does > not support it > ** missing constant: offsetInRange=Parm > @ 11 > jdk.incubator.vector.IntVector::fromArray0Template (22 bytes) force inline > by annotation > > > So it appears to be working as expected. A similar pattern occurs at a > lower-level for the passing of the mask class. `Int256Vector::fromArray0` > passes a constant class to `IntVector::fromArray0Template` (the compilation > of which bails out before checking that the `offsetInRange` is constant). Hi @PaulSandoz , could you please take a look at this PR again? Thanks so much! - PR: https://git.openjdk.java.net/jdk/pull/8035
Re: RFR: 8286279: [vectorapi] Only check index of masked lanes if offset is out of array boundary for masked store [v2]
On Tue, 31 May 2022 16:48:27 GMT, Paul Sandoz wrote: >> @PaulSandoz, could you please help to check whether the current version is >> ok for you? Thanks so much! > > @XiaohongGong looks good, now the Vector API JEP has been integrated you will > get a merge conflict, but it should be easier to resolve. Thanks for the review @PaulSandoz @merykitty ! - PR: https://git.openjdk.java.net/jdk/pull/8620
Integrated: 8286279: [vectorapi] Only check index of masked lanes if offset is out of array boundary for masked store
On Tue, 10 May 2022 01:23:55 GMT, Xiaohong Gong wrote: > Checking whether the indexes of masked lanes are inside of the valid memory > boundary is necessary for masked vector memory access. However, this could be > saved if the given offset is inside of the vector range that could make sure > no IOOBE (IndexOutOfBoundaryException) happens. The masked load APIs have > saved this kind of check for common cases. And this patch did the similar > optimization for the masked vector store. > > The performance for the new added store masked benchmarks improves about > `1.83x ~ 2.62x` on a x86 system: > > Benchmark BeforeAfter Gain Units > StoreMaskedBenchmark.byteStoreArrayMask 12757.936 23291.118 1.826 ops/ms > StoreMaskedBenchmark.doubleStoreArrayMask 1520.932 3921.616 2.578 ops/ms > StoreMaskedBenchmark.floatStoreArrayMask 2713.031 7122.535 2.625 ops/ms > StoreMaskedBenchmark.intStoreArrayMask 4113.772 8220.206 1.998 ops/ms > StoreMaskedBenchmark.longStoreArrayMask1993.986 4874.148 2.444 ops/ms > StoreMaskedBenchmark.shortStoreArrayMask 8543.593 17821.086 2.086 ops/ms > > Similar performane gain can also be observed on ARM hardware. This pull request has now been integrated. Changeset: ef7cc210 Author:Xiaohong Gong URL: https://git.openjdk.java.net/jdk/commit/ef7cc2105c66de443d3a9af706220272018a0d8d Stats: 216 lines in 9 files changed: 179 ins; 0 del; 37 mod 8286279: [vectorapi] Only check index of masked lanes if offset is out of array boundary for masked store Reviewed-by: psandoz - PR: https://git.openjdk.java.net/jdk/pull/8620
Re: RFR: 8283667: [vectorapi] Vectorization for masked load with IOOBE with predicate feature [v5]
On Mon, 6 Jun 2022 10:40:45 GMT, Jatin Bhateja wrote: >> Xiaohong Gong has updated the pull request with a new target base due to a >> merge or a rebase. The pull request now contains five commits: >> >> - Merge branch 'jdk:master' into JDK-8283667 >> - Use integer constant for offsetInRange all the way through >> - Rename "use_predicate" to "needs_predicate" >> - Rename the "usePred" to "offsetInRange" >> - 8283667: [vectorapi] Vectorization for masked load with IOOBE with >> predicate feature > > test/micro/org/openjdk/bench/jdk/incubator/vector/LoadMaskedIOOBEBenchmark.java > line 97: > >> 95: public void byteLoadArrayMaskIOOBE() { >> 96: for (int i = 0; i < inSize; i += bspecies.length()) { >> 97: VectorMask mask = VectorMask.fromArray(bspecies, m, i); > > For other case "if (offset >= 0 && offset <= (a.length - species.length())) > )" we are anyways intrinsifying, should we limit this micro to work only for > newly optimized case. Yeah, thanks and it's really a good suggestion to limit this benchmark only for the IOOBE cases. I locally modified the tests to make sure only the IOOBE case happens and the results show good as well. But do you think it's better to keep as it is since we can also see the performance of the common cases to make sure no regressions happen? As the current benchmarks can also show the performance gain by this PR. - PR: https://git.openjdk.java.net/jdk/pull/8035
Re: RFR: 8283667: [vectorapi] Vectorization for masked load with IOOBE with predicate feature [v6]
> Currently the vector load with mask when the given index happens out of the > array boundary is implemented with pure java scalar code to avoid the IOOBE > (IndexOutOfBoundaryException). This is necessary for architectures that do > not support the predicate feature. Because the masked load is implemented > with a full vector load and a vector blend applied on it. And a full vector > load will definitely cause the IOOBE which is not valid. However, for > architectures that support the predicate feature like SVE/AVX-512/RVV, it can > be vectorized with the predicated load instruction as long as the indexes of > the masked lanes are within the bounds of the array. For these architectures, > loading with unmasked lanes does not raise exception. > > This patch adds the vectorization support for the masked load with IOOBE > part. Please see the original java implementation (FIXME: optimize): > > > @ForceInline > public static > ByteVector fromArray(VectorSpecies species, >byte[] a, int offset, >VectorMask m) { > ByteSpecies vsp = (ByteSpecies) species; > if (offset >= 0 && offset <= (a.length - species.length())) { > return vsp.dummyVector().fromArray0(a, offset, m); > } > > // FIXME: optimize > checkMaskFromIndexSize(offset, vsp, m, 1, a.length); > return vsp.vOp(m, i -> a[offset + i]); > } > > Since it can only be vectorized with the predicate load, the hotspot must > check whether the current backend supports it and falls back to the java > scalar version if not. This is different from the normal masked vector load > that the compiler will generate a full vector load and a vector blend if the > predicate load is not supported. So to let the compiler make the expected > action, an additional flag (i.e. `usePred`) is added to the existing > "loadMasked" intrinsic, with the value "true" for the IOOBE part while > "false" for the normal load. And the compiler will fail to intrinsify if the > flag is "true" and the predicate load is not supported by the backend, which > means that normal java path will be executed. > > Also adds the same vectorization support for masked: > - fromByteArray/fromByteBuffer > - fromBooleanArray > - fromCharArray > > The performance for the new added benchmarks improve about `1.88x ~ 30.26x` > on the x86 AVX-512 system: > > Benchmark before After Units > LoadMaskedIOOBEBenchmark.byteLoadArrayMaskIOOBE 737.542 1387.069 ops/ms > LoadMaskedIOOBEBenchmark.doubleLoadArrayMaskIOOBE 118.366 330.776 ops/ms > LoadMaskedIOOBEBenchmark.floatLoadArrayMaskIOOBE 233.832 6125.026 ops/ms > LoadMaskedIOOBEBenchmark.intLoadArrayMaskIOOBE233.816 7075.923 ops/ms > LoadMaskedIOOBEBenchmark.longLoadArrayMaskIOOBE 119.771 330.587 ops/ms > LoadMaskedIOOBEBenchmark.shortLoadArrayMaskIOOBE 431.961 939.301 ops/ms > > Similar performance gain can also be observed on 512-bit SVE system. Xiaohong Gong has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains seven commits: - Add constant OFFSET_IN_RANGE and OFFSET_OUT_OF_RANGE - Merge branch 'jdk:master' into JDK-8283667 - Merge branch 'jdk:master' into JDK-8283667 - Use integer constant for offsetInRange all the way through - Rename "use_predicate" to "needs_predicate" - Rename the "usePred" to "offsetInRange" - 8283667: [vectorapi] Vectorization for masked load with IOOBE with predicate feature - Changes: https://git.openjdk.java.net/jdk/pull/8035/files Webrev: https://webrevs.openjdk.java.net/?repo=jdk&pr=8035&range=05 Stats: 453 lines in 44 files changed: 174 ins; 21 del; 258 mod Patch: https://git.openjdk.java.net/jdk/pull/8035.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/8035/head:pull/8035 PR: https://git.openjdk.java.net/jdk/pull/8035
Re: RFR: 8283667: [vectorapi] Vectorization for masked load with IOOBE with predicate feature [v5]
On Mon, 6 Jun 2022 15:41:06 GMT, Paul Sandoz wrote: > Looks good. As a follow on PR I think it would be useful to add constants > `OFFSET_IN_RANGE` and `OFFSET_OUT_OF_RANGE`, then it becomes much clearer in > source and you can drop the `/* offsetInRange */` comment on the argument. Hi @PaulSandoz , thanks for the advice! I'v rebased the codes and added these two constants in the latest patch. Thanks again for the review! - PR: https://git.openjdk.java.net/jdk/pull/8035
Re: RFR: 8283667: [vectorapi] Vectorization for masked load with IOOBE with predicate feature [v5]
On Tue, 7 Jun 2022 06:41:36 GMT, Jatin Bhateja wrote: >> Yeah, thanks and it's really a good suggestion to limit this benchmark only >> for the IOOBE cases. I locally modified the tests to make sure only the >> IOOBE case happens and the results show good as well. But do you think it's >> better to keep as it is since we can also see the performance of the common >> cases to make sure no regressions happen? As the current benchmarks can also >> show the performance gain by this PR. > > It was just to remove the noise from a targeted micro benchmark. But we can > keep it as it is. OK, thanks! - PR: https://git.openjdk.java.net/jdk/pull/8035
Re: RFR: 8283667: [vectorapi] Vectorization for masked load with IOOBE with predicate feature [v3]
On Thu, 12 May 2022 16:07:54 GMT, Paul Sandoz wrote: >> Xiaohong Gong has updated the pull request incrementally with one additional >> commit since the last revision: >> >> Rename "use_predicate" to "needs_predicate" > > Yes, the tests were run in debug mode. The reporting of the missing constant > occurs for the compiled method that is called from the method where the > constants are declared e.g.: > > 719 240bjdk.incubator.vector.Int256Vector::fromArray0 (15 > bytes) > ** Rejected vector op (LoadVectorMasked,int,8) because architecture does > not support it > ** missing constant: offsetInRange=Parm > @ 11 > jdk.incubator.vector.IntVector::fromArray0Template (22 bytes) force inline > by annotation > > > So it appears to be working as expected. A similar pattern occurs at a > lower-level for the passing of the mask class. `Int256Vector::fromArray0` > passes a constant class to `IntVector::fromArray0Template` (the compilation > of which bails out before checking that the `offsetInRange` is constant). Thanks for the review @PaulSandoz @sviswa7 @jatin-bhateja @merykitty ! - PR: https://git.openjdk.java.net/jdk/pull/8035
Integrated: 8283667: [vectorapi] Vectorization for masked load with IOOBE with predicate feature
On Wed, 30 Mar 2022 10:31:59 GMT, Xiaohong Gong wrote: > Currently the vector load with mask when the given index happens out of the > array boundary is implemented with pure java scalar code to avoid the IOOBE > (IndexOutOfBoundaryException). This is necessary for architectures that do > not support the predicate feature. Because the masked load is implemented > with a full vector load and a vector blend applied on it. And a full vector > load will definitely cause the IOOBE which is not valid. However, for > architectures that support the predicate feature like SVE/AVX-512/RVV, it can > be vectorized with the predicated load instruction as long as the indexes of > the masked lanes are within the bounds of the array. For these architectures, > loading with unmasked lanes does not raise exception. > > This patch adds the vectorization support for the masked load with IOOBE > part. Please see the original java implementation (FIXME: optimize): > > > @ForceInline > public static > ByteVector fromArray(VectorSpecies species, >byte[] a, int offset, >VectorMask m) { > ByteSpecies vsp = (ByteSpecies) species; > if (offset >= 0 && offset <= (a.length - species.length())) { > return vsp.dummyVector().fromArray0(a, offset, m); > } > > // FIXME: optimize > checkMaskFromIndexSize(offset, vsp, m, 1, a.length); > return vsp.vOp(m, i -> a[offset + i]); > } > > Since it can only be vectorized with the predicate load, the hotspot must > check whether the current backend supports it and falls back to the java > scalar version if not. This is different from the normal masked vector load > that the compiler will generate a full vector load and a vector blend if the > predicate load is not supported. So to let the compiler make the expected > action, an additional flag (i.e. `usePred`) is added to the existing > "loadMasked" intrinsic, with the value "true" for the IOOBE part while > "false" for the normal load. And the compiler will fail to intrinsify if the > flag is "true" and the predicate load is not supported by the backend, which > means that normal java path will be executed. > > Also adds the same vectorization support for masked: > - fromByteArray/fromByteBuffer > - fromBooleanArray > - fromCharArray > > The performance for the new added benchmarks improve about `1.88x ~ 30.26x` > on the x86 AVX-512 system: > > Benchmark before After Units > LoadMaskedIOOBEBenchmark.byteLoadArrayMaskIOOBE 737.542 1387.069 ops/ms > LoadMaskedIOOBEBenchmark.doubleLoadArrayMaskIOOBE 118.366 330.776 ops/ms > LoadMaskedIOOBEBenchmark.floatLoadArrayMaskIOOBE 233.832 6125.026 ops/ms > LoadMaskedIOOBEBenchmark.intLoadArrayMaskIOOBE233.816 7075.923 ops/ms > LoadMaskedIOOBEBenchmark.longLoadArrayMaskIOOBE 119.771 330.587 ops/ms > LoadMaskedIOOBEBenchmark.shortLoadArrayMaskIOOBE 431.961 939.301 ops/ms > > Similar performance gain can also be observed on 512-bit SVE system. This pull request has now been integrated. Changeset: 39fa52b5 Author:Xiaohong Gong URL: https://git.openjdk.java.net/jdk/commit/39fa52b5f7504eca7399b863b0fb934bdce37f7e Stats: 453 lines in 44 files changed: 174 ins; 21 del; 258 mod 8283667: [vectorapi] Vectorization for masked load with IOOBE with predicate feature Reviewed-by: sviswanathan, psandoz - PR: https://git.openjdk.java.net/jdk/pull/8035