Re: RFR: 8283667: [vectorapi] Vectorization for masked load with IOOBE with predicate feature [v3]

2022-06-07 Thread Xiaohong Gong
On Thu, 12 May 2022 16:07:54 GMT, Paul Sandoz  wrote:

>> Xiaohong Gong has updated the pull request incrementally with one additional 
>> commit since the last revision:
>> 
>>   Rename "use_predicate" to "needs_predicate"
>
> Yes, the tests were run in debug mode. The reporting of the missing constant 
> occurs for the compiled method that is called from the method where the 
> constants are declared e.g.:
> 
> 719  240bjdk.incubator.vector.Int256Vector::fromArray0 (15 
> bytes)
>   ** Rejected vector op (LoadVectorMasked,int,8) because architecture does 
> not support it
>   ** missing constant: offsetInRange=Parm
> @ 11   
> jdk.incubator.vector.IntVector::fromArray0Template (22 bytes)   force inline 
> by annotation
> 
> 
> So it appears to be working as expected. A similar pattern occurs at a 
> lower-level for the passing of the mask class. `Int256Vector::fromArray0` 
> passes a constant class to `IntVector::fromArray0Template` (the compilation 
> of which bails out before checking that the `offsetInRange` is constant).

Thanks for the review @PaulSandoz  @sviswa7 @jatin-bhateja @merykitty !

-

PR: https://git.openjdk.java.net/jdk/pull/8035


Re: RFR: 8283667: [vectorapi] Vectorization for masked load with IOOBE with predicate feature [v6]

2022-06-07 Thread Jatin Bhateja
On Tue, 7 Jun 2022 04:29:40 GMT, Xiaohong Gong  wrote:

>> Currently the vector load with mask when the given index happens out of the 
>> array boundary is implemented with pure java scalar code to avoid the IOOBE 
>> (IndexOutOfBoundaryException). This is necessary for architectures that do 
>> not support the predicate feature. Because the masked load is implemented 
>> with a full vector load and a vector blend applied on it. And a full vector 
>> load will definitely cause the IOOBE which is not valid. However, for 
>> architectures that support the predicate feature like SVE/AVX-512/RVV, it 
>> can be vectorized with the predicated load instruction as long as the 
>> indexes of the masked lanes are within the bounds of the array. For these 
>> architectures, loading with unmasked lanes does not raise exception.
>> 
>> This patch adds the vectorization support for the masked load with IOOBE 
>> part. Please see the original java implementation (FIXME: optimize):
>> 
>> 
>>   @ForceInline
>>   public static
>>   ByteVector fromArray(VectorSpecies species,
>>byte[] a, int offset,
>>VectorMask m) {
>>   ByteSpecies vsp = (ByteSpecies) species;
>>   if (offset >= 0 && offset <= (a.length - species.length())) {
>>   return vsp.dummyVector().fromArray0(a, offset, m);
>>   }
>> 
>>   // FIXME: optimize
>>   checkMaskFromIndexSize(offset, vsp, m, 1, a.length);
>>   return vsp.vOp(m, i -> a[offset + i]);
>>   }
>> 
>> Since it can only be vectorized with the predicate load, the hotspot must 
>> check whether the current backend supports it and falls back to the java 
>> scalar version if not. This is different from the normal masked vector load 
>> that the compiler will generate a full vector load and a vector blend if the 
>> predicate load is not supported. So to let the compiler make the expected 
>> action, an additional flag (i.e. `usePred`) is added to the existing 
>> "loadMasked" intrinsic, with the value "true" for the IOOBE part while 
>> "false" for the normal load. And the compiler will fail to intrinsify if the 
>> flag is "true" and the predicate load is not supported by the backend, which 
>> means that normal java path will be executed.
>> 
>> Also adds the same vectorization support for masked:
>>  - fromByteArray/fromByteBuffer
>>  - fromBooleanArray
>>  - fromCharArray
>> 
>> The performance for the new added benchmarks improve about `1.88x ~ 30.26x` 
>> on the x86 AVX-512 system:
>> 
>> Benchmark  before   After  Units
>> LoadMaskedIOOBEBenchmark.byteLoadArrayMaskIOOBE   737.542 1387.069 ops/ms
>> LoadMaskedIOOBEBenchmark.doubleLoadArrayMaskIOOBE 118.366  330.776 ops/ms
>> LoadMaskedIOOBEBenchmark.floatLoadArrayMaskIOOBE  233.832 6125.026 ops/ms
>> LoadMaskedIOOBEBenchmark.intLoadArrayMaskIOOBE233.816 7075.923 ops/ms
>> LoadMaskedIOOBEBenchmark.longLoadArrayMaskIOOBE   119.771  330.587 ops/ms
>> LoadMaskedIOOBEBenchmark.shortLoadArrayMaskIOOBE  431.961  939.301 ops/ms
>> 
>> Similar performance gain can also be observed on 512-bit SVE system.
>
> Xiaohong Gong has updated the pull request with a new target base due to a 
> merge or a rebase. The pull request now contains seven commits:
> 
>  - Add constant OFFSET_IN_RANGE and OFFSET_OUT_OF_RANGE
>  - Merge branch 'jdk:master' into JDK-8283667
>  - Merge branch 'jdk:master' into JDK-8283667
>  - Use integer constant for offsetInRange all the way through
>  - Rename "use_predicate" to "needs_predicate"
>  - Rename the "usePred" to "offsetInRange"
>  - 8283667: [vectorapi] Vectorization for masked load with IOOBE with 
> predicate feature

LGTM.

-

PR: https://git.openjdk.java.net/jdk/pull/8035


Re: RFR: 8283667: [vectorapi] Vectorization for masked load with IOOBE with predicate feature [v5]

2022-06-07 Thread Jatin Bhateja
On Tue, 7 Jun 2022 02:22:53 GMT, Xiaohong Gong  wrote:

>> test/micro/org/openjdk/bench/jdk/incubator/vector/LoadMaskedIOOBEBenchmark.java
>>  line 97:
>> 
>>> 95: public void byteLoadArrayMaskIOOBE() {
>>> 96: for (int i = 0; i < inSize; i += bspecies.length()) {
>>> 97: VectorMask mask = VectorMask.fromArray(bspecies, m, 
>>> i);
>> 
>> For other case "if (offset >= 0 && offset <= (a.length - species.length())) 
>> )" we are anyways intrinsifying, should we limit this micro to work only for 
>> newly optimized case.
>
> Yeah, thanks and it's really a good suggestion to limit this benchmark only 
> for the IOOBE cases. I locally modified the tests to make sure only the IOOBE 
> case happens and the results show good as well. But do you think it's better 
> to keep as it is since we can also see the performance of the common cases to 
> make sure no regressions happen? As the current benchmarks can also show the 
> performance gain by this PR.

It was just to remove the noise from a targeted micro benchmark. But we can 
keep it as it is.

-

PR: https://git.openjdk.java.net/jdk/pull/8035


Re: RFR: 8283667: [vectorapi] Vectorization for masked load with IOOBE with predicate feature [v5]

2022-06-07 Thread Xiaohong Gong
On Tue, 7 Jun 2022 06:41:36 GMT, Jatin Bhateja  wrote:

>> Yeah, thanks and it's really a good suggestion to limit this benchmark only 
>> for the IOOBE cases. I locally modified the tests to make sure only the 
>> IOOBE case happens and the results show good as well. But do you think it's 
>> better to keep as it is since we can also see the performance of the common 
>> cases to make sure no regressions happen? As the current benchmarks can also 
>> show the performance gain by this PR.
>
> It was just to remove the noise from a targeted micro benchmark. But we can 
> keep it as it is.

OK, thanks!

-

PR: https://git.openjdk.java.net/jdk/pull/8035


Re: RFR: 8283667: [vectorapi] Vectorization for masked load with IOOBE with predicate feature [v5]

2022-06-06 Thread Xiaohong Gong
On Mon, 6 Jun 2022 15:41:06 GMT, Paul Sandoz  wrote:

> Looks good. As a follow on PR I think it would be useful to add constants 
> `OFFSET_IN_RANGE` and `OFFSET_OUT_OF_RANGE`, then it becomes much clearer in 
> source and you can drop the `/* offsetInRange */` comment on the argument.

Hi @PaulSandoz , thanks for the advice! I'v rebased the codes and added these 
two constants in the latest patch. Thanks again for the review!

-

PR: https://git.openjdk.java.net/jdk/pull/8035


Re: RFR: 8283667: [vectorapi] Vectorization for masked load with IOOBE with predicate feature [v6]

2022-06-06 Thread Xiaohong Gong
> Currently the vector load with mask when the given index happens out of the 
> array boundary is implemented with pure java scalar code to avoid the IOOBE 
> (IndexOutOfBoundaryException). This is necessary for architectures that do 
> not support the predicate feature. Because the masked load is implemented 
> with a full vector load and a vector blend applied on it. And a full vector 
> load will definitely cause the IOOBE which is not valid. However, for 
> architectures that support the predicate feature like SVE/AVX-512/RVV, it can 
> be vectorized with the predicated load instruction as long as the indexes of 
> the masked lanes are within the bounds of the array. For these architectures, 
> loading with unmasked lanes does not raise exception.
> 
> This patch adds the vectorization support for the masked load with IOOBE 
> part. Please see the original java implementation (FIXME: optimize):
> 
> 
>   @ForceInline
>   public static
>   ByteVector fromArray(VectorSpecies species,
>byte[] a, int offset,
>VectorMask m) {
>   ByteSpecies vsp = (ByteSpecies) species;
>   if (offset >= 0 && offset <= (a.length - species.length())) {
>   return vsp.dummyVector().fromArray0(a, offset, m);
>   }
> 
>   // FIXME: optimize
>   checkMaskFromIndexSize(offset, vsp, m, 1, a.length);
>   return vsp.vOp(m, i -> a[offset + i]);
>   }
> 
> Since it can only be vectorized with the predicate load, the hotspot must 
> check whether the current backend supports it and falls back to the java 
> scalar version if not. This is different from the normal masked vector load 
> that the compiler will generate a full vector load and a vector blend if the 
> predicate load is not supported. So to let the compiler make the expected 
> action, an additional flag (i.e. `usePred`) is added to the existing 
> "loadMasked" intrinsic, with the value "true" for the IOOBE part while 
> "false" for the normal load. And the compiler will fail to intrinsify if the 
> flag is "true" and the predicate load is not supported by the backend, which 
> means that normal java path will be executed.
> 
> Also adds the same vectorization support for masked:
>  - fromByteArray/fromByteBuffer
>  - fromBooleanArray
>  - fromCharArray
> 
> The performance for the new added benchmarks improve about `1.88x ~ 30.26x` 
> on the x86 AVX-512 system:
> 
> Benchmark  before   After  Units
> LoadMaskedIOOBEBenchmark.byteLoadArrayMaskIOOBE   737.542 1387.069 ops/ms
> LoadMaskedIOOBEBenchmark.doubleLoadArrayMaskIOOBE 118.366  330.776 ops/ms
> LoadMaskedIOOBEBenchmark.floatLoadArrayMaskIOOBE  233.832 6125.026 ops/ms
> LoadMaskedIOOBEBenchmark.intLoadArrayMaskIOOBE233.816 7075.923 ops/ms
> LoadMaskedIOOBEBenchmark.longLoadArrayMaskIOOBE   119.771  330.587 ops/ms
> LoadMaskedIOOBEBenchmark.shortLoadArrayMaskIOOBE  431.961  939.301 ops/ms
> 
> Similar performance gain can also be observed on 512-bit SVE system.

Xiaohong Gong has updated the pull request with a new target base due to a 
merge or a rebase. The pull request now contains seven commits:

 - Add constant OFFSET_IN_RANGE and OFFSET_OUT_OF_RANGE
 - Merge branch 'jdk:master' into JDK-8283667
 - Merge branch 'jdk:master' into JDK-8283667
 - Use integer constant for offsetInRange all the way through
 - Rename "use_predicate" to "needs_predicate"
 - Rename the "usePred" to "offsetInRange"
 - 8283667: [vectorapi] Vectorization for masked load with IOOBE with predicate 
feature

-

Changes: https://git.openjdk.java.net/jdk/pull/8035/files
 Webrev: https://webrevs.openjdk.java.net/?repo=jdk=8035=05
  Stats: 453 lines in 44 files changed: 174 ins; 21 del; 258 mod
  Patch: https://git.openjdk.java.net/jdk/pull/8035.diff
  Fetch: git fetch https://git.openjdk.java.net/jdk pull/8035/head:pull/8035

PR: https://git.openjdk.java.net/jdk/pull/8035


Re: RFR: 8283667: [vectorapi] Vectorization for masked load with IOOBE with predicate feature [v5]

2022-06-06 Thread Xiaohong Gong
On Mon, 6 Jun 2022 10:40:45 GMT, Jatin Bhateja  wrote:

>> Xiaohong Gong has updated the pull request with a new target base due to a 
>> merge or a rebase. The pull request now contains five commits:
>> 
>>  - Merge branch 'jdk:master' into JDK-8283667
>>  - Use integer constant for offsetInRange all the way through
>>  - Rename "use_predicate" to "needs_predicate"
>>  - Rename the "usePred" to "offsetInRange"
>>  - 8283667: [vectorapi] Vectorization for masked load with IOOBE with 
>> predicate feature
>
> test/micro/org/openjdk/bench/jdk/incubator/vector/LoadMaskedIOOBEBenchmark.java
>  line 97:
> 
>> 95: public void byteLoadArrayMaskIOOBE() {
>> 96: for (int i = 0; i < inSize; i += bspecies.length()) {
>> 97: VectorMask mask = VectorMask.fromArray(bspecies, m, i);
> 
> For other case "if (offset >= 0 && offset <= (a.length - species.length())) 
> )" we are anyways intrinsifying, should we limit this micro to work only for 
> newly optimized case.

Yeah, thanks and it's really a good suggestion to limit this benchmark only for 
the IOOBE cases. I locally modified the tests to make sure only the IOOBE case 
happens and the results show good as well. But do you think it's better to keep 
as it is since we can also see the performance of the common cases to make sure 
no regressions happen? As the current benchmarks can also show the performance 
gain by this PR.

-

PR: https://git.openjdk.java.net/jdk/pull/8035


Re: RFR: 8283667: [vectorapi] Vectorization for masked load with IOOBE with predicate feature [v5]

2022-06-06 Thread Paul Sandoz
On Thu, 2 Jun 2022 03:27:59 GMT, Xiaohong Gong  wrote:

>> Currently the vector load with mask when the given index happens out of the 
>> array boundary is implemented with pure java scalar code to avoid the IOOBE 
>> (IndexOutOfBoundaryException). This is necessary for architectures that do 
>> not support the predicate feature. Because the masked load is implemented 
>> with a full vector load and a vector blend applied on it. And a full vector 
>> load will definitely cause the IOOBE which is not valid. However, for 
>> architectures that support the predicate feature like SVE/AVX-512/RVV, it 
>> can be vectorized with the predicated load instruction as long as the 
>> indexes of the masked lanes are within the bounds of the array. For these 
>> architectures, loading with unmasked lanes does not raise exception.
>> 
>> This patch adds the vectorization support for the masked load with IOOBE 
>> part. Please see the original java implementation (FIXME: optimize):
>> 
>> 
>>   @ForceInline
>>   public static
>>   ByteVector fromArray(VectorSpecies species,
>>byte[] a, int offset,
>>VectorMask m) {
>>   ByteSpecies vsp = (ByteSpecies) species;
>>   if (offset >= 0 && offset <= (a.length - species.length())) {
>>   return vsp.dummyVector().fromArray0(a, offset, m);
>>   }
>> 
>>   // FIXME: optimize
>>   checkMaskFromIndexSize(offset, vsp, m, 1, a.length);
>>   return vsp.vOp(m, i -> a[offset + i]);
>>   }
>> 
>> Since it can only be vectorized with the predicate load, the hotspot must 
>> check whether the current backend supports it and falls back to the java 
>> scalar version if not. This is different from the normal masked vector load 
>> that the compiler will generate a full vector load and a vector blend if the 
>> predicate load is not supported. So to let the compiler make the expected 
>> action, an additional flag (i.e. `usePred`) is added to the existing 
>> "loadMasked" intrinsic, with the value "true" for the IOOBE part while 
>> "false" for the normal load. And the compiler will fail to intrinsify if the 
>> flag is "true" and the predicate load is not supported by the backend, which 
>> means that normal java path will be executed.
>> 
>> Also adds the same vectorization support for masked:
>>  - fromByteArray/fromByteBuffer
>>  - fromBooleanArray
>>  - fromCharArray
>> 
>> The performance for the new added benchmarks improve about `1.88x ~ 30.26x` 
>> on the x86 AVX-512 system:
>> 
>> Benchmark  before   After  Units
>> LoadMaskedIOOBEBenchmark.byteLoadArrayMaskIOOBE   737.542 1387.069 ops/ms
>> LoadMaskedIOOBEBenchmark.doubleLoadArrayMaskIOOBE 118.366  330.776 ops/ms
>> LoadMaskedIOOBEBenchmark.floatLoadArrayMaskIOOBE  233.832 6125.026 ops/ms
>> LoadMaskedIOOBEBenchmark.intLoadArrayMaskIOOBE233.816 7075.923 ops/ms
>> LoadMaskedIOOBEBenchmark.longLoadArrayMaskIOOBE   119.771  330.587 ops/ms
>> LoadMaskedIOOBEBenchmark.shortLoadArrayMaskIOOBE  431.961  939.301 ops/ms
>> 
>> Similar performance gain can also be observed on 512-bit SVE system.
>
> Xiaohong Gong has updated the pull request with a new target base due to a 
> merge or a rebase. The pull request now contains five commits:
> 
>  - Merge branch 'jdk:master' into JDK-8283667
>  - Use integer constant for offsetInRange all the way through
>  - Rename "use_predicate" to "needs_predicate"
>  - Rename the "usePred" to "offsetInRange"
>  - 8283667: [vectorapi] Vectorization for masked load with IOOBE with 
> predicate feature

Looks good. As a follow on PR I think it would be useful to add constants 
`OFFSET_IN_RANGE` and `OFFSET_OUT_OF_RANGE`, then it becomes much clearer in 
source and you can drop the `/* offsetInRange */` comment on the argument.

-

Marked as reviewed by psandoz (Reviewer).

PR: https://git.openjdk.java.net/jdk/pull/8035


Re: RFR: 8283667: [vectorapi] Vectorization for masked load with IOOBE with predicate feature [v5]

2022-06-06 Thread Jatin Bhateja
On Thu, 2 Jun 2022 03:27:59 GMT, Xiaohong Gong  wrote:

>> Currently the vector load with mask when the given index happens out of the 
>> array boundary is implemented with pure java scalar code to avoid the IOOBE 
>> (IndexOutOfBoundaryException). This is necessary for architectures that do 
>> not support the predicate feature. Because the masked load is implemented 
>> with a full vector load and a vector blend applied on it. And a full vector 
>> load will definitely cause the IOOBE which is not valid. However, for 
>> architectures that support the predicate feature like SVE/AVX-512/RVV, it 
>> can be vectorized with the predicated load instruction as long as the 
>> indexes of the masked lanes are within the bounds of the array. For these 
>> architectures, loading with unmasked lanes does not raise exception.
>> 
>> This patch adds the vectorization support for the masked load with IOOBE 
>> part. Please see the original java implementation (FIXME: optimize):
>> 
>> 
>>   @ForceInline
>>   public static
>>   ByteVector fromArray(VectorSpecies species,
>>byte[] a, int offset,
>>VectorMask m) {
>>   ByteSpecies vsp = (ByteSpecies) species;
>>   if (offset >= 0 && offset <= (a.length - species.length())) {
>>   return vsp.dummyVector().fromArray0(a, offset, m);
>>   }
>> 
>>   // FIXME: optimize
>>   checkMaskFromIndexSize(offset, vsp, m, 1, a.length);
>>   return vsp.vOp(m, i -> a[offset + i]);
>>   }
>> 
>> Since it can only be vectorized with the predicate load, the hotspot must 
>> check whether the current backend supports it and falls back to the java 
>> scalar version if not. This is different from the normal masked vector load 
>> that the compiler will generate a full vector load and a vector blend if the 
>> predicate load is not supported. So to let the compiler make the expected 
>> action, an additional flag (i.e. `usePred`) is added to the existing 
>> "loadMasked" intrinsic, with the value "true" for the IOOBE part while 
>> "false" for the normal load. And the compiler will fail to intrinsify if the 
>> flag is "true" and the predicate load is not supported by the backend, which 
>> means that normal java path will be executed.
>> 
>> Also adds the same vectorization support for masked:
>>  - fromByteArray/fromByteBuffer
>>  - fromBooleanArray
>>  - fromCharArray
>> 
>> The performance for the new added benchmarks improve about `1.88x ~ 30.26x` 
>> on the x86 AVX-512 system:
>> 
>> Benchmark  before   After  Units
>> LoadMaskedIOOBEBenchmark.byteLoadArrayMaskIOOBE   737.542 1387.069 ops/ms
>> LoadMaskedIOOBEBenchmark.doubleLoadArrayMaskIOOBE 118.366  330.776 ops/ms
>> LoadMaskedIOOBEBenchmark.floatLoadArrayMaskIOOBE  233.832 6125.026 ops/ms
>> LoadMaskedIOOBEBenchmark.intLoadArrayMaskIOOBE233.816 7075.923 ops/ms
>> LoadMaskedIOOBEBenchmark.longLoadArrayMaskIOOBE   119.771  330.587 ops/ms
>> LoadMaskedIOOBEBenchmark.shortLoadArrayMaskIOOBE  431.961  939.301 ops/ms
>> 
>> Similar performance gain can also be observed on 512-bit SVE system.
>
> Xiaohong Gong has updated the pull request with a new target base due to a 
> merge or a rebase. The pull request now contains five commits:
> 
>  - Merge branch 'jdk:master' into JDK-8283667
>  - Use integer constant for offsetInRange all the way through
>  - Rename "use_predicate" to "needs_predicate"
>  - Rename the "usePred" to "offsetInRange"
>  - 8283667: [vectorapi] Vectorization for masked load with IOOBE with 
> predicate feature

test/micro/org/openjdk/bench/jdk/incubator/vector/LoadMaskedIOOBEBenchmark.java 
line 97:

> 95: public void byteLoadArrayMaskIOOBE() {
> 96: for (int i = 0; i < inSize; i += bspecies.length()) {
> 97: VectorMask mask = VectorMask.fromArray(bspecies, m, i);

For other case "if (offset >= 0 && offset <= (a.length - species.length())) )" 
we are anyways intrinsifying, should we limit this micro to work only for newly 
optimized case.

-

PR: https://git.openjdk.java.net/jdk/pull/8035


Re: RFR: 8283667: [vectorapi] Vectorization for masked load with IOOBE with predicate feature [v3]

2022-06-05 Thread Xiaohong Gong
On Thu, 12 May 2022 16:07:54 GMT, Paul Sandoz  wrote:

>> Xiaohong Gong has updated the pull request incrementally with one additional 
>> commit since the last revision:
>> 
>>   Rename "use_predicate" to "needs_predicate"
>
> Yes, the tests were run in debug mode. The reporting of the missing constant 
> occurs for the compiled method that is called from the method where the 
> constants are declared e.g.:
> 
> 719  240bjdk.incubator.vector.Int256Vector::fromArray0 (15 
> bytes)
>   ** Rejected vector op (LoadVectorMasked,int,8) because architecture does 
> not support it
>   ** missing constant: offsetInRange=Parm
> @ 11   
> jdk.incubator.vector.IntVector::fromArray0Template (22 bytes)   force inline 
> by annotation
> 
> 
> So it appears to be working as expected. A similar pattern occurs at a 
> lower-level for the passing of the mask class. `Int256Vector::fromArray0` 
> passes a constant class to `IntVector::fromArray0Template` (the compilation 
> of which bails out before checking that the `offsetInRange` is constant).

Hi @PaulSandoz , could you please take a look at this PR again? Thanks so much!

-

PR: https://git.openjdk.java.net/jdk/pull/8035


Re: RFR: 8283667: [vectorapi] Vectorization for masked load with IOOBE with predicate feature [v3]

2022-06-05 Thread Xiaohong Gong
On Thu, 2 Jun 2022 03:24:07 GMT, Xiaohong Gong  wrote:

>>> @XiaohongGong Could you please rebase the branch and resolve conflicts?
>> 
>> Sure, I'm working on this now. The patch will be updated soon. Thanks.
>
>> > @XiaohongGong Could you please rebase the branch and resolve conflicts?
>> 
>> Sure, I'm working on this now. The patch will be updated soon. Thanks.
> 
> Resolved the conflicts. Thanks!

> @XiaohongGong You need one more review approval.

Sure! Thanks a lot for the review!

-

PR: https://git.openjdk.java.net/jdk/pull/8035


Re: RFR: 8283667: [vectorapi] Vectorization for masked load with IOOBE with predicate feature [v3]

2022-06-02 Thread Sandhya Viswanathan
On Thu, 2 Jun 2022 03:24:07 GMT, Xiaohong Gong  wrote:

>>> @XiaohongGong Could you please rebase the branch and resolve conflicts?
>> 
>> Sure, I'm working on this now. The patch will be updated soon. Thanks.
>
>> > @XiaohongGong Could you please rebase the branch and resolve conflicts?
>> 
>> Sure, I'm working on this now. The patch will be updated soon. Thanks.
> 
> Resolved the conflicts. Thanks!

@XiaohongGong You need one more review approval.

-

PR: https://git.openjdk.java.net/jdk/pull/8035


Re: RFR: 8283667: [vectorapi] Vectorization for masked load with IOOBE with predicate feature [v5]

2022-06-02 Thread Sandhya Viswanathan
On Thu, 2 Jun 2022 03:27:59 GMT, Xiaohong Gong  wrote:

>> Currently the vector load with mask when the given index happens out of the 
>> array boundary is implemented with pure java scalar code to avoid the IOOBE 
>> (IndexOutOfBoundaryException). This is necessary for architectures that do 
>> not support the predicate feature. Because the masked load is implemented 
>> with a full vector load and a vector blend applied on it. And a full vector 
>> load will definitely cause the IOOBE which is not valid. However, for 
>> architectures that support the predicate feature like SVE/AVX-512/RVV, it 
>> can be vectorized with the predicated load instruction as long as the 
>> indexes of the masked lanes are within the bounds of the array. For these 
>> architectures, loading with unmasked lanes does not raise exception.
>> 
>> This patch adds the vectorization support for the masked load with IOOBE 
>> part. Please see the original java implementation (FIXME: optimize):
>> 
>> 
>>   @ForceInline
>>   public static
>>   ByteVector fromArray(VectorSpecies species,
>>byte[] a, int offset,
>>VectorMask m) {
>>   ByteSpecies vsp = (ByteSpecies) species;
>>   if (offset >= 0 && offset <= (a.length - species.length())) {
>>   return vsp.dummyVector().fromArray0(a, offset, m);
>>   }
>> 
>>   // FIXME: optimize
>>   checkMaskFromIndexSize(offset, vsp, m, 1, a.length);
>>   return vsp.vOp(m, i -> a[offset + i]);
>>   }
>> 
>> Since it can only be vectorized with the predicate load, the hotspot must 
>> check whether the current backend supports it and falls back to the java 
>> scalar version if not. This is different from the normal masked vector load 
>> that the compiler will generate a full vector load and a vector blend if the 
>> predicate load is not supported. So to let the compiler make the expected 
>> action, an additional flag (i.e. `usePred`) is added to the existing 
>> "loadMasked" intrinsic, with the value "true" for the IOOBE part while 
>> "false" for the normal load. And the compiler will fail to intrinsify if the 
>> flag is "true" and the predicate load is not supported by the backend, which 
>> means that normal java path will be executed.
>> 
>> Also adds the same vectorization support for masked:
>>  - fromByteArray/fromByteBuffer
>>  - fromBooleanArray
>>  - fromCharArray
>> 
>> The performance for the new added benchmarks improve about `1.88x ~ 30.26x` 
>> on the x86 AVX-512 system:
>> 
>> Benchmark  before   After  Units
>> LoadMaskedIOOBEBenchmark.byteLoadArrayMaskIOOBE   737.542 1387.069 ops/ms
>> LoadMaskedIOOBEBenchmark.doubleLoadArrayMaskIOOBE 118.366  330.776 ops/ms
>> LoadMaskedIOOBEBenchmark.floatLoadArrayMaskIOOBE  233.832 6125.026 ops/ms
>> LoadMaskedIOOBEBenchmark.intLoadArrayMaskIOOBE233.816 7075.923 ops/ms
>> LoadMaskedIOOBEBenchmark.longLoadArrayMaskIOOBE   119.771  330.587 ops/ms
>> LoadMaskedIOOBEBenchmark.shortLoadArrayMaskIOOBE  431.961  939.301 ops/ms
>> 
>> Similar performance gain can also be observed on 512-bit SVE system.
>
> Xiaohong Gong has updated the pull request with a new target base due to a 
> merge or a rebase. The pull request now contains five commits:
> 
>  - Merge branch 'jdk:master' into JDK-8283667
>  - Use integer constant for offsetInRange all the way through
>  - Rename "use_predicate" to "needs_predicate"
>  - Rename the "usePred" to "offsetInRange"
>  - 8283667: [vectorapi] Vectorization for masked load with IOOBE with 
> predicate feature

Marked as reviewed by sviswanathan (Reviewer).

-

PR: https://git.openjdk.java.net/jdk/pull/8035


Re: RFR: 8283667: [vectorapi] Vectorization for masked load with IOOBE with predicate feature [v5]

2022-06-01 Thread Xiaohong Gong
> Currently the vector load with mask when the given index happens out of the 
> array boundary is implemented with pure java scalar code to avoid the IOOBE 
> (IndexOutOfBoundaryException). This is necessary for architectures that do 
> not support the predicate feature. Because the masked load is implemented 
> with a full vector load and a vector blend applied on it. And a full vector 
> load will definitely cause the IOOBE which is not valid. However, for 
> architectures that support the predicate feature like SVE/AVX-512/RVV, it can 
> be vectorized with the predicated load instruction as long as the indexes of 
> the masked lanes are within the bounds of the array. For these architectures, 
> loading with unmasked lanes does not raise exception.
> 
> This patch adds the vectorization support for the masked load with IOOBE 
> part. Please see the original java implementation (FIXME: optimize):
> 
> 
>   @ForceInline
>   public static
>   ByteVector fromArray(VectorSpecies species,
>byte[] a, int offset,
>VectorMask m) {
>   ByteSpecies vsp = (ByteSpecies) species;
>   if (offset >= 0 && offset <= (a.length - species.length())) {
>   return vsp.dummyVector().fromArray0(a, offset, m);
>   }
> 
>   // FIXME: optimize
>   checkMaskFromIndexSize(offset, vsp, m, 1, a.length);
>   return vsp.vOp(m, i -> a[offset + i]);
>   }
> 
> Since it can only be vectorized with the predicate load, the hotspot must 
> check whether the current backend supports it and falls back to the java 
> scalar version if not. This is different from the normal masked vector load 
> that the compiler will generate a full vector load and a vector blend if the 
> predicate load is not supported. So to let the compiler make the expected 
> action, an additional flag (i.e. `usePred`) is added to the existing 
> "loadMasked" intrinsic, with the value "true" for the IOOBE part while 
> "false" for the normal load. And the compiler will fail to intrinsify if the 
> flag is "true" and the predicate load is not supported by the backend, which 
> means that normal java path will be executed.
> 
> Also adds the same vectorization support for masked:
>  - fromByteArray/fromByteBuffer
>  - fromBooleanArray
>  - fromCharArray
> 
> The performance for the new added benchmarks improve about `1.88x ~ 30.26x` 
> on the x86 AVX-512 system:
> 
> Benchmark  before   After  Units
> LoadMaskedIOOBEBenchmark.byteLoadArrayMaskIOOBE   737.542 1387.069 ops/ms
> LoadMaskedIOOBEBenchmark.doubleLoadArrayMaskIOOBE 118.366  330.776 ops/ms
> LoadMaskedIOOBEBenchmark.floatLoadArrayMaskIOOBE  233.832 6125.026 ops/ms
> LoadMaskedIOOBEBenchmark.intLoadArrayMaskIOOBE233.816 7075.923 ops/ms
> LoadMaskedIOOBEBenchmark.longLoadArrayMaskIOOBE   119.771  330.587 ops/ms
> LoadMaskedIOOBEBenchmark.shortLoadArrayMaskIOOBE  431.961  939.301 ops/ms
> 
> Similar performance gain can also be observed on 512-bit SVE system.

Xiaohong Gong has updated the pull request with a new target base due to a 
merge or a rebase. The pull request now contains five commits:

 - Merge branch 'jdk:master' into JDK-8283667
 - Use integer constant for offsetInRange all the way through
 - Rename "use_predicate" to "needs_predicate"
 - Rename the "usePred" to "offsetInRange"
 - 8283667: [vectorapi] Vectorization for masked load with IOOBE with predicate 
feature

-

Changes: https://git.openjdk.java.net/jdk/pull/8035/files
 Webrev: https://webrevs.openjdk.java.net/?repo=jdk=8035=04
  Stats: 447 lines in 43 files changed: 168 ins; 21 del; 258 mod
  Patch: https://git.openjdk.java.net/jdk/pull/8035.diff
  Fetch: git fetch https://git.openjdk.java.net/jdk pull/8035/head:pull/8035

PR: https://git.openjdk.java.net/jdk/pull/8035


Re: RFR: 8283667: [vectorapi] Vectorization for masked load with IOOBE with predicate feature [v3]

2022-06-01 Thread Xiaohong Gong
On Thu, 2 Jun 2022 01:49:10 GMT, Xiaohong Gong  wrote:

> > @XiaohongGong Could you please rebase the branch and resolve conflicts?
> 
> Sure, I'm working on this now. The patch will be updated soon. Thanks.

Resolved the conflicts. Thanks!

-

PR: https://git.openjdk.java.net/jdk/pull/8035


Re: RFR: 8283667: [vectorapi] Vectorization for masked load with IOOBE with predicate feature [v3]

2022-06-01 Thread Xiaohong Gong
On Fri, 13 May 2022 08:58:12 GMT, Xiaohong Gong  wrote:

>> Yes, the tests were run in debug mode. The reporting of the missing constant 
>> occurs for the compiled method that is called from the method where the 
>> constants are declared e.g.:
>> 
>> 719  240bjdk.incubator.vector.Int256Vector::fromArray0 (15 
>> bytes)
>>   ** Rejected vector op (LoadVectorMasked,int,8) because architecture does 
>> not support it
>>   ** missing constant: offsetInRange=Parm
>> @ 11   
>> jdk.incubator.vector.IntVector::fromArray0Template (22 bytes)   force inline 
>> by annotation
>> 
>> 
>> So it appears to be working as expected. A similar pattern occurs at a 
>> lower-level for the passing of the mask class. `Int256Vector::fromArray0` 
>> passes a constant class to `IntVector::fromArray0Template` (the compilation 
>> of which bails out before checking that the `offsetInRange` is constant).
>
> You are right @PaulSandoz ! I ran the tests and benchmarks with your patch, 
> and no failure and performance regression are found. I will update the patch 
> soon. Thanks for the help!

> @XiaohongGong Could you please rebase the branch and resolve conflicts?

Sure, I'm working on this now. The patch will be updated soon. Thanks.

-

PR: https://git.openjdk.java.net/jdk/pull/8035


Re: RFR: 8283667: [vectorapi] Vectorization for masked load with IOOBE with predicate feature [v3]

2022-06-01 Thread Sandhya Viswanathan
On Fri, 13 May 2022 08:58:12 GMT, Xiaohong Gong  wrote:

>> Yes, the tests were run in debug mode. The reporting of the missing constant 
>> occurs for the compiled method that is called from the method where the 
>> constants are declared e.g.:
>> 
>> 719  240bjdk.incubator.vector.Int256Vector::fromArray0 (15 
>> bytes)
>>   ** Rejected vector op (LoadVectorMasked,int,8) because architecture does 
>> not support it
>>   ** missing constant: offsetInRange=Parm
>> @ 11   
>> jdk.incubator.vector.IntVector::fromArray0Template (22 bytes)   force inline 
>> by annotation
>> 
>> 
>> So it appears to be working as expected. A similar pattern occurs at a 
>> lower-level for the passing of the mask class. `Int256Vector::fromArray0` 
>> passes a constant class to `IntVector::fromArray0Template` (the compilation 
>> of which bails out before checking that the `offsetInRange` is constant).
>
> You are right @PaulSandoz ! I ran the tests and benchmarks with your patch, 
> and no failure and performance regression are found. I will update the patch 
> soon. Thanks for the help!

@XiaohongGong Could you please rebase the branch and resolve conflicts?

-

PR: https://git.openjdk.java.net/jdk/pull/8035


Re: RFR: 8283667: [vectorapi] Vectorization for masked load with IOOBE with predicate feature [v4]

2022-05-13 Thread Xiaohong Gong
> Currently the vector load with mask when the given index happens out of the 
> array boundary is implemented with pure java scalar code to avoid the IOOBE 
> (IndexOutOfBoundaryException). This is necessary for architectures that do 
> not support the predicate feature. Because the masked load is implemented 
> with a full vector load and a vector blend applied on it. And a full vector 
> load will definitely cause the IOOBE which is not valid. However, for 
> architectures that support the predicate feature like SVE/AVX-512/RVV, it can 
> be vectorized with the predicated load instruction as long as the indexes of 
> the masked lanes are within the bounds of the array. For these architectures, 
> loading with unmasked lanes does not raise exception.
> 
> This patch adds the vectorization support for the masked load with IOOBE 
> part. Please see the original java implementation (FIXME: optimize):
> 
> 
>   @ForceInline
>   public static
>   ByteVector fromArray(VectorSpecies species,
>byte[] a, int offset,
>VectorMask m) {
>   ByteSpecies vsp = (ByteSpecies) species;
>   if (offset >= 0 && offset <= (a.length - species.length())) {
>   return vsp.dummyVector().fromArray0(a, offset, m);
>   }
> 
>   // FIXME: optimize
>   checkMaskFromIndexSize(offset, vsp, m, 1, a.length);
>   return vsp.vOp(m, i -> a[offset + i]);
>   }
> 
> Since it can only be vectorized with the predicate load, the hotspot must 
> check whether the current backend supports it and falls back to the java 
> scalar version if not. This is different from the normal masked vector load 
> that the compiler will generate a full vector load and a vector blend if the 
> predicate load is not supported. So to let the compiler make the expected 
> action, an additional flag (i.e. `usePred`) is added to the existing 
> "loadMasked" intrinsic, with the value "true" for the IOOBE part while 
> "false" for the normal load. And the compiler will fail to intrinsify if the 
> flag is "true" and the predicate load is not supported by the backend, which 
> means that normal java path will be executed.
> 
> Also adds the same vectorization support for masked:
>  - fromByteArray/fromByteBuffer
>  - fromBooleanArray
>  - fromCharArray
> 
> The performance for the new added benchmarks improve about `1.88x ~ 30.26x` 
> on the x86 AVX-512 system:
> 
> Benchmark  before   After  Units
> LoadMaskedIOOBEBenchmark.byteLoadArrayMaskIOOBE   737.542 1387.069 ops/ms
> LoadMaskedIOOBEBenchmark.doubleLoadArrayMaskIOOBE 118.366  330.776 ops/ms
> LoadMaskedIOOBEBenchmark.floatLoadArrayMaskIOOBE  233.832 6125.026 ops/ms
> LoadMaskedIOOBEBenchmark.intLoadArrayMaskIOOBE233.816 7075.923 ops/ms
> LoadMaskedIOOBEBenchmark.longLoadArrayMaskIOOBE   119.771  330.587 ops/ms
> LoadMaskedIOOBEBenchmark.shortLoadArrayMaskIOOBE  431.961  939.301 ops/ms
> 
> Similar performance gain can also be observed on 512-bit SVE system.

Xiaohong Gong has updated the pull request incrementally with one additional 
commit since the last revision:

  Use integer constant for offsetInRange all the way through

-

Changes:
  - all: https://git.openjdk.java.net/jdk/pull/8035/files
  - new: https://git.openjdk.java.net/jdk/pull/8035/files/9c69206e..07edfbd5

Webrevs:
 - full: https://webrevs.openjdk.java.net/?repo=jdk=8035=03
 - incr: https://webrevs.openjdk.java.net/?repo=jdk=8035=02-03

  Stats: 438 lines in 39 files changed: 33 ins; 118 del; 287 mod
  Patch: https://git.openjdk.java.net/jdk/pull/8035.diff
  Fetch: git fetch https://git.openjdk.java.net/jdk pull/8035/head:pull/8035

PR: https://git.openjdk.java.net/jdk/pull/8035


Re: RFR: 8283667: [vectorapi] Vectorization for masked load with IOOBE with predicate feature [v3]

2022-05-13 Thread Xiaohong Gong
On Thu, 12 May 2022 16:07:54 GMT, Paul Sandoz  wrote:

>> Xiaohong Gong has updated the pull request incrementally with one additional 
>> commit since the last revision:
>> 
>>   Rename "use_predicate" to "needs_predicate"
>
> Yes, the tests were run in debug mode. The reporting of the missing constant 
> occurs for the compiled method that is called from the method where the 
> constants are declared e.g.:
> 
> 719  240bjdk.incubator.vector.Int256Vector::fromArray0 (15 
> bytes)
>   ** Rejected vector op (LoadVectorMasked,int,8) because architecture does 
> not support it
>   ** missing constant: offsetInRange=Parm
> @ 11   
> jdk.incubator.vector.IntVector::fromArray0Template (22 bytes)   force inline 
> by annotation
> 
> 
> So it appears to be working as expected. A similar pattern occurs at a 
> lower-level for the passing of the mask class. `Int256Vector::fromArray0` 
> passes a constant class to `IntVector::fromArray0Template` (the compilation 
> of which bails out before checking that the `offsetInRange` is constant).

You are right @PaulSandoz ! I ran the tests and benchmarks with your patch, and 
no failure and performance regression are found. I will update the patch soon. 
Thanks for the help!

-

PR: https://git.openjdk.java.net/jdk/pull/8035


Re: RFR: 8283667: [vectorapi] Vectorization for masked load with IOOBE with predicate feature [v3]

2022-05-12 Thread Xiaohong Gong
On Thu, 12 May 2022 16:07:54 GMT, Paul Sandoz  wrote:

> Yes, the tests were run in debug mode. The reporting of the missing constant 
> occurs for the compiled method that is called from the method where the 
> constants are declared e.g.:
> 
> ```
> 719  240bjdk.incubator.vector.Int256Vector::fromArray0 (15 
> bytes)
>   ** Rejected vector op (LoadVectorMasked,int,8) because architecture does 
> not support it
>   ** missing constant: offsetInRange=Parm
> @ 11   
> jdk.incubator.vector.IntVector::fromArray0Template (22 bytes)   force inline 
> by annotation
> ```
> 
> So it appears to be working as expected. A similar pattern occurs at a 
> lower-level for the passing of the mask class. `Int256Vector::fromArray0` 
> passes a constant class to `IntVector::fromArray0Template` (the compilation 
> of which bails out before checking that the `offsetInRange` is constant).

Make sense to me! I think I understand what you mean. I will have more tests 
with the integer constant change. Thanks!

-

PR: https://git.openjdk.java.net/jdk/pull/8035


Re: RFR: 8283667: [vectorapi] Vectorization for masked load with IOOBE with predicate feature [v3]

2022-05-12 Thread Paul Sandoz
On Thu, 5 May 2022 08:56:07 GMT, Xiaohong Gong  wrote:

>> Currently the vector load with mask when the given index happens out of the 
>> array boundary is implemented with pure java scalar code to avoid the IOOBE 
>> (IndexOutOfBoundaryException). This is necessary for architectures that do 
>> not support the predicate feature. Because the masked load is implemented 
>> with a full vector load and a vector blend applied on it. And a full vector 
>> load will definitely cause the IOOBE which is not valid. However, for 
>> architectures that support the predicate feature like SVE/AVX-512/RVV, it 
>> can be vectorized with the predicated load instruction as long as the 
>> indexes of the masked lanes are within the bounds of the array. For these 
>> architectures, loading with unmasked lanes does not raise exception.
>> 
>> This patch adds the vectorization support for the masked load with IOOBE 
>> part. Please see the original java implementation (FIXME: optimize):
>> 
>> 
>>   @ForceInline
>>   public static
>>   ByteVector fromArray(VectorSpecies species,
>>byte[] a, int offset,
>>VectorMask m) {
>>   ByteSpecies vsp = (ByteSpecies) species;
>>   if (offset >= 0 && offset <= (a.length - species.length())) {
>>   return vsp.dummyVector().fromArray0(a, offset, m);
>>   }
>> 
>>   // FIXME: optimize
>>   checkMaskFromIndexSize(offset, vsp, m, 1, a.length);
>>   return vsp.vOp(m, i -> a[offset + i]);
>>   }
>> 
>> Since it can only be vectorized with the predicate load, the hotspot must 
>> check whether the current backend supports it and falls back to the java 
>> scalar version if not. This is different from the normal masked vector load 
>> that the compiler will generate a full vector load and a vector blend if the 
>> predicate load is not supported. So to let the compiler make the expected 
>> action, an additional flag (i.e. `usePred`) is added to the existing 
>> "loadMasked" intrinsic, with the value "true" for the IOOBE part while 
>> "false" for the normal load. And the compiler will fail to intrinsify if the 
>> flag is "true" and the predicate load is not supported by the backend, which 
>> means that normal java path will be executed.
>> 
>> Also adds the same vectorization support for masked:
>>  - fromByteArray/fromByteBuffer
>>  - fromBooleanArray
>>  - fromCharArray
>> 
>> The performance for the new added benchmarks improve about `1.88x ~ 30.26x` 
>> on the x86 AVX-512 system:
>> 
>> Benchmark  before   After  Units
>> LoadMaskedIOOBEBenchmark.byteLoadArrayMaskIOOBE   737.542 1387.069 ops/ms
>> LoadMaskedIOOBEBenchmark.doubleLoadArrayMaskIOOBE 118.366  330.776 ops/ms
>> LoadMaskedIOOBEBenchmark.floatLoadArrayMaskIOOBE  233.832 6125.026 ops/ms
>> LoadMaskedIOOBEBenchmark.intLoadArrayMaskIOOBE233.816 7075.923 ops/ms
>> LoadMaskedIOOBEBenchmark.longLoadArrayMaskIOOBE   119.771  330.587 ops/ms
>> LoadMaskedIOOBEBenchmark.shortLoadArrayMaskIOOBE  431.961  939.301 ops/ms
>> 
>> Similar performance gain can also be observed on 512-bit SVE system.
>
> Xiaohong Gong has updated the pull request incrementally with one additional 
> commit since the last revision:
> 
>   Rename "use_predicate" to "needs_predicate"

Yes, the tests were run in debug mode. The reporting of the missing constant 
occurs for the compiled method that is called from the method where the 
constants are declared e.g.:

719  240bjdk.incubator.vector.Int256Vector::fromArray0 (15 
bytes)
  ** Rejected vector op (LoadVectorMasked,int,8) because architecture does not 
support it
  ** missing constant: offsetInRange=Parm
@ 11   
jdk.incubator.vector.IntVector::fromArray0Template (22 bytes)   force inline by 
annotation


So it appears to be working as expected. A similar pattern occurs at a 
lower-level for the passing of the mask class. `Int256Vector::fromArray0` 
passes a constant class to `IntVector::fromArray0Template` (the compilation of 
which bails out before checking that the `offsetInRange` is constant).

-

PR: https://git.openjdk.java.net/jdk/pull/8035


Re: RFR: 8283667: [vectorapi] Vectorization for masked load with IOOBE with predicate feature [v3]

2022-05-11 Thread Xiaohong Gong
On Wed, 11 May 2022 19:45:55 GMT, Paul Sandoz  wrote:

> I tried your test code with the patch and logged compilation 
> (`-XX:-TieredCompilation -XX:+PrintCompilation -XX:+PrintInlining 
> -XX:+PrintIntrinsics -Xbatch`)
> 
> For `func` the first call to `VectorSupport::loadMasked` is intrinsic and 
> inlined:
> 
> ```
> @ 45   jdk.internal.vm.vector.VectorSupport::loadMasked (40 bytes)   
> (intrinsic)
> ```
> 
> But the second call (for the last loop iteration) fails to inline:
> 
> ```
> @ 45   jdk.internal.vm.vector.VectorSupport::loadMasked (40 bytes)   
> failed to inline (intrinsic)
> ```
> 
> Since i am running on an mac book this looks right and aligns with the 
> `-XX:+PrintIntrinsics` output:
> 
> ```
>   ** Rejected vector op (LoadVectorMasked,int,8) because architecture does 
> not support it
>   ** Rejected vector op (LoadVectorMasked,int,8) because architecture does 
> not support it
>   ** not supported: op=loadMasked vlen=8 etype=int using_byte_array=0
> ```
> 
> ?
> 
> I have not looked at the code gen nor measured performance comparing the case 
> when never out of bounds and only out of bounds for the last loop iteration.

Yeah, it looks right from the log. Did you try to find whether there is the log 
`** missing constant: offsetInRange=Parm` with `XX:+PrintIntrinsics` ? Or 
insert an assertion in `vectorIntrinsics.cpp` like:

--- a/src/hotspot/share/opto/vectorIntrinsics.cpp
+++ b/src/hotspot/share/opto/vectorIntrinsics.cpp
@@ -1236,6 +1236,7 @@ bool 
LibraryCallKit::inline_vector_mem_masked_operation(bool is_store) {
 } else {
   // Masked vector load with IOOBE always uses the predicated load.
   const TypeInt* offset_in_range = gvn().type(argument(8))->isa_int();
+  assert(offset_in_range->is_con(), "must be a constant");
   if (!offset_in_range->is_con()) {
 if (C->print_intrinsics()) {
   tty->print_cr("  ** missing constant: offsetInRange=%s",

And run the tests with debug mode.  Thanks!

-

PR: https://git.openjdk.java.net/jdk/pull/8035


Re: RFR: 8283667: [vectorapi] Vectorization for masked load with IOOBE with predicate feature [v3]

2022-05-11 Thread Paul Sandoz
On Thu, 5 May 2022 08:56:07 GMT, Xiaohong Gong  wrote:

>> Currently the vector load with mask when the given index happens out of the 
>> array boundary is implemented with pure java scalar code to avoid the IOOBE 
>> (IndexOutOfBoundaryException). This is necessary for architectures that do 
>> not support the predicate feature. Because the masked load is implemented 
>> with a full vector load and a vector blend applied on it. And a full vector 
>> load will definitely cause the IOOBE which is not valid. However, for 
>> architectures that support the predicate feature like SVE/AVX-512/RVV, it 
>> can be vectorized with the predicated load instruction as long as the 
>> indexes of the masked lanes are within the bounds of the array. For these 
>> architectures, loading with unmasked lanes does not raise exception.
>> 
>> This patch adds the vectorization support for the masked load with IOOBE 
>> part. Please see the original java implementation (FIXME: optimize):
>> 
>> 
>>   @ForceInline
>>   public static
>>   ByteVector fromArray(VectorSpecies species,
>>byte[] a, int offset,
>>VectorMask m) {
>>   ByteSpecies vsp = (ByteSpecies) species;
>>   if (offset >= 0 && offset <= (a.length - species.length())) {
>>   return vsp.dummyVector().fromArray0(a, offset, m);
>>   }
>> 
>>   // FIXME: optimize
>>   checkMaskFromIndexSize(offset, vsp, m, 1, a.length);
>>   return vsp.vOp(m, i -> a[offset + i]);
>>   }
>> 
>> Since it can only be vectorized with the predicate load, the hotspot must 
>> check whether the current backend supports it and falls back to the java 
>> scalar version if not. This is different from the normal masked vector load 
>> that the compiler will generate a full vector load and a vector blend if the 
>> predicate load is not supported. So to let the compiler make the expected 
>> action, an additional flag (i.e. `usePred`) is added to the existing 
>> "loadMasked" intrinsic, with the value "true" for the IOOBE part while 
>> "false" for the normal load. And the compiler will fail to intrinsify if the 
>> flag is "true" and the predicate load is not supported by the backend, which 
>> means that normal java path will be executed.
>> 
>> Also adds the same vectorization support for masked:
>>  - fromByteArray/fromByteBuffer
>>  - fromBooleanArray
>>  - fromCharArray
>> 
>> The performance for the new added benchmarks improve about `1.88x ~ 30.26x` 
>> on the x86 AVX-512 system:
>> 
>> Benchmark  before   After  Units
>> LoadMaskedIOOBEBenchmark.byteLoadArrayMaskIOOBE   737.542 1387.069 ops/ms
>> LoadMaskedIOOBEBenchmark.doubleLoadArrayMaskIOOBE 118.366  330.776 ops/ms
>> LoadMaskedIOOBEBenchmark.floatLoadArrayMaskIOOBE  233.832 6125.026 ops/ms
>> LoadMaskedIOOBEBenchmark.intLoadArrayMaskIOOBE233.816 7075.923 ops/ms
>> LoadMaskedIOOBEBenchmark.longLoadArrayMaskIOOBE   119.771  330.587 ops/ms
>> LoadMaskedIOOBEBenchmark.shortLoadArrayMaskIOOBE  431.961  939.301 ops/ms
>> 
>> Similar performance gain can also be observed on 512-bit SVE system.
>
> Xiaohong Gong has updated the pull request incrementally with one additional 
> commit since the last revision:
> 
>   Rename "use_predicate" to "needs_predicate"

I tried your test code with the patch and logged compilation 
(`-XX:-TieredCompilation -XX:+PrintCompilation -XX:+PrintInlining 
-XX:+PrintIntrinsics -Xbatch`)

For `func` the first call to `VectorSupport::loadMasked` is intrinsic and 
inlined:

@ 45   jdk.internal.vm.vector.VectorSupport::loadMasked (40 bytes)   
(intrinsic)

But the second call (for the last loop iteration) fails to inline: 

@ 45   jdk.internal.vm.vector.VectorSupport::loadMasked (40 bytes)   failed 
to inline (intrinsic)


Since i am running on an mac book this looks right and aligns with the 
`-XX:+PrintIntrinsics` output: 

  ** Rejected vector op (LoadVectorMasked,int,8) because architecture does not 
support it
  ** Rejected vector op (LoadVectorMasked,int,8) because architecture does not 
support it
  ** not supported: op=loadMasked vlen=8 etype=int using_byte_array=0

?

I have not looked at the code gen nor measured performance comparing the case 
when never out of bounds and only out of bounds for the last loop iteration.

-

PR: https://git.openjdk.java.net/jdk/pull/8035


Re: RFR: 8283667: [vectorapi] Vectorization for masked load with IOOBE with predicate feature [v3]

2022-05-11 Thread Paul Sandoz
On Wed, 11 May 2022 03:23:13 GMT, Xiaohong Gong  wrote:

>> I modified the code of this PR to avoid the conversion of `boolean` to 
>> `int`, so a constant integer value is passed all the way through, and the 
>> masked load is made intrinsic from the method at which the constants are 
>> passed as arguments i.e. the public `fromArray` mask accepting method.
>
> Hi @PaulSandoz , thanks for the patch for the constant int parameter. I think 
> the main change is:
> 
> -ByteVector fromArray0Template(Class maskClass, C base, long offset, 
> int index, M m, boolean offsetInRange,
> +ByteVector fromArray0Template(Class maskClass, C base, long offset, 
> int index, M m, int offsetInRange,
>  VectorSupport.LoadVectorMaskedOperation ByteVector, ByteSpecies, M> defaultImpl) {
>  m.check(species());
>  ByteSpecies vsp = vspecies();
> -if (offsetInRange) {
> -return VectorSupport.loadMasked(
> -vsp.vectorType(), maskClass, vsp.elementType(), 
> vsp.laneCount(),
> -base, offset, m, /* offsetInRange */ 1,
> -base, index, vsp, defaultImpl);
> -} else {
> -return VectorSupport.loadMasked(
> -vsp.vectorType(), maskClass, vsp.elementType(), 
> vsp.laneCount(),
> -base, offset, m, /* offsetInRange */ 0,
> -base, index, vsp, defaultImpl);
> -}
> +return VectorSupport.loadMasked(
> +vsp.vectorType(), maskClass, vsp.elementType(), vsp.laneCount(),
> +base, offset, m, offsetInRange == 1 ? 1 : 0,
> +base, index, vsp, defaultImpl);
>  }
> 
> which uses `offsetInRange == 1 ? 1 : 0`. Unfortunately this could not always 
> make sure the `offsetInRange` a constant a the compiler time. Again, this 
> change could also make the assertion fail randomly:
> 
> --- a/src/hotspot/share/opto/vectorIntrinsics.cpp
> +++ b/src/hotspot/share/opto/vectorIntrinsics.cpp
> @@ -1236,6 +1236,7 @@ bool 
> LibraryCallKit::inline_vector_mem_masked_operation(bool is_store) {
>  } else {
>// Masked vector load with IOOBE always uses the predicated load.
>const TypeInt* offset_in_range = gvn().type(argument(8))->isa_int();
> +  assert(offset_in_range->is_con(), "must be a constant");
>if (!offset_in_range->is_con()) {
>  if (C->print_intrinsics()) {
>tty->print_cr("  ** missing constant: offsetInRange=%s",
> 
> Sometimes, the compiler can parse it a constant. I think this depends on the 
> compiler OSR and speculative optimization. Did you try an example with IOOBE 
> on a non predicated hardware?
> 
> Here is the main code of my unittest to reproduce the issue:
> 
> static final VectorSpecies I_SPECIES = IntVector.SPECIES_128;
> static final int LENGTH = 1026;
> public static int[] ia;
> public static int[] ib;
> 
> private static void init() {
> for (int i = 0; i < LENGTH; i++) {
> ia[i] = i;
> ib[i] = 0;
> }
> 
> for (int i = 0; i < 2; i++) {
> m[i] = i % 2 == 0;
> }
> }
> 
>  private static void func() {
> VectorMask mask = VectorMask.fromArray(I_SPECIES, m, 0);
> for (int i = 0; i < LENGTH; i += vl) {
> IntVector av = IntVector.fromArray(I_SPECIES, ia, i, mask);
> av.lanewise(VectorOperators.ABS).intoArray(ic, i, mask);
> }
>  }
> 
>  public static void main(String[] args) {
> init();
> for (int i = 0; i < 1; i++) {
> func();
> }
>   }

@XiaohongGong Doh! The ternary was an experiment, and I forgot to re-run the 
code gen script before sending your the patch. See `IntVector`, which does not 
have that. I presume when the offset is not in range and the other code path is 
taken then it might be problematic unless all code paths are inlined. I will 
experiment further with tests.

-

PR: https://git.openjdk.java.net/jdk/pull/8035


Re: RFR: 8283667: [vectorapi] Vectorization for masked load with IOOBE with predicate feature [v3]

2022-05-10 Thread Xiaohong Gong
On Mon, 9 May 2022 21:55:27 GMT, Paul Sandoz  wrote:

>> Xiaohong Gong has updated the pull request incrementally with one additional 
>> commit since the last revision:
>> 
>>   Rename "use_predicate" to "needs_predicate"
>
> I modified the code of this PR to avoid the conversion of `boolean` to `int`, 
> so a constant integer value is passed all the way through, and the masked 
> load is made intrinsic from the method at which the constants are passed as 
> arguments i.e. the public `fromArray` mask accepting method.

Hi @PaulSandoz , thanks for the patch for the constant int parameter. I think 
the main change is:

-ByteVector fromArray0Template(Class maskClass, C base, long offset, int 
index, M m, boolean offsetInRange,
+ByteVector fromArray0Template(Class maskClass, C base, long offset, int 
index, M m, int offsetInRange,
 VectorSupport.LoadVectorMaskedOperation defaultImpl) {
 m.check(species());
 ByteSpecies vsp = vspecies();
-if (offsetInRange) {
-return VectorSupport.loadMasked(
-vsp.vectorType(), maskClass, vsp.elementType(), 
vsp.laneCount(),
-base, offset, m, /* offsetInRange */ 1,
-base, index, vsp, defaultImpl);
-} else {
-return VectorSupport.loadMasked(
-vsp.vectorType(), maskClass, vsp.elementType(), 
vsp.laneCount(),
-base, offset, m, /* offsetInRange */ 0,
-base, index, vsp, defaultImpl);
-}
+return VectorSupport.loadMasked(
+vsp.vectorType(), maskClass, vsp.elementType(), vsp.laneCount(),
+base, offset, m, offsetInRange == 1 ? 1 : 0,
+base, index, vsp, defaultImpl);
 }

which uses `offsetInRange == 1 ? 1 : 0`. Unfortunately this could not always 
make sure the `offsetInRange` a constant a the compiler time. Again, this 
change could also make the assertion fail randomly:

--- a/src/hotspot/share/opto/vectorIntrinsics.cpp
+++ b/src/hotspot/share/opto/vectorIntrinsics.cpp
@@ -1236,6 +1236,7 @@ bool 
LibraryCallKit::inline_vector_mem_masked_operation(bool is_store) {
 } else {
   // Masked vector load with IOOBE always uses the predicated load.
   const TypeInt* offset_in_range = gvn().type(argument(8))->isa_int();
+  assert(offset_in_range->is_con(), "must be a constant");
   if (!offset_in_range->is_con()) {
 if (C->print_intrinsics()) {
   tty->print_cr("  ** missing constant: offsetInRange=%s",

Sometimes, the compiler can parse it a constant. I think this depends on the 
compiler OSR and speculative optimization. Did you try an example with IOOBE on 
a non predicated hardware?

Here is the main code of my unittest to reproduce the issue:

static final VectorSpecies I_SPECIES = IntVector.SPECIES_128;
static final int LENGTH = 1026;
public static int[] ia;
public static int[] ib;

private static void init() {
for (int i = 0; i < LENGTH; i++) {
ia[i] = i;
ib[i] = 0;
}

for (int i = 0; i < 2; i++) {
m[i] = i % 2 == 0;
}
}

 private static void func() {
VectorMask mask = VectorMask.fromArray(I_SPECIES, m, 0);
for (int i = 0; i < LENGTH; i += vl) {
IntVector av = IntVector.fromArray(I_SPECIES, ia, i, mask);
av.lanewise(VectorOperators.ABS).intoArray(ic, i, mask);
}
 }

 public static void main(String[] args) {
init();
for (int i = 0; i < 1; i++) {
func();
}
  }

-

PR: https://git.openjdk.java.net/jdk/pull/8035


Re: RFR: 8283667: [vectorapi] Vectorization for masked load with IOOBE with predicate feature [v3]

2022-05-09 Thread Xiaohong Gong
On Mon, 9 May 2022 21:55:27 GMT, Paul Sandoz  wrote:

> I modified the code of this PR to avoid the conversion of `boolean` to `int`, 
> so a constant integer value is passed all the way through, and the masked 
> load is made intrinsic from the method at which the constants are passed as 
> arguments i.e. the public `fromArray` mask accepting method.

Great and thanks! Could you please show me the changes or an example? I can 
push the changes to this PR. Thanks so much!

-

PR: https://git.openjdk.java.net/jdk/pull/8035


Re: RFR: 8283667: [vectorapi] Vectorization for masked load with IOOBE with predicate feature [v3]

2022-05-09 Thread Paul Sandoz
On Thu, 5 May 2022 08:56:07 GMT, Xiaohong Gong  wrote:

>> Currently the vector load with mask when the given index happens out of the 
>> array boundary is implemented with pure java scalar code to avoid the IOOBE 
>> (IndexOutOfBoundaryException). This is necessary for architectures that do 
>> not support the predicate feature. Because the masked load is implemented 
>> with a full vector load and a vector blend applied on it. And a full vector 
>> load will definitely cause the IOOBE which is not valid. However, for 
>> architectures that support the predicate feature like SVE/AVX-512/RVV, it 
>> can be vectorized with the predicated load instruction as long as the 
>> indexes of the masked lanes are within the bounds of the array. For these 
>> architectures, loading with unmasked lanes does not raise exception.
>> 
>> This patch adds the vectorization support for the masked load with IOOBE 
>> part. Please see the original java implementation (FIXME: optimize):
>> 
>> 
>>   @ForceInline
>>   public static
>>   ByteVector fromArray(VectorSpecies species,
>>byte[] a, int offset,
>>VectorMask m) {
>>   ByteSpecies vsp = (ByteSpecies) species;
>>   if (offset >= 0 && offset <= (a.length - species.length())) {
>>   return vsp.dummyVector().fromArray0(a, offset, m);
>>   }
>> 
>>   // FIXME: optimize
>>   checkMaskFromIndexSize(offset, vsp, m, 1, a.length);
>>   return vsp.vOp(m, i -> a[offset + i]);
>>   }
>> 
>> Since it can only be vectorized with the predicate load, the hotspot must 
>> check whether the current backend supports it and falls back to the java 
>> scalar version if not. This is different from the normal masked vector load 
>> that the compiler will generate a full vector load and a vector blend if the 
>> predicate load is not supported. So to let the compiler make the expected 
>> action, an additional flag (i.e. `usePred`) is added to the existing 
>> "loadMasked" intrinsic, with the value "true" for the IOOBE part while 
>> "false" for the normal load. And the compiler will fail to intrinsify if the 
>> flag is "true" and the predicate load is not supported by the backend, which 
>> means that normal java path will be executed.
>> 
>> Also adds the same vectorization support for masked:
>>  - fromByteArray/fromByteBuffer
>>  - fromBooleanArray
>>  - fromCharArray
>> 
>> The performance for the new added benchmarks improve about `1.88x ~ 30.26x` 
>> on the x86 AVX-512 system:
>> 
>> Benchmark  before   After  Units
>> LoadMaskedIOOBEBenchmark.byteLoadArrayMaskIOOBE   737.542 1387.069 ops/ms
>> LoadMaskedIOOBEBenchmark.doubleLoadArrayMaskIOOBE 118.366  330.776 ops/ms
>> LoadMaskedIOOBEBenchmark.floatLoadArrayMaskIOOBE  233.832 6125.026 ops/ms
>> LoadMaskedIOOBEBenchmark.intLoadArrayMaskIOOBE233.816 7075.923 ops/ms
>> LoadMaskedIOOBEBenchmark.longLoadArrayMaskIOOBE   119.771  330.587 ops/ms
>> LoadMaskedIOOBEBenchmark.shortLoadArrayMaskIOOBE  431.961  939.301 ops/ms
>> 
>> Similar performance gain can also be observed on 512-bit SVE system.
>
> Xiaohong Gong has updated the pull request incrementally with one additional 
> commit since the last revision:
> 
>   Rename "use_predicate" to "needs_predicate"

I did modified the code of this PR to avoid the conversion of `boolean` to 
`int` and the masked load is made intrinsic from the method at which the 
constants are passed as arguments i.e. the public `fromArray` mask accepting 
method.

-

PR: https://git.openjdk.java.net/jdk/pull/8035


Re: RFR: 8283667: [vectorapi] Vectorization for masked load with IOOBE with predicate feature [v3]

2022-05-06 Thread Xiaohong Gong
On Fri, 6 May 2022 14:59:26 GMT, Paul Sandoz  wrote:

>> Make sense! Thanks for the explanation!
>
> Doh! of course. This is not the first and will not be the last time i get 
> caught out by the 2-slot requirement.
> It may be useful to do this:
> 
> Node* mask_arg = is_store ? argument(8) : argument(7);

Yes, the mask argument is got like:

 Node* mask = unbox_vector(is_store ? argument(8) : argument(7), mbox_type, 
elem_bt, num_elem);

-

PR: https://git.openjdk.java.net/jdk/pull/8035


Re: RFR: 8283667: [vectorapi] Vectorization for masked load with IOOBE with predicate feature [v3]

2022-05-06 Thread Paul Sandoz
On Fri, 6 May 2022 04:49:39 GMT, Xiaohong Gong  wrote:

>> offset is long so uses two argument slots (5 and 6). 
>> mask is argument (7).
>> offsetInRange is argument(8).
>
> Make sense! Thanks for the explanation!

Doh! of course. This is not the first and will not be the last time i get 
caught out by the 2-slot requirement.
It may be useful to do this:

Node* mask_arg = is_store ? argument(8) : argument(7);

-

PR: https://git.openjdk.java.net/jdk/pull/8035


Re: RFR: 8283667: [vectorapi] Vectorization for masked load with IOOBE with predicate feature [v3]

2022-05-05 Thread Xiaohong Gong
On Fri, 6 May 2022 04:22:30 GMT, Sandhya Viswanathan  
wrote:

>> I'm afraid it's `argument(8)` for the load operation since the `argument(7)` 
>> is the mask input. It seems the argument number is not right begin from the 
>> mask input which is expected to be `6`. But the it's not. Actually I don't 
>> quite understand why.
>
> offset is long so uses two argument slots (5 and 6). 
> mask is argument (7).
> offsetInRange is argument(8).

Make sense! Thanks for the explanation!

-

PR: https://git.openjdk.java.net/jdk/pull/8035


Re: RFR: 8283667: [vectorapi] Vectorization for masked load with IOOBE with predicate feature [v3]

2022-05-05 Thread Sandhya Viswanathan
On Fri, 6 May 2022 03:47:47 GMT, Xiaohong Gong  wrote:

>> src/hotspot/share/opto/vectorIntrinsics.cpp line 1238:
>> 
>>> 1236: } else {
>>> 1237:   // Masked vector load with IOOBE always uses the predicated 
>>> load.
>>> 1238:   const TypeInt* offset_in_range = 
>>> gvn().type(argument(8))->isa_int();
>> 
>> Should it be `argument(7)`? (and adjustments later to access the container).
>
> I'm afraid it's `argument(8)` for the load operation since the `argument(7)` 
> is the mask input. It seems the argument number is not right begin from the 
> mask input which is expected to be `6`. But the it's not. Actually I don't 
> quite understand why.

offset is long so uses two argument slots (5 and 6). 
mask is argument (7).
offsetInRange is argument(8).

-

PR: https://git.openjdk.java.net/jdk/pull/8035


Re: RFR: 8283667: [vectorapi] Vectorization for masked load with IOOBE with predicate feature [v3]

2022-05-05 Thread Xiaohong Gong
On Thu, 5 May 2022 19:27:47 GMT, Paul Sandoz  wrote:

>> Xiaohong Gong has updated the pull request incrementally with one additional 
>> commit since the last revision:
>> 
>>   Rename "use_predicate" to "needs_predicate"
>
> src/hotspot/share/opto/vectorIntrinsics.cpp line 1238:
> 
>> 1236: } else {
>> 1237:   // Masked vector load with IOOBE always uses the predicated load.
>> 1238:   const TypeInt* offset_in_range = 
>> gvn().type(argument(8))->isa_int();
> 
> Should it be `argument(7)`? (and adjustments later to access the container).

I'm afraid it's `argument(8)` for the load operation since the `argument(7)` is 
the mask input. It seems the argument number is not right begin from the mask 
input which is expected to be `6`. But the it's not. Actually I don't quite 
understand why.

-

PR: https://git.openjdk.java.net/jdk/pull/8035


Re: RFR: 8283667: [vectorapi] Vectorization for masked load with IOOBE with predicate feature [v3]

2022-05-05 Thread Paul Sandoz
On Thu, 5 May 2022 08:56:07 GMT, Xiaohong Gong  wrote:

>> Currently the vector load with mask when the given index happens out of the 
>> array boundary is implemented with pure java scalar code to avoid the IOOBE 
>> (IndexOutOfBoundaryException). This is necessary for architectures that do 
>> not support the predicate feature. Because the masked load is implemented 
>> with a full vector load and a vector blend applied on it. And a full vector 
>> load will definitely cause the IOOBE which is not valid. However, for 
>> architectures that support the predicate feature like SVE/AVX-512/RVV, it 
>> can be vectorized with the predicated load instruction as long as the 
>> indexes of the masked lanes are within the bounds of the array. For these 
>> architectures, loading with unmasked lanes does not raise exception.
>> 
>> This patch adds the vectorization support for the masked load with IOOBE 
>> part. Please see the original java implementation (FIXME: optimize):
>> 
>> 
>>   @ForceInline
>>   public static
>>   ByteVector fromArray(VectorSpecies species,
>>byte[] a, int offset,
>>VectorMask m) {
>>   ByteSpecies vsp = (ByteSpecies) species;
>>   if (offset >= 0 && offset <= (a.length - species.length())) {
>>   return vsp.dummyVector().fromArray0(a, offset, m);
>>   }
>> 
>>   // FIXME: optimize
>>   checkMaskFromIndexSize(offset, vsp, m, 1, a.length);
>>   return vsp.vOp(m, i -> a[offset + i]);
>>   }
>> 
>> Since it can only be vectorized with the predicate load, the hotspot must 
>> check whether the current backend supports it and falls back to the java 
>> scalar version if not. This is different from the normal masked vector load 
>> that the compiler will generate a full vector load and a vector blend if the 
>> predicate load is not supported. So to let the compiler make the expected 
>> action, an additional flag (i.e. `usePred`) is added to the existing 
>> "loadMasked" intrinsic, with the value "true" for the IOOBE part while 
>> "false" for the normal load. And the compiler will fail to intrinsify if the 
>> flag is "true" and the predicate load is not supported by the backend, which 
>> means that normal java path will be executed.
>> 
>> Also adds the same vectorization support for masked:
>>  - fromByteArray/fromByteBuffer
>>  - fromBooleanArray
>>  - fromCharArray
>> 
>> The performance for the new added benchmarks improve about `1.88x ~ 30.26x` 
>> on the x86 AVX-512 system:
>> 
>> Benchmark  before   After  Units
>> LoadMaskedIOOBEBenchmark.byteLoadArrayMaskIOOBE   737.542 1387.069 ops/ms
>> LoadMaskedIOOBEBenchmark.doubleLoadArrayMaskIOOBE 118.366  330.776 ops/ms
>> LoadMaskedIOOBEBenchmark.floatLoadArrayMaskIOOBE  233.832 6125.026 ops/ms
>> LoadMaskedIOOBEBenchmark.intLoadArrayMaskIOOBE233.816 7075.923 ops/ms
>> LoadMaskedIOOBEBenchmark.longLoadArrayMaskIOOBE   119.771  330.587 ops/ms
>> LoadMaskedIOOBEBenchmark.shortLoadArrayMaskIOOBE  431.961  939.301 ops/ms
>> 
>> Similar performance gain can also be observed on 512-bit SVE system.
>
> Xiaohong Gong has updated the pull request incrementally with one additional 
> commit since the last revision:
> 
>   Rename "use_predicate" to "needs_predicate"

src/hotspot/share/opto/vectorIntrinsics.cpp line 1238:

> 1236: } else {
> 1237:   // Masked vector load with IOOBE always uses the predicated load.
> 1238:   const TypeInt* offset_in_range = 
> gvn().type(argument(8))->isa_int();

Should it be `argument(7)`? (and adjustments later to access the container).

-

PR: https://git.openjdk.java.net/jdk/pull/8035


Re: RFR: 8283667: [vectorapi] Vectorization for masked load with IOOBE with predicate feature [v3]

2022-05-05 Thread Xiaohong Gong
> Currently the vector load with mask when the given index happens out of the 
> array boundary is implemented with pure java scalar code to avoid the IOOBE 
> (IndexOutOfBoundaryException). This is necessary for architectures that do 
> not support the predicate feature. Because the masked load is implemented 
> with a full vector load and a vector blend applied on it. And a full vector 
> load will definitely cause the IOOBE which is not valid. However, for 
> architectures that support the predicate feature like SVE/AVX-512/RVV, it can 
> be vectorized with the predicated load instruction as long as the indexes of 
> the masked lanes are within the bounds of the array. For these architectures, 
> loading with unmasked lanes does not raise exception.
> 
> This patch adds the vectorization support for the masked load with IOOBE 
> part. Please see the original java implementation (FIXME: optimize):
> 
> 
>   @ForceInline
>   public static
>   ByteVector fromArray(VectorSpecies species,
>byte[] a, int offset,
>VectorMask m) {
>   ByteSpecies vsp = (ByteSpecies) species;
>   if (offset >= 0 && offset <= (a.length - species.length())) {
>   return vsp.dummyVector().fromArray0(a, offset, m);
>   }
> 
>   // FIXME: optimize
>   checkMaskFromIndexSize(offset, vsp, m, 1, a.length);
>   return vsp.vOp(m, i -> a[offset + i]);
>   }
> 
> Since it can only be vectorized with the predicate load, the hotspot must 
> check whether the current backend supports it and falls back to the java 
> scalar version if not. This is different from the normal masked vector load 
> that the compiler will generate a full vector load and a vector blend if the 
> predicate load is not supported. So to let the compiler make the expected 
> action, an additional flag (i.e. `usePred`) is added to the existing 
> "loadMasked" intrinsic, with the value "true" for the IOOBE part while 
> "false" for the normal load. And the compiler will fail to intrinsify if the 
> flag is "true" and the predicate load is not supported by the backend, which 
> means that normal java path will be executed.
> 
> Also adds the same vectorization support for masked:
>  - fromByteArray/fromByteBuffer
>  - fromBooleanArray
>  - fromCharArray
> 
> The performance for the new added benchmarks improve about `1.88x ~ 30.26x` 
> on the x86 AVX-512 system:
> 
> Benchmark  before   After  Units
> LoadMaskedIOOBEBenchmark.byteLoadArrayMaskIOOBE   737.542 1387.069 ops/ms
> LoadMaskedIOOBEBenchmark.doubleLoadArrayMaskIOOBE 118.366  330.776 ops/ms
> LoadMaskedIOOBEBenchmark.floatLoadArrayMaskIOOBE  233.832 6125.026 ops/ms
> LoadMaskedIOOBEBenchmark.intLoadArrayMaskIOOBE233.816 7075.923 ops/ms
> LoadMaskedIOOBEBenchmark.longLoadArrayMaskIOOBE   119.771  330.587 ops/ms
> LoadMaskedIOOBEBenchmark.shortLoadArrayMaskIOOBE  431.961  939.301 ops/ms
> 
> Similar performance gain can also be observed on 512-bit SVE system.

Xiaohong Gong has updated the pull request incrementally with one additional 
commit since the last revision:

  Rename "use_predicate" to "needs_predicate"

-

Changes:
  - all: https://git.openjdk.java.net/jdk/pull/8035/files
  - new: https://git.openjdk.java.net/jdk/pull/8035/files/9b2d2f19..9c69206e

Webrevs:
 - full: https://webrevs.openjdk.java.net/?repo=jdk=8035=02
 - incr: https://webrevs.openjdk.java.net/?repo=jdk=8035=01-02

  Stats: 4 lines in 1 file changed: 0 ins; 0 del; 4 mod
  Patch: https://git.openjdk.java.net/jdk/pull/8035.diff
  Fetch: git fetch https://git.openjdk.java.net/jdk pull/8035/head:pull/8035

PR: https://git.openjdk.java.net/jdk/pull/8035


Re: RFR: 8283667: [vectorapi] Vectorization for masked load with IOOBE with predicate feature [v2]

2022-05-05 Thread Xiaohong Gong
On Thu, 5 May 2022 02:14:08 GMT, Xiaohong Gong  wrote:

>> src/hotspot/share/opto/vectorIntrinsics.cpp line 1232:
>> 
>>> 1230:   // out when current case uses the predicate feature.
>>> 1231:   if (!supports_predicate) {
>>> 1232: bool use_predicate = false;
>> 
>> If we rename this to needs_predicate it will be easier to understand.
>
> Thanks for the comment! This local variable will be removed after adding the 
> similar intrinsify for store masked. Please help to see the PR 
> https://github.com/openjdk/jdk/pull/8544. Thanks so much!

Renamed to "needs_predicate". Thanks!

-

PR: https://git.openjdk.java.net/jdk/pull/8035


Re: RFR: 8283667: [vectorapi] Vectorization for masked load with IOOBE with predicate feature [v2]

2022-05-04 Thread Xiaohong Gong
On Thu, 31 Mar 2022 02:15:26 GMT, Quan Anh Mai  wrote:

>> I'm afraid not. "Load + Blend" makes the elements of unmasked lanes to be 
>> `0`. Then a full store may change the values in the unmasked memory to be 0, 
>> which is different with the mask store API definition.
>
> The blend should be with the intended-to-store vector, so that masked lanes 
> contain the need-to-store elements and unmasked lanes contain the loaded 
> elements, which would be stored back, which results in unchanged values.

Hi @merykitty @jatin-bhateja , could you please help to take a review at the 
similar store masked PR https://github.com/openjdk/jdk/pull/8544 ? Any feedback 
is welcome! Thanks so much!

-

PR: https://git.openjdk.java.net/jdk/pull/8035


Re: RFR: 8283667: [vectorapi] Vectorization for masked load with IOOBE with predicate feature [v2]

2022-05-04 Thread Xiaohong Gong
On Thu, 28 Apr 2022 00:13:49 GMT, Sandhya Viswanathan 
 wrote:

>> Xiaohong Gong has updated the pull request incrementally with one additional 
>> commit since the last revision:
>> 
>>   Rename the "usePred" to "offsetInRange"
>
> src/hotspot/share/opto/vectorIntrinsics.cpp line 1232:
> 
>> 1230:   // out when current case uses the predicate feature.
>> 1231:   if (!supports_predicate) {
>> 1232: bool use_predicate = false;
> 
> If we rename this to needs_predicate it will be easier to understand.

Thanks for the comment! This local variable will be removed after adding the 
similar intrinsify for store masked. Please help to see the PR 
https://github.com/openjdk/jdk/pull/8544. Thanks so much!

-

PR: https://git.openjdk.java.net/jdk/pull/8035


Re: RFR: 8283667: [vectorapi] Vectorization for masked load with IOOBE with predicate feature [v2]

2022-05-04 Thread Xiaohong Gong
On Thu, 5 May 2022 01:42:48 GMT, Xiaohong Gong  wrote:

> > > Yeah, I agree that it's not good by adding a branch checking for 
> > > `offsetInRange`. But actually I met the constant issue that passing the 
> > > values all the way cannot guarantee the argument a constant in compiler 
> > > at the compile time. Do you have any better idea to fixing this?
> > 
> > 
> > That's odd, `boolean` constants are passed that are then converted to `int` 
> > constants. Did you try passing integer constants all the way through?
> 
> I will try again. I remember the main cause is the calling of `fromArray0` 
> from `fromArray`, it is not annotated with `ForceInline`. The arguments might 
> not be compiled to a constant for cases that the offset is not in the array 
> range like tail loop.

I tried to pass the integer constant all the way, and unfortunate that the 
`offsetInRange` is not compiled to a constant. The following assertion in the 
`vectorIntrinsics.cpp` will fail:

  --- a/src/hotspot/share/opto/vectorIntrinsics.cpp
+++ b/src/hotspot/share/opto/vectorIntrinsics.cpp
@@ -1236,6 +1236,7 @@ bool 
LibraryCallKit::inline_vector_mem_masked_operation(bool is_store) {
 } else {
   // Masked vector load with IOOBE always uses the predicated load.
   const TypeInt* offset_in_range = gvn().type(argument(8))->isa_int();
+  assert(offset_in_range->is_con(), "not a constant");
   if (!offset_in_range->is_con()) {
 if (C->print_intrinsics()) {
   tty->print_cr("  ** missing constant: offsetInRange=%s",

-

PR: https://git.openjdk.java.net/jdk/pull/8035


Re: RFR: 8283667: [vectorapi] Vectorization for masked load with IOOBE with predicate feature [v2]

2022-05-04 Thread Xiaohong Gong
On Thu, 5 May 2022 01:21:40 GMT, Paul Sandoz  wrote:

> > Yeah, I agree that it's not good by adding a branch checking for 
> > `offsetInRange`. But actually I met the constant issue that passing the 
> > values all the way cannot guarantee the argument a constant in compiler at 
> > the compile time. Do you have any better idea to fixing this?
> 
> That's odd, `boolean` constants are passed that are then converted to `int` 
> constants. Did you try passing integer constants all the way through?

I will try again. I remember the main cause is the calling of `fromArray0` from 
`fromArray`, it is not annotated with `ForceInline`. The arguments might not be 
compiled to a constant for cases that the offset is not in the array range like 
tail loop.

-

PR: https://git.openjdk.java.net/jdk/pull/8035


Re: RFR: 8283667: [vectorapi] Vectorization for masked load with IOOBE with predicate feature [v2]

2022-05-04 Thread Paul Sandoz
On Thu, 5 May 2022 01:13:23 GMT, Xiaohong Gong  wrote:

> Yeah, I agree that it's not good by adding a branch checking for 
> `offsetInRange`. But actually I met the constant issue that passing the 
> values all the way cannot guarantee the argument a constant in compiler at 
> the compile time. Do you have any better idea to fixing this?

That's odd, `boolean` constants are passed that are then converted to `int` 
constants.
Did you try passing integer constants all the way through?

-

PR: https://git.openjdk.java.net/jdk/pull/8035


Re: RFR: 8283667: [vectorapi] Vectorization for masked load with IOOBE with predicate feature [v2]

2022-05-04 Thread Xiaohong Gong
On Fri, 29 Apr 2022 21:34:13 GMT, Paul Sandoz  wrote:

>> Xiaohong Gong has updated the pull request incrementally with one additional 
>> commit since the last revision:
>> 
>>   Rename the "usePred" to "offsetInRange"
>
> IIUC when the hardware does not support predicated loads then any false 
> `offsetIntRange` value causes the load intrinsic to fail resulting in the 
> fallback, so it would not be materially any different to the current 
> behavior, just more uniformly implemented.
> 
> Why can't the intrinsic support the passing a boolean directly? Is it 
> something to do with constants? If that is not possible I recommend creating 
> named constant values and pass those all the way through rather than 
> converting a boolean to an integer value. Then there is no need for a branch 
> checking `offsetInRange`.
> 
> Might be better to hold off until the JEP is integrated and then update, 
> since this will conflict (`byte[]` and `ByteBuffer` load methods are removed 
> and `MemorySegment` load methods are added). You could prepare for that now 
> by branching off `vectorIntrinsics`.

Thanks for your comments @PaulSandoz !

> IIUC when the hardware does not support predicated loads then any false 
> `offsetIntRange` value causes the load intrinsic to fail resulting in the 
> fallback, so it would not be materially any different to the current 
> behavior, just more uniformly implemented.

Yes, it's true that this patch doesn't have any different to the hardware that 
does not support the predicated loads. It only benefits the predicated feature 
supported systems like ARM SVE and X86 AVX-512.

> Why can't the intrinsic support the passing a boolean directly? Is it 
> something to do with constants? If that is not possible I recommend creating 
> named constant values and pass those all the way through rather than 
> converting a boolean to an integer value. Then there is no need for a branch 
> checking offsetInRange.

Yeah, I agree that it's not good by adding a branch checking for 
`offsetInRange`. But actually I met the constant issue that passing the values 
all the way cannot guarantee the argument a constant in compiler at the compile 
time. Do you have any better idea to fixing this?

> Might be better to hold off until the JEP is integrated and then update, 
> since this will conflict (byte[] and ByteBuffer load methods are removed and 
> MemorySegment load methods are added). You could prepare for that now by 
> branching off vectorIntrinsics.

Agree. Thanks!

-

PR: https://git.openjdk.java.net/jdk/pull/8035


Re: RFR: 8283667: [vectorapi] Vectorization for masked load with IOOBE with predicate feature [v2]

2022-04-29 Thread Paul Sandoz
On Fri, 22 Apr 2022 07:08:24 GMT, Xiaohong Gong  wrote:

>> Currently the vector load with mask when the given index happens out of the 
>> array boundary is implemented with pure java scalar code to avoid the IOOBE 
>> (IndexOutOfBoundaryException). This is necessary for architectures that do 
>> not support the predicate feature. Because the masked load is implemented 
>> with a full vector load and a vector blend applied on it. And a full vector 
>> load will definitely cause the IOOBE which is not valid. However, for 
>> architectures that support the predicate feature like SVE/AVX-512/RVV, it 
>> can be vectorized with the predicated load instruction as long as the 
>> indexes of the masked lanes are within the bounds of the array. For these 
>> architectures, loading with unmasked lanes does not raise exception.
>> 
>> This patch adds the vectorization support for the masked load with IOOBE 
>> part. Please see the original java implementation (FIXME: optimize):
>> 
>> 
>>   @ForceInline
>>   public static
>>   ByteVector fromArray(VectorSpecies species,
>>byte[] a, int offset,
>>VectorMask m) {
>>   ByteSpecies vsp = (ByteSpecies) species;
>>   if (offset >= 0 && offset <= (a.length - species.length())) {
>>   return vsp.dummyVector().fromArray0(a, offset, m);
>>   }
>> 
>>   // FIXME: optimize
>>   checkMaskFromIndexSize(offset, vsp, m, 1, a.length);
>>   return vsp.vOp(m, i -> a[offset + i]);
>>   }
>> 
>> Since it can only be vectorized with the predicate load, the hotspot must 
>> check whether the current backend supports it and falls back to the java 
>> scalar version if not. This is different from the normal masked vector load 
>> that the compiler will generate a full vector load and a vector blend if the 
>> predicate load is not supported. So to let the compiler make the expected 
>> action, an additional flag (i.e. `usePred`) is added to the existing 
>> "loadMasked" intrinsic, with the value "true" for the IOOBE part while 
>> "false" for the normal load. And the compiler will fail to intrinsify if the 
>> flag is "true" and the predicate load is not supported by the backend, which 
>> means that normal java path will be executed.
>> 
>> Also adds the same vectorization support for masked:
>>  - fromByteArray/fromByteBuffer
>>  - fromBooleanArray
>>  - fromCharArray
>> 
>> The performance for the new added benchmarks improve about `1.88x ~ 30.26x` 
>> on the x86 AVX-512 system:
>> 
>> Benchmark  before   After  Units
>> LoadMaskedIOOBEBenchmark.byteLoadArrayMaskIOOBE   737.542 1387.069 ops/ms
>> LoadMaskedIOOBEBenchmark.doubleLoadArrayMaskIOOBE 118.366  330.776 ops/ms
>> LoadMaskedIOOBEBenchmark.floatLoadArrayMaskIOOBE  233.832 6125.026 ops/ms
>> LoadMaskedIOOBEBenchmark.intLoadArrayMaskIOOBE233.816 7075.923 ops/ms
>> LoadMaskedIOOBEBenchmark.longLoadArrayMaskIOOBE   119.771  330.587 ops/ms
>> LoadMaskedIOOBEBenchmark.shortLoadArrayMaskIOOBE  431.961  939.301 ops/ms
>> 
>> Similar performance gain can also be observed on 512-bit SVE system.
>
> Xiaohong Gong has updated the pull request incrementally with one additional 
> commit since the last revision:
> 
>   Rename the "usePred" to "offsetInRange"

IIUC when the hardware does not support predicated loads then any false 
`offsetIntRange` value causes the load intrinsic to fail resulting in the 
fallback, so it would not be materially any different to the current behavior, 
just more uniformly implemented.

Why can't the intrinsic support the passing a boolean directly? Is it something 
to do with constants? If that is not possible I recommend creating named 
constant values and pass those all the way through rather than converting a 
boolean to an integer value. Then there is no need for a branch checking 
`offsetInRange`.

Might be better to hold off until the JEP is integrated and then update, since 
this will conflict (`byte[]` and `ByteBuffer` load methods are removed and 
`MemorySegment` load methods are added). You could prepare for that now by 
branching off `vectorIntrinsics`.

-

PR: https://git.openjdk.java.net/jdk/pull/8035


Re: RFR: 8283667: [vectorapi] Vectorization for masked load with IOOBE with predicate feature [v2]

2022-04-28 Thread Sandhya Viswanathan
On Fri, 22 Apr 2022 07:08:24 GMT, Xiaohong Gong  wrote:

>> Currently the vector load with mask when the given index happens out of the 
>> array boundary is implemented with pure java scalar code to avoid the IOOBE 
>> (IndexOutOfBoundaryException). This is necessary for architectures that do 
>> not support the predicate feature. Because the masked load is implemented 
>> with a full vector load and a vector blend applied on it. And a full vector 
>> load will definitely cause the IOOBE which is not valid. However, for 
>> architectures that support the predicate feature like SVE/AVX-512/RVV, it 
>> can be vectorized with the predicated load instruction as long as the 
>> indexes of the masked lanes are within the bounds of the array. For these 
>> architectures, loading with unmasked lanes does not raise exception.
>> 
>> This patch adds the vectorization support for the masked load with IOOBE 
>> part. Please see the original java implementation (FIXME: optimize):
>> 
>> 
>>   @ForceInline
>>   public static
>>   ByteVector fromArray(VectorSpecies species,
>>byte[] a, int offset,
>>VectorMask m) {
>>   ByteSpecies vsp = (ByteSpecies) species;
>>   if (offset >= 0 && offset <= (a.length - species.length())) {
>>   return vsp.dummyVector().fromArray0(a, offset, m);
>>   }
>> 
>>   // FIXME: optimize
>>   checkMaskFromIndexSize(offset, vsp, m, 1, a.length);
>>   return vsp.vOp(m, i -> a[offset + i]);
>>   }
>> 
>> Since it can only be vectorized with the predicate load, the hotspot must 
>> check whether the current backend supports it and falls back to the java 
>> scalar version if not. This is different from the normal masked vector load 
>> that the compiler will generate a full vector load and a vector blend if the 
>> predicate load is not supported. So to let the compiler make the expected 
>> action, an additional flag (i.e. `usePred`) is added to the existing 
>> "loadMasked" intrinsic, with the value "true" for the IOOBE part while 
>> "false" for the normal load. And the compiler will fail to intrinsify if the 
>> flag is "true" and the predicate load is not supported by the backend, which 
>> means that normal java path will be executed.
>> 
>> Also adds the same vectorization support for masked:
>>  - fromByteArray/fromByteBuffer
>>  - fromBooleanArray
>>  - fromCharArray
>> 
>> The performance for the new added benchmarks improve about `1.88x ~ 30.26x` 
>> on the x86 AVX-512 system:
>> 
>> Benchmark  before   After  Units
>> LoadMaskedIOOBEBenchmark.byteLoadArrayMaskIOOBE   737.542 1387.069 ops/ms
>> LoadMaskedIOOBEBenchmark.doubleLoadArrayMaskIOOBE 118.366  330.776 ops/ms
>> LoadMaskedIOOBEBenchmark.floatLoadArrayMaskIOOBE  233.832 6125.026 ops/ms
>> LoadMaskedIOOBEBenchmark.intLoadArrayMaskIOOBE233.816 7075.923 ops/ms
>> LoadMaskedIOOBEBenchmark.longLoadArrayMaskIOOBE   119.771  330.587 ops/ms
>> LoadMaskedIOOBEBenchmark.shortLoadArrayMaskIOOBE  431.961  939.301 ops/ms
>> 
>> Similar performance gain can also be observed on 512-bit SVE system.
>
> Xiaohong Gong has updated the pull request incrementally with one additional 
> commit since the last revision:
> 
>   Rename the "usePred" to "offsetInRange"

@PaulSandoz Could you please take a look at the Java changes when you find 
time. This PR from @XiaohongGong is a very good step towards long standing 
Vector API  wish list for better tail loop handling.

-

PR: https://git.openjdk.java.net/jdk/pull/8035


Re: RFR: 8283667: [vectorapi] Vectorization for masked load with IOOBE with predicate feature [v2]

2022-04-28 Thread Jatin Bhateja
On Wed, 20 Apr 2022 02:44:39 GMT, Xiaohong Gong  wrote:

>>> The blend should be with the intended-to-store vector, so that masked lanes 
>>> contain the need-to-store elements and unmasked lanes contain the loaded 
>>> elements, which would be stored back, which results in unchanged values.
>> 
>> It may not work if memory is beyond legal accessible address space of the 
>> process, a corner case could be a page boundary.  Thus re-composing the 
>> intermediated vector which partially contains actual updates but effectively 
>> perform full vector write to destination address may not work in all 
>> scenarios.
>
> Thanks for the comment! So how about adding the check for the valid array 
> range like the masked vector load?
> Codes like:
> 
> public final
> void intoArray(byte[] a, int offset,
>VectorMask m) {
> if (m.allTrue()) {
> intoArray(a, offset);
> } else {
> ByteSpecies vsp = vspecies();
> if (offset >= 0 && offset <= (a.length - vsp.length())) { // 
> a full range check
> intoArray0(a, offset, m, /* usePred */ false);
>// can be vectorized by load+blend_store
> } else {
> checkMaskFromIndexSize(offset, vsp, m, 1, a.length);
> intoArray0(a, offset, m, /* usePred */ true); 
>// only be vectorized by the predicated store
> }
> }
> }

Thanks, this looks ok since out of range condition will not be intrinsified if 
targets does not support predicated vector store.

-

PR: https://git.openjdk.java.net/jdk/pull/8035


Re: RFR: 8283667: [vectorapi] Vectorization for masked load with IOOBE with predicate feature [v2]

2022-04-27 Thread Sandhya Viswanathan
On Fri, 22 Apr 2022 07:08:24 GMT, Xiaohong Gong  wrote:

>> Currently the vector load with mask when the given index happens out of the 
>> array boundary is implemented with pure java scalar code to avoid the IOOBE 
>> (IndexOutOfBoundaryException). This is necessary for architectures that do 
>> not support the predicate feature. Because the masked load is implemented 
>> with a full vector load and a vector blend applied on it. And a full vector 
>> load will definitely cause the IOOBE which is not valid. However, for 
>> architectures that support the predicate feature like SVE/AVX-512/RVV, it 
>> can be vectorized with the predicated load instruction as long as the 
>> indexes of the masked lanes are within the bounds of the array. For these 
>> architectures, loading with unmasked lanes does not raise exception.
>> 
>> This patch adds the vectorization support for the masked load with IOOBE 
>> part. Please see the original java implementation (FIXME: optimize):
>> 
>> 
>>   @ForceInline
>>   public static
>>   ByteVector fromArray(VectorSpecies species,
>>byte[] a, int offset,
>>VectorMask m) {
>>   ByteSpecies vsp = (ByteSpecies) species;
>>   if (offset >= 0 && offset <= (a.length - species.length())) {
>>   return vsp.dummyVector().fromArray0(a, offset, m);
>>   }
>> 
>>   // FIXME: optimize
>>   checkMaskFromIndexSize(offset, vsp, m, 1, a.length);
>>   return vsp.vOp(m, i -> a[offset + i]);
>>   }
>> 
>> Since it can only be vectorized with the predicate load, the hotspot must 
>> check whether the current backend supports it and falls back to the java 
>> scalar version if not. This is different from the normal masked vector load 
>> that the compiler will generate a full vector load and a vector blend if the 
>> predicate load is not supported. So to let the compiler make the expected 
>> action, an additional flag (i.e. `usePred`) is added to the existing 
>> "loadMasked" intrinsic, with the value "true" for the IOOBE part while 
>> "false" for the normal load. And the compiler will fail to intrinsify if the 
>> flag is "true" and the predicate load is not supported by the backend, which 
>> means that normal java path will be executed.
>> 
>> Also adds the same vectorization support for masked:
>>  - fromByteArray/fromByteBuffer
>>  - fromBooleanArray
>>  - fromCharArray
>> 
>> The performance for the new added benchmarks improve about `1.88x ~ 30.26x` 
>> on the x86 AVX-512 system:
>> 
>> Benchmark  before   After  Units
>> LoadMaskedIOOBEBenchmark.byteLoadArrayMaskIOOBE   737.542 1387.069 ops/ms
>> LoadMaskedIOOBEBenchmark.doubleLoadArrayMaskIOOBE 118.366  330.776 ops/ms
>> LoadMaskedIOOBEBenchmark.floatLoadArrayMaskIOOBE  233.832 6125.026 ops/ms
>> LoadMaskedIOOBEBenchmark.intLoadArrayMaskIOOBE233.816 7075.923 ops/ms
>> LoadMaskedIOOBEBenchmark.longLoadArrayMaskIOOBE   119.771  330.587 ops/ms
>> LoadMaskedIOOBEBenchmark.shortLoadArrayMaskIOOBE  431.961  939.301 ops/ms
>> 
>> Similar performance gain can also be observed on 512-bit SVE system.
>
> Xiaohong Gong has updated the pull request incrementally with one additional 
> commit since the last revision:
> 
>   Rename the "usePred" to "offsetInRange"

Rest of the patch looks good to me.

src/hotspot/share/opto/vectorIntrinsics.cpp line 1232:

> 1230:   // out when current case uses the predicate feature.
> 1231:   if (!supports_predicate) {
> 1232: bool use_predicate = false;

If we rename this to needs_predicate it will be easier to understand.

-

PR: https://git.openjdk.java.net/jdk/pull/8035


Re: RFR: 8283667: [vectorapi] Vectorization for masked load with IOOBE with predicate feature [v2]

2022-04-22 Thread Xiaohong Gong
On Wed, 20 Apr 2022 02:46:09 GMT, Xiaohong Gong  wrote:

>> src/jdk.incubator.vector/share/classes/jdk/incubator/vector/ByteVector.java 
>> line 2861:
>> 
>>> 2859: ByteSpecies vsp = (ByteSpecies) species;
>>> 2860: if (offset >= 0 && offset <= (a.length - 
>>> species.vectorByteSize())) {
>>> 2861: return vsp.dummyVector().fromByteArray0(a, offset, m, /* 
>>> usePred */ false).maybeSwap(bo);
>> 
>> Instead of usePred a term like inRange or offetInRage or offsetInVectorRange 
>> would be easier to follow.
>
> Thanks for the review. I will change it later.

The name is updated to `offsetInRange`. Thanks!

-

PR: https://git.openjdk.java.net/jdk/pull/8035


Re: RFR: 8283667: [vectorapi] Vectorization for masked load with IOOBE with predicate feature [v2]

2022-04-22 Thread Xiaohong Gong
> Currently the vector load with mask when the given index happens out of the 
> array boundary is implemented with pure java scalar code to avoid the IOOBE 
> (IndexOutOfBoundaryException). This is necessary for architectures that do 
> not support the predicate feature. Because the masked load is implemented 
> with a full vector load and a vector blend applied on it. And a full vector 
> load will definitely cause the IOOBE which is not valid. However, for 
> architectures that support the predicate feature like SVE/AVX-512/RVV, it can 
> be vectorized with the predicated load instruction as long as the indexes of 
> the masked lanes are within the bounds of the array. For these architectures, 
> loading with unmasked lanes does not raise exception.
> 
> This patch adds the vectorization support for the masked load with IOOBE 
> part. Please see the original java implementation (FIXME: optimize):
> 
> 
>   @ForceInline
>   public static
>   ByteVector fromArray(VectorSpecies species,
>byte[] a, int offset,
>VectorMask m) {
>   ByteSpecies vsp = (ByteSpecies) species;
>   if (offset >= 0 && offset <= (a.length - species.length())) {
>   return vsp.dummyVector().fromArray0(a, offset, m);
>   }
> 
>   // FIXME: optimize
>   checkMaskFromIndexSize(offset, vsp, m, 1, a.length);
>   return vsp.vOp(m, i -> a[offset + i]);
>   }
> 
> Since it can only be vectorized with the predicate load, the hotspot must 
> check whether the current backend supports it and falls back to the java 
> scalar version if not. This is different from the normal masked vector load 
> that the compiler will generate a full vector load and a vector blend if the 
> predicate load is not supported. So to let the compiler make the expected 
> action, an additional flag (i.e. `usePred`) is added to the existing 
> "loadMasked" intrinsic, with the value "true" for the IOOBE part while 
> "false" for the normal load. And the compiler will fail to intrinsify if the 
> flag is "true" and the predicate load is not supported by the backend, which 
> means that normal java path will be executed.
> 
> Also adds the same vectorization support for masked:
>  - fromByteArray/fromByteBuffer
>  - fromBooleanArray
>  - fromCharArray
> 
> The performance for the new added benchmarks improve about `1.88x ~ 30.26x` 
> on the x86 AVX-512 system:
> 
> Benchmark  before   After  Units
> LoadMaskedIOOBEBenchmark.byteLoadArrayMaskIOOBE   737.542 1387.069 ops/ms
> LoadMaskedIOOBEBenchmark.doubleLoadArrayMaskIOOBE 118.366  330.776 ops/ms
> LoadMaskedIOOBEBenchmark.floatLoadArrayMaskIOOBE  233.832 6125.026 ops/ms
> LoadMaskedIOOBEBenchmark.intLoadArrayMaskIOOBE233.816 7075.923 ops/ms
> LoadMaskedIOOBEBenchmark.longLoadArrayMaskIOOBE   119.771  330.587 ops/ms
> LoadMaskedIOOBEBenchmark.shortLoadArrayMaskIOOBE  431.961  939.301 ops/ms
> 
> Similar performance gain can also be observed on 512-bit SVE system.

Xiaohong Gong has updated the pull request incrementally with one additional 
commit since the last revision:

  Rename the "usePred" to "offsetInRange"

-

Changes:
  - all: https://git.openjdk.java.net/jdk/pull/8035/files
  - new: https://git.openjdk.java.net/jdk/pull/8035/files/8f9e8a3c..9b2d2f19

Webrevs:
 - full: https://webrevs.openjdk.java.net/?repo=jdk=8035=01
 - incr: https://webrevs.openjdk.java.net/?repo=jdk=8035=00-01

  Stats: 393 lines in 41 files changed: 0 ins; 0 del; 393 mod
  Patch: https://git.openjdk.java.net/jdk/pull/8035.diff
  Fetch: git fetch https://git.openjdk.java.net/jdk pull/8035/head:pull/8035

PR: https://git.openjdk.java.net/jdk/pull/8035


Re: RFR: 8283667: [vectorapi] Vectorization for masked load with IOOBE with predicate feature

2022-04-19 Thread Xiaohong Gong
On Sat, 9 Apr 2022 00:10:40 GMT, Sandhya Viswanathan  
wrote:

>> Currently the vector load with mask when the given index happens out of the 
>> array boundary is implemented with pure java scalar code to avoid the IOOBE 
>> (IndexOutOfBoundaryException). This is necessary for architectures that do 
>> not support the predicate feature. Because the masked load is implemented 
>> with a full vector load and a vector blend applied on it. And a full vector 
>> load will definitely cause the IOOBE which is not valid. However, for 
>> architectures that support the predicate feature like SVE/AVX-512/RVV, it 
>> can be vectorized with the predicated load instruction as long as the 
>> indexes of the masked lanes are within the bounds of the array. For these 
>> architectures, loading with unmasked lanes does not raise exception.
>> 
>> This patch adds the vectorization support for the masked load with IOOBE 
>> part. Please see the original java implementation (FIXME: optimize):
>> 
>> 
>>   @ForceInline
>>   public static
>>   ByteVector fromArray(VectorSpecies species,
>>byte[] a, int offset,
>>VectorMask m) {
>>   ByteSpecies vsp = (ByteSpecies) species;
>>   if (offset >= 0 && offset <= (a.length - species.length())) {
>>   return vsp.dummyVector().fromArray0(a, offset, m);
>>   }
>> 
>>   // FIXME: optimize
>>   checkMaskFromIndexSize(offset, vsp, m, 1, a.length);
>>   return vsp.vOp(m, i -> a[offset + i]);
>>   }
>> 
>> Since it can only be vectorized with the predicate load, the hotspot must 
>> check whether the current backend supports it and falls back to the java 
>> scalar version if not. This is different from the normal masked vector load 
>> that the compiler will generate a full vector load and a vector blend if the 
>> predicate load is not supported. So to let the compiler make the expected 
>> action, an additional flag (i.e. `usePred`) is added to the existing 
>> "loadMasked" intrinsic, with the value "true" for the IOOBE part while 
>> "false" for the normal load. And the compiler will fail to intrinsify if the 
>> flag is "true" and the predicate load is not supported by the backend, which 
>> means that normal java path will be executed.
>> 
>> Also adds the same vectorization support for masked:
>>  - fromByteArray/fromByteBuffer
>>  - fromBooleanArray
>>  - fromCharArray
>> 
>> The performance for the new added benchmarks improve about `1.88x ~ 30.26x` 
>> on the x86 AVX-512 system:
>> 
>> Benchmark  before   After  Units
>> LoadMaskedIOOBEBenchmark.byteLoadArrayMaskIOOBE   737.542 1387.069 ops/ms
>> LoadMaskedIOOBEBenchmark.doubleLoadArrayMaskIOOBE 118.366  330.776 ops/ms
>> LoadMaskedIOOBEBenchmark.floatLoadArrayMaskIOOBE  233.832 6125.026 ops/ms
>> LoadMaskedIOOBEBenchmark.intLoadArrayMaskIOOBE233.816 7075.923 ops/ms
>> LoadMaskedIOOBEBenchmark.longLoadArrayMaskIOOBE   119.771  330.587 ops/ms
>> LoadMaskedIOOBEBenchmark.shortLoadArrayMaskIOOBE  431.961  939.301 ops/ms
>> 
>> Similar performance gain can also be observed on 512-bit SVE system.
>
> src/jdk.incubator.vector/share/classes/jdk/incubator/vector/ByteVector.java 
> line 2861:
> 
>> 2859: ByteSpecies vsp = (ByteSpecies) species;
>> 2860: if (offset >= 0 && offset <= (a.length - 
>> species.vectorByteSize())) {
>> 2861: return vsp.dummyVector().fromByteArray0(a, offset, m, /* 
>> usePred */ false).maybeSwap(bo);
> 
> Instead of usePred a term like inRange or offetInRage or offsetInVectorRange 
> would be easier to follow.

Thanks for the review. I will change it later.

-

PR: https://git.openjdk.java.net/jdk/pull/8035


Re: RFR: 8283667: [vectorapi] Vectorization for masked load with IOOBE with predicate feature

2022-04-19 Thread Xiaohong Gong
On Mon, 11 Apr 2022 09:04:36 GMT, Jatin Bhateja  wrote:

>> The optimization for masked store is recorded to: 
>> https://bugs.openjdk.java.net/browse/JDK-8284050
>
>> The blend should be with the intended-to-store vector, so that masked lanes 
>> contain the need-to-store elements and unmasked lanes contain the loaded 
>> elements, which would be stored back, which results in unchanged values.
> 
> It may not work if memory is beyond legal accessible address space of the 
> process, a corner case could be a page boundary.  Thus re-composing the 
> intermediated vector which partially contains actual updates but effectively 
> perform full vector write to destination address may not work in all 
> scenarios.

Thanks for the comment! So how about adding the check for the valid array range 
like the masked vector load?
Codes like:

public final
void intoArray(byte[] a, int offset,
   VectorMask m) {
if (m.allTrue()) {
intoArray(a, offset);
} else {
ByteSpecies vsp = vspecies();
if (offset >= 0 && offset <= (a.length - vsp.length())) { // a 
full range check
intoArray0(a, offset, m, /* usePred */ false);  
 // can be vectorized by load+blend_store
} else {
checkMaskFromIndexSize(offset, vsp, m, 1, a.length);
intoArray0(a, offset, m, /* usePred */ true);   
 // only be vectorized by the predicated store
}
}
}

-

PR: https://git.openjdk.java.net/jdk/pull/8035


Re: RFR: 8283667: [vectorapi] Vectorization for masked load with IOOBE with predicate feature

2022-04-11 Thread Jatin Bhateja
On Thu, 31 Mar 2022 03:53:15 GMT, Xiaohong Gong  wrote:

>> Yeah, maybe I misunderstood what you mean. So maybe the masked store 
>> `(store(src, m))` could be implemented with:
>> 
>> 1) v1 = load
>> 2) v2 = blend(load, src, m)
>> 3) store(v2)
>> 
>> Let's record this a JBS and fix it with a followed-up patch. Thanks!
>
> The optimization for masked store is recorded to: 
> https://bugs.openjdk.java.net/browse/JDK-8284050

> The blend should be with the intended-to-store vector, so that masked lanes 
> contain the need-to-store elements and unmasked lanes contain the loaded 
> elements, which would be stored back, which results in unchanged values.

It may not work if memory is beyond legal accessible address space of the 
process, a corner case could be a page boundary.  Thus re-composing the 
intermediated vector which partially contains actual updates but effectively 
perform full vector write to destination address may not work in all scenarios.

-

PR: https://git.openjdk.java.net/jdk/pull/8035


Re: RFR: 8283667: [vectorapi] Vectorization for masked load with IOOBE with predicate feature

2022-04-08 Thread Sandhya Viswanathan
On Wed, 30 Mar 2022 10:31:59 GMT, Xiaohong Gong  wrote:

> Currently the vector load with mask when the given index happens out of the 
> array boundary is implemented with pure java scalar code to avoid the IOOBE 
> (IndexOutOfBoundaryException). This is necessary for architectures that do 
> not support the predicate feature. Because the masked load is implemented 
> with a full vector load and a vector blend applied on it. And a full vector 
> load will definitely cause the IOOBE which is not valid. However, for 
> architectures that support the predicate feature like SVE/AVX-512/RVV, it can 
> be vectorized with the predicated load instruction as long as the indexes of 
> the masked lanes are within the bounds of the array. For these architectures, 
> loading with unmasked lanes does not raise exception.
> 
> This patch adds the vectorization support for the masked load with IOOBE 
> part. Please see the original java implementation (FIXME: optimize):
> 
> 
>   @ForceInline
>   public static
>   ByteVector fromArray(VectorSpecies species,
>byte[] a, int offset,
>VectorMask m) {
>   ByteSpecies vsp = (ByteSpecies) species;
>   if (offset >= 0 && offset <= (a.length - species.length())) {
>   return vsp.dummyVector().fromArray0(a, offset, m);
>   }
> 
>   // FIXME: optimize
>   checkMaskFromIndexSize(offset, vsp, m, 1, a.length);
>   return vsp.vOp(m, i -> a[offset + i]);
>   }
> 
> Since it can only be vectorized with the predicate load, the hotspot must 
> check whether the current backend supports it and falls back to the java 
> scalar version if not. This is different from the normal masked vector load 
> that the compiler will generate a full vector load and a vector blend if the 
> predicate load is not supported. So to let the compiler make the expected 
> action, an additional flag (i.e. `usePred`) is added to the existing 
> "loadMasked" intrinsic, with the value "true" for the IOOBE part while 
> "false" for the normal load. And the compiler will fail to intrinsify if the 
> flag is "true" and the predicate load is not supported by the backend, which 
> means that normal java path will be executed.
> 
> Also adds the same vectorization support for masked:
>  - fromByteArray/fromByteBuffer
>  - fromBooleanArray
>  - fromCharArray
> 
> The performance for the new added benchmarks improve about `1.88x ~ 30.26x` 
> on the x86 AVX-512 system:
> 
> Benchmark  before   After  Units
> LoadMaskedIOOBEBenchmark.byteLoadArrayMaskIOOBE   737.542 1387.069 ops/ms
> LoadMaskedIOOBEBenchmark.doubleLoadArrayMaskIOOBE 118.366  330.776 ops/ms
> LoadMaskedIOOBEBenchmark.floatLoadArrayMaskIOOBE  233.832 6125.026 ops/ms
> LoadMaskedIOOBEBenchmark.intLoadArrayMaskIOOBE233.816 7075.923 ops/ms
> LoadMaskedIOOBEBenchmark.longLoadArrayMaskIOOBE   119.771  330.587 ops/ms
> LoadMaskedIOOBEBenchmark.shortLoadArrayMaskIOOBE  431.961  939.301 ops/ms
> 
> Similar performance gain can also be observed on 512-bit SVE system.

src/jdk.incubator.vector/share/classes/jdk/incubator/vector/ByteVector.java 
line 2861:

> 2859: ByteSpecies vsp = (ByteSpecies) species;
> 2860: if (offset >= 0 && offset <= (a.length - 
> species.vectorByteSize())) {
> 2861: return vsp.dummyVector().fromByteArray0(a, offset, m, /* 
> usePred */ false).maybeSwap(bo);

Instead of usePred a term like inRange or offetInRage or offsetInVectorRange 
would be easier to follow.

-

PR: https://git.openjdk.java.net/jdk/pull/8035


Re: RFR: 8283667: [vectorapi] Vectorization for masked load with IOOBE with predicate feature

2022-04-05 Thread Xiaohong Gong
On Wed, 30 Mar 2022 10:31:59 GMT, Xiaohong Gong  wrote:

> Currently the vector load with mask when the given index happens out of the 
> array boundary is implemented with pure java scalar code to avoid the IOOBE 
> (IndexOutOfBoundaryException). This is necessary for architectures that do 
> not support the predicate feature. Because the masked load is implemented 
> with a full vector load and a vector blend applied on it. And a full vector 
> load will definitely cause the IOOBE which is not valid. However, for 
> architectures that support the predicate feature like SVE/AVX-512/RVV, it can 
> be vectorized with the predicated load instruction as long as the indexes of 
> the masked lanes are within the bounds of the array. For these architectures, 
> loading with unmasked lanes does not raise exception.
> 
> This patch adds the vectorization support for the masked load with IOOBE 
> part. Please see the original java implementation (FIXME: optimize):
> 
> 
>   @ForceInline
>   public static
>   ByteVector fromArray(VectorSpecies species,
>byte[] a, int offset,
>VectorMask m) {
>   ByteSpecies vsp = (ByteSpecies) species;
>   if (offset >= 0 && offset <= (a.length - species.length())) {
>   return vsp.dummyVector().fromArray0(a, offset, m);
>   }
> 
>   // FIXME: optimize
>   checkMaskFromIndexSize(offset, vsp, m, 1, a.length);
>   return vsp.vOp(m, i -> a[offset + i]);
>   }
> 
> Since it can only be vectorized with the predicate load, the hotspot must 
> check whether the current backend supports it and falls back to the java 
> scalar version if not. This is different from the normal masked vector load 
> that the compiler will generate a full vector load and a vector blend if the 
> predicate load is not supported. So to let the compiler make the expected 
> action, an additional flag (i.e. `usePred`) is added to the existing 
> "loadMasked" intrinsic, with the value "true" for the IOOBE part while 
> "false" for the normal load. And the compiler will fail to intrinsify if the 
> flag is "true" and the predicate load is not supported by the backend, which 
> means that normal java path will be executed.
> 
> Also adds the same vectorization support for masked:
>  - fromByteArray/fromByteBuffer
>  - fromBooleanArray
>  - fromCharArray
> 
> The performance for the new added benchmarks improve about `1.88x ~ 30.26x` 
> on the x86 AVX-512 system:
> 
> Benchmark  before   After  Units
> LoadMaskedIOOBEBenchmark.byteLoadArrayMaskIOOBE   737.542 1387.069 ops/ms
> LoadMaskedIOOBEBenchmark.doubleLoadArrayMaskIOOBE 118.366  330.776 ops/ms
> LoadMaskedIOOBEBenchmark.floatLoadArrayMaskIOOBE  233.832 6125.026 ops/ms
> LoadMaskedIOOBEBenchmark.intLoadArrayMaskIOOBE233.816 7075.923 ops/ms
> LoadMaskedIOOBEBenchmark.longLoadArrayMaskIOOBE   119.771  330.587 ops/ms
> LoadMaskedIOOBEBenchmark.shortLoadArrayMaskIOOBE  431.961  939.301 ops/ms
> 
> Similar performance gain can also be observed on 512-bit SVE system.

Hi @PaulSandoz @jatin-bhateja @sviswa7, could you please help to check this PR? 
Any feedback is welcome! Thanks a lot!

-

PR: https://git.openjdk.java.net/jdk/pull/8035


Re: RFR: 8283667: [vectorapi] Vectorization for masked load with IOOBE with predicate feature

2022-03-30 Thread Quan Anh Mai
On Wed, 30 Mar 2022 10:31:59 GMT, Xiaohong Gong  wrote:

> Currently the vector load with mask when the given index happens out of the 
> array boundary is implemented with pure java scalar code to avoid the IOOBE 
> (IndexOutOfBoundaryException). This is necessary for architectures that do 
> not support the predicate feature. Because the masked load is implemented 
> with a full vector load and a vector blend applied on it. And a full vector 
> load will definitely cause the IOOBE which is not valid. However, for 
> architectures that support the predicate feature like SVE/AVX-512/RVV, it can 
> be vectorized with the predicated load instruction as long as the indexes of 
> the masked lanes are within the bounds of the array. For these architectures, 
> loading with unmasked lanes does not raise exception.
> 
> This patch adds the vectorization support for the masked load with IOOBE 
> part. Please see the original java implementation (FIXME: optimize):
> 
> 
>   @ForceInline
>   public static
>   ByteVector fromArray(VectorSpecies species,
>byte[] a, int offset,
>VectorMask m) {
>   ByteSpecies vsp = (ByteSpecies) species;
>   if (offset >= 0 && offset <= (a.length - species.length())) {
>   return vsp.dummyVector().fromArray0(a, offset, m);
>   }
> 
>   // FIXME: optimize
>   checkMaskFromIndexSize(offset, vsp, m, 1, a.length);
>   return vsp.vOp(m, i -> a[offset + i]);
>   }
> 
> Since it can only be vectorized with the predicate load, the hotspot must 
> check whether the current backend supports it and falls back to the java 
> scalar version if not. This is different from the normal masked vector load 
> that the compiler will generate a full vector load and a vector blend if the 
> predicate load is not supported. So to let the compiler make the expected 
> action, an additional flag (i.e. `usePred`) is added to the existing 
> "loadMasked" intrinsic, with the value "true" for the IOOBE part while 
> "false" for the normal load. And the compiler will fail to intrinsify if the 
> flag is "true" and the predicate load is not supported by the backend, which 
> means that normal java path will be executed.
> 
> Also adds the same vectorization support for masked:
>  - fromByteArray/fromByteBuffer
>  - fromBooleanArray
>  - fromCharArray
> 
> The performance for the new added benchmarks improve about `1.88x ~ 30.26x` 
> on the x86 AVX-512 system:
> 
> Benchmark  before   After  Units
> LoadMaskedIOOBEBenchmark.byteLoadArrayMaskIOOBE   737.542 1387.069 ops/ms
> LoadMaskedIOOBEBenchmark.doubleLoadArrayMaskIOOBE 118.366  330.776 ops/ms
> LoadMaskedIOOBEBenchmark.floatLoadArrayMaskIOOBE  233.832 6125.026 ops/ms
> LoadMaskedIOOBEBenchmark.intLoadArrayMaskIOOBE233.816 7075.923 ops/ms
> LoadMaskedIOOBEBenchmark.longLoadArrayMaskIOOBE   119.771  330.587 ops/ms
> LoadMaskedIOOBEBenchmark.shortLoadArrayMaskIOOBE  431.961  939.301 ops/ms
> 
> Similar performance gain can also be observed on 512-bit SVE system.

src/hotspot/share/opto/vectorIntrinsics.cpp line 1234:

> 1232: bool use_predicate = false;
> 1233: if (is_store) {
> 1234:   // Masked vector store always uses the predicated store.

Can masked store be implemented as Load + Blend + Store?

-

PR: https://git.openjdk.java.net/jdk/pull/8035


Re: RFR: 8283667: [vectorapi] Vectorization for masked load with IOOBE with predicate feature

2022-03-30 Thread Quan Anh Mai
On Wed, 30 Mar 2022 10:31:59 GMT, Xiaohong Gong  wrote:

> Currently the vector load with mask when the given index happens out of the 
> array boundary is implemented with pure java scalar code to avoid the IOOBE 
> (IndexOutOfBoundaryException). This is necessary for architectures that do 
> not support the predicate feature. Because the masked load is implemented 
> with a full vector load and a vector blend applied on it. And a full vector 
> load will definitely cause the IOOBE which is not valid. However, for 
> architectures that support the predicate feature like SVE/AVX-512/RVV, it can 
> be vectorized with the predicated load instruction as long as the indexes of 
> the masked lanes are within the bounds of the array. For these architectures, 
> loading with unmasked lanes does not raise exception.
> 
> This patch adds the vectorization support for the masked load with IOOBE 
> part. Please see the original java implementation (FIXME: optimize):
> 
> 
>   @ForceInline
>   public static
>   ByteVector fromArray(VectorSpecies species,
>byte[] a, int offset,
>VectorMask m) {
>   ByteSpecies vsp = (ByteSpecies) species;
>   if (offset >= 0 && offset <= (a.length - species.length())) {
>   return vsp.dummyVector().fromArray0(a, offset, m);
>   }
> 
>   // FIXME: optimize
>   checkMaskFromIndexSize(offset, vsp, m, 1, a.length);
>   return vsp.vOp(m, i -> a[offset + i]);
>   }
> 
> Since it can only be vectorized with the predicate load, the hotspot must 
> check whether the current backend supports it and falls back to the java 
> scalar version if not. This is different from the normal masked vector load 
> that the compiler will generate a full vector load and a vector blend if the 
> predicate load is not supported. So to let the compiler make the expected 
> action, an additional flag (i.e. `usePred`) is added to the existing 
> "loadMasked" intrinsic, with the value "true" for the IOOBE part while 
> "false" for the normal load. And the compiler will fail to intrinsify if the 
> flag is "true" and the predicate load is not supported by the backend, which 
> means that normal java path will be executed.
> 
> Also adds the same vectorization support for masked:
>  - fromByteArray/fromByteBuffer
>  - fromBooleanArray
>  - fromCharArray
> 
> The performance for the new added benchmarks improve about `1.88x ~ 30.26x` 
> on the x86 AVX-512 system:
> 
> Benchmark  before   After  Units
> LoadMaskedIOOBEBenchmark.byteLoadArrayMaskIOOBE   737.542 1387.069 ops/ms
> LoadMaskedIOOBEBenchmark.doubleLoadArrayMaskIOOBE 118.366  330.776 ops/ms
> LoadMaskedIOOBEBenchmark.floatLoadArrayMaskIOOBE  233.832 6125.026 ops/ms
> LoadMaskedIOOBEBenchmark.intLoadArrayMaskIOOBE233.816 7075.923 ops/ms
> LoadMaskedIOOBEBenchmark.longLoadArrayMaskIOOBE   119.771  330.587 ops/ms
> LoadMaskedIOOBEBenchmark.shortLoadArrayMaskIOOBE  431.961  939.301 ops/ms
> 
> Similar performance gain can also be observed on 512-bit SVE system.

AVX has `vmaskmovpd` and `vmaskmovps` for masked loads and stores, which do not 
required predicate vectors. I think the implementation should make it possible 
to take advantage of these instructions. Thanks.

-

PR: https://git.openjdk.java.net/jdk/pull/8035


RFR: 8283667: [vectorapi] Vectorization for masked load with IOOBE with predicate feature

2022-03-30 Thread Xiaohong Gong
Currently the vector load with mask when the given index happens out of the 
array boundary is implemented with pure java scalar code to avoid the IOOBE 
(IndexOutOfBoundaryException). This is necessary for architectures that do not 
support the predicate feature. Because the masked load is implemented with a 
full vector load and a vector blend applied on it. And a full vector load will 
definitely cause the IOOBE which is not valid. However, for architectures that 
support the predicate feature like SVE/AVX-512/RVV, it can be vectorized with 
the predicated load instruction as long as the indexes of the masked lanes are 
within the bounds of the array. For these architectures, loading with unmasked 
lanes does not raise exception.

This patch adds the vectorization support for the masked load with IOOBE part. 
Please see the original java implementation (FIXME: optimize):


  @ForceInline
  public static
  ByteVector fromArray(VectorSpecies species,
   byte[] a, int offset,
   VectorMask m) {
  ByteSpecies vsp = (ByteSpecies) species;
  if (offset >= 0 && offset <= (a.length - species.length())) {
  return vsp.dummyVector().fromArray0(a, offset, m);
  }

  // FIXME: optimize
  checkMaskFromIndexSize(offset, vsp, m, 1, a.length);
  return vsp.vOp(m, i -> a[offset + i]);
  }

Since it can only be vectorized with the predicate load, the hotspot must check 
whether the current backend supports it and falls back to the java scalar 
version if not. This is different from the normal masked vector load that the 
compiler will generate a full vector load and a vector blend if the predicate 
load is not supported. So to let the compiler make the expected action, an 
additional flag (i.e. `usePred`) is added to the existing "loadMasked" 
intrinsic, with the value "true" for the IOOBE part while "false" for the 
normal load. And the compiler will fail to intrinsify if the flag is "true" and 
the predicate load is not supported by the backend, which means that normal 
java path will be executed.

Also adds the same vectorization support for masked:
 - fromByteArray/fromByteBuffer
 - fromBooleanArray
 - fromCharArray

The performance for the new added benchmarks improve about `1.88x ~ 30.26x` on 
the x86 AVX-512 system:

Benchmark  before   After  Units
LoadMaskedIOOBEBenchmark.byteLoadArrayMaskIOOBE   737.542 1387.069 ops/ms
LoadMaskedIOOBEBenchmark.doubleLoadArrayMaskIOOBE 118.366  330.776 ops/ms
LoadMaskedIOOBEBenchmark.floatLoadArrayMaskIOOBE  233.832 6125.026 ops/ms
LoadMaskedIOOBEBenchmark.intLoadArrayMaskIOOBE233.816 7075.923 ops/ms
LoadMaskedIOOBEBenchmark.longLoadArrayMaskIOOBE   119.771  330.587 ops/ms
LoadMaskedIOOBEBenchmark.shortLoadArrayMaskIOOBE  431.961  939.301 ops/ms

Similar performance gain can also be observed on 512-bit SVE system.

-

Commit messages:
 - 8283667: [vectorapi] Vectorization for masked load with IOOBE with predicate 
feature

Changes: https://git.openjdk.java.net/jdk/pull/8035/files
 Webrev: https://webrevs.openjdk.java.net/?repo=jdk=8035=00
  Issue: https://bugs.openjdk.java.net/browse/JDK-8283667
  Stats: 821 lines in 43 files changed: 314 ins; 117 del; 390 mod
  Patch: https://git.openjdk.java.net/jdk/pull/8035.diff
  Fetch: git fetch https://git.openjdk.java.net/jdk pull/8035/head:pull/8035

PR: https://git.openjdk.java.net/jdk/pull/8035