On Wed, 30 Mar 2022 10:31:59 GMT, Xiaohong Gong <[email protected]> wrote:
> Currently the vector load with mask when the given index happens out of the
> array boundary is implemented with pure java scalar code to avoid the IOOBE
> (IndexOutOfBoundaryException). This is necessary for architectures that do
> not support the predicate feature. Because the masked load is implemented
> with a full vector load and a vector blend applied on it. And a full vector
> load will definitely cause the IOOBE which is not valid. However, for
> architectures that support the predicate feature like SVE/AVX-512/RVV, it can
> be vectorized with the predicated load instruction as long as the indexes of
> the masked lanes are within the bounds of the array. For these architectures,
> loading with unmasked lanes does not raise exception.
>
> This patch adds the vectorization support for the masked load with IOOBE
> part. Please see the original java implementation (FIXME: optimize):
>
>
> @ForceInline
> public static
> ByteVector fromArray(VectorSpecies<Byte> species,
> byte[] a, int offset,
> VectorMask<Byte> m) {
> ByteSpecies vsp = (ByteSpecies) species;
> if (offset >= 0 && offset <= (a.length - species.length())) {
> return vsp.dummyVector().fromArray0(a, offset, m);
> }
>
> // FIXME: optimize
> checkMaskFromIndexSize(offset, vsp, m, 1, a.length);
> return vsp.vOp(m, i -> a[offset + i]);
> }
>
> Since it can only be vectorized with the predicate load, the hotspot must
> check whether the current backend supports it and falls back to the java
> scalar version if not. This is different from the normal masked vector load
> that the compiler will generate a full vector load and a vector blend if the
> predicate load is not supported. So to let the compiler make the expected
> action, an additional flag (i.e. `usePred`) is added to the existing
> "loadMasked" intrinsic, with the value "true" for the IOOBE part while
> "false" for the normal load. And the compiler will fail to intrinsify if the
> flag is "true" and the predicate load is not supported by the backend, which
> means that normal java path will be executed.
>
> Also adds the same vectorization support for masked:
> - fromByteArray/fromByteBuffer
> - fromBooleanArray
> - fromCharArray
>
> The performance for the new added benchmarks improve about `1.88x ~ 30.26x`
> on the x86 AVX-512 system:
>
> Benchmark before After Units
> LoadMaskedIOOBEBenchmark.byteLoadArrayMaskIOOBE 737.542 1387.069 ops/ms
> LoadMaskedIOOBEBenchmark.doubleLoadArrayMaskIOOBE 118.366 330.776 ops/ms
> LoadMaskedIOOBEBenchmark.floatLoadArrayMaskIOOBE 233.832 6125.026 ops/ms
> LoadMaskedIOOBEBenchmark.intLoadArrayMaskIOOBE 233.816 7075.923 ops/ms
> LoadMaskedIOOBEBenchmark.longLoadArrayMaskIOOBE 119.771 330.587 ops/ms
> LoadMaskedIOOBEBenchmark.shortLoadArrayMaskIOOBE 431.961 939.301 ops/ms
>
> Similar performance gain can also be observed on 512-bit SVE system.
Hi @PaulSandoz @jatin-bhateja @sviswa7, could you please help to check this PR?
Any feedback is welcome! Thanks a lot!
-------------
PR: https://git.openjdk.java.net/jdk/pull/8035