On Fri, 14 Nov 2025 01:17:50 GMT, Eric Fang <[email protected]> wrote:
> `VectorMaskCastNode` is used to cast a vector mask from one type to another
> type. The cast may be generated by calling the vector API `cast` or generated
> by the compiler. For example, some vector mask operations like `trueCount`
> require the input mask to be integer types, so for floating point type masks,
> the compiler will cast the mask to the corresponding integer type mask
> automatically before doing the mask operation. This kind of cast is very
> common.
>
> If the vector element size is not changed, the `VectorMaskCastNode` don't
> generate code, otherwise code will be generated to extend or narrow the mask.
> This IR node is not free no matter it generates code or not because it may
> block some optimizations. For example:
> 1. `(VectorStoremask (VectorMaskCast (VectorLoadMask x)))` The middle
> `VectorMaskCast` prevented the following optimization: `(VectorStoremask
> (VectorLoadMask x)) => (x)`
> 2. `(VectorMaskToLong (VectorMaskCast (VectorLongToMask x)))`, which blocks
> the optimization `(VectorMaskToLong (VectorLongToMask x)) => (x)`.
>
> In these IR patterns, the value of the input `x` is not changed, so we can
> safely do the optimization. But if the input value is changed, we can't
> eliminate the cast.
>
> The general idea of this PR is introducing an `uncast_mask` helper function,
> which can be used to uncast a chain of `VectorMaskCastNode`, like the
> existing `Node::uncast(bool)` function. The funtion returns the first non
> `VectorMaskCastNode`.
>
> The intended use case is when the IR pattern to be optimized may contain one
> or more consecutive `VectorMaskCastNode` and this does not affect the
> correctness of the optimization. Then this function can be called to
> eliminate the `VectorMaskCastNode` chain.
>
> Current optimizations related to `VectorMaskCastNode` include:
> 1. `(VectorMaskCast (VectorMaskCast x)) => (x)`, see JDK-8356760.
> 2. `(XorV (VectorMaskCast (VectorMaskCmp src1 src2 cond)) (Replicate -1)) =>
> (VectorMaskCast (VectorMaskCmp src1 src2 ncond))`, see JDK-8354242.
>
> This PR does the following optimizations:
> 1. Extends the optimization pattern `(VectorMaskCast (VectorMaskCast x)) =>
> (x)` as `(VectorMaskCast (VectorMaskCast ... (VectorMaskCast x))) => (x)`.
> Because as long as types of the head and tail `VectorMaskCastNode` are
> consistent, the optimization is correct.
> 2. Supports a new optimization pattern `(VectorStoreMask (VectorMaskCast ...
> (VectorLoadMask x))) => (x)`. Since the value before and after the pattern is
> a boolean vector, it remains unchanged as long as th...
Updated the JMH benchmarks and the new test results:
On a Nvidia Grace machine with 128-bit SVE2:
Benchmark Unit Before Error After Error
Uplift
microMaskLoadCastStoreByte64 ops/us 64.29 0.02 146.67 0.09 2.28
microMaskLoadCastStoreDouble128 ops/us 10.05 0.00 38.10 0.01 3.79
microMaskLoadCastStoreFloat128 ops/us 19.94 0.00 75.05 0.07 3.76
microMaskLoadCastStoreInt128 ops/us 19.94 0.00 75.13 0.01 3.77
microMaskLoadCastStoreLong128 ops/us 10.04 0.00 38.09 0.01 3.79
microMaskLoadCastStoreShort64 ops/us 31.52 0.02 75.12 0.02 2.38
On a Nvidia Grace machine with 128-bit NEON:
Benchmark Unit Before Error After Error
Uplift
microMaskLoadCastStoreByte64 ops/us 73.33 0.01 147.01 0.06 2.00
microMaskLoadCastStoreDouble128 ops/us 8.54 0.03 38.19 0.01 4.47
microMaskLoadCastStoreFloat128 ops/us 23.75 0.01 75.27 0.10 3.17
microMaskLoadCastStoreInt128 ops/us 23.73 0.01 75.25 0.07 3.17
microMaskLoadCastStoreLong128 ops/us 8.56 0.03 38.19 0.01 4.46
microMaskLoadCastStoreShort64 ops/us 24.32 0.00 75.35 0.07 3.10
On an AMD EPYC 9124 16-Core Processor with AVX3:
Benchmark Unit Before Error After Error
Uplift
microMaskLoadCastStoreByte64 ops/us 82.39 0.11 115.15 0.03 1.40
microMaskLoadCastStoreDouble128 ops/us 0.32 0.00 0.32 0.00 0.99
microMaskLoadCastStoreFloat128 ops/us 42.10 0.10 57.58 0.02 1.37
microMaskLoadCastStoreInt128 ops/us 42.10 0.08 57.57 0.02 1.37
microMaskLoadCastStoreLong128 ops/us 0.32 0.00 0.32 0.00 0.99
microMaskLoadCastStoreShort64 ops/us 42.16 0.05 57.54 0.04 1.36
On an AMD EPYC 9124 16-Core Processor with AVX2:
Benchmark Unit Before Error After Error
Uplift
microMaskLoadCastStoreByte64 ops/us 73.59 0.27 115.14 0.04 1.56
microMaskLoadCastStoreDouble128 ops/us 0.30 0.00 0.30 0.00 1.01
microMaskLoadCastStoreFloat128 ops/us 30.68 0.09 57.57 0.02 1.88
microMaskLoadCastStoreInt128 ops/us 30.75 0.09 57.58 0.01 1.87
microMaskLoadCastStoreLong128 ops/us 0.30 0.00 0.30 0.00 1.00
microMaskLoadCastStoreShort64 ops/us 24.95 0.01 57.59 0.01 2.31
On an AMD EPYC 9124 16-Core Processor with AVX1:
Benchmark Unit Before Error After Error
Uplift
microMaskLoadCastStoreByte64 ops/us 73.68 0.02 115.17 0.03 1.56
microMaskLoadCastStoreDouble128 ops/us 0.30 0.00 0.30 0.00 1.01
microMaskLoadCastStoreFloat128 ops/us 30.80 0.12 57.59 0.01 1.87
microMaskLoadCastStoreInt128 ops/us 30.70 0.11 57.58 0.01 1.88
microMaskLoadCastStoreLong128 ops/us 0.30 0.00 0.30 0.00 0.99
microMaskLoadCastStoreShort64 ops/us 24.95 0.01 57.56 0.02 2.31
-------------
PR Comment: https://git.openjdk.org/jdk/pull/28313#issuecomment-3555660413