On Tue, 17 Mar 2026 10:08:01 GMT, Galder Zamarreño <[email protected]> wrote:
>> Duplicate `ptrue`(`MaskAll`) instructions are generated with different
>> predicate registers on SVE when multiple `VectorMask.not()` operations
>> exist. This increases the predicate register pressure and reduces
>> performance, especially after loop is unrolled.
>>
>> Root cause: the matcher clones `MaskAll` for each `not` pattern (i.e.
>> `(XorVMask (MaskAll m1))`), but SVE has no match rule for that alone. And
>> the cloned `MaskAll` nodes are not shared with each other.
>>
>> Since SVE has rules for the `andNot` pattern:
>>
>> match(Set pd (AndVMask pn (XorVMask pm (MaskAll m1))));
>>
>> `MaskAll` node should be cloned only when it is part of the `andNot` pattern
>> instead.
>>
>> A second issue: `AndVMask`, `OrVMask`, and `XorVMask` are not in the
>> matcher's commutative vector op list, so their operands are never swapped.
>> As a result, the `andNot` rule does not match when the `XorVMask` operands
>> appear in the opposite order (e.g. `(XorVMask (MaskAll m1) pm)`).
>>
>> This patch fixes both issues by 1) limiting when `MaskAll` is cloned and 2)
>> adding the three binary mask bitwise IRs to the commutative op list.
>>
>> Following is the performance result of the new added JMH tested on V1 and
>> Grace(V2) machines respecitively:
>>
>> V1 (SVE machine with 256-bit vector length):
>>
>> Benchmark Mode Threads
>> Samples Unit size Before After Gain
>> MaskLogicOperationsBenchmark.byteMaskAndNot thrpt 1
>> 30 ops/ms 256 54465.231 74374.960 1.365
>> MaskLogicOperationsBenchmark.byteMaskAndNot thrpt 1
>> 30 ops/ms 512 29156.881 39601.358 1.358
>> MaskLogicOperationsBenchmark.byteMaskAndNot thrpt 1
>> 30 ops/ms 1024 15169.894 20272.379 1.336
>> MaskLogicOperationsBenchmark.intMaskAndNot thrpt 1
>> 30 ops/ms 256 15408.510 19808.722 1.285
>> MaskLogicOperationsBenchmark.intMaskAndNot thrpt 1
>> 30 ops/ms 512 7906.952 10297.837 1.302
>> MaskLogicOperationsBenchmark.intMaskAndNot thrpt 1
>> 30 ops/ms 1024 3767.122 5097.853 1.353
>> MaskLogicOperationsBenchmark.longMaskAndNot thrpt 1
>> 30 ops/ms 256 7762.614 10534.290 1.357
>> MaskLogicOperationsBenchmark.longMaskAndNot thrpt 1
>> 30 ops/ms 512 3976.759 5123.445 1.288
>> MaskLogicOperationsBenchmark.longMaskAndNot thrpt 1
>> 30 ops/...
>
> src/hotspot/cpu/aarch64/aarch64.ad line 2687:
>
>> 2685: VectorNode::is_all_ones_vector(m)) {
>> 2686: // Check whether n is only used by an AndVMask node.
>> 2687: if (n->outcnt() == 1) {
>
> This is not something for this PR, but could this optimization also apply if
> the mask was used by more than one node? Is this something that could be done
> as a follow up? Or would it not work at all? If it doesn't make doing so it
> might be worth adding a comment for future readers?
I'm not so clear what the pointed `mask` means here?
If the mask is the result of `XorVMask`. If it is used by more than one node, I
don't think its input `MaskAll` should be cloned. Otherwise, the duplicated
`machnode` for `MaskAll` will be generated while the backend match rule is not
matched successfully because the child `XorVMask` is not single used. This is
just the issue that we want to fix here.
If the mask is `MaskAll` (the all ones vector). If it is used by more than one
node beyond the `AndNot` pattern, the backend match still will not fire because
the matcher will mark `MaskAll` as shared. `pd_clone_node()` cannot handle such
cases at the moment. I investigated how to fix this in the matcher, but the
solution would add noticeable complexity, which does not seem worthwhile given
the limited benefit, consider `MaskAll` is a loop invariant operation.
Therefore, I’d prefer to keep the current behavior, but adding a comment here
makes sense to me.
> test/micro/org/openjdk/bench/jdk/incubator/vector/MaskLogicOperationsBenchmark.java
> line 67:
>
>> 65: @Benchmark
>> 66: public void byteMaskAndNot() {
>> 67: VectorMask<Byte> vm1 = VectorMask.fromArray(B_SPECIES, ma, 0);
>
> If what is being benchmarked is the loop, would it make sense to move this to
> `@Setup`? Same thing for the other benchmarks below.
Thanks so much for looking at this PR!
> If what is being benchmarked is the loop, would it make sense to move this to
> @Setup?
Yeah, moving this mask generation to `@Setup` might be fine, but it doesn’t
really change much compared to the current version. My concerns are:
1. In `@Setup`, we would have to generate different vector masks for each
element type (`byte, short, int, long`), and then only use the compatible one
in each benchmark. The code wouldn’t significantly simplify things for me.
2. This brings no performance benefit. If `vm1` is created in `@Setup init()`,
the vector mask still has to be unboxed in the benchmark method before it is
used. Vector unboxing is heavier than using the `fromArray()` API directly,
since it requires two memory loads (loading the vector mask object, then
loading its payload) while only one for the latter. Even though all of this
happens outside the loop, it doesn’t help with performance measurement, right?
-------------
PR Review Comment: https://git.openjdk.org/jdk/pull/30013#discussion_r2950538809
PR Review Comment: https://git.openjdk.org/jdk/pull/30013#discussion_r2950493595