On Thu, 19 Mar 2026 06:33:02 GMT, Xiaohong Gong <[email protected]> wrote:
>> Duplicate `ptrue`(`MaskAll`) instructions are generated with different
>> predicate registers on SVE when multiple `VectorMask.not()` operations
>> exist. This increases the predicate register pressure and reduces
>> performance, especially after loop is unrolled.
>>
>> Root cause: the matcher clones `MaskAll` for each `not` pattern (i.e.
>> `(XorVMask (MaskAll m1))`), but SVE has no match rule for that alone. And
>> the cloned `MaskAll` nodes are not shared with each other.
>>
>> Since SVE has rules for the `andNot` pattern:
>>
>> match(Set pd (AndVMask pn (XorVMask pm (MaskAll m1))));
>>
>> `MaskAll` node should be cloned only when it is part of the `andNot` pattern
>> instead.
>>
>> A second issue: `AndVMask`, `OrVMask`, and `XorVMask` are not in the
>> matcher's commutative vector op list, so their operands are never swapped.
>> As a result, the `andNot` rule does not match when the `XorVMask` operands
>> appear in the opposite order (e.g. `(XorVMask (MaskAll m1) pm)`).
>>
>> This patch fixes both issues by 1) limiting when `MaskAll` is cloned and 2)
>> adding the three binary mask bitwise IRs to the commutative op list.
>>
>> Following is the performance result of the new added JMH tested on V1 and
>> Grace(V2) machines respecitively:
>>
>> V1 (SVE machine with 256-bit vector length):
>>
>> Benchmark Mode Threads
>> Samples Unit size Before After Gain
>> MaskLogicOperationsBenchmark.byteMaskAndNot thrpt 1
>> 30 ops/ms 256 54465.231 74374.960 1.365
>> MaskLogicOperationsBenchmark.byteMaskAndNot thrpt 1
>> 30 ops/ms 512 29156.881 39601.358 1.358
>> MaskLogicOperationsBenchmark.byteMaskAndNot thrpt 1
>> 30 ops/ms 1024 15169.894 20272.379 1.336
>> MaskLogicOperationsBenchmark.intMaskAndNot thrpt 1
>> 30 ops/ms 256 15408.510 19808.722 1.285
>> MaskLogicOperationsBenchmark.intMaskAndNot thrpt 1
>> 30 ops/ms 512 7906.952 10297.837 1.302
>> MaskLogicOperationsBenchmark.intMaskAndNot thrpt 1
>> 30 ops/ms 1024 3767.122 5097.853 1.353
>> MaskLogicOperationsBenchmark.longMaskAndNot thrpt 1
>> 30 ops/ms 256 7762.614 10534.290 1.357
>> MaskLogicOperationsBenchmark.longMaskAndNot thrpt 1
>> 30 ops/ms 512 3976.759 5123.445 1.288
>> MaskLogicOperationsBenchmark.longMaskAndNot thrpt 1
>> 30 ops/...
>
> Xiaohong Gong has updated the pull request incrementally with one additional
> commit since the last revision:
>
> Add comment
src/hotspot/cpu/aarch64/aarch64.ad line 2692:
> 2690: Node* u = n->unique_out();
> 2691: if (u->Opcode() == Op_AndVMask) {
> 2692: return true;
Suggestion:
// Check whether n is only used by an AndVMask node.
return n->outcnt() == 1 && n->unique_out() == Op_AndVMask;
Then you hardly even need a comment.
-------------
PR Review Comment: https://git.openjdk.org/jdk/pull/30013#discussion_r3020451973