On Thu, 19 Mar 2026 06:33:02 GMT, Xiaohong Gong <[email protected]> wrote:
>> Duplicate `ptrue`(`MaskAll`) instructions are generated with different >> predicate registers on SVE when multiple `VectorMask.not()` operations >> exist. This increases the predicate register pressure and reduces >> performance, especially after loop is unrolled. >> >> Root cause: the matcher clones `MaskAll` for each `not` pattern (i.e. >> `(XorVMask (MaskAll m1))`), but SVE has no match rule for that alone. And >> the cloned `MaskAll` nodes are not shared with each other. >> >> Since SVE has rules for the `andNot` pattern: >> >> match(Set pd (AndVMask pn (XorVMask pm (MaskAll m1)))); >> >> `MaskAll` node should be cloned only when it is part of the `andNot` pattern >> instead. >> >> A second issue: `AndVMask`, `OrVMask`, and `XorVMask` are not in the >> matcher's commutative vector op list, so their operands are never swapped. >> As a result, the `andNot` rule does not match when the `XorVMask` operands >> appear in the opposite order (e.g. `(XorVMask (MaskAll m1) pm)`). >> >> This patch fixes both issues by 1) limiting when `MaskAll` is cloned and 2) >> adding the three binary mask bitwise IRs to the commutative op list. >> >> Following is the performance result of the new added JMH tested on V1 and >> Grace(V2) machines respecitively: >> >> V1 (SVE machine with 256-bit vector length): >> >> Benchmark Mode Threads >> Samples Unit size Before After Gain >> MaskLogicOperationsBenchmark.byteMaskAndNot thrpt 1 >> 30 ops/ms 256 54465.231 74374.960 1.365 >> MaskLogicOperationsBenchmark.byteMaskAndNot thrpt 1 >> 30 ops/ms 512 29156.881 39601.358 1.358 >> MaskLogicOperationsBenchmark.byteMaskAndNot thrpt 1 >> 30 ops/ms 1024 15169.894 20272.379 1.336 >> MaskLogicOperationsBenchmark.intMaskAndNot thrpt 1 >> 30 ops/ms 256 15408.510 19808.722 1.285 >> MaskLogicOperationsBenchmark.intMaskAndNot thrpt 1 >> 30 ops/ms 512 7906.952 10297.837 1.302 >> MaskLogicOperationsBenchmark.intMaskAndNot thrpt 1 >> 30 ops/ms 1024 3767.122 5097.853 1.353 >> MaskLogicOperationsBenchmark.longMaskAndNot thrpt 1 >> 30 ops/ms 256 7762.614 10534.290 1.357 >> MaskLogicOperationsBenchmark.longMaskAndNot thrpt 1 >> 30 ops/ms 512 3976.759 5123.445 1.288 >> MaskLogicOperationsBenchmark.longMaskAndNot thrpt 1 >> 30 ops/... > > Xiaohong Gong has updated the pull request incrementally with one additional > commit since the last revision: > > Add comment I've been toying in this area on x86, except the problem on x86 is opposite: It does not clone the `MaskAll` node enough: 1. there is a missing rule for long - only nodes with int constant are cloned 2. `XorVMask` is not treated as commutative, so even the existing rule does not always match I have [WIP](https://github.com/openjdk/jdk/compare/master...jerrinot:jdk:jh_knot_kxor), is it worth sending a PR? ------------- PR Comment: https://git.openjdk.org/jdk/pull/30013#issuecomment-4182582388
