Duplicate `ptrue`(`MaskAll`) instructions are generated with different
predicate registers on SVE when multiple `VectorMask.not()` operations exist.
This increases the predicate register pressure and reduces performance,
especially after loop is unrolled.
Root cause: the matcher clones `MaskAll` for each `not` pattern (i.e.
`(XorVMask (MaskAll m1))`), but SVE has no match rule for that alone. And the
cloned `MaskAll` nodes are not shared with each other.
Since SVE has rules for the `andNot` pattern:
match(Set pd (AndVMask pn (XorVMask pm (MaskAll m1))));
`MaskAll` node should be cloned only when it is part of the `andNot` pattern
instead.
A second issue: `AndVMask`, `OrVMask`, and `XorVMask` are not in the matcher's
commutative vector op list, so their operands are never swapped. As a result,
the `andNot` rule does not match when the `XorVMask` operands appear in the
opposite order (e.g. `(XorVMask (MaskAll m1) pm)`).
This patch fixes both issues by 1) limiting when `MaskAll` is cloned and 2)
adding the three binary mask bitwise IRs to the commutative op list.
Following is the performance result of the new added JMH tested on V1 and
Grace(V2) machines respecitively:
V1 (SVE machine with 256-bit vector length):
Benchmark Mode Threads
Samples Unit size Before After Gain
MaskLogicOperationsBenchmark.byteMaskAndNot thrpt 1 30
ops/ms 256 54465.231 74374.960 1.365
MaskLogicOperationsBenchmark.byteMaskAndNot thrpt 1 30
ops/ms 512 29156.881 39601.358 1.358
MaskLogicOperationsBenchmark.byteMaskAndNot thrpt 1 30
ops/ms 1024 15169.894 20272.379 1.336
MaskLogicOperationsBenchmark.intMaskAndNot thrpt 1 30
ops/ms 256 15408.510 19808.722 1.285
MaskLogicOperationsBenchmark.intMaskAndNot thrpt 1 30
ops/ms 512 7906.952 10297.837 1.302
MaskLogicOperationsBenchmark.intMaskAndNot thrpt 1 30
ops/ms 1024 3767.122 5097.853 1.353
MaskLogicOperationsBenchmark.longMaskAndNot thrpt 1 30
ops/ms 256 7762.614 10534.290 1.357
MaskLogicOperationsBenchmark.longMaskAndNot thrpt 1 30
ops/ms 512 3976.759 5123.445 1.288
MaskLogicOperationsBenchmark.longMaskAndNot thrpt 1 30
ops/ms 1024 1937.389 2573.394 1.328
MaskLogicOperationsBenchmark.shortMaskAndNot thrpt 1 30
ops/ms 256 30165.102 39632.060 1.313
MaskLogicOperationsBenchmark.shortMaskAndNot thrpt 1 30
ops/ms 512 15653.812 20026.600 1.279
MaskLogicOperationsBenchmark.shortMaskAndNot thrpt 1 30
ops/ms 1024 7838.684 10795.177 1.377
MaskLogicOperationsBenchmark.highMaskRegisterPressureWithNots thrpt 1 30
ops/ms 256 20185.546 21548.108 1.067
MaskLogicOperationsBenchmark.highMaskRegisterPressureWithNots thrpt 1 30
ops/ms 512 9549.994 11097.954 1.162
MaskLogicOperationsBenchmark.highMaskRegisterPressureWithNots thrpt 1 30
ops/ms 1024 4797.370 5624.987 1.172
Grace(V2, SVE machine with 128-bit vector length):
Benchmark Mode Threads
Samples Unit size Before After Gain
MaskLogicOperationsBenchmark.byteMaskAndNot thrpt 1 30
ops/ms 256 88221.700 114208.097 1.294
MaskLogicOperationsBenchmark.byteMaskAndNot thrpt 1 30
ops/ms 512 46472.948 64268.305 1.382
MaskLogicOperationsBenchmark.byteMaskAndNot thrpt 1 30
ops/ms 1024 24367.417 33957.434 1.393
MaskLogicOperationsBenchmark.intMaskAndNot thrpt 1 30
ops/ms 256 15774.203 27054.729 1.715
MaskLogicOperationsBenchmark.intMaskAndNot thrpt 1 30
ops/ms 512 7938.354 11484.306 1.446
MaskLogicOperationsBenchmark.intMaskAndNot thrpt 1 30
ops/ms 1024 3973.106 5658.552 1.424
MaskLogicOperationsBenchmark.longMaskAndNot thrpt 1 30
ops/ms 256 7976.768 11533.359 1.445
MaskLogicOperationsBenchmark.longMaskAndNot thrpt 1 30
ops/ms 512 4013.574 5662.615 1.410
MaskLogicOperationsBenchmark.longMaskAndNot thrpt 1 30
ops/ms 1024 2003.350 2810.982 1.403
MaskLogicOperationsBenchmark.shortMaskAndNot thrpt 1 30
ops/ms 256 30464.920 47910.299 1.572
MaskLogicOperationsBenchmark.shortMaskAndNot thrpt 1 30
ops/ms 512 15826.314 23330.242 1.474
MaskLogicOperationsBenchmark.shortMaskAndNot thrpt 1 30
ops/ms 1024 7936.939 11420.379 1.438
MaskLogicOperationsBenchmark.highMaskRegisterPressureWithNots thrpt 1 30
ops/ms 256 17008.969 21002.746 1.234
MaskLogicOperationsBenchmark.highMaskRegisterPressureWithNots thrpt 1 30
ops/ms 512 8159.229 10648.533 1.305
MaskLogicOperationsBenchmark.highMaskRegisterPressureWithNots thrpt 1 30
ops/ms 1024 4004.777 5355.436 1.337
-------------
Commit messages:
- 8378737: AArch64: Fix SVE match rule issues for VectorMask.andNot()
Changes: https://git.openjdk.org/jdk/pull/30013/files
Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=30013&range=00
Issue: https://bugs.openjdk.org/browse/JDK-8378737
Stats: 280 lines in 5 files changed: 260 ins; 10 del; 10 mod
Patch: https://git.openjdk.org/jdk/pull/30013.diff
Fetch: git fetch https://git.openjdk.org/jdk.git pull/30013/head:pull/30013
PR: https://git.openjdk.org/jdk/pull/30013