> Duplicate `ptrue`(`MaskAll`) instructions are generated with different > predicate registers on SVE when multiple `VectorMask.not()` operations exist. > This increases the predicate register pressure and reduces performance, > especially after loop is unrolled. > > Root cause: the matcher clones `MaskAll` for each `not` pattern (i.e. > `(XorVMask (MaskAll m1))`), but SVE has no match rule for that alone. And the > cloned `MaskAll` nodes are not shared with each other. > > Since SVE has rules for the `andNot` pattern: > > match(Set pd (AndVMask pn (XorVMask pm (MaskAll m1)))); > > `MaskAll` node should be cloned only when it is part of the `andNot` pattern > instead. > > A second issue: `AndVMask`, `OrVMask`, and `XorVMask` are not in the > matcher's commutative vector op list, so their operands are never swapped. As > a result, the `andNot` rule does not match when the `XorVMask` operands > appear in the opposite order (e.g. `(XorVMask (MaskAll m1) pm)`). > > This patch fixes both issues by 1) limiting when `MaskAll` is cloned and 2) > adding the three binary mask bitwise IRs to the commutative op list. > > Following is the performance result of the new added JMH tested on V1 and > Grace(V2) machines respecitively: > > V1 (SVE machine with 256-bit vector length): > > Benchmark Mode Threads > Samples Unit size Before After Gain > MaskLogicOperationsBenchmark.byteMaskAndNot thrpt 1 > 30 ops/ms 256 54465.231 74374.960 1.365 > MaskLogicOperationsBenchmark.byteMaskAndNot thrpt 1 > 30 ops/ms 512 29156.881 39601.358 1.358 > MaskLogicOperationsBenchmark.byteMaskAndNot thrpt 1 > 30 ops/ms 1024 15169.894 20272.379 1.336 > MaskLogicOperationsBenchmark.intMaskAndNot thrpt 1 > 30 ops/ms 256 15408.510 19808.722 1.285 > MaskLogicOperationsBenchmark.intMaskAndNot thrpt 1 > 30 ops/ms 512 7906.952 10297.837 1.302 > MaskLogicOperationsBenchmark.intMaskAndNot thrpt 1 > 30 ops/ms 1024 3767.122 5097.853 1.353 > MaskLogicOperationsBenchmark.longMaskAndNot thrpt 1 > 30 ops/ms 256 7762.614 10534.290 1.357 > MaskLogicOperationsBenchmark.longMaskAndNot thrpt 1 > 30 ops/ms 512 3976.759 5123.445 1.288 > MaskLogicOperationsBenchmark.longMaskAndNot thrpt 1 > 30 ops/ms 1024 1937.389 2573.394 1.328 > MaskLogicOperationsB...
Xiaohong Gong has updated the pull request incrementally with one additional commit since the last revision: Add comment ------------- Changes: - all: https://git.openjdk.org/jdk/pull/30013/files - new: https://git.openjdk.org/jdk/pull/30013/files/496cd09b..e3f30aea Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=30013&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=30013&range=00-01 Stats: 2 lines in 1 file changed: 2 ins; 0 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/30013.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/30013/head:pull/30013 PR: https://git.openjdk.org/jdk/pull/30013
