> Duplicate `ptrue`(`MaskAll`) instructions are generated with different > predicate registers on SVE when multiple `VectorMask.not()` operations exist. > This increases the predicate register pressure and reduces performance, > especially after loop is unrolled. > > Root cause: the matcher clones `MaskAll` for each `not` pattern (i.e. > `(XorVMask (MaskAll m1))`), but SVE has no match rule for that alone. And the > cloned `MaskAll` nodes are not shared with each other. > > Since SVE has rules for the `andNot` pattern: > > match(Set pd (AndVMask pn (XorVMask pm (MaskAll m1)))); > > `MaskAll` node should be cloned only when it is part of the `andNot` pattern > instead. > > A second issue: `AndVMask`, `OrVMask`, and `XorVMask` are not in the > matcher's commutative vector op list, so their operands are never swapped. As > a result, the `andNot` rule does not match when the `XorVMask` operands > appear in the opposite order (e.g. `(XorVMask (MaskAll m1) pm)`). > > This patch fixes both issues by 1) limiting when `MaskAll` is cloned and 2) > adding the three binary mask bitwise IRs to the commutative op list. > > Following is the performance result of the new added JMH tested on V1 and > Grace(V2) machines respecitively: > > V1 (SVE machine with 256-bit vector length): > > Benchmark Mode Threads > Samples Unit size Before After Gain > MaskLogicOperationsBenchmark.byteMaskAndNot thrpt 1 > 30 ops/ms 256 54465.231 74374.960 1.365 > MaskLogicOperationsBenchmark.byteMaskAndNot thrpt 1 > 30 ops/ms 512 29156.881 39601.358 1.358 > MaskLogicOperationsBenchmark.byteMaskAndNot thrpt 1 > 30 ops/ms 1024 15169.894 20272.379 1.336 > MaskLogicOperationsBenchmark.intMaskAndNot thrpt 1 > 30 ops/ms 256 15408.510 19808.722 1.285 > MaskLogicOperationsBenchmark.intMaskAndNot thrpt 1 > 30 ops/ms 512 7906.952 10297.837 1.302 > MaskLogicOperationsBenchmark.intMaskAndNot thrpt 1 > 30 ops/ms 1024 3767.122 5097.853 1.353 > MaskLogicOperationsBenchmark.longMaskAndNot thrpt 1 > 30 ops/ms 256 7762.614 10534.290 1.357 > MaskLogicOperationsBenchmark.longMaskAndNot thrpt 1 > 30 ops/ms 512 3976.759 5123.445 1.288 > MaskLogicOperationsBenchmark.longMaskAndNot thrpt 1 > 30 ops/ms 1024 1937.389 2573.394 1.328 > MaskLogicOperationsB...
Xiaohong Gong has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains four additional commits since the last revision: - Do not clone maskAll if two inputs of AndVMask are both mask-not pattern - Merge branch 'jdk:master' into JDK-8378737 - Add comment - 8378737: AArch64: Fix SVE match rule issues for VectorMask.andNot() ------------- Changes: - all: https://git.openjdk.org/jdk/pull/30013/files - new: https://git.openjdk.org/jdk/pull/30013/files/e3f30aea..d4b60475 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=30013&range=02 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=30013&range=01-02 Stats: 122215 lines in 3110 files changed: 65983 ins; 25130 del; 31102 mod Patch: https://git.openjdk.org/jdk/pull/30013.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/30013/head:pull/30013 PR: https://git.openjdk.org/jdk/pull/30013
