On Fri, 11 Mar 2022 06:29:22 GMT, Xiaohong Gong <xg...@openjdk.org> wrote:
> The current vector `"NEG"` is implemented with substraction a vector by zero > in case the architecture does not support the negation instruction. And to > fit the predicate feature for architectures that support it, the masked > vector `"NEG" ` is implemented with pattern `"v.not(m).add(1, m)"`. They both > can be optimized to a single negation instruction for ARM SVE. > And so does the non-masked "NEG" for NEON. Besides, implementing the masked > "NEG" with substraction for architectures that support neither negation > instruction nor predicate feature can also save several instructions than the > current pattern. > > To optimize the VectorAPI negation, this patch moves the implementation from > Java side to hotspot. The compiler will generate different nodes according to > the architecture: > - Generate the (predicated) negation node if architecture supports it, > otherwise, generate "`zero.sub(v)`" pattern for non-masked operation. > - Generate `"zero.sub(v, m)"` for masked operation if the architecture does > not have predicate feature, otherwise generate the original pattern > `"v.xor(-1, m).add(1, m)"`. > > So with this patch, the following transformations are applied: > > For non-masked negation with NEON: > > movi v16.4s, #0x0 > sub v17.4s, v16.4s, v17.4s ==> neg v17.4s, v17.4s > > and with SVE: > > mov z16.s, #0 > sub z18.s, z16.s, z17.s ==> neg z16.s, p7/m, z16.s > > For masked negation with NEON: > > movi v17.4s, #0x1 > mvn v19.16b, v18.16b > mov v20.16b, v16.16b ==> neg v18.4s, v17.4s > bsl v20.16b, v19.16b, v18.16b bsl v19.16b, v18.16b, v17.16b > add v19.4s, v20.4s, v17.4s > mov v18.16b, v16.16b > bsl v18.16b, v19.16b, v20.16b > > and with SVE: > > mov z16.s, #-1 > mov z17.s, #1 ==> neg z16.s, p0/m, z16.s > eor z18.s, p0/m, z18.s, z16.s > add z18.s, p0/m, z18.s, z17.s > > Here are the performance gains for benchmarks (see [1][2]) on ARM and x86 > machines(note that the non-masked negation benchmarks do not have any > improvement on X86 since no instructions are changed): > > NEON: > Benchmark Gain > Byte128Vector.NEG 1.029 > Byte128Vector.NEGMasked 1.757 > Short128Vector.NEG 1.041 > Short128Vector.NEGMasked 1.659 > Int128Vector.NEG 1.005 > Int128Vector.NEGMasked 1.513 > Long128Vector.NEG 1.003 > Long128Vector.NEGMasked 1.878 > > SVE with 512-bits: > Benchmark Gain > ByteMaxVector.NEG 1.10 > ByteMaxVector.NEGMasked 1.165 > ShortMaxVector.NEG 1.056 > ShortMaxVector.NEGMasked 1.195 > IntMaxVector.NEG 1.002 > IntMaxVector.NEGMasked 1.239 > LongMaxVector.NEG 1.031 > LongMaxVector.NEGMasked 1.191 > > X86 (non AVX-512): > Benchmark Gain > ByteMaxVector.NEGMasked 1.254 > ShortMaxVector.NEGMasked 1.359 > IntMaxVector.NEGMasked 1.431 > LongMaxVector.NEGMasked 1.989 > > [1] > https://github.com/openjdk/panama-vector/blob/vectorIntrinsics/test/micro/org/openjdk/bench/jdk/incubator/vector/operation/Byte128Vector.java#L1881 > [2] > https://github.com/openjdk/panama-vector/blob/vectorIntrinsics/test/micro/org/openjdk/bench/jdk/incubator/vector/operation/Byte128Vector.java#L1896 src/hotspot/share/opto/vectorIntrinsics.cpp line 209: > 207: #ifndef PRODUCT > 208: if (C->print_intrinsics()) { > 209: tty->print_cr(" ** Rejected vector op (%s,%s,%d) because > architecture does not support variable vector negate", "variable vector negate" seems a bit strange to me. How about removing "variable"? src/hotspot/share/opto/vectorIntrinsics.cpp line 291: > 289: if ((mask_use_type & VecMaskUsePred) != 0) { > 290: if (!Matcher::has_predicated_vectors()) { > 291: return false; If we return here, we would miss the intrinsic failing msg "Rejected vector mask predicate using ...", right? src/hotspot/share/opto/vectornode.cpp line 141: > 139: case T_BYTE: > 140: case T_SHORT: > 141: case T_INT: return Op_NegVI; Why not add `Op_NegVB` for `BYTE` and `Op_NegVS` for `SHORT`? Is there any performance drop for byte/short negation operation if both of them are handled as a NegVI vector? src/hotspot/share/opto/vectornode.cpp line 1635: > 1633: } > 1634: > 1635: Node* NegVINode::Ideal(PhaseGVN* phase, bool can_reshape) { Much duplication in `NegVINode::Ideal` and `NegVLNode::Ideal`. Is it possible to refactor the implementation? ------------- PR: https://git.openjdk.java.net/jdk/pull/7782