On Tue, 4 Apr 2023 13:46:12 GMT, Quan Anh Mai <qa...@openjdk.org> wrote:
>> `Vector::slice` is a method at the top-level class of the Vector API that >> concatenates the 2 inputs into an intermediate composite and extracts a >> window equal to the size of the inputs into the result. It is used in vector >> conversion methods where the part number is not 0 to slice the parts to the >> correct positions. Slicing is also used in text processing such as utf8 and >> utf16 validation. x86 starting from SSSE3 has `palignr` which does vector >> slicing very efficiently. As a result, I think it is beneficial to add a C2 >> node for this operation as well as intrinsify `Vector::slice` method. >> >> A slice is currently implemented as >> `v2.rearrange(iota).blend(v1.rearrange(iota), blendMask)` which requires >> preparation of the index vector and the blending mask. Even with the >> preparations being hoisted out of the loops, microbenchmarks show >> improvement using the slice instrinsics. Some have tremendous increases in >> throughput due to the limitation that a mask of length 2 cannot currently be >> intrinsified, leading to falling back to the Java implementations. >> >> Please take a look and have some reviews. Thank you very much. > > Quan Anh Mai has updated the pull request incrementally with one additional > commit since the last revision: > > style src/hotspot/cpu/x86/x86.ad line 7953: > 7951: __ punpckldq($dst$$XMMRegister, $src$$XMMRegister); > 7952: } > 7953: __ psrldq($dst$$XMMRegister, $origin$$constant * > type2aelembytes(bt)); Move it to a new macro assembly routine. src/hotspot/cpu/x86/x86.ad line 7962: > 7960: !VM_Version::supports_ssse3()); > 7961: match(Set dst (VectorSlice (Binary dst src) origin)); > 7962: effect(TEMP xtmp); Please also associate TEMP_DEF / TEMP with dst to avoid early source overwrite in case dst/src are allocated same register. src/hotspot/cpu/x86/x86.ad line 7970: > 7968: __ movdqu($xtmp$$XMMRegister, $src$$XMMRegister); > 7969: __ pslldq($xtmp$$XMMRegister, 16 - shift_count); > 7970: __ por($dst$$XMMRegister, $xtmp$$XMMRegister); Move to macro assembly routine. src/hotspot/cpu/x86/x86.ad line 8007: > 8005: } > 8006: __ vpsrldq($dst$$XMMRegister, $dst$$XMMRegister, shift_count, > Assembler::AVX_128bit); > 8007: } Move to macro assembly routine. src/hotspot/cpu/x86/x86.ad line 8063: > 8061: (type2aelembytes(Matcher::vector_element_basic_type(n)) * > n->in(2)->get_int()) % 4U != 0 && > 8062: (type2aelembytes(Matcher::vector_element_basic_type(n)) * > n->in(2)->get_int() < 16 || > 8063: type2aelembytes(Matcher::vector_element_basic_type(n)) * > n->in(2)->get_int() > 48)); Move these bulky predications to source_hpp section, like done at https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/x86/x86.ad#L8786 src/hotspot/cpu/x86/x86.ad line 8082: > 8080: type2aelembytes(Matcher::vector_element_basic_type(n)) * > n->in(2)->get_int() > 16 && > 8081: type2aelembytes(Matcher::vector_element_basic_type(n)) * > n->in(2)->get_int() < 48); > 8082: match(Set dst (VectorSlice (Binary src1 src2) origin)); Same as above. src/hotspot/cpu/x86/x86.ad line 8099: > 8097: (Matcher::vector_length_in_bytes(n) == 64 || > 8098: (Matcher::vector_length_in_bytes(n) == 32 && > 8099: VM_Version::supports_avx512vl()))); Same as above. src/hotspot/share/opto/vectorIntrinsics.cpp line 1914: > 1912: if (vector_klass->const_oop() == NULL || elem_klass->const_oop() == > NULL || > 1913: !vlen->is_con() || !origin_type->is_con()) { > 1914: if (C->print_intrinsics()) { Hi @merykitty , your inline expander is not handling non-constant origin case, this will introduce performance regressions w.r.t to existing implementation. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/12909#discussion_r1176428922 PR Review Comment: https://git.openjdk.org/jdk/pull/12909#discussion_r1176424735 PR Review Comment: https://git.openjdk.org/jdk/pull/12909#discussion_r1176429190 PR Review Comment: https://git.openjdk.org/jdk/pull/12909#discussion_r1176429410 PR Review Comment: https://git.openjdk.org/jdk/pull/12909#discussion_r1176428080 PR Review Comment: https://git.openjdk.org/jdk/pull/12909#discussion_r1176428309 PR Review Comment: https://git.openjdk.org/jdk/pull/12909#discussion_r1176428542 PR Review Comment: https://git.openjdk.org/jdk/pull/12909#discussion_r1176407424