Re: RFR: 8303762: [vectorapi] Intrinsification of Vector.slice [v6]

Jatin Bhateja Tue, 25 Apr 2023 05:23:04 -0700

On Tue, 4 Apr 2023 13:46:12 GMT, Quan Anh Mai <qa...@openjdk.org> wrote:


>> `Vector::slice` is a method at the top-level class of the Vector API that 
>> concatenates the 2 inputs into an intermediate composite and extracts a 
>> window equal to the size of the inputs into the result. It is used in vector 
>> conversion methods where the part number is not 0 to slice the parts to the 
>> correct positions. Slicing is also used in text processing such as utf8 and 
>> utf16 validation. x86 starting from SSSE3 has `palignr` which does vector 
>> slicing very efficiently. As a result, I think it is beneficial to add a C2 
>> node for this operation as well as intrinsify `Vector::slice` method.
>> 
>> A slice is currently implemented as 
>> `v2.rearrange(iota).blend(v1.rearrange(iota), blendMask)` which requires 
>> preparation of the index vector and the blending mask. Even with the 
>> preparations being hoisted out of the loops, microbenchmarks show 
>> improvement using the slice instrinsics. Some have tremendous increases in 
>> throughput due to the limitation that a mask of length 2 cannot currently be 
>> intrinsified, leading to falling back to the Java implementations.
>> 
>> Please take a look and have some reviews. Thank you very much.
>
> Quan Anh Mai has updated the pull request incrementally with one additional 
> commit since the last revision:
> 
>   style

src/hotspot/cpu/x86/x86.ad line 7953:

> 7951:       __ punpckldq($dst$$XMMRegister, $src$$XMMRegister);
> 7952:     }
> 7953:     __ psrldq($dst$$XMMRegister, $origin$$constant * 
> type2aelembytes(bt));

Move it to a new macro assembly routine.

src/hotspot/cpu/x86/x86.ad line 7962:

> 7960:             !VM_Version::supports_ssse3());
> 7961:   match(Set dst (VectorSlice (Binary dst src) origin));
> 7962:   effect(TEMP xtmp);

Please also associate TEMP_DEF / TEMP with dst to avoid early source overwrite 
in case dst/src are allocated same register.

src/hotspot/cpu/x86/x86.ad line 7970:

> 7968:     __ movdqu($xtmp$$XMMRegister, $src$$XMMRegister);
> 7969:     __ pslldq($xtmp$$XMMRegister, 16 - shift_count);
> 7970:     __ por($dst$$XMMRegister, $xtmp$$XMMRegister);

Move to macro assembly routine.

src/hotspot/cpu/x86/x86.ad line 8007:

> 8005:       }
> 8006:       __ vpsrldq($dst$$XMMRegister, $dst$$XMMRegister, shift_count, 
> Assembler::AVX_128bit);
> 8007:     }

Move to macro assembly routine.

src/hotspot/cpu/x86/x86.ad line 8063:

> 8061:             (type2aelembytes(Matcher::vector_element_basic_type(n)) * 
> n->in(2)->get_int()) % 4U != 0 &&
> 8062:             (type2aelembytes(Matcher::vector_element_basic_type(n)) * 
> n->in(2)->get_int() < 16 ||
> 8063:              type2aelembytes(Matcher::vector_element_basic_type(n)) * 
> n->in(2)->get_int() > 48));

Move these bulky predications to source_hpp section, like done at 
https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/x86/x86.ad#L8786

src/hotspot/cpu/x86/x86.ad line 8082:

> 8080:             type2aelembytes(Matcher::vector_element_basic_type(n)) * 
> n->in(2)->get_int() > 16 &&
> 8081:             type2aelembytes(Matcher::vector_element_basic_type(n)) * 
> n->in(2)->get_int() < 48);
> 8082:   match(Set dst (VectorSlice (Binary src1 src2) origin));

Same as above.

src/hotspot/cpu/x86/x86.ad line 8099:

> 8097:             (Matcher::vector_length_in_bytes(n) == 64 ||
> 8098:              (Matcher::vector_length_in_bytes(n) == 32 &&
> 8099:               VM_Version::supports_avx512vl())));

Same as above.

src/hotspot/share/opto/vectorIntrinsics.cpp line 1914:

> 1912:   if (vector_klass->const_oop() == NULL || elem_klass->const_oop() == 
> NULL ||
> 1913:       !vlen->is_con() || !origin_type->is_con()) {
> 1914:     if (C->print_intrinsics()) {

Hi @merykitty , your inline expander is not handling non-constant origin case, 
this will introduce performance regressions w.r.t to existing implementation.

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/12909#discussion_r1176428922
PR Review Comment: https://git.openjdk.org/jdk/pull/12909#discussion_r1176424735
PR Review Comment: https://git.openjdk.org/jdk/pull/12909#discussion_r1176429190
PR Review Comment: https://git.openjdk.org/jdk/pull/12909#discussion_r1176429410
PR Review Comment: https://git.openjdk.org/jdk/pull/12909#discussion_r1176428080
PR Review Comment: https://git.openjdk.org/jdk/pull/12909#discussion_r1176428309
PR Review Comment: https://git.openjdk.org/jdk/pull/12909#discussion_r1176428542
PR Review Comment: https://git.openjdk.org/jdk/pull/12909#discussion_r1176407424

Re: RFR: 8303762: [vectorapi] Intrinsification of Vector.slice [v6]

Reply via email to