On Mon, 1 Jan 2024 14:36:06 GMT, Jatin Bhateja <jbhat...@openjdk.org> wrote:

>> Hi All,
>> 
>> This patch optimizes sub-word gather operation for x86 targets with AVX2 and 
>> AVX512 features.
>> 
>> Following is the summary of changes:-
>> 
>> 1) Intrinsify sub-word gather using hybrid algorithm which initially 
>> partially unrolls scalar loop to accumulates values from gather indices into 
>> a quadword(64bit) slice followed by vector permutation to place the slice 
>> into appropriate vector lanes, it prevents code bloating and generates 
>> compact JIT sequence. This coupled with savings from expansive array 
>> allocation in existing java implementation translates into significant 
>> performance of 1.5-10x gains with included micro on Intel Atom family CPUs 
>> and with JVM option UseAVX=2.
>> 
>> ![image](https://github.com/openjdk/jdk/assets/59989778/e25ba4ad-6a61-42fa-9566-452f741a9c6d)
>> 
>> 
>> 2) For AVX512 targets algorithm uses integral gather instructions to load 
>> values from normalized indices which are multiple of integer size, followed 
>> by shuffling and packing exact sub-word values from integral lanes. 
>> 
>> 3) Patch was also compared against modified java fallback implementation by 
>> replacing temporary array allocation with zero initialized vector and a 
>> scalar loops which inserts gathered values into vector. But, vector insert 
>> operation in higher vector lanes is a three step process which first 
>> extracts the upper vector 128 bit lane, updates it with gather subword value 
>> and then inserts the lane back to its original position. This makes inserts 
>> into higher order lanes costly w.r.t to proposed solution. In addition 
>> generated JIT code for modified fallback implementation was very bulky. This 
>> may impact in-lining decisions into caller contexts.
>> 
>> Kindly review and share your feedback.
>> 
>> Best Regards,
>> Jatin
>
> Jatin Bhateja has updated the pull request with a new target base due to a 
> merge or a rebase. The pull request now contains 12 commits:
> 
>  - Accelerating masked sub-word gathers for AVX2 targets, this gives 
> additional 1.5-4x speedups over existing implementation.
>  - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8318650
>  - Removing JDK-8321648 related changes.
>  - Refined AVX3 implementation with integral gather.
>  - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8318650
>  - Fix incorrect comment
>  - Review comments resolutions.
>  - Review comments resolutions.
>  - Review comments resolutions.
>  - Restricting masked sub-word gather to AVX512 target to align with integral 
> gather support.
>  - ... and 2 more: https://git.openjdk.org/jdk/compare/518ec971...de47076e

Just had a quick look at this. Is there any support for gather with different 
indices for each element in the vector?

src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 1627:

> 1625:     vpsrlvd(dst, dst, xtmp, vlen_enc);
> 1626:     // Pack double word vector into byte vector.
> 1627:     vpackI2X(T_BYTE, dst, ones, xtmp, vlen_enc);

I would prefer if there was less code duplication here. I think there are just 
a few values which you could set to variables, and then apply for both versions.

src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 1634:

> 1632:                                                     Register offset, 
> XMMRegister offset_vec, XMMRegister idx_vec,
> 1633:                                                     XMMRegister xtmp1, 
> XMMRegister xtmp2, XMMRegister xtmp3, KRegister mask,
> 1634:                                                     KRegister gmask, 
> int vlen_enc, int vlen) {

Would you mind giving a quick summary of what the input registers are and what 
exactly this method does?
Why do we need to call `vgather_subword_avx3` so many times 
(`lane_count_subwords`)?

src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 1757:

> 1755:     for (int i = 0; i < 4; i++) {
> 1756:       movl(rtmp, Address(idx_base, i * 4));
> 1757:       pinsrw(dst, Address(base, rtmp, Address::times_2), i);

Do I understand this right that you are basically doing this?
`dst[i*4 .. i*4 + 3] = load_8bytes(base + (idx_base + i * 4) * 2)`
But this does not look like a gather, rather like 4 adjacent loads that pack 
the data together into a single 8*4 byte vector.

If so, maybe you should either leave a comment, or even rename the method.

src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 1776:

> 1774:     for (int i = 0; i < 4; i++) {
> 1775:       movl(rtmp, Address(idx_base, i * 4));
> 1776:       addl(rtmp, offset);

Can the `offset` not be added to `idx_base` before the loop?

src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 1900:

> 1898:       vgather8b(elem_ty, xtmp3, base, idx_base, rtmp, vlen_enc);
> 1899:     } else {
> 1900:       LP64_ONLY(vgather8b_masked(elem_ty, xtmp3, base, idx_base, mask, 
> midx, rtmp, vlen_enc));

What happens if if not `LP64_ONLY`?

-------------

Changes requested by epeter (Reviewer).

PR Review: https://git.openjdk.org/jdk/pull/16354#pullrequestreview-1821723578
PR Review Comment: https://git.openjdk.org/jdk/pull/16354#discussion_r1452399791
PR Review Comment: https://git.openjdk.org/jdk/pull/16354#discussion_r1452425355
PR Review Comment: https://git.openjdk.org/jdk/pull/16354#discussion_r1452440206
PR Review Comment: https://git.openjdk.org/jdk/pull/16354#discussion_r1452441071
PR Review Comment: https://git.openjdk.org/jdk/pull/16354#discussion_r1452443784

Reply via email to