Re: RFR: 8318650: Optimized subword gather for x86 targets. [v7]

Jatin Bhateja Tue, 21 Nov 2023 11:58:53 -0800

On Wed, 15 Nov 2023 02:17:58 GMT, Jatin Bhateja <jbhat...@openjdk.org> wrote:


>> Hi All,
>> 
>> This patch optimizes sub-word gather operation for x86 targets with AVX2 and 
>> AVX512 features.
>> 
>> Following is the summary of changes:-
>> 
>> 1) Intrinsify sub-word gather with high performance backend implementation 
>> based on hybrid algorithm which initially partially unrolls scalar loop to 
>> accumulates values from gather indices into a quadword(64bit) slice followed 
>> by vector permutation to place the slice into appropriate vector lanes, it 
>> prevents code bloating and generates compact
>> JIT sequence. This coupled with savings from expansive array allocation in 
>> existing java implementation translates into significant performance of 
>> 1.3-5x gains with included micro.
>> 
>> 
>> ![image](https://github.com/openjdk/jdk/assets/59989778/e25ba4ad-6a61-42fa-9566-452f741a9c6d)
>> 
>> 
>> 2) Patch was also compared against modified java fallback implementation by 
>> replacing temporary array allocation with zero initialized vector and a 
>> scalar loops which inserts gathered values into vector. But, vector insert 
>> operation in higher vector lanes is a three step process which first 
>> extracts the upper vector 128 bit lane, updates it with gather subword value 
>> and then inserts the lane back to its original position. This makes inserts 
>> into higher order lanes costly w.r.t to proposed solution. In addition 
>> generated JIT code for modified fallback implementation was very bulky. This 
>> may impact in-lining decisions into caller contexts.
>> 
>> 3) Some minor adjustments in existing gather instruction pattens for 
>> double/quad words.
>> 
>> 
>> Kindly review and share your feedback.
>> 
>> 
>> Best Regards,
>> Jatin
>
> Jatin Bhateja has updated the pull request incrementally with one additional 
> commit since the last revision:
> 
>   Fix incorrect comment

> I have not thought about this thoroughly so it may be incorrect. Can we 
> delegate the gather to `int` and extract the result from it.
> 
> ```
> vpand(xtmp1, idx, bt == T_SHORT ? -2 : -4); // align the index so that the 
> address is aligned for int accesses
> vpgatherdd(xtmp2, mask, Address(base, xtmp1, bt == T_SHORT ? times_2 : 
> times_1, offset));
> vpand(xtmp1, idx, bt == T_SHORT ? 1 : 3); // Need to align the requested 
> elements
> vpslld(xtmp1, xtmp1, bt == T_SHORT ? 4 : 3);
> vpsrlvd(xtmp1, xtmp2, xtmp1);
> vpmovdw(dst, xtmp1);
> ```



> I have not thought about this thoroughly so it may be incorrect. Can we 
> delegate the gather to `int` and extract the result from it.
> 
> ```
> vpand(xtmp1, idx, bt == T_SHORT ? -2 : -4); // align the index so that the 
> address is aligned for int accesses
> vpgatherdd(xtmp2, mask, Address(base, xtmp1, bt == T_SHORT ? times_2 : 
> times_1, offset));
> vpand(xtmp1, idx, bt == T_SHORT ? 1 : 3); // Need to align the requested 
> elements
> vpslld(xtmp1, xtmp1, bt == T_SHORT ? 4 : 3);
> vpsrlvd(xtmp1, xtmp2, xtmp1);
> vpmovdw(dst, xtmp1);
> ```

Double word gather will always try to access 4 contiguous byte from a 
normalized index, and will not be able to prevent access violations if other 3 
bytes in double word are non accessible.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/16354#issuecomment-1821586982

Re: RFR: 8318650: Optimized subword gather for x86 targets. [v7]

Reply via email to