On Wed, 15 Nov 2023 02:17:58 GMT, Jatin Bhateja <jbhat...@openjdk.org> wrote:
>> Hi All, >> >> This patch optimizes sub-word gather operation for x86 targets with AVX2 and >> AVX512 features. >> >> Following is the summary of changes:- >> >> 1) Intrinsify sub-word gather with high performance backend implementation >> based on hybrid algorithm which initially partially unrolls scalar loop to >> accumulates values from gather indices into a quadword(64bit) slice followed >> by vector permutation to place the slice into appropriate vector lanes, it >> prevents code bloating and generates compact >> JIT sequence. This coupled with savings from expansive array allocation in >> existing java implementation translates into significant performance of >> 1.3-5x gains with included micro. >> >> >> ![image](https://github.com/openjdk/jdk/assets/59989778/e25ba4ad-6a61-42fa-9566-452f741a9c6d) >> >> >> 2) Patch was also compared against modified java fallback implementation by >> replacing temporary array allocation with zero initialized vector and a >> scalar loops which inserts gathered values into vector. But, vector insert >> operation in higher vector lanes is a three step process which first >> extracts the upper vector 128 bit lane, updates it with gather subword value >> and then inserts the lane back to its original position. This makes inserts >> into higher order lanes costly w.r.t to proposed solution. In addition >> generated JIT code for modified fallback implementation was very bulky. This >> may impact in-lining decisions into caller contexts. >> >> 3) Some minor adjustments in existing gather instruction pattens for >> double/quad words. >> >> >> Kindly review and share your feedback. >> >> >> Best Regards, >> Jatin > > Jatin Bhateja has updated the pull request incrementally with one additional > commit since the last revision: > > Fix incorrect comment > I have not thought about this thoroughly so it may be incorrect. Can we > delegate the gather to `int` and extract the result from it. > > ``` > vpand(xtmp1, idx, bt == T_SHORT ? -2 : -4); // align the index so that the > address is aligned for int accesses > vpgatherdd(xtmp2, mask, Address(base, xtmp1, bt == T_SHORT ? times_2 : > times_1, offset)); > vpand(xtmp1, idx, bt == T_SHORT ? 1 : 3); // Need to align the requested > elements > vpslld(xtmp1, xtmp1, bt == T_SHORT ? 4 : 3); > vpsrlvd(xtmp1, xtmp2, xtmp1); > vpmovdw(dst, xtmp1); > ``` > I have not thought about this thoroughly so it may be incorrect. Can we > delegate the gather to `int` and extract the result from it. > > ``` > vpand(xtmp1, idx, bt == T_SHORT ? -2 : -4); // align the index so that the > address is aligned for int accesses > vpgatherdd(xtmp2, mask, Address(base, xtmp1, bt == T_SHORT ? times_2 : > times_1, offset)); > vpand(xtmp1, idx, bt == T_SHORT ? 1 : 3); // Need to align the requested > elements > vpslld(xtmp1, xtmp1, bt == T_SHORT ? 4 : 3); > vpsrlvd(xtmp1, xtmp2, xtmp1); > vpmovdw(dst, xtmp1); > ``` Double word gather will always try to access 4 contiguous byte from a normalized index, and will not be able to prevent access violations if other 3 bytes in double word are non accessible. ------------- PR Comment: https://git.openjdk.org/jdk/pull/16354#issuecomment-1821586982