On Mon, 15 Jan 2024 13:49:06 GMT, Emanuel Peter <epe...@openjdk.org> wrote:
>> Jatin Bhateja has updated the pull request with a new target base due to a >> merge or a rebase. The pull request now contains 12 commits: >> >> - Accelerating masked sub-word gathers for AVX2 targets, this gives >> additional 1.5-4x speedups over existing implementation. >> - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8318650 >> - Removing JDK-8321648 related changes. >> - Refined AVX3 implementation with integral gather. >> - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8318650 >> - Fix incorrect comment >> - Review comments resolutions. >> - Review comments resolutions. >> - Review comments resolutions. >> - Restricting masked sub-word gather to AVX512 target to align with >> integral gather support. >> - ... and 2 more: https://git.openjdk.org/jdk/compare/518ec971...de47076e > > src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 1627: > >> 1625: vpsrlvd(dst, dst, xtmp, vlen_enc); >> 1626: // Pack double word vector into byte vector. >> 1627: vpackI2X(T_BYTE, dst, ones, xtmp, vlen_enc); > > I would prefer if there was less code duplication here. I think there are > just a few values which you could set to variables, and then apply for both > versions. Meaty part of the algorithm accept different operands, line #1593, #1599 and #1601, keep two flows for SHORT and BYTE separate will be better maintainable. > src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 1634: > >> 1632: Register offset, >> XMMRegister offset_vec, XMMRegister idx_vec, >> 1633: XMMRegister xtmp1, >> XMMRegister xtmp2, XMMRegister xtmp3, KRegister mask, >> 1634: KRegister gmask, >> int vlen_enc, int vlen) { > > Would you mind giving a quick summary of what the input registers are and > what exactly this method does? > Why do we need to call `vgather_subword_avx3` so many times > (`lane_count_subwords`)? Method gathers sub-words from gather indices using integral gather instructions, because of the lane size mismatch b/w int and sub-words algorithm makes multiple calls to vgather_subword_avx3. > src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 1757: > >> 1755: for (int i = 0; i < 4; i++) { >> 1756: movl(rtmp, Address(idx_base, i * 4)); >> 1757: pinsrw(dst, Address(base, rtmp, Address::times_2), i); > > Do I understand this right that you are basically doing this? > `dst[i*4 .. i*4 + 3] = load_8bytes(base + (idx_base + i * 4) * 2)` > But this does not look like a gather, rather like 4 adjacent loads that pack > the data together into a single 8*4 byte vector. > > Why can this not be done by a simple `32bit` load? Loop scans over integral index array and pick the work from computed address, indexes could be non-contiguous. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/16354#discussion_r1452964120 PR Review Comment: https://git.openjdk.org/jdk/pull/16354#discussion_r1452964077 PR Review Comment: https://git.openjdk.org/jdk/pull/16354#discussion_r1452964030