Re: RFR: 8318650: Optimized subword gather for x86 targets. [v10]

2024-01-15 Thread Emanuel Peter
On Tue, 16 Jan 2024 06:08:28 GMT, Jatin Bhateja wrote: >> Or would that require too many registers? > >> Can the `offset` not be added to `idx_base` before the loop? > > Offset needs to be added to each index element, please refer to API > specification for details. >

Re: RFR: 8318650: Optimized subword gather for x86 targets. [v10]

2024-01-15 Thread Emanuel Peter
On Tue, 16 Jan 2024 06:08:35 GMT, Jatin Bhateja wrote: >> src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 1634: >> >>> 1632: Register offset, >>> XMMRegister offset_vec, XMMRegister idx_vec, >>> 1633:

Re: RFR: 8318650: Optimized subword gather for x86 targets. [v10]

2024-01-15 Thread Emanuel Peter
On Tue, 16 Jan 2024 06:08:40 GMT, Jatin Bhateja wrote: >> src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 1627: >> >>> 1625: vpsrlvd(dst, dst, xtmp, vlen_enc); >>> 1626: // Pack double word vector into byte vector. >>> 1627: vpackI2X(T_BYTE, dst, ones, xtmp, vlen_enc); >> >> I

Re: RFR: 8318650: Optimized subword gather for x86 targets. [v10]

2024-01-15 Thread Emanuel Peter
On Tue, 16 Jan 2024 06:08:31 GMT, Jatin Bhateja wrote: >> src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 1757: >> >>> 1755: for (int i = 0; i < 4; i++) { >>> 1756: movl(rtmp, Address(idx_base, i * 4)); >>> 1757: pinsrw(dst, Address(base, rtmp, Address::times_2), i); >> >>

Re: RFR: 8318650: Optimized subword gather for x86 targets. [v10]

2024-01-15 Thread Emanuel Peter
On Tue, 16 Jan 2024 06:17:43 GMT, Jatin Bhateja wrote: >> src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 1900: >> >>> 1898: vgather8b(elem_ty, xtmp3, base, idx_base, rtmp, vlen_enc); >>> 1899: } else { >>> 1900: LP64_ONLY(vgather8b_masked(elem_ty, xtmp3, base, idx_base, >>>

Re: RFR: 8318650: Optimized subword gather for x86 targets. [v10]

2024-01-15 Thread Jatin Bhateja
On Mon, 15 Jan 2024 14:27:43 GMT, Emanuel Peter wrote: >> Jatin Bhateja has updated the pull request with a new target base due to a >> merge or a rebase. The pull request now contains 12 commits: >> >> - Accelerating masked sub-word gathers for AVX2 targets, this gives >> additional 1.5-4x

Re: RFR: 8318650: Optimized subword gather for x86 targets. [v10]

2024-01-15 Thread Jatin Bhateja
On Mon, 15 Jan 2024 14:36:38 GMT, Emanuel Peter wrote: >> src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 1776: >> >>> 1774: for (int i = 0; i < 4; i++) { >>> 1775: movl(rtmp, Address(idx_base, i * 4)); >>> 1776: addl(rtmp, offset); >> >> Can the `offset` not be added to

Re: RFR: 8318650: Optimized subword gather for x86 targets. [v10]

2024-01-15 Thread Jatin Bhateja
On Mon, 15 Jan 2024 13:49:06 GMT, Emanuel Peter wrote: >> Jatin Bhateja has updated the pull request with a new target base due to a >> merge or a rebase. The pull request now contains 12 commits: >> >> - Accelerating masked sub-word gathers for AVX2 targets, this gives >> additional 1.5-4x

Re: RFR: 8318650: Optimized subword gather for x86 targets. [v10]

2024-01-15 Thread Emanuel Peter
On Mon, 15 Jan 2024 14:25:28 GMT, Emanuel Peter wrote: >> Jatin Bhateja has updated the pull request with a new target base due to a >> merge or a rebase. The pull request now contains 12 commits: >> >> - Accelerating masked sub-word gathers for AVX2 targets, this gives >> additional 1.5-4x

Re: RFR: 8318650: Optimized subword gather for x86 targets. [v10]

2024-01-15 Thread Emanuel Peter
On Mon, 1 Jan 2024 14:36:06 GMT, Jatin Bhateja wrote: >> Hi All, >> >> This patch optimizes sub-word gather operation for x86 targets with AVX2 and >> AVX512 features. >> >> Following is the summary of changes:- >> >> 1) Intrinsify sub-word gather using hybrid algorithm which initially >>

Re: RFR: 8318650: Optimized subword gather for x86 targets. [v10]

2024-01-01 Thread Jatin Bhateja
> Hi All, > > This patch optimizes sub-word gather operation for x86 targets with AVX2 and > AVX512 features. > > Following is the summary of changes:- > > 1) Intrinsify sub-word gather using hybrid algorithm which initially > partially unrolls scalar loop to accumulates values from gather