On Thu, Sep 24, 2020 at 9:38 PM Segher Boessenkool <seg...@kernel.crashing.org> wrote: > > Hi! > > On Thu, Sep 24, 2020 at 04:55:21PM +0200, Richard Biener wrote: > > Btw, on x86_64 the following produces sth reasonable: > > > > #define N 32 > > typedef int T; > > typedef T V __attribute__((vector_size(N))); > > V setg (V v, int idx, T val) > > { > > V valv = (V){idx, idx, idx, idx, idx, idx, idx, idx}; > > V mask = ((V){0, 1, 2, 3, 4, 5, 6, 7} == valv); > > v = (v & ~mask) | (valv & mask); > > return v; > > } > > > > vmovd %edi, %xmm1 > > vpbroadcastd %xmm1, %ymm1 > > vpcmpeqd .LC0(%rip), %ymm1, %ymm2 > > vpblendvb %ymm2, %ymm1, %ymm0, %ymm0 > > ret > > > > I'm quite sure you could do sth similar on power? > > This only allows inserting aligned elements. Which is probably fine > of course (we don't allow elements that straddle vector boundaries > either, anyway). > > And yes, we can do that :-) > > That should be > #define N 32 > typedef int T; > typedef T V __attribute__((vector_size(N))); > V setg (V v, int idx, T val) > { > V valv = (V){val, val, val, val, val, val, val, val}; > V idxv = (V){idx, idx, idx, idx, idx, idx, idx, idx}; > V mask = ((V){0, 1, 2, 3, 4, 5, 6, 7} == idxv); > v = (v & ~mask) | (valv & mask); > return v; > }
Whoops yeah, simplified it a bit too much ;) > after which I get (-march=znver2) > > setg: > vmovd %edi, %xmm1 > vmovd %esi, %xmm2 > vpbroadcastd %xmm1, %ymm1 > vpbroadcastd %xmm2, %ymm2 > vpcmpeqd .LC0(%rip), %ymm1, %ymm1 > vpandn %ymm0, %ymm1, %ymm0 > vpand %ymm2, %ymm1, %ymm1 > vpor %ymm0, %ymm1, %ymm0 > ret I get with -march=znver2 -O2 vmovd %edi, %xmm1 vmovd %esi, %xmm2 vpbroadcastd %xmm1, %ymm1 vpbroadcastd %xmm2, %ymm2 vpcmpeqd .LC0(%rip), %ymm1, %ymm1 vpblendvb %ymm1, %ymm2, %ymm0, %ymm0 and with -mavx512vl vpbroadcastd %edi, %ymm1 vpcmpd $0, .LC0(%rip), %ymm1, %k1 vpbroadcastd %esi, %ymm0{%k1} broadcast-with-mask - heh, would be interesting if we manage to combine v[idx1] = val; v[idx2] = val; ;) Now, with SSE4.2 the 16byte case compiles to setg: .LFB0: .cfi_startproc movd %edi, %xmm3 movdqa %xmm0, %xmm1 movd %esi, %xmm4 pshufd $0, %xmm3, %xmm0 pcmpeqd .LC0(%rip), %xmm0 movdqa %xmm0, %xmm2 pandn %xmm1, %xmm2 pshufd $0, %xmm4, %xmm1 pand %xmm1, %xmm0 por %xmm2, %xmm0 ret since there's no blend with a variable mask IIRC. with aarch64 and SVE it doesn't handle the 32byte case at all, the 16byte case compiles to setg: .LFB0: .cfi_startproc adrp x2, .LC0 dup v1.4s, w0 dup v2.4s, w1 ldr q3, [x2, #:lo12:.LC0] cmeq v1.4s, v1.4s, v3.4s bit v0.16b, v2.16b, v1.16b which looks equivalent to the AVX2 code. For all of those varying the vector element type may also cause "issues" I guess. > .LC0: > .long 0 > .long 1 > .long 2 > .long 3 > .long 4 > .long 5 > .long 6 > .long 7 > > and for powerpc (changing it to 16B vectors, -mcpu=power9) it is > > setg: > addis 9,2,.LC0@toc@ha > mtvsrws 32,5 > mtvsrws 33,6 > addi 9,9,.LC0@toc@l > lxv 45,0(9) > vcmpequw 0,0,13 > xxsel 34,34,33,32 > blr > > .LC0: > .long 0 > .long 1 > .long 2 > .long 3 > > (We can generate that 0..3 vector without doing loads; I guess x86 can > do that as well? But it takes more than one insn to do (of course we > have to set up the memory address first *with* the load, heh).) > > For power8 it becomes (we need to splat in separate insns): > > setg: > addis 9,2,.LC0@toc@ha > mtvsrwz 32,5 > mtvsrwz 33,6 > addi 9,9,.LC0@toc@l > lxvw4x 45,0,9 > xxspltw 32,32,1 > xxspltw 33,33,1 > vcmpequw 0,0,13 > xxsel 34,34,33,32 > blr > > > Segher