On Thu, Sep 24, 2020 at 9:38 PM Segher Boessenkool
<seg...@kernel.crashing.org> wrote:
>
> Hi!
>
> On Thu, Sep 24, 2020 at 04:55:21PM +0200, Richard Biener wrote:
> > Btw, on x86_64 the following produces sth reasonable:
> >
> > #define N 32
> > typedef int T;
> > typedef T V __attribute__((vector_size(N)));
> > V setg (V v, int idx, T val)
> > {
> >   V valv = (V){idx, idx, idx, idx, idx, idx, idx, idx};
> >   V mask = ((V){0, 1, 2, 3, 4, 5, 6, 7} == valv);
> >   v = (v & ~mask) | (valv & mask);
> >   return v;
> > }
> >
> >         vmovd   %edi, %xmm1
> >         vpbroadcastd    %xmm1, %ymm1
> >         vpcmpeqd        .LC0(%rip), %ymm1, %ymm2
> >         vpblendvb       %ymm2, %ymm1, %ymm0, %ymm0
> >         ret
> >
> > I'm quite sure you could do sth similar on power?
>
> This only allows inserting aligned elements.  Which is probably fine
> of course (we don't allow elements that straddle vector boundaries
> either, anyway).
>
> And yes, we can do that :-)
>
> That should be
>   #define N 32
>   typedef int T;
>   typedef T V __attribute__((vector_size(N)));
>   V setg (V v, int idx, T val)
>   {
>     V valv = (V){val, val, val, val, val, val, val, val};
>     V idxv = (V){idx, idx, idx, idx, idx, idx, idx, idx};
>     V mask = ((V){0, 1, 2, 3, 4, 5, 6, 7} == idxv);
>     v = (v & ~mask) | (valv & mask);
>     return v;
>   }

Whoops yeah, simplified it a bit too much ;)

> after which I get (-march=znver2)
>
> setg:
>         vmovd   %edi, %xmm1
>         vmovd   %esi, %xmm2
>         vpbroadcastd    %xmm1, %ymm1
>         vpbroadcastd    %xmm2, %ymm2
>         vpcmpeqd        .LC0(%rip), %ymm1, %ymm1
>         vpandn  %ymm0, %ymm1, %ymm0
>         vpand   %ymm2, %ymm1, %ymm1
>         vpor    %ymm0, %ymm1, %ymm0
>         ret

I get with -march=znver2 -O2

        vmovd   %edi, %xmm1
        vmovd   %esi, %xmm2
        vpbroadcastd    %xmm1, %ymm1
        vpbroadcastd    %xmm2, %ymm2
        vpcmpeqd        .LC0(%rip), %ymm1, %ymm1
        vpblendvb       %ymm1, %ymm2, %ymm0, %ymm0

and with -mavx512vl

        vpbroadcastd    %edi, %ymm1
        vpcmpd  $0, .LC0(%rip), %ymm1, %k1
        vpbroadcastd    %esi, %ymm0{%k1}

broadcast-with-mask - heh, would be interesting if we manage
to combine v[idx1] = val; v[idx2] = val; ;)

Now, with SSE4.2 the 16byte case compiles to

setg:
.LFB0:
        .cfi_startproc
        movd    %edi, %xmm3
        movdqa  %xmm0, %xmm1
        movd    %esi, %xmm4
        pshufd  $0, %xmm3, %xmm0
        pcmpeqd .LC0(%rip), %xmm0
        movdqa  %xmm0, %xmm2
        pandn   %xmm1, %xmm2
        pshufd  $0, %xmm4, %xmm1
        pand    %xmm1, %xmm0
        por     %xmm2, %xmm0
        ret

since there's no blend with a variable mask IIRC.

with aarch64 and SVE it doesn't handle the 32byte case at all,
the 16byte case compiles to

setg:
.LFB0:
        .cfi_startproc
        adrp    x2, .LC0
        dup     v1.4s, w0
        dup     v2.4s, w1
        ldr     q3, [x2, #:lo12:.LC0]
        cmeq    v1.4s, v1.4s, v3.4s
        bit     v0.16b, v2.16b, v1.16b

which looks equivalent to the AVX2 code.

For all of those varying the vector element type may also
cause "issues" I guess.

> .LC0:
>         .long   0
>         .long   1
>         .long   2
>         .long   3
>         .long   4
>         .long   5
>         .long   6
>         .long   7
>
> and for powerpc (changing it to 16B vectors, -mcpu=power9) it is
>
> setg:
>         addis 9,2,.LC0@toc@ha
>         mtvsrws 32,5
>         mtvsrws 33,6
>         addi 9,9,.LC0@toc@l
>         lxv 45,0(9)
>         vcmpequw 0,0,13
>         xxsel 34,34,33,32
>         blr
>
> .LC0:
>         .long   0
>         .long   1
>         .long   2
>         .long   3
>
> (We can generate that 0..3 vector without doing loads; I guess x86 can
> do that as well?  But it takes more than one insn to do (of course we
> have to set up the memory address first *with* the load, heh).)
>
> For power8 it becomes (we need to splat in separate insns):
>
> setg:
>         addis 9,2,.LC0@toc@ha
>         mtvsrwz 32,5
>         mtvsrwz 33,6
>         addi 9,9,.LC0@toc@l
>         lxvw4x 45,0,9
>         xxspltw 32,32,1
>         xxspltw 33,33,1
>         vcmpequw 0,0,13
>         xxsel 34,34,33,32
>         blr
>
>
> Segher

Reply via email to