On September 25, 2020 5:50:40 AM GMT+02:00, xionghu luo <luo...@linux.ibm.com> 
wrote:
>Hi,
>
>On 2020/9/24 21:27, Richard Biener wrote:
>> On Thu, Sep 24, 2020 at 10:21 AM xionghu luo <luo...@linux.ibm.com>
>wrote:
>> 
>> I'll just comment that
>> 
>>          xxperm 34,34,33
>>          xxinsertw 34,0,12
>>          xxperm 34,34,32
>> 
>> doesn't look like a variable-position insert instruction but
>> this is a variable whole-vector rotate plus an insert at index zero
>> followed by a variable whole-vector rotate.  I'm not fluend in
>> ppc assembly but
>> 
>>          rlwinm 6,6,2,28,29
>>          mtvsrwz 0,5
>>          lvsr 1,0,6
>>          lvsl 0,0,6
>> 
>> possibly computes the shift masks for r33/r32?  though
>> I do not see those registers mentioned...
>
>For V4SI:
>       rlwinm 6,6,2,28,29      // r6*4
>       mtvsrwz 0,5             // vs0   <- r5  (0xfe)
>       lvsr 1,0,6              // vs33  <- lvsr[r6]
>       lvsl 0,0,6              // vs32  <- lvsl[r6] 
>       xxperm 34,34,33       
>       xxinsertw 34,0,12
>       xxperm 34,34,32
>       blr
>
>
>idx = idx * 4; 
>0    0      0x4000000030000000200000001  
>xxperm:0x4000000030000000200000001  
>vs33:0x101112131415161718191a1b1c1d1e1f 
>vs32:0x102030405060708090a0b0c0d0e0f
>1    4      0x4000000030000000200000001  
>xxperm:0x1000000040000000300000002  
>vs33:0xc0d0e0f101112131415161718191a1b  
>vs32:0x405060708090a0b0c0d0e0f10111213
>2    8      0x4000000030000000200000001  
>xxperm:0x2000000010000000400000003  
>vs33:0x8090a0b0c0d0e0f1011121314151617  
>vs32:0x8090a0b0c0d0e0f1011121314151617
>3    12     0x4000000030000000200000001  
>xxperm:0x3000000020000000100000004  
>vs33:0x405060708090a0b0c0d0e0f10111213  
>vs32:0xc0d0e0f101112131415161718191a1b
>
>vs34:
> 0x40000000300000002000000fe
> 0x400000003000000fe00000001
> 0x4000000fe0000000200000001
>0xfe000000030000000200000001
>
>
>"xxinsertw 34,0,12" will always insert vs0[32:63] content to the forth
>word of
>target vector, bits[96:127].  Then the second xxperm rotate the
>modified vector
>back. 
>
>All the instructions are register based operation, as Segher replied,
>power9
>supports only fixed position inserts, so we need do some trick here to
>support
>it instead of generate short store wide load instructions.

OK, I fair enough - I've heard power 10 does support those inserts. 

>
>> 
>> This might be a generic viable expansion strathegy btw,
>> which is why I asked before whether the CPU supports
>> inserts at a variable position ...  the building blocks are
>> already there with vec_set at constant zero position
>> plus vec_perm_const for the rotates.
>> 
>> But well, I did ask this question.  Multiple times.
>> 
>> ppc does _not_ have a VSX instruction
>> like xxinsertw r34, r8, r12 where r8 denotes
>> the vector element (or byte position or whatever).
>> 
>> So I don't think vec_set with a variable index is the
>> best approach.
>> Xionghu - you said even without the patch the stack
>> storage is eventually elided but
>> 
>>          addi 9,1,-16
>>          rldic 6,6,2,60
>>          stxv 34,-16(1)
>>          stwx 5,9,6
>>          lxv 34,-16(1)
>> 
>> still shows stack(?) store/load with a bad STLF penalty.
>
>
>Sorry that if I didn't describe clearly and misunderstood you, I mean
>if insert many
>instructions(tested with a loop inserted) between "stwx 5,9,6" and "lxv
>34,-16(1)",
>the store hit load performance issue could be elided, but this is not
>the solution
>we want.
>
>I also changed your test as below and build for X86, seems it also
>generates
>inefficient code?  What my patch does maybe different usage from your
>pasted
>case? 
>
>#define N 32
>typedef int T;
>typedef T V __attribute__((vector_size(N)));
>  V setg3 (V v, int idx, T val)
>{
>    v[idx&31] = val;
>    return v;
>}
>
>-O2 -S -mavx -march=znver2:
>
>setg3:
>        push    rbp
>        and     edi, 31
>        mov     rbp, rsp
>        and     rsp, -32
>        vmovdqa YMMWORD PTR [rsp-32], ymm0
>        mov     DWORD PTR [rsp-32+rdi*4], esi
>        vmovdqa ymm0, YMMWORD PTR [rsp-32]
>        leave
>        ret
>
>
>While idx is constant: 
>
>setg3:
>        vpinsrd xmm1, xmm0, esi, 3
>        vinserti128     ymm0, ymm0, xmm1, 0x0
>        ret
>
>And ARM with -O2 -S -march=armv8.2-a+sve (N change to 16): 
>
>setg3:
>        sub     sp, sp, #16
>        and     x0, x0, 15
>        str     q0, [sp]
>        str     w1, [sp, x0, lsl 2]
>        ldr     q0, [sp]
>        add     sp, sp, 16
>        ret
>
>While idx is constant: 
>
>setg3:
>        ins     v0.s[3], w1
>        ret
>
>
>Though I've no idea how to optimize this on X86 and ARM with vector
>instructions
>to avoid short store with wide load followed on stack.

Yes, all targets suffer from this. On x86 you can play similar tricks as on 
power9, 
See my attempt in the other mail (and Seghers fix). 

Richard. 

>
>Thanks,
>Xionghu

Reply via email to