[Bug target/96789] x264: sub4x4_dct() improves when vectorization is disabled

crazylht at gmail dot com via Gcc-bugs Sat, 26 Sep 2020 20:07:44 -0700

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96789


--- Comment #23 from Hongtao.liu <crazylht at gmail dot com> ---
>  _813 = {_437, _448, _459, _470, _490, _501, _512, _523, _543, _554, _565,
> _576, _125, _143, _161, _179}; 

The cost of vec_construct in i386 backend is 64, calculated as 16 x 4

cut from i386.c
---
/* N element inserts into SSE vectors.  */ 
int cost = TYPE_VECTOR_SUBPARTS (vectype) * ix86_cost->sse_op;
---

>From perspective of pipeline latency, is seems ok, but from perspective of
rtx_cost, it seems inaccurate since it would be initialized as
---
        vmovd   %eax, %xmm0
        vpinsrb $1, 1(%rsi), %xmm0, %xmm0
        vmovd   %eax, %xmm7
        vpinsrb $1, 3(%rsi), %xmm7, %xmm7
        vmovd   %eax, %xmm3
        vpinsrb $1, 17(%rsi), %xmm3, %xmm3
        vmovd   %eax, %xmm6
        vpinsrb $1, 19(%rsi), %xmm6, %xmm6
        vmovd   %eax, %xmm1
        vpinsrb $1, 33(%rsi), %xmm1, %xmm1
        vmovd   %eax, %xmm5
        vpinsrb $1, 35(%rsi), %xmm5, %xmm5
        vmovd   %eax, %xmm2
        vpinsrb $1, 49(%rsi), %xmm2, %xmm2
        vmovd   %eax, %xmm4
        vpinsrb $1, 51(%rsi), %xmm4, %xmm4
        vpunpcklwd      %xmm6, %xmm3, %xmm3
        vpunpcklwd      %xmm4, %xmm2, %xmm2
        vpunpcklwd      %xmm7, %xmm0, %xmm0
        vpunpcklwd      %xmm5, %xmm1, %xmm1
        vpunpckldq      %xmm2, %xmm1, %xmm1
        vpunpckldq      %xmm3, %xmm0, %xmm0
        vpunpcklqdq     %xmm1, %xmm0, %xmm0
---

it's 16 "vector insert" + (4 + 2 + 1) "vector concat/permutation", so cost
should be 92(23 * 4).

[Bug target/96789] x264: sub4x4_dct() improves when vectorization is disabled

Reply via email to