https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96789
--- Comment #23 from Hongtao.liu <crazylht at gmail dot com> --- > _813 = {_437, _448, _459, _470, _490, _501, _512, _523, _543, _554, _565, > _576, _125, _143, _161, _179}; The cost of vec_construct in i386 backend is 64, calculated as 16 x 4 cut from i386.c --- /* N element inserts into SSE vectors. */ int cost = TYPE_VECTOR_SUBPARTS (vectype) * ix86_cost->sse_op; --- >From perspective of pipeline latency, is seems ok, but from perspective of rtx_cost, it seems inaccurate since it would be initialized as --- vmovd %eax, %xmm0 vpinsrb $1, 1(%rsi), %xmm0, %xmm0 vmovd %eax, %xmm7 vpinsrb $1, 3(%rsi), %xmm7, %xmm7 vmovd %eax, %xmm3 vpinsrb $1, 17(%rsi), %xmm3, %xmm3 vmovd %eax, %xmm6 vpinsrb $1, 19(%rsi), %xmm6, %xmm6 vmovd %eax, %xmm1 vpinsrb $1, 33(%rsi), %xmm1, %xmm1 vmovd %eax, %xmm5 vpinsrb $1, 35(%rsi), %xmm5, %xmm5 vmovd %eax, %xmm2 vpinsrb $1, 49(%rsi), %xmm2, %xmm2 vmovd %eax, %xmm4 vpinsrb $1, 51(%rsi), %xmm4, %xmm4 vpunpcklwd %xmm6, %xmm3, %xmm3 vpunpcklwd %xmm4, %xmm2, %xmm2 vpunpcklwd %xmm7, %xmm0, %xmm0 vpunpcklwd %xmm5, %xmm1, %xmm1 vpunpckldq %xmm2, %xmm1, %xmm1 vpunpckldq %xmm3, %xmm0, %xmm0 vpunpcklqdq %xmm1, %xmm0, %xmm0 --- it's 16 "vector insert" + (4 + 2 + 1) "vector concat/permutation", so cost should be 92(23 * 4).