https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93919

--- Comment #4 from Matthias Kretz (Vir) <kretz at kde dot org> ---
Yes, this is the same issue.

FWIW, a vectorization with SSE4.1 could do:
  pxor xmm0, xmm0
  pinsrw xmm0, WORD PTR in[rip], 0
  pmovsxbw xmm0, xmm0
  movd DWORD PTR out[rip], xmm0

Whether that's faster than
  movsx eax, BYTE PTR in[rip]
  mov WORD PTR out[rip], ax
  movsx eax, BYTE PTR in[rip+1]
  mov WORD PTR out[rip+2], ax

probably depends on whether the load/store ports are limiting the performance
on this section of code. Without SSE4.1 I don't think it's worth vectorizing
this conversion.

In any case, my analysis that there's an out-of-bounds store was wrong. Please
disregard.

Reply via email to