https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93919
--- Comment #4 from Matthias Kretz (Vir) <kretz at kde dot org> --- Yes, this is the same issue. FWIW, a vectorization with SSE4.1 could do: pxor xmm0, xmm0 pinsrw xmm0, WORD PTR in[rip], 0 pmovsxbw xmm0, xmm0 movd DWORD PTR out[rip], xmm0 Whether that's faster than movsx eax, BYTE PTR in[rip] mov WORD PTR out[rip], ax movsx eax, BYTE PTR in[rip+1] mov WORD PTR out[rip+2], ax probably depends on whether the load/store ports are limiting the performance on this section of code. Without SSE4.1 I don't think it's worth vectorizing this conversion. In any case, my analysis that there's an out-of-bounds store was wrong. Please disregard.