https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101096
--- Comment #2 from Hongtao.liu <crazylht at gmail dot com> --- For foo, after supporting downconvert instruction, below it's difference for codegen. @@ -6,15 +6,17 @@ foo: .LFB0: .cfi_startproc - movl $255, %eax - vpbroadcastw %eax, %xmm0 - vpand 16(%rsi), %xmm0, %xmm2 - vpand (%rsi), %xmm0, %xmm1 - vpackuswb %xmm2, %xmm1, %xmm1 - vpand (%rdi), %xmm0, %xmm2 - vpand 16(%rdi), %xmm0, %xmm0 - vpackuswb %xmm0, %xmm2, %xmm0 - vpaddb %xmm0, %xmm1, %xmm0 + vmovdqu16 (%rsi), %xmm1 + vmovdqu16 16(%rsi), %xmm0 + vmovdqu16 16(%rdi), %xmm2 + vpmovwb %xmm0, %xmm0 + vpmovwb %xmm1, %xmm1 + vpunpcklqdq %xmm0, %xmm1, %xmm1 + vmovdqu16 (%rdi), %xmm0 + vpmovwb %xmm2, %xmm2 + vpmovwb %xmm0, %xmm0 + vpunpcklqdq %xmm2, %xmm0, %xmm0 + vpaddb %xmm1, %xmm0, %xmm0 vmovdqu8 %xmm0, (%rdx) If GCC vectorizer support different vector length(then we don't need to down convert and pack), vpmovwb may be better, but if not, the instructions number seems more or less, but vpmovwb is more expensive than vpand.