https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101096

--- Comment #2 from Hongtao.liu <crazylht at gmail dot com> ---
For foo, after supporting downconvert instruction, below it's difference for
codegen.

@@ -6,15 +6,17 @@
 foo:
 .LFB0:
        .cfi_startproc
-       movl    $255, %eax
-       vpbroadcastw    %eax, %xmm0
-       vpand   16(%rsi), %xmm0, %xmm2
-       vpand   (%rsi), %xmm0, %xmm1
-       vpackuswb       %xmm2, %xmm1, %xmm1
-       vpand   (%rdi), %xmm0, %xmm2
-       vpand   16(%rdi), %xmm0, %xmm0
-       vpackuswb       %xmm0, %xmm2, %xmm0
-       vpaddb  %xmm0, %xmm1, %xmm0
+       vmovdqu16       (%rsi), %xmm1
+       vmovdqu16       16(%rsi), %xmm0
+       vmovdqu16       16(%rdi), %xmm2
+       vpmovwb %xmm0, %xmm0
+       vpmovwb %xmm1, %xmm1
+       vpunpcklqdq     %xmm0, %xmm1, %xmm1
+       vmovdqu16       (%rdi), %xmm0
+       vpmovwb %xmm2, %xmm2
+       vpmovwb %xmm0, %xmm0
+       vpunpcklqdq     %xmm2, %xmm0, %xmm0
+       vpaddb  %xmm1, %xmm0, %xmm0
        vmovdqu8        %xmm0, (%rdx)

If GCC vectorizer support different vector length(then we don't need to down
convert and pack), vpmovwb may be better, but if not, the instructions number
seems more or less, but vpmovwb is more expensive than vpand.

Reply via email to