http://gcc.gnu.org/bugzilla/show_bug.cgi?id=59835

--- Comment #5 from Jakub Jelinek <jakub at gcc dot gnu.org> ---
Untested patch for kunpckhi:
2014-01-16  Jakub Jelinek  <ja...@redhat.com>

    * config/i386/i386.md (kunpckhi): Add GPR alternative.

--- gcc/config/i386/i386.md.jj    2014-01-09 21:07:23.000000000 +0100
+++ gcc/config/i386/i386.md    2014-01-16 17:53:54.983352747 +0100
@@ -8486,14 +8486,16 @@ (define_insn "kortestchi"
    (set_attr "prefix" "vex")])

 (define_insn "kunpckhi"
-  [(set (match_operand:HI 0 "register_operand" "=Yk")
+  [(set (match_operand:HI 0 "register_operand" "=Yk,Q")
     (ior:HI
       (ashift:HI
-        (match_operand:HI 1 "register_operand" "Yk")
+        (match_operand:HI 1 "register_operand" "Yk,Q")
         (const_int 8))
-      (zero_extend:HI (match_operand:QI 2 "register_operand" "Yk"))))]
+      (zero_extend:HI (match_operand:QI 2 "register_operand" "Yk,0"))))]
   "TARGET_AVX512F"
-  "kunpckbw\t{%2, %1, %0|%0, %1, %2}"
+  "@
+   kunpckbw\t{%2, %1, %0|%0, %1, %2}
+   mov{b}\t{%b1, %h0|%h0, %b1}"
   [(set_attr "mode" "HI")
    (set_attr "type" "msklog")
    (set_attr "prefix" "vex")])

Of course, no real performance testing has been performed, perhaps there should
be one ? or more for the =Q, Q, 0 alternative.  Without any ?, we don't ICE or
endlessly consume memory anymore, with one ? we do again.

With -O2 -march=k8 -mavx512f the patch changes (from before r206638 to trunk +
patch):
-    kmovw    %edi, %k1
-    kunpckbw    %k1, %k1, %k0
-    kmovw    %k0, -8(%rsp)
-    movd    -8(%rsp), %mm0
+    movl    %edi, %eax
+    movb    %al, %ah
+    movd    %eax, %mm0
Dunno of course how that compares performance wise, but at least it is shorter.
For -O2 -mavx512f:
-    kmovw    %edi, %k1
-    kunpckbw    %k1, %k1, %k0
-    kmovw    %k0, -8(%rsp)
+    movl    %edi, %eax
+    movl    %edi, %edx
+    movb    %al, %dh
+    movw    %dx, -8(%rsp)
so in this case perhaps using mask registers is better, as we store the result
into memory anyway.

Reply via email to