http://gcc.gnu.org/bugzilla/show_bug.cgi?id=59835
--- Comment #5 from Jakub Jelinek <jakub at gcc dot gnu.org> --- Untested patch for kunpckhi: 2014-01-16 Jakub Jelinek <ja...@redhat.com> * config/i386/i386.md (kunpckhi): Add GPR alternative. --- gcc/config/i386/i386.md.jj 2014-01-09 21:07:23.000000000 +0100 +++ gcc/config/i386/i386.md 2014-01-16 17:53:54.983352747 +0100 @@ -8486,14 +8486,16 @@ (define_insn "kortestchi" (set_attr "prefix" "vex")]) (define_insn "kunpckhi" - [(set (match_operand:HI 0 "register_operand" "=Yk") + [(set (match_operand:HI 0 "register_operand" "=Yk,Q") (ior:HI (ashift:HI - (match_operand:HI 1 "register_operand" "Yk") + (match_operand:HI 1 "register_operand" "Yk,Q") (const_int 8)) - (zero_extend:HI (match_operand:QI 2 "register_operand" "Yk"))))] + (zero_extend:HI (match_operand:QI 2 "register_operand" "Yk,0"))))] "TARGET_AVX512F" - "kunpckbw\t{%2, %1, %0|%0, %1, %2}" + "@ + kunpckbw\t{%2, %1, %0|%0, %1, %2} + mov{b}\t{%b1, %h0|%h0, %b1}" [(set_attr "mode" "HI") (set_attr "type" "msklog") (set_attr "prefix" "vex")]) Of course, no real performance testing has been performed, perhaps there should be one ? or more for the =Q, Q, 0 alternative. Without any ?, we don't ICE or endlessly consume memory anymore, with one ? we do again. With -O2 -march=k8 -mavx512f the patch changes (from before r206638 to trunk + patch): - kmovw %edi, %k1 - kunpckbw %k1, %k1, %k0 - kmovw %k0, -8(%rsp) - movd -8(%rsp), %mm0 + movl %edi, %eax + movb %al, %ah + movd %eax, %mm0 Dunno of course how that compares performance wise, but at least it is shorter. For -O2 -mavx512f: - kmovw %edi, %k1 - kunpckbw %k1, %k1, %k0 - kmovw %k0, -8(%rsp) + movl %edi, %eax + movl %edi, %edx + movb %al, %dh + movw %dx, -8(%rsp) so in this case perhaps using mask registers is better, as we store the result into memory anyway.