http://gcc.gnu.org/bugzilla/show_bug.cgi?id=57233

            Bug ID: 57233
           Summary: Vector lowering of LROTATE_EXPR pessimizes code
           Product: gcc
           Version: 4.9.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: tree-optimization
          Assignee: unassigned at gcc dot gnu.org
          Reporter: glisse at gcc dot gnu.org

Hello,

the vector lowering pass, when it sees a rotate on a vector that is not a
supported operation, lowers it to scalar rotates. However, from a quick look at
the RTL expanders (untested), they know how to handle a vector rotate as long
as shifts and ior are supported, and that would yield better code than the
scalar ops. So I think the vector lowering pass should not just check if rotate
is supported, but also if shift and ior are, before splitting the operation.

typedef unsigned vec __attribute__((vector_size(4*sizeof(int))));
vec f(vec a){
  return (a<<2)|(a>>30);
}

without rotate:
    vpsrld    $30, %xmm0, %xmm1
    vpslld    $2, %xmm0, %xmm0
    vpor    %xmm0, %xmm1, %xmm0

with a patch that recognizes rotate for vectors:
    vpextrd    $2, %xmm0, %edx
    vmovd    %xmm0, %eax
    rorx    $30, %eax, %eax
    movl    %eax, -16(%rsp)
    rorx    $30, %edx, %ecx
    vpextrd    $1, %xmm0, %eax
    movl    %ecx, -12(%rsp)
    vmovd    -16(%rsp), %xmm3
    vpextrd    $3, %xmm0, %edx
    vmovd    -12(%rsp), %xmm2
    rorx    $30, %eax, %eax
    rorx    $30, %edx, %edx
    vpinsrd    $1, %eax, %xmm3, %xmm1
    vpinsrd    $1, %edx, %xmm2, %xmm0
    vpunpcklqdq    %xmm0, %xmm1, %xmm0

(I am not sure all those ext/ins are optimal, I would have expected one mov
from xmm0 to memory, then the scalar rotates are done and write to memory
again, and one final mov back to the FPU, but my intuition may be wrong)

Reply via email to