http://gcc.gnu.org/bugzilla/show_bug.cgi?id=46599

           Summary: Possible enhancement for inline stringops with -Os
           Product: gcc
           Version: 4.5.1
            Status: UNCONFIRMED
          Severity: enhancement
          Priority: P3
         Component: other
        AssignedTo: unassig...@gcc.gnu.org
        ReportedBy: gcc.h...@gmail.com
              Host: Fedora 14
            Target: Core i7
             Build: GCC 4.5.1 20100924


GCC 4.5.1 20100924 "-Os -minline-all-stringops"  on Core i7

int
main( int argc, char *argv[] )
{
  int i, a[256], b[256];

  for( i = 0; i < 256; ++i )  // discourage optimization
    a[i] = rand();

  memcpy( b, a, argc * sizeof(int) );

  printf( "%d\n", b[rand()] );  // discourage optimization

  return 0;
}

I wonder if its possible to improve the -Os code generation for inline
stringops when
the length is known to be a multiple of 4 bytes?

That is, instead of:

    movsx   rcx, ebp    # argc
    sal rcx, 2
    rep movsb

it would be nice to see:

    movsx   rcx, ebp    # argc
    rep movsd

Note that  memcpy( b, a, 1024 ) generates:

    mov ecx, 256
    rep movsd

This is for -Os which normally emits a movs, not a loop.  The same applies to
stos.

The reason I think this might be possible is this:-

Use -mstringop-strategy=rep_4byte to force the use of movsd.

For memcpy( b, a, argc * sizeof(int) ) we get:

    movsx   rcx, ebp    # argc
    sal rcx, 2
    cmp rcx, 4
    jb  .L5 #,
    shr rcx, 2
    rep movsd
.L5:


For memcpy( b, a, argc ) we get:

    movsx   rax, ebp    # argc, argc
    mov rdi, rsp    # tmp76,
    lea rsi, [rsp+1024] # tmp77,
    cmp rax, 4  # argc,
    jb  .L3 #,
    mov rcx, rax    # tmp78, argc
    shr rcx, 2  # tmp78,
    rep movsd
.L3:
    xor edx, edx    # tmp80
    test    al, 2   # argc,
    je  .L4 #,
    mov dx, WORD PTR [rsi]  # tmp82,
    mov WORD PTR [rdi], dx  #, tmp82
    mov edx, 2  # tmp80,
.L4:
    test    al, 1   # argc,
    je  .L5 #,
    mov al, BYTE PTR [rsi+rdx]  # tmp85,
    mov BYTE PTR [rdi+rdx], al  #, tmp85
.L5:

In the former case "memcpy(b, a, argc * sizeof(int))" gcc has omitted all the
code do deal with 1,
2, and 3 bytes so the stringop code generation has apparently spotted that the
length
is a multiple of 4 bytes.

I can see that the expression code for the length is separate from the stringop
stuff.  Though it does do the right thing with a literal.

Incidentally, for the second case, memcpy( b, a, argc ), the Visual Studio
compiler generates code like this:

    mov eax, ecx
    shr ecx, 2
    rep movsd
    mov ecx, eax
    and ecx, 3
    rep movsb

which seems cleaner (no jumps) than the GCC code, though knowing GCC there is
probably a good reason for its choice as it generally seems to have a far more
sophisticated optimizer.

Reply via email to