> gcc --version
gcc (SUSE Linux) 4.3.2 [gcc-4_3-branch revision 141291]

> cat test.c
// file test.c One byte transfer

void f(char *a,char *b){
*b=*a;
}

void F(char *a,char *b){
asm volatile("mov (%rdi),%al\nmov %al,(%rsi)");
}
...

> gcc -g -otest test.c -O2 -mtune=core2
> objdump -d test
....
00000000004004f0 <f>:
  4004f0:       0f b6 07                movzbl (%rdi),%eax
  4004f3:       88 06                   mov    %al,(%rsi)
  4004f5:       c3                      retq   
  4004f6:       66 2e 0f 1f 84 00 00    nopw   %cs:0x0(%rax,%rax,1)
  4004fd:       00 00 00 

0000000000400500 <F>:
  400500:       8a 07                   mov    (%rdi),%al
  400502:       88 06                   mov    %al,(%rsi)
  400504:       c3                      retq   

GCC use movzbl (%rdi),%eax, but better to use mov (%rdi),%al, because last
instruction 1 byte shorter. Execution time the same (at least on Core 2 Duo and
Core 2 Solo).

Probably it is result of Intel recomendations to use movz to avoid a partial
register stall. But smaller instruction reduce fetch bandwidth... and

Qwote from: IntelĀ® 64 and IA-32 Architectures Optimization Reference Manual
248966. 3.5.2.3 Partial Register Stalls
"The delay of a partial register stall is small in processors based on Intel
Core and
NetBurst microarchitectures, and in Pentium M processor (with CPUID signature
family 6, model 13), Intel Core Solo, and Intel Core Duo processors. Pentium M
processors (CPUID signature with family 6, model 9) and the P6 family incur a
large
penalty."


-- 
           Summary: Nonoptimal byte load. mov (%rdi),%al better then movzbl
                    (%rdi),%eax
           Product: gcc
           Version: 4.3.2
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: c
        AssignedTo: unassigned at gcc dot gnu dot org
        ReportedBy: vvv at ru dot ru


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39549

Reply via email to