> gcc --version gcc (SUSE Linux) 4.3.2 [gcc-4_3-branch revision 141291] > cat test.c // file test.c One byte transfer
void f(char *a,char *b){ *b=*a; } void F(char *a,char *b){ asm volatile("mov (%rdi),%al\nmov %al,(%rsi)"); } ... > gcc -g -otest test.c -O2 -mtune=core2 > objdump -d test .... 00000000004004f0 <f>: 4004f0: 0f b6 07 movzbl (%rdi),%eax 4004f3: 88 06 mov %al,(%rsi) 4004f5: c3 retq 4004f6: 66 2e 0f 1f 84 00 00 nopw %cs:0x0(%rax,%rax,1) 4004fd: 00 00 00 0000000000400500 <F>: 400500: 8a 07 mov (%rdi),%al 400502: 88 06 mov %al,(%rsi) 400504: c3 retq GCC use movzbl (%rdi),%eax, but better to use mov (%rdi),%al, because last instruction 1 byte shorter. Execution time the same (at least on Core 2 Duo and Core 2 Solo). Probably it is result of Intel recomendations to use movz to avoid a partial register stall. But smaller instruction reduce fetch bandwidth... and Qwote from: IntelĀ® 64 and IA-32 Architectures Optimization Reference Manual 248966. 3.5.2.3 Partial Register Stalls "The delay of a partial register stall is small in processors based on Intel Core and NetBurst microarchitectures, and in Pentium M processor (with CPUID signature family 6, model 13), Intel Core Solo, and Intel Core Duo processors. Pentium M processors (CPUID signature with family 6, model 9) and the P6 family incur a large penalty." -- Summary: Nonoptimal byte load. mov (%rdi),%al better then movzbl (%rdi),%eax Product: gcc Version: 4.3.2 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c AssignedTo: unassigned at gcc dot gnu dot org ReportedBy: vvv at ru dot ru http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39549