I have made a curious performance observation with gcc under 64 bit cygwin on a corei7. I'm genuinely puzzled and couldn't find any information about it. Perhaps this is only indirectly a gcc question though, bear with me.

I have two trivial programs which assign a loop variable to a local variable 10^8 times. One does it the obvious way, the other one accesses the variable through a pointer, which means it must dereference the pointer first. This is reflected nicely in the disassembly snippets of the respective loop bodies below. Funny enough, the loop with the extra dereferencing runs considerably faster than the loop with the direct assignment (>10%). While the issue (indeed the whole program ;-) ) goes away with optimization, in less trivial scenarios that may not be so.

My first question is: What makes the smaller code slower?
The gcc question is: Should assignment always be performed through a pointer if it is faster? (Probably not, but why not?) A session transcript including the compilable source is below.

Here are the disassembled loop bodies:

Direct access
=====================================================
        localInt = i;
   1004010e6:   8b 45 fc                mov    -0x4(%rbp),%eax
   1004010e9:   89 45 f8                mov    %eax,-0x8(%rbp)


Pointer access
=====================================================
        *localP = i;
   1004010ee:   48 8b 45 f0             mov    -0x10(%rbp),%rax
   1004010f2:   8b 55 fc                mov    -0x4(%rbp),%edx
   1004010f5:   89 10                   mov    %edx,(%rax)

Note the first instruction which moves the address into %rax. The other two are similar to the direct assignment above.--

Here is a session transcript:

$ gcc -v
Using built-in specs.
COLLECT_GCC=gcc
COLLECT_LTO_WRAPPER=/usr/lib/gcc/x86_64-pc-cygwin/4.8.2/lto-wrapper.exe
Target: x86_64-pc-cygwin
Configured with: /cygdrive/i/szsz/tmpp/cygwin64/gcc/gcc-4.8.2-3/src/gcc-4.8.2/configure --srcdir=/cygdrive/i/szsz/tmpp/cygwin64/gcc/gcc-4.8.2-3/src/gcc-4.8.2 --prefix=/usr --exec-prefix=/usr --bindir=/usr/bin --sbindir=/usr/sbin --libexecdir=/usr/libexec --datadir=/usr/share --localstatedir=/var --sysconfdir=/etc --libdir=/usr/lib --datarootdir=/usr/share --docdir=/usr/share/doc/gcc --htmldir=/usr/share/doc/gcc/html -C --build=x86_64-pc-cygwin --host=x86_64-pc-cygwin --target=x86_64-pc-cygwin --without-libiconv-prefix --without-libintl-prefix --enable-shared --enable-shared-libgcc --enable-static --enable-version-specific-runtime-libs --enable-bootstrap --disable-__cxa_atexit --with-dwarf2 --with-tune=generic --enable-languages=ada,c,c++,fortran,lto,objc,obj-c++ --enable-graphite --enable-threads=posix --enable-libatomic --enable-libgomp --disable-libitm --enable-libquadmath --enable-libquadmath-support --enable-libssp --enable-libada --enable-libgcj-sublibs --disable-java-awt --disable-symvers --with-ecj-jar=/usr/share/java/ecj.jar --with-gnu-ld --with-gnu-as --with-cloog-include=/usr/include/cloog-isl --without-libiconv-prefix --without-libintl-prefix --with-system-zlib --libexecdir=/usr/lib
Thread model: posix
gcc version 4.8.2 (GCC)

peter@peter-lap ~/src/test/obj_vs_ptr
$ cat ./t
#!/bin/bash

cat $1.c && gcc -std=c99 -O0 -g -o $1 $1.c && time ./$1


peter@peter-lap ~/src/test/obj_vs_ptr
$ ./t obj
int main()
{
    int localInt;
    for (int i = 0; i < 100000000; ++i)
        localInt = i;
    return 0;
}
real    0m0.248s
user    0m0.234s
sys     0m0.015s

peter@peter-lap ~/src/test/obj_vs_ptr
$ ./t ptr
int main()
{
    int localInt;
    int *localP = &localInt;
    for (int i = 0; i < 100000000; ++i)
        *localP = i;
    return 0;
}

real    0m0.215s
user    0m0.203s
sys     0m0.000s

Reply via email to