I have made a curious performance observation with gcc under 64 bit
cygwin on a corei7. I'm genuinely puzzled and couldn't find any
information about it. Perhaps this is only indirectly a gcc question
though, bear with me.
I have two trivial programs which assign a loop variable to a local
variable 10^8 times. One does it the obvious way, the other one accesses
the variable through a pointer, which means it must dereference the
pointer first. This is reflected nicely in the disassembly snippets of
the respective loop bodies below. Funny enough, the loop with the extra
dereferencing runs considerably faster than the loop with the direct
assignment (>10%). While the issue (indeed the whole program ;-) ) goes
away with optimization, in less trivial scenarios that may not be so.
My first question is: What makes the smaller code slower?
The gcc question is: Should assignment always be performed through a
pointer if it is faster? (Probably not, but why not?) A session
transcript including the compilable source is below.
Here are the disassembled loop bodies:
Direct access
=====================================================
localInt = i;
1004010e6: 8b 45 fc mov -0x4(%rbp),%eax
1004010e9: 89 45 f8 mov %eax,-0x8(%rbp)
Pointer access
=====================================================
*localP = i;
1004010ee: 48 8b 45 f0 mov -0x10(%rbp),%rax
1004010f2: 8b 55 fc mov -0x4(%rbp),%edx
1004010f5: 89 10 mov %edx,(%rax)
Note the first instruction which moves the address into %rax. The other
two are similar to the direct assignment above.--
Here is a session transcript:
$ gcc -v
Using built-in specs.
COLLECT_GCC=gcc
COLLECT_LTO_WRAPPER=/usr/lib/gcc/x86_64-pc-cygwin/4.8.2/lto-wrapper.exe
Target: x86_64-pc-cygwin
Configured with:
/cygdrive/i/szsz/tmpp/cygwin64/gcc/gcc-4.8.2-3/src/gcc-4.8.2/configure
--srcdir=/cygdrive/i/szsz/tmpp/cygwin64/gcc/gcc-4.8.2-3/src/gcc-4.8.2
--prefix=/usr --exec-prefix=/usr --bindir=/usr/bin --sbindir=/usr/sbin
--libexecdir=/usr/libexec --datadir=/usr/share --localstatedir=/var
--sysconfdir=/etc --libdir=/usr/lib --datarootdir=/usr/share
--docdir=/usr/share/doc/gcc --htmldir=/usr/share/doc/gcc/html -C
--build=x86_64-pc-cygwin --host=x86_64-pc-cygwin
--target=x86_64-pc-cygwin --without-libiconv-prefix
--without-libintl-prefix --enable-shared --enable-shared-libgcc
--enable-static --enable-version-specific-runtime-libs
--enable-bootstrap --disable-__cxa_atexit --with-dwarf2
--with-tune=generic
--enable-languages=ada,c,c++,fortran,lto,objc,obj-c++ --enable-graphite
--enable-threads=posix --enable-libatomic --enable-libgomp
--disable-libitm --enable-libquadmath --enable-libquadmath-support
--enable-libssp --enable-libada --enable-libgcj-sublibs
--disable-java-awt --disable-symvers
--with-ecj-jar=/usr/share/java/ecj.jar --with-gnu-ld --with-gnu-as
--with-cloog-include=/usr/include/cloog-isl --without-libiconv-prefix
--without-libintl-prefix --with-system-zlib --libexecdir=/usr/lib
Thread model: posix
gcc version 4.8.2 (GCC)
peter@peter-lap ~/src/test/obj_vs_ptr
$ cat ./t
#!/bin/bash
cat $1.c && gcc -std=c99 -O0 -g -o $1 $1.c && time ./$1
peter@peter-lap ~/src/test/obj_vs_ptr
$ ./t obj
int main()
{
int localInt;
for (int i = 0; i < 100000000; ++i)
localInt = i;
return 0;
}
real 0m0.248s
user 0m0.234s
sys 0m0.015s
peter@peter-lap ~/src/test/obj_vs_ptr
$ ./t ptr
int main()
{
int localInt;
int *localP = &localInt;
for (int i = 0; i < 100000000; ++i)
*localP = i;
return 0;
}
real 0m0.215s
user 0m0.203s
sys 0m0.000s