Version info: $ powerpc-apple-darwin8.11.0-gcc-4.4.0 -v Using built-in specs. Target: powerpc-apple-darwin8.11.0 Configured with: ../gcc-4.4-20090116/configure --prefix=/opt/local --enable-languages=c,c++,objc,obj-c++ --libdir=/opt/local/lib/gcc44 --includedir=/opt/local/include/gcc44 --infodir=/opt/local/share/info --mandir=/opt/local/share/man --with-local-prefix=/opt/local --with-system-zlib --disable-nls --program-suffix=-mp-4.4 --with-gxx-include-dir=/opt/local/include/gcc44/c++/ --with-gmp=/opt/local --with-mpfr=/opt/local --disable-multilib Thread model: posix gcc version 4.4.0 20090116 (experimental) (GCC)
Above is a macports (formerly darwin ports) build of gcc4.4.0 on an OSX 10.4.11 ppc7450 host. Following C++ function produces different code depending on the use of 'loop_assignment_ai' vs 'flat_assignment_ai' snippets: #include <stdio.h> inline static void mmul( float (&c)[4][4], const float (&a)[4][4], const float (&b)[4][4]) { // iterate by product's rows for (unsigned i = 0; i < 4; i++) { register float ai[4][4]; // swizzle each element of the i-th row of A into a full vector for (unsigned j = 0; j < 4; j++) // flat_assignment_ai: /* ai[j][0] = ai[j][1] = ai[j][2] = ai[j][3] = a[i][j]; */ // loop_assignment_ai: for (unsigned k = 0; k < 4; k++) ai[j][k] = a[i][j]; // multiply the first element of the i-th row of A by the first row of B for (unsigned k = 0; k < 4; k++) { c[i][k] = ai[0][k] * b[0][k]; } // multiply-add all subsequent elements of the i-th row of A by the respective rows of B for (unsigned j = 1; j < 4; j++) { for (unsigned k = 0; k < 4; k++) { c[i][k] += ai[j][k] * b[j][k]; } } } } // function invoked with following parameters (statics) float a[4][4] __attribute__ ((aligned (16))); float b[4][4] __attribute__ ((aligned (16))); float c[4][4] __attribute__ ((aligned (16))); int main(int argc, char * const argv[]) { // omitted here is assignment of sample test values to arguments a & b unsigned ndz; // non-deterministic zero printf("enter a zero: "); if (1 != scanf("%u", &ndz)) // user expected to punch in a zero here return -1; const unsigned ndf = ndz ? 1 : 0; // non-deterministic const factor: it is meant to be zero, but the cc does not know that thus it can't declare our loop 'redundant' unsigned r = 10000000; do { mmul(*(&c + ndf * r), *(&a + ndf * r), *(&b + ndf * r)); } while (--r); return r; } /code Observed ~10% performance degradation when using 'loop_assignment_ai' instead of 'direct_assignment_ai'. It appears that the differences in the generated ppc code are mainly in instruction scheduling. Following optimization-related compiler options were used for the test: -fno-exceptions -fno-rtti -faltivec -maltivec -mtune=7450 -O3 -funroll-loops -ffast-math -fstrict-aliasing -ftree-vectorize -ftree-vectorizer-verbose=3 -fvisibility-inlines-hidden -fno-threadsafe-statics For the record, the intended vectorization fails, so the resulting code is entirely scalar. -martin -- Summary: gcc 4.4.0 20090116 loop unrolling causes unaccountable performance degradation Product: gcc Version: 4.4.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c++ AssignedTo: unassigned at gcc dot gnu dot org ReportedBy: blu dot dark at gmail dot com GCC build triplet: powerpc-apple-darwin8.11.0 GCC host triplet: powerpc-apple-darwin8.11.0 GCC target triplet: powerpc-apple-darwin8.11.0 http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39046