http://gcc.gnu.org/bugzilla/show_bug.cgi?id=59967

            Bug ID: 59967
           Summary: Performance regression from 4.7.x to 4.8.x (loop not
                    unrolled)
           Product: gcc
           Version: 4.8.2
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: c
          Assignee: unassigned at gcc dot gnu.org
          Reporter: chbreitkopf at gmail dot com

Created attachment 31967
  --> http://gcc.gnu.org/bugzilla/attachment.cgi?id=31967&action=edit
preprocessed source of ray/src/rt/ambient.c

gcc 4.8.x generates 10-15% slower code compared to 4.7.x for Mark Stock's
radiance benchmark (http://markjstock.org/pages/rad_bench.html).

I observed this regression on Linux x86_64, and with different CPUs (Ivy
Bridge, Haswell, AMD Phenom, Kaveri). I had suspected the new register
allocator, but the actual cause is a difference in loop unrolling.

The hotspot is the nested loops with the recursive call at the end of the
sumambient() function. When using -Ofast, gcc 4.7.x will unroll the outer loop,
which results in some optimization possibilities in the inner loop. gcc 4.8.x
does not unroll the outer loop. -funroll-loops does not change the behavior.

Reply via email to