If you compile the attached code with optimization on a 4.1.x system it will generate a store into a stack temporary in the middle of the loop that is never used. If you compile the code with -DUSE_MACRO where it uses macros instead of inline functions, it will generate the correct code without the extra store. It is still a bug in the 4.3 mainline with a compiler built on March 30th.
-- Summary: Interaction between x86_64 builtin function and inline functions causes poor code Product: gcc Version: 4.1.2 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c++ AssignedTo: unassigned at gcc dot gnu dot org ReportedBy: michael dot meissner at amd dot com GCC build triplet: x86_64-redhat-linux GCC host triplet: x86_64-redhat-linux GCC target triplet: x86_64-redhat-linux http://gcc.gnu.org/bugzilla/show_bug.cgi?id=31307