I have the following code: struct bounding_box {
pack4sf m_Mins; pack4sf m_Maxs; void set(__v4sf v_mins, __v4sf v_maxs) { m_Mins = v_mins; m_Maxs = v_maxs; } }; struct bin { bounding_box m_Box[3]; pack4si m_NL; pack4sf m_AL; }; static const std::size_t bin_count = 16; bin aBins[bin_count]; for(std::size_t i = 0; i != bin_count; ++i) { bin& b = aBins[i]; b.m_Box[0].set(g_VecInf, g_VecMinusInf); b.m_Box[1].set(g_VecInf, g_VecMinusInf); b.m_Box[2].set(g_VecInf, g_VecMinusInf); b.m_NL = __v4si{ 0, 0, 0, 0 }; } where pack4sf/si are union-based wrappers for __v4sf/si. GCC 4.5 on Core i7/Cygwin with -O3 -fno-lto -msse -msse2 -mfpmath=sse -march=native -mtune=native -fomit-frame-pointer completely unrolled the loop into 112 movdqa instructions, which is "a bit" too agressive. Should I file a bug report? The processor has an 18 instructions long prefetch queue and the loop is perfectly predictable by the built-in branch prediction circuitry, so translating it as is would result in huge fetch/decode bandwidth reduction. Is there something like "#pragma nounroll" to selectively disable this optimization? Best regards Piotr Wyderski