I have the following code:

    struct bounding_box {

        pack4sf m_Mins;
        pack4sf m_Maxs;

        void set(__v4sf v_mins, __v4sf v_maxs) {

            m_Mins = v_mins;
            m_Maxs = v_maxs;
        }
    };

    struct bin {

        bounding_box m_Box[3];
        pack4si      m_NL;
        pack4sf      m_AL;
    };

    static const std::size_t bin_count = 16;
    bin aBins[bin_count];

    for(std::size_t i = 0; i != bin_count; ++i) {

        bin& b = aBins[i];

        b.m_Box[0].set(g_VecInf, g_VecMinusInf);
        b.m_Box[1].set(g_VecInf, g_VecMinusInf);
        b.m_Box[2].set(g_VecInf, g_VecMinusInf);
        b.m_NL = __v4si{ 0, 0, 0, 0 };
    }

where pack4sf/si are union-based wrappers for __v4sf/si.
GCC 4.5 on Core i7/Cygwin with

-O3 -fno-lto -msse -msse2 -mfpmath=sse -march=native -mtune=native
-fomit-frame-pointer

completely unrolled the loop into 112 movdqa instructions,
which is "a bit" too agressive. Should I file a bug report?
The processor has an 18 instructions long prefetch queue
and the loop is perfectly predictable by the built-in branch
prediction circuitry, so translating it as is would result in huge
fetch/decode bandwidth reduction. Is there something like
"#pragma nounroll" to selectively disable this optimization?

Best regards
Piotr Wyderski

Reply via email to