https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91201

--- Comment #6 from Joel Yliluoma <bisqwit at iki dot fi> ---
Maybe a horizontal checksum is a bit obscure term. A 8-bit checksum is what is
being accomplished, nonetheless. Yes, there are simpler ways to do it…

But I tried a number of different approaches in order to try and get maximum
performance SIMD code out of GCC, and I came upon this curious case that I
posted this bugreport about.

To another compiler, I reported a related bug concerning a code that looks like
this:

    unsigned char calculate_checksum(const void* ptr)
    {
        unsigned char bytes[16], result = 0;
        memcpy(bytes, ptr, 16);
        // The reason the memcpy is there in place is because to
        // my knowledge, it is the only _safe_ way permitted by
        // the standard to do conversions between representations.
        // Union, pointer casting, etc. are not safe.
        for(unsigned n=0; n<16; ++n) result += bytes[n];
        return result;
    }

After my report, their compiler now generates:

        vmovdqu xmm0, xmmword ptr [rdi]
        vpshufd xmm1, xmm0, 78 # xmm1 = xmm0[2,3,0,1]
        vpaddb xmm0, xmm0, xmm1
        vpxor xmm1, xmm1, xmm1
        vpsadbw xmm0, xmm0, xmm1
        vpextrb eax, xmm0, 0
        ret

This is what GCC generates for the same code.

        vmovdqu xmm0, XMMWORD PTR [rdi]
        vpsrldq xmm1, xmm0, 8
        vpaddb  xmm0, xmm0, xmm1
        vpsrldq xmm1, xmm0, 4
        vpaddb  xmm0, xmm0, xmm1
        vpsrldq xmm1, xmm0, 2
        vpaddb  xmm0, xmm0, xmm1
        vpsrldq xmm1, xmm0, 1
        vpaddb  xmm0, xmm0, xmm1
        vpextrb eax, xmm0, 0
        ret

So the bottom line is, (v)psadbw reductions should be added as M. Glisse
correctly indicated.

Reply via email to