[Bug target/39821] 120% slowdown with vectorizer

crazylht at gmail dot com via Gcc-bugs Tue, 27 Jul 2021 22:36:33 -0700

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=39821


--- Comment #9 from Hongtao.liu <crazylht at gmail dot com> ---
(In reply to Richard Biener from comment #8)
> I've pushed the change that makes us run into ix86_multiplication_cost but
> as said that doesn't differentiate highpart or widening multiply yet and
> thus we're now missing optimizations because of too conservative costing.

For MULT_HIGHPART_EXPR, x86 only have pmulhw, it's probably ok to go into
ix86_multiplication_cost.

For WIDEN_MULT_EXPR, we need a separate cost function which should also accept
sign info since we have pmuludq under sse2 but pmuldq under sse4.1.


.i.e we should vectorize udotproduct under sse2, but sdotprodoct under sse4.1

#include<stdint.h>
uint64_t udotproduct(uint32_t *v1, uint32_t *v2, int order)
{
    uint64_t accum = 0;
    while (order--)
        accum += (uint64_t) *v1++ * *v2++;
    return accum;
}

#include<stdint.h>
int64_t sdotproduct(int32_t *v1, int32_t *v2, int order)
{
    int64_t accum = 0;
    while (order--)
        accum += (int64_t) *v1++ * *v2++;
    return accum;
}

[Bug target/39821] 120% slowdown with vectorizer

Reply via email to