https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88605

--- Comment #2 from Devin Hussey <husseydevin at gmail dot com> ---
While __builtin_convertvector would improve the situation, the main issue here
is the blindness to some obvious patterns.

If I write this code, I want either pmovzdq or vmovl. I don't want to waste
time with scalar on the stack.

U64x2 pmovzdq(U32x2 v)
{
    return (U64x2) { v[0], v[1] };
}

If I write this code, I want pmuludq or vmull if it can be optimized to it. I
don't want to mask it and do an entire 64-bit multiply.

U64x2 pmuludq(U64x2 v1, U64x2 v2)
{
    return (v1 & 0xFFFFFFFF) * (v2 & 0xFFFFFFFF);
}

If I do this, I don't want scalar code on NEON. I want vshl + vsri, or at the
very least, vshl + vshr + vorr.

U64x2 vrol64(U64x2 v, int N)
{
    return (v << N) | (v >> (64 - N));
}

Having a generic SIMD overload library built-in is awesome, but only if it
saves time.

If I can write one block of code that looks like normal C code but it actually
optimized vector code that runs at even 80% the speed of specialized intrinsics
regardless of the platform (or even if the platform supports SIMD), that saves
a lot of time especially when trying to remember the difference between
_mm_mullo and _mm_mul.

If you can write your code so you can do this

#ifdef __GNUC__
typedef unsigned U32x4 __attribute__((vector_size(16)));
#else
typedef unsigned U32x4[4];
#endif

and use them interchangeably with ANSI C arrays without worrying about GCC
scalarizing the code, that saves even more time.

If you have to write your code like asm.js or mix intrinsics with normal code
just to get code that runs at half the speed of intrinsics, that is not
beneficial.

Reply via email to