https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88605
--- Comment #2 from Devin Hussey <husseydevin at gmail dot com> --- While __builtin_convertvector would improve the situation, the main issue here is the blindness to some obvious patterns. If I write this code, I want either pmovzdq or vmovl. I don't want to waste time with scalar on the stack. U64x2 pmovzdq(U32x2 v) { return (U64x2) { v[0], v[1] }; } If I write this code, I want pmuludq or vmull if it can be optimized to it. I don't want to mask it and do an entire 64-bit multiply. U64x2 pmuludq(U64x2 v1, U64x2 v2) { return (v1 & 0xFFFFFFFF) * (v2 & 0xFFFFFFFF); } If I do this, I don't want scalar code on NEON. I want vshl + vsri, or at the very least, vshl + vshr + vorr. U64x2 vrol64(U64x2 v, int N) { return (v << N) | (v >> (64 - N)); } Having a generic SIMD overload library built-in is awesome, but only if it saves time. If I can write one block of code that looks like normal C code but it actually optimized vector code that runs at even 80% the speed of specialized intrinsics regardless of the platform (or even if the platform supports SIMD), that saves a lot of time especially when trying to remember the difference between _mm_mullo and _mm_mul. If you can write your code so you can do this #ifdef __GNUC__ typedef unsigned U32x4 __attribute__((vector_size(16))); #else typedef unsigned U32x4[4]; #endif and use them interchangeably with ANSI C arrays without worrying about GCC scalarizing the code, that saves even more time. If you have to write your code like asm.js or mix intrinsics with normal code just to get code that runs at half the speed of intrinsics, that is not beneficial.