[Bug target/88510] GCC generates inefficient U64x2/v2di scalar multiply for NEON32
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88510 --- Comment #4 from Devin Hussey --- I am deciding to refer to goodmul as ssemul from now on. I think it is a better name. I am also wondering if Aarch64 gets a benefit from this vs. scalarizing if the value is already in a NEON register. I don't have an Aarch64 device to test on. For the reference, I use an LG G3 with a Snapdragon 801 (Cortex-A15) underclocked to 4 cores @ 1.7 GHz. I also did some testing, and twomul is also fastest if a value can be interleaved outside of the loop (e.g. a constant). ssemul is only fastest if either both operands can be interleaved beforehand or the high or low bits are known to be zero in which it can be simplified. For example, the xxHash64 routine, which looks like this: const U8 *p; const U8 *limit = p + len - 31; U64x2 v[2]; ... do { // Actually unrolled for (int i = 0; i < 2; i++) { // Load (U8 load because alignment is dumb) U64x2 inp = vreinterpretq_u64_u8(vld1q_u8(p)); p += 16; v[i] += inp * PRIME64_2; v[i] = (v[i] << 31) | (v[i] >> (64 - 31)); v[i] *= PRIME64_1; } } while (p < limit); seems to be the fastest when implemented like this: // Wordswap and separate low bits for twomul const U64x2 prime1Base = vdupq_n_u64(PRIME64_1); const U32x2 prime1Lo = vmovn_u64(prime1Base); const U32x4 prime1Rev = vrev64q_u32(vreinterpretq_u32_u64(prime1Base)); // Interleave for ssemul _Alignas(16) const U64 PRIME2[2] = { PRIME64_2, PRIME64_2 }; const U32x2x2 prime2 = vld2_u32((const U32 *)__builtin_assume_aligned(PRIME2, 16)); U64x2 v[2]; do { // actually unrolled for (int i = 0; i < 2; i++) { // Interleaved load U32x2x2 inp = vld2_u32((const U32 *)p); p += 16; // ssemul // val = (U64x2)inpLo * (U64x2)prime2Hi; U64x2 val = vmull_u32(inp.val[0], prime2.val[1]); // val += (U64x2)inpHi * (U64x2)prime2Lo; val = vmlal_u32(val, inp.val[1], prime2.val[0]); // val <<= 32; val = vshlq_n_u64(val, 32); // val += (U64x2)inpLo * (U64x2)prime2Lo; val = vmlal_u32(val, inp.val[0], prime2.val[0]); // end ssemul // Add v[i] = vaddq_u64(v[i], val); // Rotate left v[i] = vsriq_n_u64(vshlq_n_u64(v[i], 31), v[i], 33); // twomul // topLo = v[i] & 0x; U32x2 topLo = vmovn_u64(v[i]); // top = (U32x4)v[i]; U32x4 top = vreinterpretq_u32_u64(v[i]); // prod = { // topLo * prime1Hi, // topHi * prime1Lo // }; U32x4 prod = vmulq_u32(top, prime1Rev); // prod64 = (U64x2)prod[0] + (U64x2)prod[1]; U64x2 prod64 = vpaddlq_u32(prod); // prod64 <<= 32; prod64 = vshlq_n_u64(prod64, 32); // prod64 += (U64x2)topLo * (U64x2)prime1Lo; prod64 = vmlal_u32(prod64, topLo, prime1Lo); // end twomul } } while (p < limit); As you can see, since we can do an interleaved load on p, it is fastest to do ssemul, however, since we are using v for more than just multiplication, we use twomul. On my G3 in Termux with the xxhsum 100 KB benchmark, this gets to 2.65 GB/s, compared to 0.8 GB/s scalar and 2.24 GB/s with both of them using ssemul. However, this was compiled with Clang. For some reason, even though I see no major differences in the assembly, GCC consistently produces code at roughly 80% the performance of Clang. But this is mostly an algorithm thing, that isn't important. Considering that this is 64-bit arithmetic on a 32-bit device, that is pretty good.
[Bug target/88963] gcc generates terrible code for vectors of 64+ length which are not natively supported
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88963 Devin Hussey changed: What|Removed |Added CC||husseydevin at gmail dot com --- Comment #4 from Devin Hussey --- Strangely, this doesn't seem to affect the ARM or aarch64 backends, although I am on a December build (specifically Dec 29). 8.2 is also unaffected. arm-none-eabi-gcc -mfloat-abi=hard -mfpu=neon -march=armv7-a -O3 -S test.c test: vldmia r1, {d0-d7} vldmia r2, {d24-d31} vadd.i32q8, q0, q12 vadd.i32q9, q1, q13 vadd.i32q10, q2, q14 vadd.i32q11, q3, q15 vstmia r0, {d16-d23} bx lr aarch64-none-eabi-gcc -O3 -S test.c test: ld1 {v16.16b - v19.16b}, [x1] ld1 {v4.16b - v7.16b}, [x2] add v0.4s, v16.4s, v4.4s add v1.4s, v17.4s, v5.4s add v2.4s, v18.4s, v6.4s add v3.4s, v19.4s, v7.4s st1 {v0.16b - v3.16b}, [x0] ret Amusingly, Clang trunk for ARMv7-a has a similar issue (aarch64 is fine). test: .fnstart .save {r11, lr} push{r11, lr} add r3, r1, #48 mov lr, r1 mov r12, r2 vld1.64 {d20, d21}, [r3] add r3, r2, #48 add r1, r1, #32 vld1.32 {d16, d17}, [lr]! vld1.32 {d18, d19}, [r12]! vadd.i32q8, q9, q8 vld1.64 {d22, d23}, [r3] vadd.i32q10, q11, q10 vld1.64 {d26, d27}, [r1] add r1, r2, #32 vld1.64 {d28, d29}, [r1] add r1, r0, #48 vadd.i32q11, q14, q13 vld1.64 {d24, d25}, [lr] vld1.64 {d18, d19}, [r12] vadd.i32q9, q9, q12 vst1.64 {d20, d21}, [r1] add r1, r0, #32 vst1.32 {d16, d17}, [r0]! vst1.64 {d22, d23}, [r1] vst1.64 {d18, d19}, [r0] pop {r11, pc}
[Bug target/88963] gcc generates terrible code for vectors of 64+ length which are not natively supported
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88963 --- Comment #9 from Devin Hussey --- (In reply to Andrew Pinski from comment #6) > Try using 128 (or 256) and you might see that aarch64 falls down similarly. yup. Oof. test: sub sp, sp, #560 stp x29, x30, [sp] mov x29, sp stp x19, x20, [sp, 16] mov x19, 128 mov x20, x0 add x0, sp, 176 str x21, [sp, 32] mov x21, x2 mov x2, x19 bl memcpy mov x2, x19 mov x1, x21 add x0, sp, 304 bl memcpy ldr q7, [sp, 176] mov x2, x19 ldr q6, [sp, 192] add x1, sp, 48 ldr q5, [sp, 208] mov x0, x20 ldr q4, [sp, 224] ldr q3, [sp, 240] ldr q2, [sp, 256] ldr q1, [sp, 272] ldr q0, [sp, 288] ldr q23, [sp, 304] ldr q22, [sp, 320] ldr q21, [sp, 336] ldr q20, [sp, 352] ldr q19, [sp, 368] ldr q18, [sp, 384] ldr q17, [sp, 400] ldr q16, [sp, 416] add v7.4s, v7.4s, v23.4s add v6.4s, v6.4s, v22.4s add v5.4s, v5.4s, v21.4s add v4.4s, v4.4s, v20.4s add v3.4s, v3.4s, v19.4s str q7, [sp, 48] add v2.4s, v2.4s, v18.4s str q6, [sp, 64] add v1.4s, v1.4s, v17.4s str q5, [sp, 80] add v0.4s, v0.4s, v16.4s str q4, [sp, 96] str q3, [sp, 112] str q2, [sp, 128] str q1, [sp, 144] str q0, [sp, 160] bl memcpy ldp x29, x30, [sp] ldp x19, x20, [sp, 16] ldr x21, [sp, 32] add sp, sp, 560 ret
[Bug target/88963] gcc generates terrible code for vectors of 64+ length which are not natively supported
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88963 --- Comment #10 from Devin Hussey --- I also want to add that aarch64 shouldn't even be spilling; it has 32 NEON registers and with 128 byte vectors it should only use 24.
[Bug regression/93418] New: GCC incorrectly constant propagates _mm_sllv/srlv/srav
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93418 Bug ID: 93418 Summary: GCC incorrectly constant propagates _mm_sllv/srlv/srav Product: gcc Version: 10.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: regression Assignee: unassigned at gcc dot gnu.org Reporter: husseydevin at gmail dot com Target Milestone: --- Regression starting in GCC 9 Currently, GCC constant propagates the AVX2 _mm_sllv family with constant amounts to only shift by the first element instead of all elements individually. #include #include // force -O0 __attribute__((__optimize__("-O0"))) void unoptimized() { __m128i vals = _mm_set1_epi32(0x); __m128i shifts = _mm_setr_epi32(16, 31, -34, 3); __m128i shifted = _mm_sllv_epi32(vals, shifts); printf("%08x %08x %08x %08x\n", _mm_extract_epi32(shifted, 0), _mm_extract_epi32(shifted, 1), _mm_extract_epi32(shifted, 2), _mm_extract_epi32(shifted, 3)); } // force -O3 __attribute__((__optimize__("-O3"))) void optimized() { __m128i vals = _mm_set1_epi32(0x); __m128i shifts = _mm_setr_epi32(16, 31, -34, 3); __m128i shifted = _mm_sllv_epi32(vals, shifts); printf("%08x %08x %08x %08x\n", _mm_extract_epi32(shifted, 0), _mm_extract_epi32(shifted, 1), _mm_extract_epi32(shifted, 2), _mm_extract_epi32(shifted, 3)); } int main() { printf("Without optimizations (correct result):\t"); unoptimized(); printf("With optimizations (incorrect result):\t"); optimized(); } I would expect this code to emit the following: Without optimizations (correct result): 8000 fff8 With optimizations (incorrect result): 8000 fff8 Clang and GCC < 9 exhibit the first output, but 9.1 and later However, I get this output on GCC 9 and later: Without optimizations (correct result): 8000 fff8 With optimizations (incorrect result): Godbolt link: https://gcc.godbolt.org/z/oC3Psp
[Bug target/93418] [9/10 Regression] GCC incorrectly constant propagates _mm_sllv/srlv/srav
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93418 --- Comment #3 from Devin Hussey --- I think I found the culprit commit. Haven't set up a GCC build tree yet, though. https://github.com/gcc-mirror/gcc/commit/a51c4926712307787d133ba50af8c61393a9229b
[Bug target/93418] [9/10 Regression] GCC incorrectly constant propagates _mm_sllv/srlv/srav
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93418 Devin Hussey changed: What|Removed |Added Build||2020-01-24 0:00 --- Comment #5 from Devin Hussey --- Finally got GCC to build after it was throwing a fit. I can confirm that the regression is in that commit. g:28a8a768ebef5e31f950013f1b48b14c008b4b3b works correctly, g:6a03477e85e1b097ed6c0b86c76436de575aef04 does not.
[Bug target/93418] [9/10 Regression] GCC incorrectly constant propagates _mm_sllv/srlv/srav
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93418 --- Comment #8 from Devin Hussey --- Seems to work. ~ $ ~/gcc-test/bin/x86_64-pc-cygwin-gcc.exe -mavx2 -O3 _mm_sllv_bug.c ~ $ ./a.exe Without optimizations (correct result): 8000 fff8 With optimizations (incorrect result): 8000 fff8 ~ $ And checking the assembly, the shifts are constant propagated. The provided test file also passes.
[Bug target/88255] New: Thumb-1: GCC too aggressive on mul->lsl/sub/add optimization
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88255 Bug ID: 88255 Summary: Thumb-1: GCC too aggressive on mul->lsl/sub/add optimization Product: gcc Version: 8.2.1 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: husseydevin at gmail dot com Target Milestone: --- I might be wrong, but it appears that GCC is too aggressive in its conversion from multiplication to shift+add when targeting Thumb-1 It is true that, for example, the Cortex-M0 can have the small multiplier and a 16 cycle shift sequence would be faster. However, I was targeting arm7tdmi (-march=armv4t -mthumb -O3 -mtune=arm7tdmi) which, if I am not mistaken, uses one cycle for every 8 bits. http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0234b/i102180.html However, looking in the source code, I notice that the loop is dividing by 4. I think it might be a bug that is causing the otherwise 7 (I think) cycle sequence in the code below to be considered as having a weight of 18 cycles. https://github.com/gcc-mirror/gcc/blob/master/gcc/config/arm/arm.c#L8959 I could be wrong, but one of the things I noticed is that very old versions of GCC (2.95) will not perform this many shifts, and that Clang, when given the transpiled output in C and targeted for the same platform, will actually convert it back into a ldr/mul. However, when targeting cortex-m0plus.small-multiply, it will still turn it into multiplication. Code example: unsigned MultiplyByPrime(unsigned val) { return val * 2246822519U; } MultiplyByPrime: lslsr3, r0, #7 @ unsigned ret = val << 7; subsr3, r3, r0 @ ret -= val; lslsr3, r3, #5 @ ret <<= 5; subsr3, r3, r0 @ ret -= val; lslsr3, r3, #2 @ ret <<= 2; addsr3, r3, r0 @ ret += val; lslsr2, r3, #3 @ unsigned tmp = ret << 3; addsr3, r3, r2 @ ret += tmp; lslsr3, r3, #1 @ ret <<= 1; addsr3, r3, r0 @ ret += val; lslsr3, r3, #6 @ ret <<= 6; addsr3, r3, r0 @ ret += val; lslsr2, r3, #4 @ tmp = ret << 4; subsr3, r2, r3 @ ret = tmp - ret; lslsr3, r3, #3 @ ret <<= 3; subsr0, r3, r0 @ ret -= val; bx lr @ return ret;
[Bug target/88510] New: GCC generates inefficient U64x2 scalar multiply for NEON32
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88510 Bug ID: 88510 Summary: GCC generates inefficient U64x2 scalar multiply for NEON32 Product: gcc Version: 8.2.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: husseydevin at gmail dot com Target Milestone: --- Note: I use these typedefs here for brevity. typedef uint64x2_t U64x2; typedef uint32x2_t U32x2; typedef uint32x2x2_t U32x2x2; typedef uint32x4_t U32x4; GCC and Clang both have issues with this code on ARMv7a NEON, and will switch to scalar: U64x2 multiply(U64x2 top, U64x2 bot) { return top * bot; } gcc-8 -mfloat-abi=hard -mfpu=neon -O3 -S -march=armv7-a multiply: push{r4, r5, r6, r7, lr} sub sp, sp, #20 vmovr0, r1, d0 @ v2di vmovr6, r7, d2 @ v2di vmovr2, r3, d1 @ v2di vmovr4, r5, d3 @ v2di mul lr, r0, r7 mla lr, r6, r1, lr mul ip, r2, r5 umull r0, r1, r0, r6 mla ip, r4, r3, ip add r1, lr, r1 umull r2, r3, r2, r4 strdr0, [sp] add r3, ip, r3 strdr2, [sp, #8] vld1.64 {d0-d1}, [sp:64] add sp, sp, #20 pop {r4, r5, r6, r7, pc} Clang's is worse, and you can compare the output, as well as the i386 SSE4.1 code here: https://godbolt.org/z/35owtL Related LLVM bug 39967: https://bugs.llvm.org/show_bug.cgi?id=39967 I started the discussion in LLVM, as it had the worse problem, and we have come up with a few options for faster code that does not require scalar. You can also find the benchmark file (with outdated tests) and results results. They are from Clang, but since they use intrinsics, results are similar. While we don't have vmulq_u64, we do have faster ways to multiply without going scalar. I have benchmarked the code, and have found this option, based on the code emitted for SSE4.1: U64x2 goodmul_sse(U64x2 top, U64x2 bot) { U32x2 topHi = vshrn_n_u64(top, 32); // U32x2 topHi = top >> 32; U32x2 topLo = vmovn_u64(top); // U32x2 topLo = top & 0x; U32x2 botHi = vshrn_n_u64(bot, 32); // U32x2 botHi = bot >> 32; U32x2 botLo = vmovn_u64(bot); // U32x2 botLo = bot & 0x; U64x2 ret64 = vmull_u32(topHi, botLo); // U64x2 ret64 = (U64x2)topHi * (U64x2)botLo; ret64 = vmlal_u32(ret64, topLo, botHi); // ret64 += (U64x2)topLo * (U64x2)botHi; ret64 = vshlq_n_u64(ret64, 32); // ret64 <<= 32; ret64 = vmlal_u32(ret64, topLo, botLo); // ret64 += (U64x2)topLo * (U64x2)botLo; return ret64; } If GCC can figure out how to interleave one or two of the operands, for example, changing this: U64x2 inp1 = vld1q_u64(p); U64x2 inp2 = vld1q_u64(q); vec = goodmul_sse(inp1, inp2); to this (if it knows inp1 and/or inp2 are only used for multiplication): U32x2x2 inp1 = vld2_u32(p); U32x2x2 inp2 = vld2_u32(q); vec = goodmul_sse_interleaved(inp1, inp2) then we can do this and save 4 cycles: U64x2 goodmul_sse_interleaved(const U32x2x2 top, const U32x2x2 bot) { U64x2 ret64 = vmull_u32(top.val[1], bot.val[0]); // U64x2 ret64 = (U64x2)topHi * (U64x2)botLo; ret64 = vmlal_u32(ret64, top.val[0], bot.val[1]); // ret64 += (U64x2)topLo * (U64x2)botHi; ret64 = vshlq_n_u64(ret64, 32); // ret64 <<= 32; ret64 = vmlal_u32(ret64, top.val[0], bot.val[0]); // ret64 += (U64x2)topLo * (U64x2)botLo; return ret64; } Another user posted this (typos fixed). It seems to use two fewer cycles when not interleaved (not 100% sure about it), but two cycles slower when it is fully interleaved. U64x2 twomul(U64x2 top, U64x2 bot) { U32x2 top_low = vmovn_u64(top); U32x2 bot_low = vmovn_u64(bot); U32x4 top_re = vreinterpretq_u32_u64(top); U32x4 bot_re = vrev64q_u32(vreinterpretq_u32_u64(bot)); U32x4 prod = vmulq_u32(top_re, bot_re); U64x2 paired = vpaddlq_u32(prod); U64x2 shifted = vshlq_n_u64(paired, 32); return vmlal_u32(shifted, top_low, bot_low); } Either one of these is faster than scalar.
[Bug tree-optimization/88605] New: vector extensions: Widening or conversion generates inefficient or scalar code.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88605 Bug ID: 88605 Summary: vector extensions: Widening or conversion generates inefficient or scalar code. Product: gcc Version: 9.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: husseydevin at gmail dot com Target Milestone: --- If you want to, say, convert a u32x2 vector to a u64x2 while avoiding intrinsics, good luck. GCC doesn't have a builtin like __builtin_convertvector, and doing the conversion manually generates scalar code. This makes clean generic vector code difficult. SSE and NEON both have plenty of conversion instructions, such as pmovzxdq or vmovl.32, but GCC will not emit them. typedef unsigned long long U64; typedef U64 U64x2 __attribute__((vector_size(16))); typedef unsigned int U32; typedef U32 U32x2 __attribute__((vector_size(8))); U64x2 vconvert_u64_u32(U32x2 v) { return (U64x2) { v[0], v[1] }; } x86_32: Flags: -O3 -m32 -msse4.1 Clang Trunk (revision 350063) vconvert_u64_u32: pmovzxdqxmm0, qword ptr [esp + 4] # xmm0 = mem[0],zero,mem[1],zero ret GCC (GCC-Explorer-Build) 9.0.0 20181225 (experimental) convert_u64_u32: pushebx sub esp, 40 movqQWORD PTR [esp+8], mm0 mov ecx, DWORD PTR [esp+8] mov ebx, DWORD PTR [esp+12] mov DWORD PTR [esp+8], ecx movdxmm0, DWORD PTR [esp+8] mov DWORD PTR [esp+20], ebx movdxmm1, DWORD PTR [esp+20] mov DWORD PTR [esp+16], ecx add esp, 40 punpcklqdq xmm0, xmm1 pop ebx ret I can't even understand what is going on here, except it is wasting 44 bytes of stack for no good reason. x86_64: Flags: -O3 -m64 -msse4.1 Clang: vconvert_u64_u32: pmovzxdqxmm0, xmm0 # xmm0 = xmm0[0],zero,xmm0[1],zero ret GCC: vconvert_u64_u32: movqrax, xmm0 movdDWORD PTR [rsp-28], xmm0 movdxmm0, DWORD PTR [rsp-28] shr rax, 32 pinsrq xmm0, rax, 1 ret ARMv7 NEON: Flags: -march=armv7-a -mfloat-abi=hard -mfpu=neon -O3 Clang (with --target=arm-none-eabi): vconvert_u64_u32: vmovl.u32 q0, d0 bx lr arm-unknown-linux-gnueabi-gcc (GCC) 8.2.0: vconvert_u64_u32: mov r3, #0 sub sp, sp, #16 add r2, sp, #8 vst1.32 {d0[0]}, [sp] vst1.32 {d0[1]}, [r2] str r3, [sp, #4] str r3, [sp, #12] vld1.64 {d0-d1}, [sp:64] add sp, sp, #16 bx lr aarch64 NEON: Flags: -O3 Clang (with --target=aarch64-none-eabi): vconvert_u64_u32: ushll v0.2d, v0.2s, #0 ret aarch64-unknown-linux-gnu-gcc 8.2.0: vconvert_u64_u32: umovw1, v0.s[0] umovw0, v0.s[1] uxtwx1, w1 uxtwx0, w0 dup v0.2d, x1 ins v0.d[1], x0 ret Some other things include things like getting a standalone pmuludq. In clang, this always generates pmuludq: U64x2 pmuludq(U64x2 v1, U64x2 v2) { return (v1 & 0x) * (v2 & 0x); } But GCC generates this: pmuludq: movdqa xmm2, XMMWORD PTR .LC0[rip] pandxmm0, xmm2 pandxmm2, xmm1 movdqa xmm4, xmm2 movdqa xmm1, xmm0 movdqa xmm3, xmm0 psrlq xmm4, 32 psrlq xmm1, 32 pmuludq xmm0, xmm4 pmuludq xmm1, xmm2 pmuludq xmm3, xmm2 paddq xmm1, xmm0 psllq xmm1, 32 paddq xmm3, xmm1 movdqa xmm0, xmm3 ret .LC0: .quad 4294967295 .quad 4294967295 and that is the best code it generates. Much worse code is generated depending on how you write it. Meanwhile, while it has some struggles with sse2 and x86_64, there is a reliable way to get Clang to generate pmuludq, and the NEON equivalent, vmull.u32, https://godbolt.org/z/H_tOi1
[Bug tree-optimization/88605] vector extensions: Widening or conversion generates inefficient or scalar code.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88605 --- Comment #2 from Devin Hussey --- While __builtin_convertvector would improve the situation, the main issue here is the blindness to some obvious patterns. If I write this code, I want either pmovzdq or vmovl. I don't want to waste time with scalar on the stack. U64x2 pmovzdq(U32x2 v) { return (U64x2) { v[0], v[1] }; } If I write this code, I want pmuludq or vmull if it can be optimized to it. I don't want to mask it and do an entire 64-bit multiply. U64x2 pmuludq(U64x2 v1, U64x2 v2) { return (v1 & 0x) * (v2 & 0x); } If I do this, I don't want scalar code on NEON. I want vshl + vsri, or at the very least, vshl + vshr + vorr. U64x2 vrol64(U64x2 v, int N) { return (v << N) | (v >> (64 - N)); } Having a generic SIMD overload library built-in is awesome, but only if it saves time. If I can write one block of code that looks like normal C code but it actually optimized vector code that runs at even 80% the speed of specialized intrinsics regardless of the platform (or even if the platform supports SIMD), that saves a lot of time especially when trying to remember the difference between _mm_mullo and _mm_mul. If you can write your code so you can do this #ifdef __GNUC__ typedef unsigned U32x4 __attribute__((vector_size(16))); #else typedef unsigned U32x4[4]; #endif and use them interchangeably with ANSI C arrays without worrying about GCC scalarizing the code, that saves even more time. If you have to write your code like asm.js or mix intrinsics with normal code just to get code that runs at half the speed of intrinsics, that is not beneficial.
[Bug target/88510] GCC generates inefficient U64x2/v2di scalar multiply for NEON32
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88510 Devin Hussey changed: What|Removed |Added Summary|GCC generates inefficient |GCC generates inefficient |U64x2 scalar multiply for |U64x2/v2di scalar multiply |NEON32 |for NEON32 --- Comment #1 from Devin Hussey --- I noticed that the scalarization is performed in the veclower21 stage. In making a patch for LLVM, I found that the x86 code could basically be copy-pasted over, just adding truncates and replacing the SSE instructions with NEON instructions. I would add it if someone told me where the SSE code is and where to put the NEON code. That is what helped me with the LLVM patch.
[Bug tree-optimization/88605] vector extensions: Widening or conversion generates inefficient or scalar code.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88605 --- Comment #4 from Devin Hussey --- I also want to note that LLVM is probably a good place to look. They have been pushing to remove as many intrinsic builtins as they can in favor of idiomatic code. This has multiple advantages: 1. You can open up and see what x intrinsic really does (many SIMD instructions have inadequate documentation) 2. Platform independent intrinsic headers 3. More useful vector extensions Should we make a metabug for this? Such as "Improve vector extension pattern recognition" or something?
[Bug target/88510] GCC generates inefficient U64x2/v2di scalar multiply for NEON32
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88510 --- Comment #2 from Devin Hussey --- Update: I did the calculations, and twomul has the same cycle count as goodmul_sse. vmul.i32 with 128-bit operands takes 4 cycles (I assumed it was two), so just like goodmul_sse, it takes 11 cycles.
[Bug c/88698] New: Relax generic vector conversions
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88698 Bug ID: 88698 Summary: Relax generic vector conversions Product: gcc Version: 9.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c Assignee: unassigned at gcc dot gnu.org Reporter: husseydevin at gmail dot com Target Milestone: --- GCC is far too strict about vector conversions. Currently, mixing generic vector extensions and platform-specific intrinsics almost always requires either a cast or -flax-vector-extensions, which is annoying and breaks a lot of things Clang happily accepts. Here is my proposal: * x86's __mNi should implicitly convert between any N-bit vector. This matches the void pointer-like behavior of SSE's vectors. * Any vector with equivalent lane types and number of lanes should convert without an issue. For example, uint32_t vector_size(16) and NEON's uint32x4_t have no reason not to be compatible. * Signed <-> unsigned should act like other implicit signed <-> unsigned conversions, -Wextra in C and warning in C++. * Implicit conversions between different vectors of the same size should emit an error.
[Bug c/88698] Relax generic vector conversions
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88698 --- Comment #2 from Devin Hussey --- What I am saying is that I think -flax-vector-conversions should be default, or we should only have minimal warnings instead of errors. That will make generic vectors much easier to use. It is to be noted that Clang has -Wvector-conversion, which is the equivalent of -fno-lax-vector-conversions, however, it is a warning that only occurs with -Weverything. Not even -Wall -Wextra -Wpedantic in C++ mode will enable it. If Clang thinks this is such a minor issue that it won't even warn with -Wall -Wextra -Wpedantic, why does GCC consider it an error? However, if you want examples, here: Example 1 (SSE2) Here, we are trying to use an intrinsic which accepts and returns an __m128i (defined as "long long __attribute__((vector_size(16)))") with a u32x4 (defined as "uint32_t __attribute__((vector_size(16)))") #include #include typedef uint32_t u32x4 __attribute__((vector_size(16))); u32x4 shift(u32x4 val) { return _mm_srli_epi32(val, 15); } On Clang, it will happily accept that, only complaining on -Wvector-conversion. GCC will fail to compile. There are three ways around that: 1. Typedef u32x4 to __m128i. This is unreasonable, because that causes the operator overloads and constructors to operate on 64-bit integers instead of 32-bit. 2. Add -flax-vector-conversions. Requiring someone to add a warning suppression flag to compile your code is often seen as code smell. 3. Cast. Good lord, if you thought intrinsics were ugly, this will change your mind: return (u32x4)_mm_srli_epi32((__m128i)val, 15); or C++-style: return static_cast(_mm_srli_epi32(static_cast<__m128i>(val), 15)); Example 2 (ARMv7-a + NEON): #include _Static_assert(sizeof(unsigned long) == sizeof(unsigned int), "use 32-bit please"); typedef unsigned long u32x4 __attribute__((vector_size(16))); u32x4 shift(u32x4 val) { return vshrq_n_u32(val, 15); } This is the second issue: unsigned long and unsigned int are the same size and should have no issues converting between each other. This often comes from a situation where uint32_t is set to unsigned long. Example 3 (Generic): typedef unsigned u32x4 __attribute__((vector_size(16))); typedef unsigned long long u64x2 __attribute__((vector_size(16))); u64x2 cast(u32x4 val) { return val; } This should emit a warning without a cast. I would recommend an error, but Clang without -Wvector-conversion accepts this without any complaining. Example 4 (Generic): typedef unsigned u32x2 __attribute__((vector_size(8))); typedef unsigned long long u64x2 __attribute__((vector_size(16))); u64x2 cast(u32x2 val) { return val; } This is clearly an error. There should be __builtin_convertvector which is being tracked in a different bug, but that is not the point.
[Bug c/88698] Relax generic vector conversions
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88698 --- Comment #5 from Devin Hussey --- Well, if we are aiming for strict compliance, might as well throw out every GCC extension in existence (including vector extensions), those aren't strictly compliant to the C/C++ standard. /s The whole point of extensions are to be an extension that violates the standard. #include uint64x2_t mult(uint64x2_t top, uint64x2_t bot) { return top * bot; } I am breaking two rules here: 1. Using operator overloads, which are not part of the standard. 2. Implying a nonexistent instruction, as there is no vmul.i64. (it is scalarized at the moment, but I explained in bug 88510 that there are better options) Clang even allows this: #include uint32x4_t mult(uint16x8_t top, uint32x4_t bot) { return top * bot; } In which it will reinterpret all to the widest lane type.
[Bug target/88705] New: [ARM][Generic Vector Extensions] float32x4/float64x2 vector operator overloads scalarize on NEON
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88705 Bug ID: 88705 Summary: [ARM][Generic Vector Extensions] float32x4/float64x2 vector operator overloads scalarize on NEON Product: gcc Version: 9.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: husseydevin at gmail dot com Target Milestone: --- For some reason, GCC scalarizes float32x4_t and float64x2_t on ARM32 NEON when using vector extensions. typedef float f32x4 __attribute__((vector_size(16))); typedef double f64x2 __attribute__((vector_size(16))); f32x4 fmul (f32x4 v1, f32x4 v2) { return v1 * v2; } f64x2 dmul (f64x2 v1, f64x2 v2) { return v1 * v2; } Expected output: arm-none-eabi-gcc (git commit 640647d4, not the latest) -O3 -S -march=armv7-a -mfloat-abi=hard -mfpu=neon fmul: vmul.f32 q0, q0, q1 bx lr dmul: vmul.f64 d1, d1, d3 vmul.f64 d0, d0, d2 bx lr Actual output: fmul: vmov.32 r3, d0[0] sub sp, sp, #16 vmovs12, r3 vmov.32 r3, d2[0] vmovs9, r3 vmov.32 r3, d0[1] vmul.f32s12, s12, s9 vstr.32 s12, [sp] vmovs13, r3 vmov.32 r3, d2[1] vmovs10, r3 vmov.32 r3, d1[0] vmul.f32s13, s13, s10 vstr.32 s13, [sp, #4] vmovs14, r3 vmov.32 r3, d1[1] vmovs15, r3 vmov.32 r3, d3[0] vmovs11, r3 vmov.32 r3, d3[1] vmul.f32s14, s14, s11 vstr.32 s14, [sp, #8] vmovs0, r3 vmul.f32s0, s15, s0 vstr.32 s0, [sp, #12] vld1.64 {d0-d1}, [sp:64] add sp, sp, #16 bx lr dmul: push{r4, r5, r6, r7} sub sp, sp, #96 vstrd0, [sp, #64] vstrd1, [sp, #72] vstrd2, [sp, #48] vstrd3, [sp, #56] vldr.64 d17, [sp, #64] vldr.64 d19, [sp, #48] vldr.64 d16, [sp, #72] vldr.64 d18, [sp, #56] vmul.f64d17, d17, d19 vmul.f64d16, d16, d18 vstr.64 d17, [sp, #32] ldrdr0, [sp, #32] mov r4, r0 mov r5, r1 strdr4, [sp] vstr.64 d16, [sp, #40] ldr r2, [sp, #40] ldr ip, [sp, #44] str r2, [sp, #8] str ip, [sp, #12] vld1.64 {d0-d1}, [sp:64] add sp, sp, #96 pop {r4, r5, r6, r7} bx lr The same thing happens for other operators. Oddly, according to Godbolt, GCC 4.5 actually did 32-bit float vectors properly, but regressed more and more each release starting in 4.6.
[Bug target/88705] [ARM][Generic Vector Extensions] float32x4/float64x2 vector operator overloads scalarize on NEON
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88705 Devin Hussey changed: What|Removed |Added Status|RESOLVED|UNCONFIRMED Resolution|INVALID |--- --- Comment #3 from Devin Hussey --- Well, it is still not as efficient as it should be. This would be the code that only uses VFP: fmul: vadd.f32s0, s0, s4 vadd.f32s1, s1, s5 vadd.f32s2, s2, s6 vadd.f32s3, s3, s7 bx lr dmul: vadd.f64d0, d0, d2 vadd.f64d1, d1, d3 bx lr There is no need to keep swapping in and out of NEON registers.
[Bug middle-end/88670] [meta-bug] generic vector extension issues
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88670 Bug 88670 depends on bug 88705, which changed state. Bug 88705 Summary: [ARM][Generic Vector Extensions] float32x4/float64x2 vector operator overloads scalarize on NEON https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88705 What|Removed |Added Status|RESOLVED|UNCONFIRMED Resolution|INVALID |---
[Bug c/88698] Relax generic vector conversions
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88698 --- Comment #7 from Devin Hussey --- I mean, sure, but how about this? What about meeting in the middle? -fno-lax-vector-conversions generates errors like it does now. -flax-vector-conversions shuts GCC up. No flag causes warnings on -Wpedantic or -Wvector-conversion. If we really want to enforce the standard, we should also add a pedantic warning for when we use overloads on intrinsic types without -std=gnu*. -Wgnu-vector-extensions or something: warning: { arithmetic operators | logical operators | array subscripts | initializer lists } on vector types are a GNU extension I feel that the weird promotion rules Clang uses should be an error, and assignment to different types should warn without a cast.
[Bug c/88698] Relax generic vector conversions
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88698 --- Comment #10 from Devin Hussey --- Well what about a special type attribute or some kind of transparent_union like thing for Intel's types? It seems that Intel's intrinsics are the main (only) platform that uses generic types.
[Bug c++/85052] Implement support for clang's __builtin_convertvector
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85052 --- Comment #6 from Devin Hussey --- The patch seems to be working. typedef unsigned u32x2 __attribute__((vector_size(8))); typedef unsigned long long u64x2 __attribute__((vector_size(16))); u64x2 cvt(u32x2 in) { return __builtin_convertvector(in, u64x2); } It doesn't generate the best code, but it isn't bad. x86_64, SSE4.1: cvt: movq%xmm0, %rax movd%eax, %xmm0 shrq$32, %rax pinsrq $1, %rax, %xmm0 ret x86_64, SSE2: cvt: movq%xmm0, %rax movd%eax, %xmm0 shrq$32, %rax movq%rax, %xmm1 punpcklqdq %xmm1, %xmm0 ret ARMv7a NEON: cvt: sub sp, sp, #16 mov r3, #0 str r3, [sp, #4] str r3, [sp, #12] add r3, sp, #8 vst1.32 {d0[0]}, [sp] vst1.32 {d0[1]}, [r3] vld1.64 {d0-d1}, [sp:64] add sp, sp, #16 bx lr I haven't built the others yet. The correct code would be this ([signed|unsigned]): cvt: vmovl.[s|u]32q0, d0 bx lr I am testing other targets now. For the reference, this is what clang generates for other targets: aarch64: cvt: [s|u]shll v0.2d, v0.2s, #0 ret sse4.1/avx: cvt: [v]pmov[s|z]xdqxmm0, xmm0 ret sse2: signed_cvt: pxorxmm1, xmm1 pcmpgtd xmm1, xmm0 punpckldq xmm0, xmm1 # xmm0 = xmm0[0],xmm1[0],xmm0[1],xmm1[1] ret unsigned_cvt: xorps xmm1, xmm1 unpcklpsxmm0, xmm1 # xmm0 = xmm0[0],xmm1[0],xmm0[1],xmm1[1] ret
[Bug c++/85052] Implement support for clang's __builtin_convertvector
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85052 --- Comment #7 from Devin Hussey --- Wait, silly me, this isn't about optimizations, this is about patterns. It does the same thing it was doing for this code: typedef unsigned u32x2 __attribute__((vector_size(8))); typedef unsigned long long u64x2 __attribute__((vector_size(16))); u64x2 cvt(u32x2 in) { return (u64x2) { (unsigned long long)in[0], (unsigned long long)in[1] }; }
[Bug target/85048] [missed optimization] vector conversions
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85048 Devin Hussey changed: What|Removed |Added CC||husseydevin at gmail dot com --- Comment #5 from Devin Hussey --- ARM/AArch64 NEON use these: FromTo Intrinsic ARMv7-a AArch64 intXxY_t -> int2XxY_tvmovl_sX vmovl.sX sshll #0? uintXxY_t. -> uint2XxY_t vmovl_uX vmovl.uX ushll #0? [u]int2XxY_t -> [u]intXxY_t vmovn_[us]Xvmovn.iX xtn floatXxY_t -> intXxY_t vcvt[q]_sX_fX vcvt.sX.fX fcvtzs floatXxY_t -> uintXxY_tvcvt[q]_uX_fX vcvt.uX.fX fcvtzu intXxY_t -> floatXxY_t vcvt[q]_fX_sX vcvt.fX.sX scvtf uintXxY_t-> floatXxY_t vcvt[q]_fX_uX vcvt.fX.uX ucvtf float32x2_t -> float64x2_t vcvt_f32_f64 2x vcvt.f64.f32 fcvtl float64x2_t -> float32x2_t vcvt_f64_f32 2x vcvt.f32.f64 fcvtn Clang optimizes vmovl to vshll by zero for some reason. float32x2_t <-> float64x2_t requires 2 VFP instructions on ARMv7-a.
[Bug rtl-optimization/103641] New: [aarch64][11 regression] Severe compile time regression in SLP vectorize step
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103641 Bug ID: 103641 Summary: [aarch64][11 regression] Severe compile time regression in SLP vectorize step Product: gcc Version: 11.2.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: rtl-optimization Assignee: unassigned at gcc dot gnu.org Reporter: husseydevin at gmail dot com Target Milestone: --- Created attachment 51966 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=51966&action=edit aarch64-linux-gnu-gcc-11 -O3 -c xxhash.c -ftime-report -ftime-report-details While GCC 11.2 has been noticably better at NEON64 code, with some files it hangs for more than 15-30 seconds on the SLP vectorization step. I haven't narrowed this down to a specific thing yet because I don't know much about the GCC internals, but it is *extremely* noticeable in the xxHash library. (https://github.com/Cyan4973/xxHash). This is a test compiling xxhash.c from Git revision a17161efb1d2de151857277628678b0e0b486155. This was done on a Core i5-430m with 8GB RAM and an SSD on Debian Bullseye amd64. GCC 10 (10.2.1-6) was from the\repos, GCC 11 (11.2.0) was built from the tarball with similar flags. While this may cause bias, the two compilers get very similar times when the SLP vectorizer is off. $ time aarch64-linux-gnu-gcc-10 -O3 -c xxhash.c real0m3.596s user0m3.270s sys 0m0.149s $ time aarch64-linux-gnu-gcc-11 -O3 -c xxhash.c real0m31.579s user0m31.314s sys 0m0.112s When disabling the NEON intrinsics with `-DXXH_VECTOR=0`, it only takes ~21 seconds. Time variable usr sys wall GGC phase opt and generate : 31.46 ( 97%) 0.24 ( 32%) 31.80 ( 96%) 54M ( 63%) callgraph functions expansion : 31.01 ( 96%) 0.18 ( 24%) 31.29 ( 94%) 42M ( 49%) tree slp vectorization : 28.35 ( 88%) 0.03 ( 4%) 28.37 ( 85%) 9941k ( 11%) TOTAL : 32.34 0.75 33.20 86M This is significantly worse on my Pi 4B, where an ARMv7->AArch64 build took 3 minutes, although I presume that is mostly due to being 32-bit and the CPU being much slower.
[Bug middle-end/103641] [11/12 regression] Severe compile time regression in SLP vectorize step
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103641 --- Comment #19 from Devin Hussey --- > The new costs on AArch64 have a vector multiplication cost of 4, which is > very reasonable. Would this include multv2di3 by any chance? Because another thing I noticed is that GCC is also trying to multiply 64-bit numbers like it's free but it just ends up scalarizing.
[Bug middle-end/103781] New: [AArch64, 11 regr.] Failed partial vectorization of mulv2di3
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103781 Bug ID: 103781 Summary: [AArch64, 11 regr.] Failed partial vectorization of mulv2di3 Product: gcc Version: 11.2.1 Status: UNCONFIRMED Severity: normal Priority: P3 Component: middle-end Assignee: unassigned at gcc dot gnu.org Reporter: husseydevin at gmail dot com Target Milestone: --- As of GCC 11, the AArch64 backend is very greedy in trying to vectorize mulv2di3. However, there is no mulv2di3 routine so it extracts from the vector. The bad codegen should be obvious. #include void fma_u64(uint64_t *restrict acc, const uint64_t *restrict x, const uint64_t *restrict y) { for (int i = 0; i < 16384; i++){ acc[0] += *x++ * *y++; acc[1] += *x++ * *y++; } } gcc-11 -O3 fma_u64: .LFB0: .cfi_startproc ldr q1, [x0] add x6, x1, 262144 .p2align 3,,7 .L2: ldr x4, [x1], 16 ldr x5, [x2], 16 ldr x3, [x1, -8] mul x4, x4, x5 ldr x5, [x2, -8] fmovd0, x4 ins v0.d[1], x5 mul x3, x3, x5 ins v0.d[1], x3 add v1.2d, v1.2d, v0.2d cmp x1, x6 bne .L2 str q1, [x0] ret .cfi_endproc GCC 10.2.1 emits better code. fma_u64: .LFB0: .cfi_startproc ldp x4, x3, [x0] add x9, x1, 262144 .p2align 3,,7 .L2: ldr x8, [x1], 16 ldr x7, [x2], 16 ldr x6, [x1, -8] ldr x5, [x2, -8] maddx4, x8, x7, x4 maddx3, x6, x5, x3 cmp x9, x1 bne .L2 stp x4, x3, [x0] ret .cfi_endproc However, the ideal code would be a 2 iteration unroll. Side note: why not ldp in the loop?
[Bug target/103781] Cost model for SLP for aarch64 is not so good still
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103781 --- Comment #2 from Devin Hussey --- Yeah my bad, I meant SLP, I get them mixed up all the time.
[Bug target/103781] generic/cortex-a53 cost model for SLP for aarch64 is good
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103781 --- Comment #4 from Devin Hussey --- Makes sense because the multiplier is what, 5 cycles on an A53?
[Bug target/110013] New: [i386] vector_size(8) on 32-bit ABI
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110013 Bug ID: 110013 Summary: [i386] vector_size(8) on 32-bit ABI Product: gcc Version: 14.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: husseydevin at gmail dot com Target Milestone: --- Closely related to bug 86541, which was fixed on x64 only. On 32-bit, GCC passes any vector_size(8) vectors to external functions in MMX registers, similar to how it passes 16 byte vectors in SSE registers. This appears to be the only time that GCC will ever naturally generate an MMX instruction. This is only good if and only if you are using MMX intrinsics and are manually handling _mm_empty(). Otherwise, if, say, you are porting over NEON code (where I found this issue) using the vector_size intrinsics, this can cause some sneaky issues if your function fails to inline: 1. Things will likely break because GCC doesn't handle MMX and x87 properly - Example of broken code (works with -mno-mmx): https://godbolt.org/z/xafWPohKb 2. You will have a nasty performance toll, more than just a cdecl call, as GCC doesn't actually know what to do with an MMX register and just spills it into memory. - This especially can be seen when v2sf is used and it places the floats into MMX registers. There are two options. The first is to use the weird ABI that Clang seems to use: | Type | SIMD | Params | Return | | float| base | stack | ST0:ST1 | | float| SSE | XMM0-2 | XMM0| | double | all | stack | ST0 | | long long/__m64 | all | stack | EAX:EDX | | int, short, char | base | stack | stack | | int, short, char | SSE2 | stack | XMM0| However, since the current ABIs aren't 100% compatible anyways, I think that a much simpler solution is to just convert to SSE like x64 does, falling back to the stack if SSE is not available. Changing the ABI to this also allows us to port MMX with SSE (bug 86541) to 32-bit mode. If you REALLY need MMX intrinsics, you can't inline, and you don't have SSE2, you can cope with a stack spill.
[Bug target/110013] [i386] vector_size(8) on 32-bit ABI emits broken MMX
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110013 --- Comment #1 from Devin Hussey --- As a side note, the official psABI does say that function call parameters use MM0-MM2, if Clang follows its own rules then it means that the supposed stability of the ABI is meaningless.
[Bug target/110013] [i386] vector_size(8) on 32-bit ABI emits broken MMX
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110013 --- Comment #2 from Devin Hussey --- Scratch that. There is a somewhat easy way to fix this following psABI AND using MMX with SSE. Upon calling a function, we can have the following sequence func: movdq2q mm0, xmm0 movq mm1, [esp + n] call mmx_func movq2dq xmm0, mm0 emms Then, this prologue: mmx_func: movq2dq xmm0, mm0 movq2dq xmm1, mm1 emms ... movdq2q mm0, xmm0 ret