[Bug target/88510] GCC generates inefficient U64x2/v2di scalar multiply for NEON32

2019-01-14 Thread husseydevin at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88510

--- Comment #4 from Devin Hussey  ---
I am deciding to refer to goodmul as ssemul from now on. I think it is a better
name.

I am also wondering if Aarch64 gets a benefit from this vs. scalarizing if the
value is already in a NEON register. I don't have an Aarch64 device to test on.
For the reference, I use an LG G3 with a Snapdragon 801 (Cortex-A15)
underclocked to 4 cores @ 1.7 GHz.

I also did some testing, and twomul is also fastest if a value can be
interleaved outside of the loop (e.g. a constant). ssemul is only fastest if
either both operands can be interleaved beforehand or the high or low bits are
known to be zero in which it can be simplified.

For example, the xxHash64 routine,  which looks like this:

const U8 *p;
const U8 *limit = p + len - 31;
U64x2 v[2];
...
do {
// Actually unrolled
for (int i = 0; i < 2; i++) {
// Load (U8 load because alignment is dumb)
U64x2 inp = vreinterpretq_u64_u8(vld1q_u8(p));
p += 16;
v[i] += inp * PRIME64_2;
v[i]  = (v[i] << 31) | (v[i] >> (64 - 31));
v[i] *= PRIME64_1;
}
} while (p < limit);

seems to be the fastest when implemented like this:


// Wordswap and separate low bits for twomul
const U64x2 prime1Base = vdupq_n_u64(PRIME64_1);
const U32x2 prime1Lo = vmovn_u64(prime1Base);
const U32x4 prime1Rev = vrev64q_u32(vreinterpretq_u32_u64(prime1Base));

// Interleave for ssemul
_Alignas(16) const U64 PRIME2[2] = { PRIME64_2, PRIME64_2 };
const U32x2x2 prime2 = vld2_u32((const U32 *)__builtin_assume_aligned(PRIME2,
16));

U64x2 v[2];
do {
// actually unrolled
for (int i = 0; i < 2; i++) {
// Interleaved load
U32x2x2 inp = vld2_u32((const U32 *)p);
p += 16;

// ssemul
// val = (U64x2)inpLo * (U64x2)prime2Hi;
U64x2 val = vmull_u32(inp.val[0], prime2.val[1]);

// val += (U64x2)inpHi * (U64x2)prime2Lo;
val = vmlal_u32(val, inp.val[1], prime2.val[0]);

// val <<= 32;
val = vshlq_n_u64(val, 32);

// val += (U64x2)inpLo * (U64x2)prime2Lo;
val = vmlal_u32(val, inp.val[0], prime2.val[0]);
// end ssemul

// Add
v[i] = vaddq_u64(v[i], val);

// Rotate left
v[i] = vsriq_n_u64(vshlq_n_u64(v[i], 31), v[i], 33);

// twomul
// topLo = v[i] & 0x;
U32x2 topLo = vmovn_u64(v[i]);

// top = (U32x4)v[i];
U32x4 top = vreinterpretq_u32_u64(v[i]);

// prod = {
//   topLo * prime1Hi,
//   topHi * prime1Lo
// };
U32x4 prod = vmulq_u32(top, prime1Rev);

// prod64 = (U64x2)prod[0] + (U64x2)prod[1];
U64x2 prod64 = vpaddlq_u32(prod);

// prod64 <<= 32;
prod64 = vshlq_n_u64(prod64, 32);

// prod64 += (U64x2)topLo * (U64x2)prime1Lo;
prod64 = vmlal_u32(prod64, topLo, prime1Lo);
// end twomul
} 
} while (p < limit);

As you can see, since we can do an interleaved load on p, it is fastest to do
ssemul, however, since we are using v for more than just multiplication, we use
twomul.

On my G3 in Termux with the xxhsum 100 KB benchmark, this gets to 2.65 GB/s,
compared to 0.8 GB/s scalar and 2.24 GB/s with both of them using ssemul.
However, this was compiled with Clang. For some reason, even though I see no
major differences in the assembly, GCC consistently produces code at roughly
80% the performance of Clang. But this is mostly an algorithm thing, that isn't
important.

Considering that this is 64-bit arithmetic on a 32-bit device, that is pretty
good.

[Bug target/88963] gcc generates terrible code for vectors of 64+ length which are not natively supported

2019-01-22 Thread husseydevin at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88963

Devin Hussey  changed:

   What|Removed |Added

 CC||husseydevin at gmail dot com

--- Comment #4 from Devin Hussey  ---
Strangely, this doesn't seem to affect the ARM or aarch64 backends, although I
am on a December build (specifically Dec 29). 8.2 is also unaffected.

arm-none-eabi-gcc -mfloat-abi=hard -mfpu=neon -march=armv7-a -O3 -S test.c

test:
vldmia  r1, {d0-d7}
vldmia  r2, {d24-d31}
vadd.i32q8, q0, q12
vadd.i32q9, q1, q13
vadd.i32q10, q2, q14
vadd.i32q11, q3, q15
vstmia  r0, {d16-d23}
bx  lr

aarch64-none-eabi-gcc -O3 -S test.c

test:
ld1 {v16.16b - v19.16b}, [x1]
ld1 {v4.16b - v7.16b}, [x2]
add v0.4s, v16.4s, v4.4s
add v1.4s, v17.4s, v5.4s
add v2.4s, v18.4s, v6.4s
add v3.4s, v19.4s, v7.4s
st1 {v0.16b - v3.16b}, [x0]
ret

Amusingly, Clang trunk for ARMv7-a has a similar issue (aarch64 is fine).

test:
.fnstart
.save   {r11, lr}
push{r11, lr}
add r3, r1, #48
mov lr, r1
mov r12, r2
vld1.64 {d20, d21}, [r3]
add r3, r2, #48
add r1, r1, #32
vld1.32 {d16, d17}, [lr]!
vld1.32 {d18, d19}, [r12]!
vadd.i32q8, q9, q8
vld1.64 {d22, d23}, [r3]
vadd.i32q10, q11, q10
vld1.64 {d26, d27}, [r1]
add r1, r2, #32
vld1.64 {d28, d29}, [r1]
add r1, r0, #48
vadd.i32q11, q14, q13
vld1.64 {d24, d25}, [lr]
vld1.64 {d18, d19}, [r12]
vadd.i32q9, q9, q12
vst1.64 {d20, d21}, [r1]
add r1, r0, #32
vst1.32 {d16, d17}, [r0]!
vst1.64 {d22, d23}, [r1]
vst1.64 {d18, d19}, [r0]
pop {r11, pc}

[Bug target/88963] gcc generates terrible code for vectors of 64+ length which are not natively supported

2019-01-22 Thread husseydevin at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88963

--- Comment #9 from Devin Hussey  ---
(In reply to Andrew Pinski from comment #6)
> Try using 128 (or 256) and you might see that aarch64 falls down similarly.

yup. Oof.

test:
sub sp, sp, #560
stp x29, x30, [sp]
mov x29, sp
stp x19, x20, [sp, 16]
mov x19, 128
mov x20, x0
add x0, sp, 176
str x21, [sp, 32]
mov x21, x2
mov x2, x19
bl  memcpy
mov x2, x19
mov x1, x21
add x0, sp, 304
bl  memcpy
ldr q7, [sp, 176]
mov x2, x19
ldr q6, [sp, 192]
add x1, sp, 48
ldr q5, [sp, 208]
mov x0, x20
ldr q4, [sp, 224]
ldr q3, [sp, 240]
ldr q2, [sp, 256]
ldr q1, [sp, 272]
ldr q0, [sp, 288]
ldr q23, [sp, 304]
ldr q22, [sp, 320]
ldr q21, [sp, 336]
ldr q20, [sp, 352]
ldr q19, [sp, 368]
ldr q18, [sp, 384]
ldr q17, [sp, 400]
ldr q16, [sp, 416]
add v7.4s, v7.4s, v23.4s
add v6.4s, v6.4s, v22.4s
add v5.4s, v5.4s, v21.4s
add v4.4s, v4.4s, v20.4s
add v3.4s, v3.4s, v19.4s
str q7, [sp, 48]
add v2.4s, v2.4s, v18.4s
str q6, [sp, 64]
add v1.4s, v1.4s, v17.4s
str q5, [sp, 80]
add v0.4s, v0.4s, v16.4s
str q4, [sp, 96]
str q3, [sp, 112]
str q2, [sp, 128]
str q1, [sp, 144]
str q0, [sp, 160]
bl  memcpy
ldp x29, x30, [sp]
ldp x19, x20, [sp, 16]
ldr x21, [sp, 32]
add sp, sp, 560
ret

[Bug target/88963] gcc generates terrible code for vectors of 64+ length which are not natively supported

2019-01-22 Thread husseydevin at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88963

--- Comment #10 from Devin Hussey  ---
I also want to add that aarch64 shouldn't even be spilling; it has 32 NEON
registers and with 128 byte vectors it should only use 24.

[Bug regression/93418] New: GCC incorrectly constant propagates _mm_sllv/srlv/srav

2020-01-24 Thread husseydevin at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93418

Bug ID: 93418
   Summary: GCC incorrectly constant propagates _mm_sllv/srlv/srav
   Product: gcc
   Version: 10.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: regression
  Assignee: unassigned at gcc dot gnu.org
  Reporter: husseydevin at gmail dot com
  Target Milestone: ---

Regression starting in GCC 9

Currently, GCC constant propagates the AVX2 _mm_sllv family with constant
amounts to only shift by the first element instead of all elements
individually.

#include 
#include 

// force -O0
__attribute__((__optimize__("-O0")))
void unoptimized() {
__m128i vals = _mm_set1_epi32(0x);
__m128i shifts = _mm_setr_epi32(16, 31, -34, 3);
__m128i shifted = _mm_sllv_epi32(vals, shifts);
printf("%08x %08x %08x %08x\n", _mm_extract_epi32(shifted, 0),
_mm_extract_epi32(shifted, 1), _mm_extract_epi32(shifted, 2),
_mm_extract_epi32(shifted, 3));
}

// force -O3
__attribute__((__optimize__("-O3")))
void optimized() {
__m128i vals = _mm_set1_epi32(0x);
__m128i shifts = _mm_setr_epi32(16, 31, -34, 3);
__m128i shifted = _mm_sllv_epi32(vals, shifts);
printf("%08x %08x %08x %08x\n", _mm_extract_epi32(shifted, 0),
_mm_extract_epi32(shifted, 1), _mm_extract_epi32(shifted, 2),
_mm_extract_epi32(shifted, 3));
}

int main() {
printf("Without optimizations (correct result):\t");
unoptimized();
printf("With optimizations (incorrect result):\t");
optimized();
}


I would expect this code to emit the following:

Without optimizations (correct result):   8000  fff8
With optimizations (incorrect result):    8000  fff8

Clang and GCC < 9 exhibit the first output, but 9.1 and later 

However, I get this output on GCC 9 and later:

Without optimizations (correct result):   8000  fff8
With optimizations (incorrect result):      

Godbolt link: https://gcc.godbolt.org/z/oC3Psp

[Bug target/93418] [9/10 Regression] GCC incorrectly constant propagates _mm_sllv/srlv/srav

2020-01-24 Thread husseydevin at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93418

--- Comment #3 from Devin Hussey  ---
I think I found the culprit commit.

Haven't set up a GCC build tree yet, though. 

https://github.com/gcc-mirror/gcc/commit/a51c4926712307787d133ba50af8c61393a9229b

[Bug target/93418] [9/10 Regression] GCC incorrectly constant propagates _mm_sllv/srlv/srav

2020-01-24 Thread husseydevin at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93418

Devin Hussey  changed:

   What|Removed |Added

  Build||2020-01-24 0:00

--- Comment #5 from Devin Hussey  ---
Finally got GCC to build after it was throwing a fit.

I can confirm that the regression is in that commit.

g:28a8a768ebef5e31f950013f1b48b14c008b4b3b works correctly, 
g:6a03477e85e1b097ed6c0b86c76436de575aef04 does not.

[Bug target/93418] [9/10 Regression] GCC incorrectly constant propagates _mm_sllv/srlv/srav

2020-01-27 Thread husseydevin at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93418

--- Comment #8 from Devin Hussey  ---
Seems to work.

~ $ ~/gcc-test/bin/x86_64-pc-cygwin-gcc.exe -mavx2 -O3 _mm_sllv_bug.c

~ $ ./a.exe
Without optimizations (correct result):  8000  fff8
With optimizations (incorrect result):   8000  fff8

~ $

And checking the assembly, the shifts are constant propagated.

The provided test file also passes.

[Bug target/88255] New: Thumb-1: GCC too aggressive on mul->lsl/sub/add optimization

2018-11-28 Thread husseydevin at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88255

Bug ID: 88255
   Summary: Thumb-1: GCC too aggressive on mul->lsl/sub/add
optimization
   Product: gcc
   Version: 8.2.1
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: husseydevin at gmail dot com
  Target Milestone: ---

I might be wrong, but it appears that GCC is too aggressive in its conversion
from multiplication to shift+add when targeting Thumb-1

It is true that, for example, the Cortex-M0 can have the small multiplier and a
16 cycle shift sequence would be faster. However, I was targeting arm7tdmi
(-march=armv4t -mthumb -O3 -mtune=arm7tdmi) which, if I am not mistaken, uses
one cycle for every 8 bits.

http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0234b/i102180.html

However, looking in the source code, I notice that the loop is dividing by 4. I
think it might be a bug that is causing the otherwise 7 (I think) cycle
sequence in the code below to be considered as having a weight of 18 cycles.

https://github.com/gcc-mirror/gcc/blob/master/gcc/config/arm/arm.c#L8959

I could be wrong, but one of the things I noticed is that very old versions of
GCC (2.95) will not perform this many shifts, and that Clang, when given the 
transpiled output in C and targeted for the same platform, will actually
convert it back into a ldr/mul.

However, when targeting cortex-m0plus.small-multiply, it will still turn it
into multiplication.

Code example: 

  unsigned MultiplyByPrime(unsigned val)
  {
  return val * 2246822519U;
  }

  MultiplyByPrime:
 lslsr3, r0, #7 @ unsigned ret = val << 7;
 subsr3, r3, r0 @ ret -= val;
 lslsr3, r3, #5 @ ret <<= 5;
 subsr3, r3, r0 @ ret -= val;
 lslsr3, r3, #2 @ ret <<= 2;
 addsr3, r3, r0 @ ret += val;
 lslsr2, r3, #3 @ unsigned tmp = ret << 3;
 addsr3, r3, r2 @ ret += tmp;
 lslsr3, r3, #1 @ ret <<= 1;
 addsr3, r3, r0 @ ret += val;
 lslsr3, r3, #6 @ ret <<= 6;
 addsr3, r3, r0 @ ret += val;
 lslsr2, r3, #4 @ tmp = ret << 4;
 subsr3, r2, r3 @ ret = tmp - ret;
 lslsr3, r3, #3 @ ret <<= 3;
 subsr0, r3, r0 @ ret -= val;
 bx  lr @ return ret;

[Bug target/88510] New: GCC generates inefficient U64x2 scalar multiply for NEON32

2018-12-14 Thread husseydevin at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88510

Bug ID: 88510
   Summary: GCC generates inefficient U64x2 scalar multiply for
NEON32
   Product: gcc
   Version: 8.2.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: husseydevin at gmail dot com
  Target Milestone: ---

Note: I use these typedefs here for brevity.

typedef uint64x2_t U64x2;
typedef uint32x2_t U32x2;
typedef uint32x2x2_t U32x2x2;
typedef uint32x4_t U32x4;

GCC and Clang both have issues with this code on ARMv7a NEON, and will switch
to scalar:

U64x2 multiply(U64x2 top, U64x2 bot)
{
return top * bot;
}

gcc-8 -mfloat-abi=hard -mfpu=neon -O3 -S -march=armv7-a 

multiply:
push{r4, r5, r6, r7, lr}
sub sp, sp, #20
vmovr0, r1, d0  @ v2di
vmovr6, r7, d2  @ v2di
vmovr2, r3, d1  @ v2di
vmovr4, r5, d3  @ v2di
mul lr, r0, r7
mla lr, r6, r1, lr
mul ip, r2, r5
umull   r0, r1, r0, r6
mla ip, r4, r3, ip
add r1, lr, r1
umull   r2, r3, r2, r4
strdr0, [sp]
add r3, ip, r3
strdr2, [sp, #8]
vld1.64 {d0-d1}, [sp:64]
add sp, sp, #20
pop {r4, r5, r6, r7, pc}

Clang's is worse, and you can compare the output, as well as the i386 SSE4.1
code here: https://godbolt.org/z/35owtL

Related LLVM bug 39967: https://bugs.llvm.org/show_bug.cgi?id=39967

I started the discussion in LLVM, as it had the worse problem, and we have come
up with a few options for faster code that does not require scalar. You can
also find the benchmark file (with outdated tests) and results results. They
are from Clang, but since they use intrinsics, results are similar.

While we don't have vmulq_u64, we do have faster ways to multiply without going
scalar.

I have benchmarked the code, and have found this option, based on the code
emitted for SSE4.1:

U64x2 goodmul_sse(U64x2 top, U64x2 bot)
{
U32x2 topHi = vshrn_n_u64(top, 32); // U32x2 topHi  = top >> 32;
U32x2 topLo = vmovn_u64(top);   // U32x2 topLo  = top & 0x;
U32x2 botHi = vshrn_n_u64(bot, 32); // U32x2 botHi  = bot >> 32;
U32x2 botLo = vmovn_u64(bot);   // U32x2 botLo  = bot & 0x;

U64x2 ret64 = vmull_u32(topHi, botLo);  // U64x2 ret64   = (U64x2)topHi *
(U64x2)botLo;
ret64 = vmlal_u32(ret64, topLo, botHi); //   ret64  += (U64x2)topLo *
(U64x2)botHi;
ret64 = vshlq_n_u64(ret64, 32); //   ret64 <<= 32;
ret64 = vmlal_u32(ret64, topLo, botLo); //   ret64  += (U64x2)topLo *
(U64x2)botLo;
return ret64;
}

If GCC can figure out how to interleave one or two of the operands, for
example, changing this:

U64x2 inp1 = vld1q_u64(p);
U64x2 inp2 = vld1q_u64(q);
vec = goodmul_sse(inp1, inp2);

to this (if it knows inp1 and/or inp2 are only used for multiplication):

U32x2x2 inp1 = vld2_u32(p);
U32x2x2 inp2 = vld2_u32(q);
vec = goodmul_sse_interleaved(inp1, inp2)

then we can do this and save 4 cycles:

U64x2 goodmul_sse_interleaved(const U32x2x2 top, const U32x2x2 bot)
{
U64x2 ret64 = vmull_u32(top.val[1], bot.val[0]);  // U64x2 ret64   =
(U64x2)topHi * (U64x2)botLo;
ret64 = vmlal_u32(ret64, top.val[0], bot.val[1]); //   ret64  +=
(U64x2)topLo * (U64x2)botHi;
ret64 = vshlq_n_u64(ret64, 32);   //   ret64 <<= 32;
ret64 = vmlal_u32(ret64, top.val[0], bot.val[0]); //   ret64  +=
(U64x2)topLo * (U64x2)botLo;
return ret64;
}

Another user posted this (typos fixed).

It seems to use two fewer cycles when not interleaved (not 100% sure about it),
but two cycles slower when it is fully interleaved.

U64x2 twomul(U64x2 top, U64x2 bot)
{
U32x2 top_low = vmovn_u64(top);
U32x2 bot_low = vmovn_u64(bot);
U32x4 top_re = vreinterpretq_u32_u64(top);
U32x4 bot_re = vrev64q_u32(vreinterpretq_u32_u64(bot));
U32x4 prod = vmulq_u32(top_re, bot_re);
U64x2 paired = vpaddlq_u32(prod);
U64x2 shifted = vshlq_n_u64(paired, 32);
return vmlal_u32(shifted, top_low, bot_low);
}

Either one of these is faster than scalar.

[Bug tree-optimization/88605] New: vector extensions: Widening or conversion generates inefficient or scalar code.

2018-12-26 Thread husseydevin at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88605

Bug ID: 88605
   Summary: vector extensions: Widening or conversion generates
inefficient or scalar code.
   Product: gcc
   Version: 9.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: husseydevin at gmail dot com
  Target Milestone: ---

If you want to, say, convert a u32x2 vector to a u64x2 while avoiding
intrinsics, good luck.

GCC doesn't have a builtin like __builtin_convertvector, and doing the
conversion manually generates scalar code. This makes clean generic vector code
difficult.

SSE and NEON both have plenty of conversion instructions, such as pmovzxdq or
vmovl.32, but GCC will not emit them. 

typedef unsigned long long U64;
typedef U64 U64x2 __attribute__((vector_size(16)));
typedef unsigned int U32;
typedef U32 U32x2 __attribute__((vector_size(8)));

U64x2 vconvert_u64_u32(U32x2 v)
{
return (U64x2) { v[0], v[1] };
}

x86_32:

Flags: -O3 -m32 -msse4.1
Clang Trunk (revision 350063)

vconvert_u64_u32:
pmovzxdqxmm0, qword ptr [esp + 4] # xmm0 =
mem[0],zero,mem[1],zero
ret

GCC (GCC-Explorer-Build) 9.0.0 20181225 (experimental)
convert_u64_u32:
pushebx
sub esp, 40
movqQWORD PTR [esp+8], mm0
mov ecx, DWORD PTR [esp+8]
mov ebx, DWORD PTR [esp+12]
mov DWORD PTR [esp+8], ecx
movdxmm0, DWORD PTR [esp+8]
mov DWORD PTR [esp+20], ebx
movdxmm1, DWORD PTR [esp+20]
mov DWORD PTR [esp+16], ecx
add esp, 40
punpcklqdq  xmm0, xmm1
pop ebx
ret
I can't even understand what is going on here, except it is wasting 44 bytes of
stack for no good reason.

x86_64: 

Flags: -O3 -m64 -msse4.1

Clang:

vconvert_u64_u32:
pmovzxdqxmm0, xmm0  # xmm0 = xmm0[0],zero,xmm0[1],zero
ret

GCC:
vconvert_u64_u32:
movqrax, xmm0
movdDWORD PTR [rsp-28], xmm0
movdxmm0, DWORD PTR [rsp-28]
shr rax, 32
pinsrq  xmm0, rax, 1
ret

ARMv7 NEON:
Flags: -march=armv7-a -mfloat-abi=hard -mfpu=neon -O3

Clang (with --target=arm-none-eabi):
vconvert_u64_u32:
vmovl.u32   q0, d0
bx  lr

arm-unknown-linux-gnueabi-gcc (GCC) 8.2.0:
vconvert_u64_u32:
mov r3, #0
sub sp, sp, #16
add r2, sp, #8
vst1.32 {d0[0]}, [sp]
vst1.32 {d0[1]}, [r2]
str r3, [sp, #4]
str r3, [sp, #12]
vld1.64 {d0-d1}, [sp:64]
add sp, sp, #16
bx  lr

aarch64 NEON:
Flags: -O3

Clang (with --target=aarch64-none-eabi):
vconvert_u64_u32:
ushll   v0.2d, v0.2s, #0
ret

aarch64-unknown-linux-gnu-gcc 8.2.0:

vconvert_u64_u32:
umovw1, v0.s[0]
umovw0, v0.s[1]
uxtwx1, w1
uxtwx0, w0
dup v0.2d, x1
ins v0.d[1], x0
ret

Some other things include things like getting a standalone pmuludq.

In clang, this always generates pmuludq:
U64x2 pmuludq(U64x2 v1, U64x2 v2)
{
return (v1 & 0x) * (v2 & 0x);
}

But GCC generates this:
pmuludq:
movdqa  xmm2, XMMWORD PTR .LC0[rip]
pandxmm0, xmm2
pandxmm2, xmm1
movdqa  xmm4, xmm2
movdqa  xmm1, xmm0
movdqa  xmm3, xmm0
psrlq   xmm4, 32
psrlq   xmm1, 32
pmuludq xmm0, xmm4
pmuludq xmm1, xmm2
pmuludq xmm3, xmm2
paddq   xmm1, xmm0
psllq   xmm1, 32
paddq   xmm3, xmm1
movdqa  xmm0, xmm3
ret
.LC0:
.quad   4294967295
.quad   4294967295

and that is the best code it generates. Much worse code is generated depending
on how you write it.

Meanwhile, while it has some struggles with sse2 and x86_64, there is a
reliable way to get Clang to generate pmuludq, and the NEON equivalent,
vmull.u32, 
https://godbolt.org/z/H_tOi1

[Bug tree-optimization/88605] vector extensions: Widening or conversion generates inefficient or scalar code.

2018-12-27 Thread husseydevin at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88605

--- Comment #2 from Devin Hussey  ---
While __builtin_convertvector would improve the situation, the main issue here
is the blindness to some obvious patterns.

If I write this code, I want either pmovzdq or vmovl. I don't want to waste
time with scalar on the stack.

U64x2 pmovzdq(U32x2 v)
{
return (U64x2) { v[0], v[1] };
}

If I write this code, I want pmuludq or vmull if it can be optimized to it. I
don't want to mask it and do an entire 64-bit multiply.

U64x2 pmuludq(U64x2 v1, U64x2 v2)
{
return (v1 & 0x) * (v2 & 0x);
}

If I do this, I don't want scalar code on NEON. I want vshl + vsri, or at the
very least, vshl + vshr + vorr.

U64x2 vrol64(U64x2 v, int N)
{
return (v << N) | (v >> (64 - N));
}

Having a generic SIMD overload library built-in is awesome, but only if it
saves time.

If I can write one block of code that looks like normal C code but it actually
optimized vector code that runs at even 80% the speed of specialized intrinsics
regardless of the platform (or even if the platform supports SIMD), that saves
a lot of time especially when trying to remember the difference between
_mm_mullo and _mm_mul.

If you can write your code so you can do this

#ifdef __GNUC__
typedef unsigned U32x4 __attribute__((vector_size(16)));
#else
typedef unsigned U32x4[4];
#endif

and use them interchangeably with ANSI C arrays without worrying about GCC
scalarizing the code, that saves even more time.

If you have to write your code like asm.js or mix intrinsics with normal code
just to get code that runs at half the speed of intrinsics, that is not
beneficial.

[Bug target/88510] GCC generates inefficient U64x2/v2di scalar multiply for NEON32

2018-12-31 Thread husseydevin at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88510

Devin Hussey  changed:

   What|Removed |Added

Summary|GCC generates inefficient   |GCC generates inefficient
   |U64x2 scalar multiply for   |U64x2/v2di scalar multiply
   |NEON32  |for NEON32

--- Comment #1 from Devin Hussey  ---
I noticed that the scalarization is performed in the veclower21 stage. 

In making a patch for LLVM, I found that the x86 code could basically be
copy-pasted over, just adding truncates and replacing the SSE instructions with
NEON instructions. I would add it if someone told me where the SSE code is and
where to put the NEON code. That is what helped me with the LLVM patch.

[Bug tree-optimization/88605] vector extensions: Widening or conversion generates inefficient or scalar code.

2019-01-02 Thread husseydevin at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88605

--- Comment #4 from Devin Hussey  ---
I also want to note that LLVM is probably a good place to look. They have been
pushing to remove as many intrinsic builtins as they can in favor of idiomatic
code.

This has multiple advantages:
 1. You can open up  and see what x intrinsic really does (many
SIMD instructions have inadequate documentation)
 2. Platform independent intrinsic headers
 3. More useful vector extensions

Should we make a metabug for this?
Such as "Improve vector extension pattern recognition" or something?

[Bug target/88510] GCC generates inefficient U64x2/v2di scalar multiply for NEON32

2019-01-03 Thread husseydevin at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88510

--- Comment #2 from Devin Hussey  ---
Update: I did the calculations, and twomul has the same cycle count as
goodmul_sse. vmul.i32 with 128-bit operands takes 4 cycles (I assumed it was
two), so just like goodmul_sse, it takes 11 cycles.

[Bug c/88698] New: Relax generic vector conversions

2019-01-04 Thread husseydevin at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88698

Bug ID: 88698
   Summary: Relax generic vector conversions
   Product: gcc
   Version: 9.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c
  Assignee: unassigned at gcc dot gnu.org
  Reporter: husseydevin at gmail dot com
  Target Milestone: ---

GCC is far too strict about vector conversions. 

Currently, mixing generic vector extensions and platform-specific intrinsics
almost always requires either a cast or -flax-vector-extensions, which is
annoying and breaks a lot of things Clang happily accepts.

Here is my proposal:
* x86's __mNi should implicitly convert between any N-bit vector. This matches
the void pointer-like behavior of SSE's vectors.
* Any vector with equivalent lane types and number of lanes should convert
without an issue. For example, uint32_t vector_size(16) and NEON's uint32x4_t
have no reason not to be compatible. 
* Signed <-> unsigned should act like other implicit signed <-> unsigned
conversions, -Wextra in C and warning in C++.
* Implicit conversions between different vectors of the same size should emit
an error.

[Bug c/88698] Relax generic vector conversions

2019-01-04 Thread husseydevin at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88698

--- Comment #2 from Devin Hussey  ---
What I am saying is that I think -flax-vector-conversions should be default, or
we should only have minimal warnings instead of errors.

That will make generic vectors much easier to use.

It is to be noted that Clang has -Wvector-conversion, which is the equivalent
of -fno-lax-vector-conversions, however, it is a warning that only occurs with
-Weverything. Not even -Wall -Wextra -Wpedantic in C++ mode will enable it. 

If Clang thinks this is such a minor issue that it won't even warn with -Wall
-Wextra -Wpedantic, why does GCC consider it an error?



However, if you want examples, here:

Example 1 (SSE2)

Here, we are trying to use an intrinsic which accepts and returns an __m128i 
(defined as "long long __attribute__((vector_size(16)))") with a u32x4 (defined
as "uint32_t __attribute__((vector_size(16)))")


#include 
#include 

typedef uint32_t u32x4 __attribute__((vector_size(16)));

u32x4 shift(u32x4 val)
{
return _mm_srli_epi32(val, 15);
}


On Clang, it will happily accept that, only complaining on -Wvector-conversion.

GCC will fail to compile. 

There are three ways around that:
1. Typedef u32x4 to __m128i. This is unreasonable, because that causes the
operator overloads and constructors to operate on 64-bit integers instead of
32-bit.
2. Add -flax-vector-conversions. Requiring someone to add a warning suppression
flag to compile your code is often seen as code smell.
3. Cast. Good lord, if you thought intrinsics were ugly, this will change your
mind:

return (u32x4)_mm_srli_epi32((__m128i)val, 15);

or C++-style:

return static_cast(_mm_srli_epi32(static_cast<__m128i>(val), 15));


Example 2 (ARMv7-a + NEON):


#include 
_Static_assert(sizeof(unsigned long) == sizeof(unsigned int), "use 32-bit
please");

typedef unsigned long u32x4 __attribute__((vector_size(16)));

u32x4 shift(u32x4 val)
{
return vshrq_n_u32(val, 15);
}


This is the second issue: unsigned long and unsigned int are the same size and
should have no issues converting between each other.

This often comes from a situation where uint32_t is set to unsigned long.


Example 3 (Generic):


typedef unsigned u32x4 __attribute__((vector_size(16)));
typedef unsigned long long u64x2 __attribute__((vector_size(16)));

u64x2 cast(u32x4 val)
{
return val;
}


This should emit a warning without a cast. I would recommend an error, but
Clang without -Wvector-conversion accepts this without any complaining.


Example 4 (Generic): 


typedef unsigned u32x2 __attribute__((vector_size(8)));
typedef unsigned long long u64x2 __attribute__((vector_size(16)));


u64x2 cast(u32x2 val)
{
return val;
}


This is clearly an error. There should be __builtin_convertvector which is
being tracked in a different bug, but that is not the point.

[Bug c/88698] Relax generic vector conversions

2019-01-04 Thread husseydevin at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88698

--- Comment #5 from Devin Hussey  ---
Well, if we are aiming for strict compliance, might as well throw out every GCC
extension in existence (including vector extensions), those aren't strictly
compliant to the C/C++ standard. /s

The whole point of extensions are to be an extension that violates the
standard.


#include 

uint64x2_t mult(uint64x2_t top, uint64x2_t bot)
{
return top * bot;
}


I am breaking two rules here:
1. Using operator overloads, which are not part of the standard.
2. Implying a nonexistent instruction, as there is no vmul.i64. (it is
scalarized at the moment, but I explained in bug 88510 that there are better
options)


Clang even allows this:


#include 

uint32x4_t mult(uint16x8_t top, uint32x4_t bot)
{
return top * bot;
}


In which it will reinterpret all to the widest lane type.

[Bug target/88705] New: [ARM][Generic Vector Extensions] float32x4/float64x2 vector operator overloads scalarize on NEON

2019-01-04 Thread husseydevin at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88705

Bug ID: 88705
   Summary: [ARM][Generic Vector Extensions] float32x4/float64x2
vector operator overloads scalarize on NEON
   Product: gcc
   Version: 9.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: husseydevin at gmail dot com
  Target Milestone: ---

For some reason, GCC scalarizes float32x4_t and float64x2_t on ARM32 NEON when
using vector extensions. 

typedef float f32x4 __attribute__((vector_size(16)));
typedef double f64x2 __attribute__((vector_size(16)));

f32x4 fmul (f32x4 v1, f32x4 v2)
{
   return v1 * v2;
}
f64x2 dmul (f64x2 v1, f64x2 v2)
{
   return v1 * v2;
}

Expected output:

arm-none-eabi-gcc (git commit 640647d4, not the latest) -O3 -S -march=armv7-a
-mfloat-abi=hard -mfpu=neon

fmul:
vmul.f32 q0, q0, q1
bx lr
dmul:
vmul.f64 d1, d1, d3
vmul.f64 d0, d0, d2
bx lr

Actual output:

fmul:
vmov.32 r3, d0[0]
sub sp, sp, #16
vmovs12, r3
vmov.32 r3, d2[0]
vmovs9, r3
vmov.32 r3, d0[1]
vmul.f32s12, s12, s9
vstr.32 s12, [sp]
vmovs13, r3
vmov.32 r3, d2[1]
vmovs10, r3
vmov.32 r3, d1[0]
vmul.f32s13, s13, s10
vstr.32 s13, [sp, #4]
vmovs14, r3
vmov.32 r3, d1[1]
vmovs15, r3
vmov.32 r3, d3[0]
vmovs11, r3
vmov.32 r3, d3[1]
vmul.f32s14, s14, s11
vstr.32 s14, [sp, #8]
vmovs0, r3
vmul.f32s0, s15, s0
vstr.32 s0, [sp, #12]
vld1.64 {d0-d1}, [sp:64]
add sp, sp, #16
bx  lr
dmul:
push{r4, r5, r6, r7}
sub sp, sp, #96
vstrd0, [sp, #64]
vstrd1, [sp, #72]
vstrd2, [sp, #48]
vstrd3, [sp, #56]
vldr.64 d17, [sp, #64]
vldr.64 d19, [sp, #48]
vldr.64 d16, [sp, #72]
vldr.64 d18, [sp, #56]
vmul.f64d17, d17, d19
vmul.f64d16, d16, d18
vstr.64 d17, [sp, #32]
ldrdr0, [sp, #32]
mov r4, r0
mov r5, r1
strdr4, [sp]
vstr.64 d16, [sp, #40]
ldr r2, [sp, #40]
ldr ip, [sp, #44]
str r2, [sp, #8]
str ip, [sp, #12]
vld1.64 {d0-d1}, [sp:64]
add sp, sp, #96
pop {r4, r5, r6, r7}
bx  lr

The same thing happens for other operators.

Oddly, according to Godbolt, GCC 4.5 actually did 32-bit float vectors
properly, but regressed more and more each release starting in 4.6.

[Bug target/88705] [ARM][Generic Vector Extensions] float32x4/float64x2 vector operator overloads scalarize on NEON

2019-01-04 Thread husseydevin at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88705

Devin Hussey  changed:

   What|Removed |Added

 Status|RESOLVED|UNCONFIRMED
 Resolution|INVALID |---

--- Comment #3 from Devin Hussey  ---
Well, it is still not as efficient as it should be.

This would be the code that only uses VFP:

fmul:
vadd.f32s0, s0, s4
vadd.f32s1, s1, s5
vadd.f32s2, s2, s6
vadd.f32s3, s3, s7
bx  lr

dmul:
vadd.f64d0, d0, d2
vadd.f64d1, d1, d3
bx  lr

There is no need to keep swapping in and out of NEON registers.

[Bug middle-end/88670] [meta-bug] generic vector extension issues

2019-01-04 Thread husseydevin at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88670
Bug 88670 depends on bug 88705, which changed state.

Bug 88705 Summary: [ARM][Generic Vector Extensions] float32x4/float64x2 vector 
operator overloads scalarize on NEON
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88705

   What|Removed |Added

 Status|RESOLVED|UNCONFIRMED
 Resolution|INVALID |---

[Bug c/88698] Relax generic vector conversions

2019-01-05 Thread husseydevin at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88698

--- Comment #7 from Devin Hussey  ---
I mean, sure, but how about this?

What about meeting in the middle?

-fno-lax-vector-conversions generates errors like it  does now.
-flax-vector-conversions shuts GCC up.
No flag causes warnings on -Wpedantic or -Wvector-conversion.

If we really want to enforce the standard, we should  also add a pedantic
warning for when we use overloads on intrinsic types without -std=gnu*.
-Wgnu-vector-extensions or something:

warning:
{
   arithmetic operators |
   logical operators |
   array subscripts |
   initializer lists
}
on vector types are a GNU extension

I feel that the weird promotion rules Clang uses should be an error, and
assignment to different types should warn without a cast.

[Bug c/88698] Relax generic vector conversions

2019-01-05 Thread husseydevin at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88698

--- Comment #10 from Devin Hussey  ---
Well what about a special type attribute or some kind of transparent_union like
thing for Intel's types? It seems that Intel's intrinsics are the main (only)
platform that uses generic types.

[Bug c++/85052] Implement support for clang's __builtin_convertvector

2019-01-05 Thread husseydevin at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85052

--- Comment #6 from Devin Hussey  ---
The patch seems to be working.

typedef unsigned u32x2 __attribute__((vector_size(8)));
typedef unsigned long long u64x2 __attribute__((vector_size(16)));

u64x2 cvt(u32x2 in)
{
return __builtin_convertvector(in, u64x2);
}

It doesn't generate the best code, but it isn't bad.

x86_64, SSE4.1:

cvt:
movq%xmm0, %rax
movd%eax, %xmm0
shrq$32, %rax
pinsrq  $1, %rax, %xmm0
ret

x86_64, SSE2:

cvt:
movq%xmm0, %rax
movd%eax, %xmm0
shrq$32, %rax
movq%rax, %xmm1
punpcklqdq  %xmm1, %xmm0
ret

ARMv7a NEON:

cvt:
sub sp, sp, #16
mov r3, #0
str r3, [sp, #4]
str r3, [sp, #12]
add r3, sp, #8
vst1.32 {d0[0]}, [sp]
vst1.32 {d0[1]}, [r3]
vld1.64 {d0-d1}, [sp:64]
add sp, sp, #16
bx  lr

I haven't built the others yet.

The correct code would be this ([signed|unsigned]):

cvt:
vmovl.[s|u]32q0, d0
bx lr

I am testing other targets now. 

For the reference, this is what clang generates for other targets:

aarch64:

cvt:
[s|u]shll   v0.2d, v0.2s, #0
ret

sse4.1/avx:

cvt:
[v]pmov[s|z]xdqxmm0, xmm0
ret

sse2:

signed_cvt:
pxorxmm1, xmm1
pcmpgtd xmm1, xmm0
punpckldq   xmm0, xmm1  # xmm0 =
xmm0[0],xmm1[0],xmm0[1],xmm1[1]
ret

unsigned_cvt:
xorps   xmm1, xmm1
unpcklpsxmm0, xmm1  # xmm0 =
xmm0[0],xmm1[0],xmm0[1],xmm1[1]
ret

[Bug c++/85052] Implement support for clang's __builtin_convertvector

2019-01-05 Thread husseydevin at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85052

--- Comment #7 from Devin Hussey  ---
Wait, silly me, this isn't about optimizations, this is about patterns.

It does the same thing it was doing for this code:

typedef unsigned u32x2 __attribute__((vector_size(8)));
typedef unsigned long long u64x2 __attribute__((vector_size(16)));

u64x2 cvt(u32x2 in)
{
return (u64x2) { (unsigned long long)in[0], (unsigned long long)in[1] };
}

[Bug target/85048] [missed optimization] vector conversions

2019-01-05 Thread husseydevin at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85048

Devin Hussey  changed:

   What|Removed |Added

 CC||husseydevin at gmail dot com

--- Comment #5 from Devin Hussey  ---
ARM/AArch64 NEON use these:

FromTo   Intrinsic  ARMv7-a  AArch64
intXxY_t -> int2XxY_tvmovl_sX   vmovl.sX sshll #0?
uintXxY_t.   -> uint2XxY_t   vmovl_uX   vmovl.uX ushll #0?
[u]int2XxY_t -> [u]intXxY_t  vmovn_[us]Xvmovn.iX xtn
floatXxY_t   -> intXxY_t vcvt[q]_sX_fX  vcvt.sX.fX   fcvtzs
floatXxY_t   -> uintXxY_tvcvt[q]_uX_fX  vcvt.uX.fX   fcvtzu
intXxY_t -> floatXxY_t   vcvt[q]_fX_sX  vcvt.fX.sX   scvtf
uintXxY_t-> floatXxY_t   vcvt[q]_fX_uX  vcvt.fX.uX   ucvtf
float32x2_t  -> float64x2_t  vcvt_f32_f64   2x vcvt.f64.f32  fcvtl
float64x2_t  -> float32x2_t  vcvt_f64_f32   2x vcvt.f32.f64  fcvtn

Clang optimizes vmovl to vshll by zero for some reason. 

float32x2_t <-> float64x2_t requires 2 VFP instructions on ARMv7-a.

[Bug rtl-optimization/103641] New: [aarch64][11 regression] Severe compile time regression in SLP vectorize step

2021-12-10 Thread husseydevin at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103641

Bug ID: 103641
   Summary: [aarch64][11 regression] Severe compile time
regression in SLP vectorize step
   Product: gcc
   Version: 11.2.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: rtl-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: husseydevin at gmail dot com
  Target Milestone: ---

Created attachment 51966
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=51966&action=edit
aarch64-linux-gnu-gcc-11 -O3 -c xxhash.c -ftime-report -ftime-report-details

While GCC 11.2 has been noticably better at NEON64 code, with some files it
hangs for more than 15-30 seconds on the SLP vectorization step.

I haven't narrowed this down to a specific thing yet because I don't know much
about the GCC internals, but it is *extremely* noticeable in the xxHash
library. (https://github.com/Cyan4973/xxHash).

This is a test compiling xxhash.c from Git revision
a17161efb1d2de151857277628678b0e0b486155.

This was done on a Core i5-430m with 8GB RAM and an SSD on Debian Bullseye
amd64. GCC 10 (10.2.1-6) was from the\repos, GCC 11 (11.2.0) was built from the
tarball with similar flags. While this may cause bias, the two compilers get
very similar times when the SLP vectorizer is off.

$ time aarch64-linux-gnu-gcc-10 -O3 -c xxhash.c

real0m3.596s
user0m3.270s
sys 0m0.149s
$ time aarch64-linux-gnu-gcc-11 -O3 -c xxhash.c

real0m31.579s
user0m31.314s
sys 0m0.112s

When disabling the NEON intrinsics with `-DXXH_VECTOR=0`, it only takes ~21
seconds. 

Time variable   usr   sys  wall
  GGC
 phase opt and generate :  31.46 ( 97%)   0.24 ( 32%)  31.80 ( 96%)
   54M ( 63%)
 callgraph functions expansion  :  31.01 ( 96%)   0.18 ( 24%)  31.29 ( 94%)
   42M ( 49%)
 tree slp vectorization :  28.35 ( 88%)   0.03 (  4%)  28.37 ( 85%)
 9941k ( 11%)

 TOTAL  :  32.34  0.75 33.20   
   86M

This is significantly worse on my Pi 4B, where an ARMv7->AArch64 build took 3
minutes, although I presume that is mostly due to being 32-bit and the CPU
being much slower.

[Bug middle-end/103641] [11/12 regression] Severe compile time regression in SLP vectorize step

2021-12-10 Thread husseydevin at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103641

--- Comment #19 from Devin Hussey  ---
> The new costs on AArch64 have a vector multiplication cost of 4, which is 
> very reasonable.

Would this include multv2di3 by any chance?

Because another thing I noticed is that GCC is also trying to multiply 64-bit
numbers like it's free but it just ends up scalarizing.

[Bug middle-end/103781] New: [AArch64, 11 regr.] Failed partial vectorization of mulv2di3

2021-12-20 Thread husseydevin at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103781

Bug ID: 103781
   Summary: [AArch64, 11 regr.] Failed partial vectorization of
mulv2di3
   Product: gcc
   Version: 11.2.1
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: middle-end
  Assignee: unassigned at gcc dot gnu.org
  Reporter: husseydevin at gmail dot com
  Target Milestone: ---

As of GCC 11, the AArch64 backend is very greedy in trying to vectorize
mulv2di3. However, there is no mulv2di3 routine so it extracts from the vector.

The bad codegen should be obvious. 

#include 

void fma_u64(uint64_t *restrict acc, const uint64_t *restrict x, const uint64_t
*restrict y)
{
for (int i = 0; i < 16384; i++){
acc[0] += *x++ * *y++;
acc[1] += *x++ * *y++;
}
}

gcc-11 -O3

fma_u64:
.LFB0:
.cfi_startproc
ldr q1, [x0]
add x6, x1, 262144
.p2align 3,,7
.L2:
ldr x4, [x1], 16
ldr x5, [x2], 16
ldr x3, [x1, -8]
mul x4, x4, x5
ldr x5, [x2, -8]
fmovd0, x4
ins v0.d[1], x5
mul x3, x3, x5
ins v0.d[1], x3
add v1.2d, v1.2d, v0.2d
cmp x1, x6
bne .L2
str q1, [x0]
ret
.cfi_endproc

GCC 10.2.1 emits better code.

fma_u64:
.LFB0:
.cfi_startproc
ldp x4, x3, [x0]
add x9, x1, 262144
.p2align 3,,7
.L2:
ldr x8, [x1], 16
ldr x7, [x2], 16
ldr x6, [x1, -8]
ldr x5, [x2, -8]
maddx4, x8, x7, x4
maddx3, x6, x5, x3
cmp x9, x1
bne .L2
stp x4, x3, [x0]
ret
.cfi_endproc

However, the ideal code would be a 2 iteration unroll.

Side note: why not ldp in the loop?

[Bug target/103781] Cost model for SLP for aarch64 is not so good still

2021-12-20 Thread husseydevin at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103781

--- Comment #2 from Devin Hussey  ---
Yeah my bad, I meant SLP, I get them mixed up all the time.

[Bug target/103781] generic/cortex-a53 cost model for SLP for aarch64 is good

2021-12-20 Thread husseydevin at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103781

--- Comment #4 from Devin Hussey  ---
Makes sense because the multiplier is what, 5 cycles on an A53?

[Bug target/110013] New: [i386] vector_size(8) on 32-bit ABI

2023-05-27 Thread husseydevin at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110013

Bug ID: 110013
   Summary: [i386] vector_size(8) on 32-bit ABI
   Product: gcc
   Version: 14.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: husseydevin at gmail dot com
  Target Milestone: ---

Closely related to bug 86541, which was fixed on x64 only.

On 32-bit, GCC passes any vector_size(8) vectors to external functions in MMX
registers, similar to how it passes 16 byte vectors in SSE registers. 

This appears to be the only time that GCC will ever naturally generate an MMX
instruction.

This is only good if and only if you are using MMX intrinsics and are manually
handling _mm_empty().

Otherwise, if, say, you are porting over NEON code (where I found this issue)
using the vector_size intrinsics, this can cause some sneaky issues if your
function fails to inline:
1. Things will likely break because GCC doesn't handle MMX and x87 properly
   - Example of broken code (works with -mno-mmx):
https://godbolt.org/z/xafWPohKb
2. You will have a nasty performance toll, more than just a cdecl call, as GCC
doesn't actually know what to do with an MMX register and just spills it into
memory.
   - This especially can be seen when v2sf is used and it places the floats
into MMX registers.

There are two options. The first is to use the weird ABI that Clang seems to
use:

| Type | SIMD | Params | Return  |
| float| base | stack  | ST0:ST1 |
| float| SSE  | XMM0-2 | XMM0|
| double   | all  | stack  | ST0 |
| long long/__m64  | all  | stack  | EAX:EDX |
| int, short, char | base | stack  | stack   |
| int, short, char | SSE2 | stack  | XMM0|

However, since the current ABIs aren't 100% compatible anyways, I think that a
much simpler solution is to just convert to SSE like x64 does, falling back to
the stack if SSE is not available.

Changing the ABI to this also allows us to port MMX with SSE (bug 86541) to
32-bit mode. If you REALLY need MMX intrinsics, you can't inline, and you don't
have SSE2, you can cope with a stack spill.

[Bug target/110013] [i386] vector_size(8) on 32-bit ABI emits broken MMX

2023-05-27 Thread husseydevin at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110013

--- Comment #1 from Devin Hussey  ---
As a side note, the official psABI does say that function call parameters use
MM0-MM2, if Clang follows its own rules then it means that the supposed
stability of the ABI is meaningless.

[Bug target/110013] [i386] vector_size(8) on 32-bit ABI emits broken MMX

2023-05-27 Thread husseydevin at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110013

--- Comment #2 from Devin Hussey  ---
Scratch that. There is a somewhat easy way to fix this following psABI AND
using MMX with SSE.

Upon calling a function, we can have the following sequence

func:
movdq2q  mm0, xmm0
movq mm1, [esp + n]
call mmx_func
movq2dq  xmm0, mm0
emms

Then, this prologue:

mmx_func:
movq2dq   xmm0, mm0
movq2dq   xmm1, mm1
emms
...
movdq2q   mm0, xmm0
ret