Re: [PATCH 0/6] aarch64: Implement TImode comparisons

2020-03-19 Thread Wilco Dijkstra
Hi Richard,

> Any compare can be done in at most 2 instructions:
> 
> void doit(void);
> void f(long long a)
> {
> if (a <= 1)
> doit();
> }
> 
> f:
> cmp r0, #2
> sbcs    r3, r1, #0
> blt .L4

> Well, this one requires that you be able to add 1 to an input and for that
> input to not overflow.  But you're right that I should be using this sequence
> for LT (not LE).

And for GE. For LE and GT swap operands and condition. You can safely
increment the immediate since small immediates that fit CMP have no
chance of overflowing and large immediates have to be split off anyway
(and then are treated like register variants).

Cheers,
Wilco


Re: [PATCH 0/6] aarch64: Implement TImode comparisons

2020-03-19 Thread Richard Henderson via Gcc-patches
On 3/19/20 8:47 AM, Wilco Dijkstra wrote:
> Hi Richard,
> 
> Thanks for these patches - yes TI mode expansions can certainly be improved!
> So looking at your expansions for signed compares, why not copy the optimal
> sequence from 32-bit Arm?
> 
> Any compare can be done in at most 2 instructions:
> 
> void doit(void);
> void f(long long a)
> {
> if (a <= 1)
> doit();
> }
> 
> f:
> cmp r0, #2
> sbcsr3, r1, #0
> blt .L4

Well, this one requires that you be able to add 1 to an input and for that
input to not overflow.  But you're right that I should be using this sequence
for LT (not LE).

I'll have another look.


r~


Re: [PATCH 0/6] aarch64: Implement TImode comparisons

2020-03-19 Thread Wilco Dijkstra
Hi Richard,

Thanks for these patches - yes TI mode expansions can certainly be improved!
So looking at your expansions for signed compares, why not copy the optimal
sequence from 32-bit Arm?

Any compare can be done in at most 2 instructions:

void doit(void);
void f(long long a)
{
if (a <= 1)
doit();
}

f:
cmp r0, #2
sbcsr3, r1, #0
blt .L4
bx  lr
.L4:
b   doit

Cheers,
Wilco

[PATCH 0/6] aarch64: Implement TImode comparisons

2020-03-18 Thread Richard Henderson via Gcc-patches
This is attacking case 3 of PR 94174.

The existing ccmp optimization happens at the gimple level,
which means that rtl expansion of TImode stuff cannot take
advantage.  But we can to even better than the existing
ccmp optimization.

This expansion is similar size to our current branchful 
expansion, but all straight-line code.  I will assume in
general that the branch predictor will work better with
fewer branches.

E.g.

-  10:  b7f800a3tbnzx3, #63, 24 <__subvti3+0x24>
-  14:  eb02003fcmp x1, x2
-  18:  5400010cb.gt38 <__subvti3+0x38>
-  1c:  54000140b.eq44 <__subvti3+0x44>  // b.none
-  20:  d65f03c0ret
-  24:  eb01005fcmp x2, x1
-  28:  548cb.gt38 <__subvti3+0x38>
-  2c:  54a1b.ne20 <__subvti3+0x20>  // b.any
-  30:  eb9fcmp x4, x0
-  34:  5469b.ls20 <__subvti3+0x20>  // b.plast
-  38:  a9bf7bfdstp x29, x30, [sp, #-16]!
-  3c:  910003fdmov x29, sp
-  40:  9400bl  0 
-  44:  eb04001fcmp x0, x4
-  48:  5488b.hi38 <__subvti3+0x38>  // b.pmore
-  4c:  d65f03c0ret

+  10:  b7f800e3tbnzx3, #63, 2c <__subvti3+0x2c>
+  14:  eb01005fcmp x2, x1
+  18:  1a9fb7e2csetw2, ge  // ge = tcont
+  1c:  fa400080ccmpx4, x0, #0x0, eq  // eq = none
+  20:  7a40a844ccmpw2, #0x0, #0x4, ge  // ge = tcont
+  24:  54e0b.eq40 <__subvti3+0x40>  // b.none
+  28:  d65f03c0ret
+  2c:  eb01005fcmp x2, x1
+  30:  1a9fc7e2csetw2, le
+  34:  fa400081ccmpx4, x0, #0x1, eq  // eq = none
+  38:  7a40d844ccmpw2, #0x0, #0x4, le
+  3c:  5460b.eq28 <__subvti3+0x28>  // b.none
+  40:  a9bf7bfdstp x29, x30, [sp, #-16]!
+  44:  910003fdmov x29, sp
+  48:  9400bl  0 

So one less insn, but 2 branches instead of 6.

As for the specific case of the PR,

void test_int128(__int128 a, uint64_t l)
{
if ((__int128_t)a - l <= 1)
doit();
}

0:  eb02subsx0, x0, x2
4:  da1f0021sbc x1, x1, xzr
8:  f13fcmp x1, #0x0
-   c:  544db.le14 
-  10:  d65f03c0ret
-  14:  5461b.ne20   // b.any
-  18:  f100041fcmp x0, #0x1
-  1c:  54a8b.hi10   // b.pmore
+   c:  1a9fc7e1csetw1, le
+  10:  fa410801ccmpx0, #0x1, #0x1, eq  // eq = none
+  14:  7a40d824ccmpw1, #0x0, #0x4, le
+  18:  5441b.ne20   // b.any
+  1c:  d65f03c0ret
   20:  1400b   0 


r~


Richard Henderson (6):
  aarch64: Add ucmp_*_carryinC patterns for all usub_*_carryinC
  aarch64: Adjust result of aarch64_gen_compare_reg
  aarch64: Accept 0 as first argument to compares
  aarch64: Simplify @ccmp operands
  aarch64: Improve nzcv argument to ccmp
  aarch64: Implement TImode comparisons

 gcc/config/aarch64/aarch64.c  | 304 --
 gcc/config/aarch64/aarch64-simd.md|  18 +-
 gcc/config/aarch64/aarch64-speculation.cc |   5 +-
 gcc/config/aarch64/aarch64.md | 280 ++--
 4 files changed, 429 insertions(+), 178 deletions(-)

-- 
2.20.1