[Bug target/95650] aarch64: Missed optimization storing addition of two shorts

2020-06-12 Thread pinskia at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95650

--- Comment #1 from Andrew Pinski  ---
I think clang is wrong here.  The abi says the non lower 16 bits is undefined.
So they could be set and you could get a wrap around.

[Bug target/95650] aarch64: Missed optimization storing addition of two shorts

2020-06-12 Thread pinskia at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95650

Andrew Pinski  changed:

   What|Removed |Added

   Keywords|ABI |

--- Comment #2 from Andrew Pinski  ---
Or maybe it does not matter for the add case after all but only the add case.

The generic optimization here is:
subreg:QI(A&0xff + B&0xff)
Can be optimized to subreg:QI(A+B)
Likewise for HI/0x, etc.

Let me see if I can get this.

[Bug target/95650] aarch64: Missed optimization storing addition of two shorts

2020-06-12 Thread acoplan at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95650

--- Comment #3 from Alex Coplan  ---
I think clang's optimisation is sound here.

C says that we add two shorts as int and then truncate to short (i.e. reduce
mod 16).

The question is whether the top bits being set (which the ABI allows) can
influence the result. I don't think it can.

The observation is that the "top bits being set" are just extra multiples of
2^16 in the addition, which just disappear when we reduce mod 2^16. That is:

(x_1 + x_2 + y_1 + y_2) % 2^16 = (x_1 + x_2) % 2^16

where x_1,x_2 are arbitrary integers and y_1,y_2 are multiples of 2^16 (the top
bits).

[Bug target/95650] aarch64: Missed optimization storing addition of two shorts

2020-06-12 Thread wilco at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95650

Wilco  changed:

   What|Removed |Added

   Last reconfirmed||2020-06-12
 Ever confirmed|0   |1
 CC||wilco at gcc dot gnu.org
 Status|UNCONFIRMED |NEW

--- Comment #4 from Wilco  ---
(In reply to Alex Coplan from comment #3)
> I think clang's optimisation is sound here.
> 
> C says that we add two shorts as int and then truncate to short (i.e. reduce
> mod 16).
> 
> The question is whether the top bits being set (which the ABI allows) can
> influence the result. I don't think it can.
> 
> The observation is that the "top bits being set" are just extra multiples of
> 2^16 in the addition, which just disappear when we reduce mod 2^16. That is:
> 
> (x_1 + x_2 + y_1 + y_2) % 2^16 = (x_1 + x_2) % 2^16
> 
> where x_1,x_2 are arbitrary integers and y_1,y_2 are multiples of 2^16 (the
> top bits).

Confirmed. It works for signed as well and any operator except right shift and
division. Basically the store requires only the bottom 16 bits to be valid, and
a backwards dataflow can propagate this to remove unnecessary zero and sign
extends.

[Bug target/95650] aarch64: Missed optimization storing addition of two shorts

2021-05-30 Thread pinskia at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95650

Andrew Pinski  changed:

   What|Removed |Added

   Severity|normal  |enhancement
  Build|aarch64-none-linux-gnu  |
   Host|aarch64-none-linux-gnu  |
   Last reconfirmed|2020-06-12 00:00:00 |2021-5-30

--- Comment #5 from Andrew Pinski  ---
The problem is when combine comes in we get:
Trying 3, 8 -> 10:
3: r94:SI=zero_extend(x1:HI)
  REG_DEAD x1:HI
8: r96:SI=zero_extend(x0:HI)+r94:SI
  REG_DEAD x0:HI
  REG_DEAD r94:SI
   10: [r98:DI]=r96:SI#0
  REG_DEAD r98:DI
  REG_DEAD r96:SI
Failed to match this instruction:
(set (mem:HI (reg:DI 98) [1 *ptr_5(D)+0 S2 A16])
(plus:HI (reg:HI 1 x1 [ b ])
(reg:HI 0 x0 [ a ])))


There is no plus:HI pattern for aarch64 so there is no matching happening if we
do this a 3->2.
I don't know if combine could be enhanced here to allow widening to SI to
happen if HImode for plus does not exist.

[Bug target/95650] aarch64: Missed optimization storing addition of two shorts

2021-06-01 Thread rearnsha at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95650

--- Comment #6 from Richard Earnshaw  ---
AArch32 is able to produce the optimal sequence because the ABI specifies
caller widening of parameters.  For safety reasons AArch64 takes the opposite
approach and requires the callee to narrow arguments.  

Sadly, because this isn't handled at the gimple level, it has to be detected
during RTL optimization.

Yes, the optimization is sound because the bits above bit 16 in the input
values cannot affect the lower bits in the result of an addition.

It's also likely that the expanders cannot see enough of what is going on to
transform this efficiently either.