[Bug target/91598] [8/9 regression] 60% speed drop on neon intrinsic loop

2021-05-05 Thread clyon at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91598

--- Comment #8 from Christophe Lyon  ---
All intrinsics have been re-implemented, and I can see no asm version in
arm_neon.h as of r12-513.

[Bug target/91598] [8/9 regression] 60% speed drop on neon intrinsic loop

2021-01-11 Thread wilco at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91598

Wilco  changed:

   What|Removed |Added

 CC||wilco at gcc dot gnu.org

--- Comment #7 from Wilco  ---
Fixed in GCC9 and trunk. Given the regression is so large, it is worth
backporting.

The other issue is that assembler statements (and likely any RTL instruction
without specified latencies) are badly scheduled. Honoring all known latencies
and assuming a reasonable default latency otherwise would be a better approach.

[Bug target/91598] [8/9 regression] 60% speed drop on neon intrinsic loop

2020-03-11 Thread marxin at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91598

Martin Liška  changed:

   What|Removed |Added

 CC||marxin at gcc dot gnu.org

--- Comment #6 from Martin Liška  ---
commit r10-7073-g0b8393221177617f19e7c5c5c692b8c59f85fffb
Author: Wilco Dijkstra 
Date:   Fri Mar 6 18:29:02 2020 +

[AArch64] Use intrinsics for widening multiplies (PR91598)

Inline assembler instructions don't have latency info and the scheduler
does
not attempt to schedule them at all - it does not even honor latencies of
asm source operands.  As a result, SIMD intrinsics which are implemented
using
inline assembler perform very poorly, particularly on in-order cores.
Add new patterns and intrinsics for widening multiplies, which results in a
63% speedup for the example in the PR, thus fixing the reported regression.

gcc/
PR target/91598
* config/aarch64/aarch64-builtins.c (TYPES_TERNOPU_LANE): Add
define.
* config/aarch64/aarch64-simd.md
(aarch64_vec_mult_lane): Add new insn for widening lane
mul.
(aarch64_vec_mlal_lane): Likewise.
* config/aarch64/aarch64-simd-builtins.def: Add intrinsics.
* config/aarch64/arm_neon.h:
(vmlal_lane_s16): Expand using intrinsics rather than inline asm.
(vmlal_lane_u16): Likewise.
(vmlal_lane_s32): Likewise.
(vmlal_lane_u32): Likewise.
(vmlal_laneq_s16): Likewise.
(vmlal_laneq_u16): Likewise.
(vmlal_laneq_s32): Likewise.
(vmlal_laneq_u32): Likewise.
(vmull_lane_s16): Likewise.
(vmull_lane_u16): Likewise.
(vmull_lane_s32): Likewise.
(vmull_lane_u32): Likewise.
(vmull_laneq_s16): Likewise.
(vmull_laneq_u16): Likewise.
(vmull_laneq_s32): Likewise.
(vmull_laneq_u32): Likewise.
* config/aarch64/iterators.md (Vcondtype): New iterator for lane
mul.
(Qlane): Likewise.