Hi all,

The scheduling automata sizes are getting a bit out of control (as the PR 
complains about) and the Cortex-A8
one is one of the largest offenders. An easy, low-hanging fruit in dealing with 
this are some of the FP/NEON operations
that have very large reservation durations specified for them. They bloat the 
state space by quite a lot and it's not
likely that there is enough parallelism present in the program to fill the (for 
example) 64 cycles that are modelled
for the double-precision division. In the past we've dealt with this by 
decreasing the modelled reservation duration
to keep the state space down.

This patch does that for the cortex_a8_neon automaton and caps the reservation 
duration for a particular reservation
to 15 cycles. This should be plenty to demonstrate that these are high latency 
instructions.
With this patch the number of NDFA states is massively reduced by more than 70% 
(26796 -> 6020).

As I don't have access to reasonable Cortex-A8 hardware I benchmarked it on 
SPEC2000 on a Cortex-A15.
The idea (from Ramana) is that since Cortex-A8 tuning is the default tuning for 
armv7-a the patch shouldn't hurt
the more widely accessible Cortex-A15 targets. There were no regressions in 
performance there.

Bootstrapped and tested on arm-none-linux-gnueabihf.
Ok for trunk?

Thanks,
Kyrill

2016-08-26  Kyrylo Tkachov  <kyrylo.tkac...@arm.com>

    PR target/70473
    * config/arm/cortex-a8-neon.md (cortex_a8_vfp_muld): Reduce
    reservation duration to 15 cycles.
    (cortex_a8_vfp_macs): Likewise.
    (cortex_a8_vfp_macd): Likewise.
    (cortex_a8_vfp_divs): Likewise.
    (cortex_a8_vfp_divd): Likewise.
diff --git a/gcc/config/arm/cortex-a8-neon.md b/gcc/config/arm/cortex-a8-neon.md
index 45f861f6c6f840bd113e468eeec5373e06058f6d..b16c29974a7278e70d64dc83b5b388aebb51718b 100644
--- a/gcc/config/arm/cortex-a8-neon.md
+++ b/gcc/config/arm/cortex-a8-neon.md
@@ -357,30 +357,34 @@ (define_insn_reservation "cortex_a8_vfp_muls" 12
        (eq_attr "type" "fmuls"))
   "cortex_a8_vfp,cortex_a8_vfplite*11")
 
+;; Don't model a reservation for more than 15 cycles as this explodes the
+;; state space of the automaton for little gain.  It is unlikely that the
+;; scheduler will find enough instructions to hide the full latency of the
+;; instructions.
 (define_insn_reservation "cortex_a8_vfp_muld" 17
   (and (eq_attr "tune" "cortexa8")
        (eq_attr "type" "fmuld"))
-  "cortex_a8_vfp,cortex_a8_vfplite*16")
+  "cortex_a8_vfp,cortex_a8_vfplite*15")
 
 (define_insn_reservation "cortex_a8_vfp_macs" 21
   (and (eq_attr "tune" "cortexa8")
        (eq_attr "type" "fmacs,ffmas"))
-  "cortex_a8_vfp,cortex_a8_vfplite*20")
+  "cortex_a8_vfp,cortex_a8_vfplite*15")
 
 (define_insn_reservation "cortex_a8_vfp_macd" 26
   (and (eq_attr "tune" "cortexa8")
        (eq_attr "type" "fmacd,ffmad"))
-  "cortex_a8_vfp,cortex_a8_vfplite*25")
+  "cortex_a8_vfp,cortex_a8_vfplite*15")
 
 (define_insn_reservation "cortex_a8_vfp_divs" 37
   (and (eq_attr "tune" "cortexa8")
        (eq_attr "type" "fdivs, fsqrts"))
-  "cortex_a8_vfp,cortex_a8_vfplite*36")
+  "cortex_a8_vfp,cortex_a8_vfplite*15")
 
 (define_insn_reservation "cortex_a8_vfp_divd" 65
   (and (eq_attr "tune" "cortexa8")
        (eq_attr "type" "fdivd, fsqrtd"))
-  "cortex_a8_vfp,cortex_a8_vfplite*64")
+  "cortex_a8_vfp,cortex_a8_vfplite*15")
 
 ;; Comparisons can actually take 7 cycles sometimes instead of four,
 ;; but given all the other instructions lumped into type=ffarith that

Reply via email to