https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91598

            Bug ID: 91598
           Summary: [8/9/10 regression] 60% speed drop on neon intrinsic
                    loop
           Product: gcc
           Version: 10.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: target
          Assignee: unassigned at gcc dot gnu.org
          Reporter: mkuvyrkov at gcc dot gnu.org
  Target Milestone: ---

Performance of the attached neon loop drops on Cortex-A53 by about 60% between
GCC 7 and GCC 8.  Performance of trunk is the same as GCC 8.

There are two separate changes, both related to instruction scheduler that
cause the regression.  The first change in r253235 is responsible for 70% of
the regression.
===
    haifa-sched: fix autopref_rank_for_schedule qsort comparator

            * haifa-sched.c (autopref_rank_for_schedule): Order 'irrelevant'
insns
            first, always call autopref_rank_data otherwise.



    git-svn-id: svn+ssh://gcc.gnu.org/svn/gcc/trunk@253235
138bc75d-0d04-0410-961f-82ee72b054a4
===

After this change instead of
r1 = [rb + 0]
r2 = [rb + 8]
r3 = [rb + 16]
r4 = <math with r1>
r5 = <math with r2>
r6 = <math with r3>

we got
r1 = [rb + 0]
<math with r1>
r2 = [rb + 8]
<math with r2>
r3 = [rb + 16]
<math with r3>

which, apparently, cortex-a53 autoprefetcher doesn't recognize.  This schedule
happens because r2= load gets lower priority than the "irrelevant" <math with
r1> due to the above patch.

If we think about it, the fact that "r1 = [rb + 0]" can be scheduled means that
true dependencies of all similar base+offset loads are resolved.  Therefore,
for autoprefetcher-friendly schedule we should prioritize memory reads before
"irrelevant" instructions.

On the other hand, following similar logic, we want to delay memory stores as
much as possible to start scheduling them only after all potential producers
are scheduled.  I.e., for autoprefetcher-friendly schedule we should prioritize
"irrelevant" instructions before memory writes.

Obvious patch to implement the above is attached.  It brings 70% of regressed
performance on this testcase back.

The second part of the regression is due to compiler getting lucky with
scheduling inline-asms representing the intrinsics.  After 
===
    Set default sched pressure algorithm

    The Arm backend sets the default sched-pressure algorithm to
SCHED_PRESSURE_MODEL.
    Benchmarking on AArch64 shows this speeds up floating point performance on
SPEC -
    eg. CactusBSSN improves by ~16%.  The gains are mostly due to less
spilling,
    so enable this on AArch64 by default.

        gcc/
            * config/aarch64/aarch64.c (aarch64_override_options_internal):
            Set PARAM_SCHED_PRESSURE_ALGORITHM to SCHED_PRESSURE_MODEL.


    git-svn-id: svn+ssh://gcc.gnu.org/svn/gcc/trunk@254378
138bc75d-0d04-0410-961f-82ee72b054a4
===
the compiler no longer gets lucky on this testcase.

The solution here is to convert intrinsics in arm-neon.h to builtins/UNSPECs
and attach scheduler descriptions to the UNSPECs.

Reply via email to