https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116312

Richard Sandiford <rsandifo at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |rsandifo at gcc dot gnu.org

--- Comment #3 from Richard Sandiford <rsandifo at gcc dot gnu.org> ---
FWIW, see the comment in aarch64_sve_adjust_stmt_cost for some of the problems
with costing LDP and STP correctly:

  /* Advanced SIMD can load and store pairs of registers using LDP and STP,
     but there are no equivalent instructions for SVE.  This means that
     (all other things being equal) 128-bit SVE needs twice as many load
     and store instructions as Advanced SIMD in order to process vector pairs.

     Also, scalar code can often use LDP and STP to access pairs of values,
     so it is too simplistic to say that one SVE load or store replaces
     VF scalar loads and stores.

     Ideally we would account for this in the scalar and Advanced SIMD
     costs by making suitable load/store pairs as cheap as a single
     load/store.  However, that would be a very invasive change and in
     practice it tends to stress other parts of the cost model too much.
     E.g. stores of scalar constants currently count just a store,
     whereas stores of vector constants count a store and a vec_init.
     This is an artificial distinction for AArch64, where stores of
     nonzero scalar constants need the same kind of register invariant
     as vector stores.

     An alternative would be to double the cost of any SVE loads and stores
     that could be paired in Advanced SIMD (and possibly also paired in
     scalar code).  But this tends to stress other parts of the cost model
     in the same way.  It also means that we can fall back to Advanced SIMD
     even if full-loop predication would have been useful.

     Here we go for a more conservative version: double the costs of SVE
     loads and stores if one iteration of the scalar loop processes enough
     elements for it to use a whole number of Advanced SIMD LDP or STP
     instructions.  This makes it very likely that the VF would be 1 for
     Advanced SIMD, and so no epilogue should be needed.  */

Reply via email to