On Sun, Jan 21, 2024 at 07:57:54PM +0530, Ajit Agarwal wrote:
> 
> Hello All:
> 
> New pass to replace adjacent memory addresses lxv with lxvp.
> Added common infrastructure for load store fusion for
> different targets.
> 
> Common routines are refactored in fusion-common.h.
> 
> AARCH64 load/store fusion pass is not changed with the 
> common infrastructure.
> 
> For AARCH64 architectures just include "fusion-common.h"
> and target dependent code can be added to that.
> 
> 
> Alex/Richard:
> 
> If you would like me to add for AARCH64 I can do that for AARCH64.
> 
> If you would like to do that is fine with me.
> 
> Bootstrapped and regtested with powerpc64-linux-gnu.
> 
> Improvement in performance is seen with Spec 2017 spec FP benchmarks.

This patch is a lot better than the previous patch in that it generates fewer
extra instructions, and just replaces some of the load vector instructions with
load vector pair.

In compiling Spec 2017 with it, I see the following results:

Benchmarks that generate lxvp instead of lxv:

        500.perlbench_r         replace 10 LXVs with  5 LXVPs
        502.gcc_r               replace  2 LXVs with  1 LXVPs
        510.parest_r            replace 28 LXVs with 14 LXVPs
        511.povray_r            replace  4 LXVs with  2 LXVPs
        521.wrf_r               replace 12 LXVs with  6 LXVPs
        527.cam4_r              replace 12 LXVs with  6 LXVPs
        557.xz_r                replace 10 LXVs with  5 LXVPs

A few of the benchmarks generated a different number of NOPs, based on how
prefixed addresses were generated.  I tend to feel this is minor compared to
the others.

         507.cactuBSSN_r         17 fewer alignment NOPs
         520.omnetpp_r          231 more  alignment NOPs
         523.xalancbmk_r        246 fewer alignment NOPs
         531.deepsjeng_r          2 more  alignment NOPs
         541.leela_r             28 more  alignment NOPs
         549.fotonik3d_r         27 more  alignment NOPs
         554.roms_r               8 more  alignment NOPs

However there were three benchmarks where the code regressed.  In particular,
it looks like there are more load and store vectors to the stack, so it
indicates more spilling is going on.

        525.x264_r              16 more  stack spills, but  84 LXVPs
        526.blender_r            4 more  stack spills, but 149 LXVPs

One benchmark actually generated fewer stack spills as well as generating
LXVPs.

        538.imagick_r           11 fewer stack spills, and  26 LXVPs

Note, these are changes to the static instructions generated.  It does not
evaluate whether the changes help/hurt performance.

-- 
Michael Meissner, IBM
PO Box 98, Ayer, Massachusetts, USA, 01432
email: meiss...@linux.ibm.com

Reply via email to