Re: [PATCH] RISC-V: Allow more load/stores to be compressed

Kito Cheng Wed, 18 Sep 2019 03:02:52 -0700

Hi Craig:

Some general review comment:
- Split new pass into new file.
- Add new option to enable/disable this pass.
- Could you extend this patch to support lw/sw/ld/sd/flw/fsw/fld/fsd?
  I think there is lots of common logic for supporting other types
compressed load/store
  instruction, but I'd like to see those support at once.
- Do you have experimental data about doing that after register
allocation/reload,
  I'd prefer doing such optimization after RA, because we can
accurately estimate
  how many byte we can gain, I guess it because RA didn't assign fit
src/dest reg
  or base reg so that increase code size?


On Fri, Sep 13, 2019 at 12:20 AM Craig Blackmore
<craig.blackm...@embecosm.com> wrote:
>
> This patch aims to allow more load/store instructions to be compressed by
> replacing a load/store of 'base register + large offset' with a new load/store
> of 'new base + small offset'. If the new base gets stored in a compressed
> register, then the new load/store can be compressed. Since there is an 
> overhead
> in creating the new base, this change is only attempted when 'base register' 
> is
> referenced in at least 4 load/stores in a basic block.
>
> The optimization is implemented in a new RISC-V specific pass called
> shorten_memrefs which is enabled for RVC targets. It has been developed for 
> the
> 32-bit lw/sw instructions but could also be extended to 64-bit ld/sd in 
> future.
>
> The patch saves 164 bytes (0.3%) on a proprietary application (59450 bytes
> compared to 59286 bytes) compiled for rv32imc bare metal with -Os. On the
> Embench benchmark suite (https://www.embench.org/) we see code size reductions
> of up to 18 bytes (0.7%) and only two cases where code size is increased
> slightly, by 2 bytes each:
>
> Embench results (.text size in bytes, excluding .rodata)
>
> Benchmark       Without patch  With patch  Diff
> aha-mont64      1052           1052        0
> crc32           232            232         0
> cubic           2446           2448        2
> edn             1454           1450        -4
> huffbench       1642           1642        0
> matmult-int     420            420         0
> minver          1056           1056        0
> nbody           714            714         0
> nettle-aes      2888           2884        -4
> nettle-sha256   5566           5564        -2
> nsichneu        15052          15052       0
> picojpeg        8078           8078        0
> qrduino         6140           6140        0
> sglib-combined  2444           2444        0
> slre            2438           2420        -18
> st              880            880         0
> statemate       3842           3842        0
> ud              702            702         0
> wikisort        4278           4280        2
> -------------------------------------------------
> Total           61324          61300       -24
>
> The patch has been tested on the following bare metal targets using QEMU
> and there were no regressions:
>
>   rv32i
>   rv32iac
>   rv32im
>   rv32imac
>   rv32imafc
>   rv64imac
>   rv64imafdc
>
> We noticed that sched2 undoes some of the addresses generated by this
> optimization and consequently increases code size, therefore this patch adds a
> check in sched-deps.c to avoid changes that are expected to increase code size
> when not optimizing for speed. Since this change touches target-independent
> code, the patch has been bootstrapped and tested on x86 with no regressions.
>
> gcc/ChangeLog
>
>         * config/riscv/riscv.c (tree-pass.h): New include.
>         (cfg.h) Likewise.
>         (context.h) Likewise.
>         (riscv_compressed_reg_p): New function.
>         (riscv_compressed_lw_address_p): Likewise.
>         (riscv_legitimize_address): Attempt to convert base + large_offset
>         to compressible new_base + small_offset.
>         (riscv_address_cost): Make anticipated compressed load/stores
>         cheaper for code size than uncompressed load/stores.
>         (class pass_shorten_memrefs): New pass.
>         (pass_shorten_memrefs::execute): Likewise.
>         (make_pass_shorten_memrefs): Likewise.
>         (riscv_option_override): Register shorten_memrefs pass for
>         TARGET_RVC.
>         (riscv_register_priority): Move compressed register check to
>         riscv_compressed_reg_p.
>         * sched-deps.c (attempt_change): When optimizing for code size
>         don't make change if it increases code size.
>
> ---
>  gcc/config/riscv/riscv.c | 179 
> +++++++++++++++++++++++++++++++++++++++++++++--
>  gcc/sched-deps.c         |  10 +++
>  2 files changed, 183 insertions(+), 6 deletions(-)
>
> diff --git a/gcc/config/riscv/riscv.c b/gcc/config/riscv/riscv.c
> index 39bf87a..e510314 100644
> --- a/gcc/config/riscv/riscv.c
> +++ b/gcc/config/riscv/riscv.c
> @@ -55,6 +55,9 @@ along with GCC; see the file COPYING3.  If not see
>  #include "diagnostic.h"
>  #include "builtins.h"
>  #include "predict.h"
> +#include "tree-pass.h"
> +#include "cfg.h"
> +#include "context.h"
>
>  /* True if X is an UNSPEC wrapper around a SYMBOL_REF or LABEL_REF.  */
>  #define UNSPEC_ADDRESS_P(X)                                    \
> @@ -848,6 +851,44 @@ riscv_legitimate_address_p (machine_mode mode, rtx x, 
> bool strict_p)
>    return riscv_classify_address (&addr, x, mode, strict_p);
>  }
>
> +/* Return true if hard reg REGNO can be used in compressed instructions.  */
> +
> +static bool
> +riscv_compressed_reg_p (int regno)
> +{
> +  /* x8-x15/f8-f15 are compressible registers.  */
> +  return (TARGET_RVC && (IN_RANGE (regno, GP_REG_FIRST + 8, GP_REG_FIRST + 
> 15)
> +         || IN_RANGE (regno, FP_REG_FIRST + 8, FP_REG_FIRST + 15)));
> +}
> +
> +/* Return true if load/store from/to address x can be compressed.  */
> +
> +static bool
> +riscv_compressed_lw_address_p (rtx x)
> +{
> +  struct riscv_address_info addr;
> +  bool result = riscv_classify_address (&addr, x, GET_MODE (x),
> +                                       reload_completed);
> +
> +  /* Before reload, assuming all load/stores of valid addresses get 
> compressed
> +     gives better code size than checking if the address is reg + 
> small_offset
> +     early on.  */
> +  if (result && !reload_completed)
> +    return true;
> +
> +  /* Return false if address is not compressed_reg + small_offset.  */
> +  if (!result
> +      || addr.type != ADDRESS_REG
> +      || (!riscv_compressed_reg_p (REGNO (addr.reg))
> +           && addr.reg != stack_pointer_rtx)
> +      || !CONST_INT_P (addr.offset)
> +      || (INTVAL (addr.offset) & 3) != 0
> +      || !IN_RANGE (INTVAL (addr.offset), 0, 124))
> +    return false;
> +
> +  return result;
> +}
> +
>  /* Return the number of instructions needed to load or store a value
>     of mode MODE at address X.  Return 0 if X isn't valid for MODE.
>     Assume that multiword moves may need to be split into word moves
> @@ -1318,7 +1359,9 @@ riscv_legitimize_address (rtx x, rtx oldx 
> ATTRIBUTE_UNUSED,
>    if (riscv_split_symbol (NULL, x, mode, &addr))
>      return riscv_force_address (addr, mode);
>
> -  /* Handle BASE + OFFSET using riscv_add_offset.  */
> +  /* When optimizing for size, try to convert BASE + LARGE_OFFSET into
> +     NEW_BASE + SMALL_OFFSET to allow possible compressed load/store, 
> otherwise,
> +     handle BASE + OFFSET using riscv_add_offset.  */
>    if (GET_CODE (x) == PLUS && CONST_INT_P (XEXP (x, 1))
>        && INTVAL (XEXP (x, 1)) != 0)
>      {
> @@ -1327,7 +1370,24 @@ riscv_legitimize_address (rtx x, rtx oldx 
> ATTRIBUTE_UNUSED,
>
>        if (!riscv_valid_base_register_p (base, mode, false))
>         base = copy_to_mode_reg (Pmode, base);
> -      addr = riscv_add_offset (NULL, base, offset);
> +      if (optimize_function_for_size_p (cfun)
> +         && (strcmp (current_pass->name, "shorten_memrefs") == 0)
> +         && mode == SImode
> +         && (offset & 3) == 0
> +         && !IN_RANGE (offset, 0, 124))

I think the offset check can be relax here, since you can put those
offset at HIGH,
and rest offset for compressed load/store still can align to fit
instruction format.

e.g.
  lw a1, 125(a0)
 lw a2, 129(a0)
->
 addi a3, a0, 125
 lw a1, 0(a3)
 lw a2, 4(a3)

> +       {
> +         rtx high;
> +
> +         /* Leave OFFSET as a 7-bit offset and put the excess in HIGH.  */
> +         high = GEN_INT (offset & ~124);
> +         offset &= 124;

124 like a magic number appear 4 times in this patch,
I know it come from (((2^5) << 2) -1) & ~0x3 = 124, but it should
find some way to make this more readable.

> +         if (!SMALL_OPERAND (INTVAL (high)))
> +           high = force_reg (Pmode, high);
> +         base = force_reg (Pmode, gen_rtx_PLUS (Pmode, high, base));
> +         addr = plus_constant (Pmode, base, offset);
> +       }
> +      else
> +       addr = riscv_add_offset (NULL, base, offset);

Could you split those logic into new function?

>        return riscv_force_address (addr, mode);
>      }
>
> @@ -1812,7 +1872,10 @@ riscv_address_cost (rtx addr, machine_mode mode,
>                     addr_space_t as ATTRIBUTE_UNUSED,
>                     bool speed ATTRIBUTE_UNUSED)
>  {
> -  return riscv_address_insns (addr, mode, false);
> +  if (!speed && mode == SImode
> +      && riscv_compressed_lw_address_p (addr))
> +    return 1;
> +  return !speed + riscv_address_insns (addr, mode, false);
>  }
>
>  /* Return one word of double-word value OP.  HIGH_P is true to select the
> @@ -4541,6 +4604,106 @@ riscv_init_machine_status (void)
>    return ggc_cleared_alloc<machine_function> ();
>  }
>
> +namespace {
> +
> +const pass_data pass_data_shorten_memrefs =
> +{
> +  RTL_PASS, /* type */
> +  "shorten_memrefs", /* name */
> +  OPTGROUP_NONE, /* optinfo_flags */
> +  TV_NONE, /* tv_id */
> +  0, /* properties_required */
> +  0, /* properties_provided */
> +  0, /* properties_destroyed */
> +  0, /* todo_flags_start */
> +  0, /* todo_flags_finish */
> +};
> +
> +class pass_shorten_memrefs : public rtl_opt_pass
> +{
> +public:
> +  pass_shorten_memrefs (gcc::context *ctxt)
> +    : rtl_opt_pass (pass_data_shorten_memrefs, ctxt)
> +  {}
> +
> +  /* opt_pass methods: */
> +  virtual bool gate (function *) { return optimize > 0; }
> +  virtual unsigned int execute (function *);
> +
> +}; // class pass_shorten_memrefs
> +
> +/* Try to make more use of compressed load and store instructions by 
> replacing
> +   a load/store at address BASE + LARGE_OFFSET with a new load/store at 
> address
> +   NEW BASE + SMALL OFFSET.  If NEW BASE is stored in a compressed register, 
> the
> +   load/store can be compressed.  Since creating NEW BASE incurs an overhead,
> +   the change is only attempted when BASE is referenced by at least four
> +   load/stores in the same basic block.  */
> +unsigned int
> +pass_shorten_memrefs::execute (function *fn)
> +{
> +  typedef int_hash <HOST_WIDE_INT, 0> regno_hash;
> +  typedef hash_map <regno_hash, int> regno_map;
> +
> +  basic_block bb;
> +  rtx_insn *insn;
> +
> +  regstat_init_n_sets_and_refs ();
> +
> +  FOR_ALL_BB_FN (bb, fn)
> +  {
> +    regno_map *m = hash_map<regno_hash, int>::create_ggc (10);
> +    for (int pass = 0; !optimize_bb_for_speed_p (bb) && pass < 2; pass++)

It seems like collect info at first pass and doing the transform at second pass,
But you did the transform at first pass too, split two pass into two part would
be more readable.

> +      FOR_BB_INSNS (bb, insn)
> +       {
> +         if (!NONJUMP_INSN_P (insn))
> +           continue;
> +         rtx pat = PATTERN (insn);
> +         if (GET_CODE (pat) != SET)
> +           continue;
> +         start_sequence ();
> +         for (int i = 0; i < 2; i++)
> +           {
> +             rtx mem = XEXP (pat, i);
> +             if (MEM_P (mem) && GET_MODE (mem) == SImode)
> +               {
> +                 rtx addr = XEXP (mem, 0);
> +                 if (GET_CODE (addr) != PLUS)
> +                   continue;
> +                 if (!REG_P (XEXP (addr, 0)))
> +                   continue;
> +                 HOST_WIDE_INT regno = REGNO (XEXP (addr, 0));
> +                 if (REG_N_REFS (regno) < 4)
> +                   continue;
> +                 if (pass == 0)
> +                   m->get_or_insert (regno)++;
> +                 else if (m->get_or_insert (regno) > 3)
> +                   {
> +                     addr
> +                       = riscv_legitimize_address (addr, addr, GET_MODE 
> (mem));
> +                     XEXP (pat, i) = replace_equiv_address (mem, addr);
> +                     df_insn_rescan (insn);
> +                   }
> +               }
> +           }
> +         rtx_insn *seq = get_insns ();
> +         end_sequence ();
> +         emit_insn_before (seq, insn);
> +       }
> +
> +  }
> +  regstat_free_n_sets_and_refs ();
> +
> +  return 0;
> +}
> +
> +} // anon namespace
> +
> +opt_pass *
> +make_pass_shorten_memrefs (gcc::context *ctxt)
> +{
> +  return new pass_shorten_memrefs (ctxt);
> +}
> +
>  /* Implement TARGET_OPTION_OVERRIDE.  */
>
>  static void
> @@ -4637,6 +4800,10 @@ riscv_option_override (void)
>      error ("%<-mriscv-attribute%> RISC-V ELF attribute requires GNU as 2.32"
>            " [%<-mriscv-attribute%>]");
>  #endif
> +
> +  if (TARGET_RVC)
> +    register_pass (make_pass_shorten_memrefs (g),
> +                  PASS_POS_INSERT_AFTER, "store_motion", 1);
>  }

GCC has new interface to put target specify optimization passes.
You can create a new file riscv-passes.def like
aarch64-passes.def/i386-passes.def
and add PASSES_EXTRA += $(srcdir)/config/riscv/riscv-passes.def to t-riscv.

>
>  /* Implement TARGET_CONDITIONAL_REGISTER_USAGE.  */
> @@ -4676,9 +4843,9 @@ riscv_conditional_register_usage (void)
>  static int
>  riscv_register_priority (int regno)
>  {
> -  /* Favor x8-x15/f8-f15 to improve the odds of RVC instruction selection.  
> */
> -  if (TARGET_RVC && (IN_RANGE (regno, GP_REG_FIRST + 8, GP_REG_FIRST + 15)
> -                    || IN_RANGE (regno, FP_REG_FIRST + 8, FP_REG_FIRST + 
> 15)))
> +  /* Favor compressed registers to improve the odds of RVC instruction
> +     selection.  */
> +  if (riscv_compressed_reg_p (regno))
>      return 1;
>
>    return 0;
> diff --git a/gcc/sched-deps.c b/gcc/sched-deps.c
> index 52db3cc..92a0893 100644
> --- a/gcc/sched-deps.c
> +++ b/gcc/sched-deps.c
> @@ -38,6 +38,7 @@ along with GCC; see the file COPYING3.  If not see
>  #include "sched-int.h"
>  #include "params.h"
>  #include "cselib.h"
> +#include "predict.h"
>
>  #ifdef INSN_SCHEDULING
>
> @@ -4707,6 +4708,15 @@ attempt_change (struct mem_inc_info *mii, rtx new_addr)
>    rtx mem = *mii->mem_loc;
>    rtx new_mem;
>
> +  /* When not optimizing for speed, avoid changes that are expected to make 
> code
> +     size larger.  */
> +  addr_space_t as = MEM_ADDR_SPACE (mem);
> +  bool speed = optimize_bb_for_speed_p (BLOCK_FOR_INSN (mii->mem_insn));
> +  int old_cost = address_cost (XEXP (mem, 0), GET_MODE (mem), as, speed);
> +  int new_cost = address_cost (new_addr, GET_MODE (mem), as, speed);
> +  if (new_cost > old_cost && !speed)

I think !speed should not needed here, it mean address_cost is
incorrect if generated worse code, but this change will effect all
other targets,
so I think it would be better to split into another patch and CC
related reviewer.


> +    return NULL_RTX;
> +
>    /* Jump through a lot of hoops to keep the attributes up to date.  We
>       do not want to call one of the change address variants that take
>       an offset even though we know the offset in many cases.  These
>

Re: [PATCH] RISC-V: Allow more load/stores to be compressed

Reply via email to