Re: [PATCH v2] Generate offset adjusted operation for op_by_pieces operations

H.J. Lu via Gcc-patches Mon, 26 Apr 2021 18:14:24 -0700

On Mon, Apr 26, 2021 at 01:45:30PM +0200, Richard Biener wrote:
> On Mon, Apr 26, 2021 at 12:21 AM H.J. Lu <hjl.to...@gmail.com> wrote:
> >
> > On Fri, Apr 23, 2021 at 09:12:10AM +0200, Richard Biener wrote:
> > > On Fri, Apr 23, 2021 at 1:35 AM H.J. Lu via Gcc-patches
> > > <gcc-patches@gcc.gnu.org> wrote:
> > > >
> > > > For op_by_pieces operations between two areas of memory on non-strict
> > > > alignment target, add -foverlap-op-by-pieces=[off|on|max-memset] to
> > > > generate overlapping operations to minimize number of operations if it
> > > > is not a stack push which must not overlap.
> > > >
> > > > When operating on LENGTH bytes of memory, -foverlap-op-by-pieces=on
> > > > starts with the widest usable integer size, MAX_SIZE, for LENGTH bytes
> > > > and finishes with the smallest usable integer size, MIN_SIZE, for the
> > > > remaining bytes where MAX_SIZE >= MIN_SIZE.  If MIN_SIZE > the remaining
> > > > bytes, the last operation is performed on MIN_SIZE bytes of overlapping
> > > > memory from the previous operation.
> > > >
> > > > For memset with non-zero byte, -foverlap-op-by-pieces=max-memset 
> > > > generates
> > > > an overlapping fill with MAX_SIZE if the number of the remaining bytes 
> > > > is
> > > > greater than one.
> > > >
> > > > Tested on Linux/x86-64 with both -foverlap-op-by-pieces enabled and
> > > > disabled by default.
> > >
> > > Neither the user documentation nor the patch description tells me what
> > > "generate overlapping operations" does.  I _suspect_ it's doing an
> > > offset adjusted read/write of the last piece of a memory region to
> > > avoid doing more than one smaller operations.  Thus for a region
> > > of size 7 and 4-byte granular ops you'd do operations at
> > > offset 0 and 3 rather than one at 0, a two-byte at offset 4 and
> > > a one-byte at offset 7.
> >
> > That is correct.
> >
> > >
> > > When the tail is of power-of-two size you still generate non-overlapping
> > > ops?
> >
> > No.
> >
> > >
> > > For memmove there's a correctness issue so you have to make sure
> > > to first load the last two ops before performing the stores which
> > > increases register pressure.
> >
> > I believe memmove is handled correctly.
> >
> > >
> > > I'm not sure we want a -f option to control this - not all targets will
> > > be able to support this.  So I'd use a target hook or rather extend
> > > the existing use_by_pieces_infrastructure_p hook with an alternate
> > > return (some flags bitmask I guess).  We do have one extra
> > > target hook, compare_by_pieces_branch_ratio, so by that using
> > > an alternate hook might be also OK.
> >
> > Fixed.
> >
> > >
> > > Adding a -m option in targets that want this user-controllable would
> > > be OK of course.
> > >
> >
> > I can add a -m option to control if neeed.
> >
> > Here is the v2 patch.  Changes are:
> >
> > 1. Added a target hook, TARGET_OVERLAP_OP_BY_PIECES_P.
> > 2. Added a pointer argument to pieces_addr::adjust to pass the RTL
> > information from the previous iteraton to m_constfn.
> > 3. Updated builtin_memset_read_str and builtin_memset_gen_str to
> > generate the new RTL from the previous RTL info.
> >
> > OK for master branch?
> >
> > Thanks.
> >
> > H.J.
> > ----
> > Add an overlap_op_by_pieces_p target hook for op_by_pieces operations
> > between two areas of memory to generate one offset adjusted operation
> > in the smallest integer mode for the remaining bytes on the last piece
> > operation of a memory region to avoid doing more than one smaller
> > operations.
> >
> > Pass the RTL information from the previous iteration to m_constfn in
> > op_by_pieces operation so that builtin_memset_[read|gen]_str can
> > generate the new RTL from the previous RTL.
> >
> > Tested on Linux/x86-64.
> >
> > gcc/
> >
> >         PR middl-end/90773
> >         * builtins.c (builtin_memcpy_read_str): Add a dummy argument.
> >         (builtin_strncpy_read_str): Likewise.
> >         (builtin_memset_read_str): Add an argument for the previous RTL
> >         information and generate the new RTL from the previous RTL info.
> >         (builtin_memset_gen_str): Likewise.
> >         * builtins.h (builtin_strncpy_read_str): Update the prototype.
> >         (builtin_memset_read_str): Likewise.
> >         * expr.c (by_pieces_ninsns): If targetm.overlap_op_by_pieces_p()
> >         returns true, round up size and alignment to the widest integer
> >         mode for maximum size.
> >         (pieces_addr::adjust): Add a pointer to by_pieces_prev argument
> >         and pass it to m_constfn.
> >         (op_by_pieces_d): Add get_usable_mode, m_push and
> >         m_overlap_op_by_pieces.
> >         (op_by_pieces_d::op_by_pieces_d): Add a bool argument to
> >         initialize m_push.  Initialize m_overlap_op_by_pieces with
> >         targetm.overlap_op_by_pieces_p ().
> >         (op_by_pieces_d::get_usable_mode): New.
> >         (op_by_pieces_d::run): Use get_usable_mode to get the largest
> >         usable integer mode, pass the previous info to pieces_addr::adjust
> >         and generate overlapping operations if m_overlap_op_by_pieces is
> >         true.
> >         (PUSHG_P): New.
> >         (move_by_pieces_d::move_by_pieces_d): Updated for op_by_pieces_d
> >         change.
> >         (store_by_pieces_d::store_by_pieces_d): Updated for op_by_pieces_d
> >         change.
> >         (can_store_by_pieces): Use by_pieces_constfn on constfun.
> >         (store_by_pieces): Use by_pieces_constfn on constfun.  Updated
> >         for op_by_pieces_d change.
> >         (clear_by_pieces_1): Add a dummy argument.
> >         (clear_by_pieces): Updated for op_by_pieces_d change.
> >         (compare_by_pieces_d::compare_by_pieces_d): Likewise.
> >         (string_cst_read_str): Add a dummy argument.
> >         * expr.h (by_pieces_constfn): Add a dummy argument.
> >         (by_pieces_prev): New.
> >         * target.def (overlap_op_by_pieces_p): New target hook.
> >         * config/i386/i386.c (TARGET_OVERLAP_OP_BY_PIECES_P): New.
> >         * doc/tm.texi.in: Add TARGET_OVERLAP_OP_BY_PIECES_P.
> >         * doc/tm.texi: Regenerated.
> >
> > gcc/testsuite/
> >
> >         PR middl-end/90773
> >         * g++.dg/pr90773-1.h: New test.
> >         * g++.dg/pr90773-1a.C: Likewise.
> >         * g++.dg/pr90773-1b.C: Likewise.
> >         * g++.dg/pr90773-1c.C: Likewise.
> >         * g++.dg/pr90773-1d.C: Likewise.
> >         * gcc.target/i386/pr90773-1.c: Likewise.
> >         * gcc.target/i386/pr90773-2.c: Likewise.
> >         * gcc.target/i386/pr90773-3.c: Likewise.
> >         * gcc.target/i386/pr90773-4.c: Likewise.
> >         * gcc.target/i386/pr90773-5.c: Likewise.
> >         * gcc.target/i386/pr90773-6.c: Likewise.
> >         * gcc.target/i386/pr90773-7.c: Likewise.
> >         * gcc.target/i386/pr90773-8.c: Likewise.
> >         * gcc.target/i386/pr90773-9.c: Likewise.
> >         * gcc.target/i386/pr90773-10.c: Likewise.
> >         * gcc.target/i386/pr90773-11.c: Likewise.
> >         * gcc.target/i386/pr90773-12.c: Likewise.
> >         * gcc.target/i386/pr90773-13.c: Likewise.
> >         * gcc.target/i386/pr90773-14.c: Likewise.
> > ---
> >  gcc/builtins.c                             |  47 +++++-
> >  gcc/builtins.h                             |   6 +-
> >  gcc/config/i386/i386.c                     |   3 +
> >  gcc/doc/tm.texi                            |   7 +
> >  gcc/doc/tm.texi.in                         |   2 +
> >  gcc/expr.c                                 | 171 ++++++++++++++++-----
> >  gcc/expr.h                                 |  10 +-
> >  gcc/target.def                             |   9 ++
> >  gcc/testsuite/g++.dg/pr90773-1.h           |  14 ++
> >  gcc/testsuite/g++.dg/pr90773-1a.C          |  13 ++
> >  gcc/testsuite/g++.dg/pr90773-1b.C          |   5 +
> >  gcc/testsuite/g++.dg/pr90773-1c.C          |   5 +
> >  gcc/testsuite/g++.dg/pr90773-1d.C          |  19 +++
> >  gcc/testsuite/gcc.target/i386/pr90773-1.c  |  17 ++
> >  gcc/testsuite/gcc.target/i386/pr90773-10.c |  13 ++
> >  gcc/testsuite/gcc.target/i386/pr90773-11.c |  13 ++
> >  gcc/testsuite/gcc.target/i386/pr90773-12.c |  11 ++
> >  gcc/testsuite/gcc.target/i386/pr90773-13.c |  11 ++
> >  gcc/testsuite/gcc.target/i386/pr90773-14.c |  13 ++
> >  gcc/testsuite/gcc.target/i386/pr90773-2.c  |  20 +++
> >  gcc/testsuite/gcc.target/i386/pr90773-3.c  |  23 +++
> >  gcc/testsuite/gcc.target/i386/pr90773-4.c  |  13 ++
> >  gcc/testsuite/gcc.target/i386/pr90773-5.c  |  13 ++
> >  gcc/testsuite/gcc.target/i386/pr90773-6.c  |  11 ++
> >  gcc/testsuite/gcc.target/i386/pr90773-7.c  |  11 ++
> >  gcc/testsuite/gcc.target/i386/pr90773-8.c  |  13 ++
> >  gcc/testsuite/gcc.target/i386/pr90773-9.c  |  13 ++
> >  27 files changed, 457 insertions(+), 49 deletions(-)
> >  create mode 100644 gcc/testsuite/g++.dg/pr90773-1.h
> >  create mode 100644 gcc/testsuite/g++.dg/pr90773-1a.C
> >  create mode 100644 gcc/testsuite/g++.dg/pr90773-1b.C
> >  create mode 100644 gcc/testsuite/g++.dg/pr90773-1c.C
> >  create mode 100644 gcc/testsuite/g++.dg/pr90773-1d.C
> >  create mode 100644 gcc/testsuite/gcc.target/i386/pr90773-1.c
> >  create mode 100644 gcc/testsuite/gcc.target/i386/pr90773-10.c
> >  create mode 100644 gcc/testsuite/gcc.target/i386/pr90773-11.c
> >  create mode 100644 gcc/testsuite/gcc.target/i386/pr90773-12.c
> >  create mode 100644 gcc/testsuite/gcc.target/i386/pr90773-13.c
> >  create mode 100644 gcc/testsuite/gcc.target/i386/pr90773-14.c
> >  create mode 100644 gcc/testsuite/gcc.target/i386/pr90773-2.c
> >  create mode 100644 gcc/testsuite/gcc.target/i386/pr90773-3.c
> >  create mode 100644 gcc/testsuite/gcc.target/i386/pr90773-4.c
> >  create mode 100644 gcc/testsuite/gcc.target/i386/pr90773-5.c
> >  create mode 100644 gcc/testsuite/gcc.target/i386/pr90773-6.c
> >  create mode 100644 gcc/testsuite/gcc.target/i386/pr90773-7.c
> >  create mode 100644 gcc/testsuite/gcc.target/i386/pr90773-8.c
> >  create mode 100644 gcc/testsuite/gcc.target/i386/pr90773-9.c
> >
> > diff --git a/gcc/builtins.c b/gcc/builtins.c
> > index 8c5324bf7de..efb449f6c22 100644
> > --- a/gcc/builtins.c
> > +++ b/gcc/builtins.c
> > @@ -128,7 +128,6 @@ static rtx expand_builtin_va_copy (tree);
> >  static rtx inline_expand_builtin_bytecmp (tree, rtx);
> >  static rtx expand_builtin_strcmp (tree, rtx);
> >  static rtx expand_builtin_strncmp (tree, rtx, machine_mode);
> > -static rtx builtin_memcpy_read_str (void *, HOST_WIDE_INT, 
> > scalar_int_mode);
> >  static rtx expand_builtin_memchr (tree, rtx);
> >  static rtx expand_builtin_memcpy (tree, rtx);
> >  static rtx expand_builtin_memory_copy_args (tree dest, tree src, tree len,
> > @@ -145,7 +144,6 @@ static rtx expand_builtin_stpcpy (tree, rtx, 
> > machine_mode);
> >  static rtx expand_builtin_stpncpy (tree, rtx);
> >  static rtx expand_builtin_strncat (tree, rtx);
> >  static rtx expand_builtin_strncpy (tree, rtx);
> > -static rtx builtin_memset_gen_str (void *, HOST_WIDE_INT, scalar_int_mode);
> >  static rtx expand_builtin_memset (tree, rtx, machine_mode);
> >  static rtx expand_builtin_memset_args (tree, tree, tree, rtx, 
> > machine_mode, tree);
> >  static rtx expand_builtin_bzero (tree);
> > @@ -3860,7 +3858,7 @@ expand_builtin_strnlen (tree exp, rtx target, 
> > machine_mode target_mode)
> >     a target constant.  */
> >
> >  static rtx
> > -builtin_memcpy_read_str (void *data, HOST_WIDE_INT offset,
> > +builtin_memcpy_read_str (void *data, void *, HOST_WIDE_INT offset,
> >                          scalar_int_mode mode)
> >  {
> >    /* The REPresentation pointed to by DATA need not be a nul-terminated
> > @@ -6373,7 +6371,7 @@ expand_builtin_stpncpy (tree exp, rtx)
> >     constant.  */
> >
> >  rtx
> > -builtin_strncpy_read_str (void *data, HOST_WIDE_INT offset,
> > +builtin_strncpy_read_str (void *data, void *, HOST_WIDE_INT offset,
> >                           scalar_int_mode mode)
> >  {
> >    const char *str = (const char *) data;
> > @@ -6584,12 +6582,33 @@ expand_builtin_strncpy (tree exp, rtx target)
> >
> >  /* Callback routine for store_by_pieces.  Read GET_MODE_BITSIZE (MODE)
> >     bytes from constant string DATA + OFFSET and return it as target
> > -   constant.  */
> > +   constant.  If PREV isn't nullptr, it has the RTL info from the
> > +   previous iteration.  */
> >
> >  rtx
> > -builtin_memset_read_str (void *data, HOST_WIDE_INT offset ATTRIBUTE_UNUSED,
> > +builtin_memset_read_str (void *data, void *prevp,
> > +                        HOST_WIDE_INT offset ATTRIBUTE_UNUSED,
> >                          scalar_int_mode mode)
> >  {
> > +  by_pieces_prev *prev = (by_pieces_prev *) prevp;
> > +  if (prev != nullptr && prev->data != nullptr)
> > +    {
> > +      /* Use the previous data in the same mode.  */
> > +      if (prev->mode == mode)
> > +       return prev->data;
> > +
> > +      /* Truncate the previous constant integer to a const_wide_int in
> > +        MODE.  */
> > +      if (!CONST_SCALAR_INT_P (prev->data))
> > +       gcc_unreachable ();
> > +
> > +      rtx_mode_t op = rtx_mode_t (prev->data, prev->mode);
> > +      wide_int result = wide_int::from (op,
> > +                                       GET_MODE_PRECISION (prev->mode),
> > +                                       UNSIGNED);
> > +      return immed_wide_int_const (result, mode);
> 
> Isn't this premature optimization?  The original code should arrive at the
> very same CONST_INT, no? ...


Removed.

> 
> > +    }
> > +
> >    const char *c = (const char *) data;
> >    char *p = XALLOCAVEC (char, GET_MODE_SIZE (mode));
> >
> > @@ -6601,16 +6620,28 @@ builtin_memset_read_str (void *data, HOST_WIDE_INT 
> > offset ATTRIBUTE_UNUSED,
> >  /* Callback routine for store_by_pieces.  Return the RTL of a register
> >     containing GET_MODE_SIZE (MODE) consecutive copies of the unsigned
> >     char value given in the RTL register data.  For example, if mode is
> > -   4 bytes wide, return the RTL for 0x01010101*data.  */
> > +   4 bytes wide, return the RTL for 0x01010101*data.  If PREV isn't
> > +   nullptr, it has the RTL info from the previous iteration.  */
> >
> >  static rtx
> > -builtin_memset_gen_str (void *data, HOST_WIDE_INT offset ATTRIBUTE_UNUSED,
> > +builtin_memset_gen_str (void *data, void *prevp,
> > +                       HOST_WIDE_INT offset ATTRIBUTE_UNUSED,
> >                         scalar_int_mode mode)
> >  {
> >    rtx target, coeff;
> >    size_t size;
> >    char *p;
> >
> > +  by_pieces_prev *prev = (by_pieces_prev *) prevp;
> > +  if (prev != nullptr && prev->data != nullptr)
> > +    {
> > +      /* Use the previous data in the same mode.  */
> > +      if (prev->mode == mode)
> > +       return prev->data;
> > +
> > +      return simplify_gen_subreg (mode, prev->data, prev->mode, 0);
> > +    }
> > +
> 
> ... compared to here where CSE might have difficulties creating the very
> same subreg.
> 
> >    size = GET_MODE_SIZE (mode);
> >    if (size == 1)
> >      return (rtx) data;
> > diff --git a/gcc/builtins.h b/gcc/builtins.h
> > index 307a20fbadb..e71f40c300a 100644
> > --- a/gcc/builtins.h
> > +++ b/gcc/builtins.h
> > @@ -110,8 +110,10 @@ extern void expand_builtin_update_setjmp_buf (rtx);
> >  extern tree mathfn_built_in (tree, enum built_in_function fn);
> >  extern tree mathfn_built_in (tree, combined_fn);
> >  extern tree mathfn_built_in_type (combined_fn);
> > -extern rtx builtin_strncpy_read_str (void *, HOST_WIDE_INT, 
> > scalar_int_mode);
> > -extern rtx builtin_memset_read_str (void *, HOST_WIDE_INT, 
> > scalar_int_mode);
> > +extern rtx builtin_strncpy_read_str (void *, void *, HOST_WIDE_INT,
> > +                                    scalar_int_mode);
> > +extern rtx builtin_memset_read_str (void *, void *, HOST_WIDE_INT,
> > +                                   scalar_int_mode);
> >  extern rtx expand_builtin_saveregs (void);
> >  extern tree std_build_builtin_va_list (void);
> >  extern tree std_fn_abi_va_list (tree);
> > diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c
> > index adcef1e98bf..68f33f96f5a 100644
> > --- a/gcc/config/i386/i386.c
> > +++ b/gcc/config/i386/i386.c
> > @@ -23538,6 +23538,9 @@ ix86_run_selftests (void)
> >  #undef TARGET_ADDRESS_COST
> >  #define TARGET_ADDRESS_COST ix86_address_cost
> >
> > +#undef TARGET_OVERLAP_OP_BY_PIECES_P
> > +#define TARGET_OVERLAP_OP_BY_PIECES_P hook_bool_void_true
> > +
> >  #undef TARGET_FLAGS_REGNUM
> >  #define TARGET_FLAGS_REGNUM FLAGS_REG
> >  #undef TARGET_FIXED_CONDITION_CODE_REGS
> > diff --git a/gcc/doc/tm.texi b/gcc/doc/tm.texi
> > index 823f85ba9ab..8631435f271 100644
> > --- a/gcc/doc/tm.texi
> > +++ b/gcc/doc/tm.texi
> > @@ -6758,6 +6758,13 @@ in code size, for example where the number of insns 
> > emitted to perform a
> >  move would be greater than that of a library call.
> >  @end deftypefn
> >
> > +@deftypefn {Target Hook} bool TARGET_OVERLAP_OP_BY_PIECES_P (void)
> > +This target hook should return true if when the @code{by_pieces}
> > +infrastructure is used, an offset adjusted memory operation in the
> 
> .. offset adjusted unaligned memory operation ..
> 
> might be worth noting that the access will be unaligned.

Fixed.

> 
> > +smallest integer mode for the last piece operation of a memory region
> > +can be generated to avoid doing more than one smaller operations.
> > +@end deftypefn
> > +
> >  @deftypefn {Target Hook} int TARGET_COMPARE_BY_PIECES_BRANCH_RATIO 
> > (machine_mode @var{mode})
> >  When expanding a block comparison in MODE, gcc can try to reduce the
> >  number of branches at the expense of more memory operations.  This hook
> > diff --git a/gcc/doc/tm.texi.in b/gcc/doc/tm.texi.in
> > index 2321a5fc4e0..4ac3452278e 100644
> > --- a/gcc/doc/tm.texi.in
> > +++ b/gcc/doc/tm.texi.in
> > @@ -4586,6 +4586,8 @@ If you don't define this, a reasonable default is 
> > used.
> >
> >  @hook TARGET_USE_BY_PIECES_INFRASTRUCTURE_P
> >
> > +@hook TARGET_OVERLAP_OP_BY_PIECES_P
> > +
> >  @hook TARGET_COMPARE_BY_PIECES_BRANCH_RATIO
> >
> >  @defmac MOVE_MAX_PIECES
> > diff --git a/gcc/expr.c b/gcc/expr.c
> > index a0e19465965..b5b96ea1185 100644
> > --- a/gcc/expr.c
> > +++ b/gcc/expr.c
> > @@ -815,12 +815,27 @@ by_pieces_ninsns (unsigned HOST_WIDE_INT l, unsigned 
> > int align,
> >                   unsigned int max_size, by_pieces_operation op)
> >  {
> >    unsigned HOST_WIDE_INT n_insns = 0;
> > +  scalar_int_mode mode;
> > +
> > +  if (targetm.overlap_op_by_pieces_p () && op != COMPARE_BY_PIECES)
> > +    {
> > +      /* NB: Round up L and ALIGN to the widest integer mode for
> > +        MAX_SIZE.  */
> > +      mode = widest_int_mode_for_size (max_size);
> > +      if (optab_handler (mov_optab, mode) != CODE_FOR_nothing)
> > +       {
> > +         unsigned HOST_WIDE_INT up = ROUND_UP (l, GET_MODE_SIZE (mode));
> > +         if (up > l)
> > +           l = up;
> > +         align = GET_MODE_ALIGNMENT (mode);
> 
> why's this necessary?  It looks like alignment_for_piecewise_move will
> properly honor slow_unaligned_access which better should be false for
> overlap_op_by_pieces

Without it, by_pieces_ninsns will return 19, instead of 16, for lengh
255 on x86-64.

> 
> > +       }
> > +    }
> >
> >    align = alignment_for_piecewise_move (MOVE_MAX_PIECES, align);
> >
> >    while (max_size > 1 && l > 0)
> >      {
> > -      scalar_int_mode mode = widest_int_mode_for_size (max_size);
> > +      mode = widest_int_mode_for_size (max_size);
> >        enum insn_code icode;
> >
> >        unsigned int modesize = GET_MODE_SIZE (mode);
> > @@ -888,7 +903,8 @@ class pieces_addr
> >    void *m_cfndata;
> >  public:
> >    pieces_addr (rtx, bool, by_pieces_constfn, void *);
> > -  rtx adjust (scalar_int_mode, HOST_WIDE_INT);
> > +  rtx adjust (scalar_int_mode, HOST_WIDE_INT,
> > +             by_pieces_prev * = nullptr);
> >    void increment_address (HOST_WIDE_INT);
> >    void maybe_predec (HOST_WIDE_INT);
> >    void maybe_postinc (HOST_WIDE_INT);
> > @@ -990,10 +1006,12 @@ pieces_addr::decide_autoinc (machine_mode ARG_UNUSED 
> > (mode), bool reverse,
> >     but we still modify the MEM's properties.  */
> >
> >  rtx
> > -pieces_addr::adjust (scalar_int_mode mode, HOST_WIDE_INT offset)
> > +pieces_addr::adjust (scalar_int_mode mode, HOST_WIDE_INT offset,
> > +                    by_pieces_prev *prev)
> >  {
> >    if (m_constfn)
> > -    return m_constfn (m_cfndata, offset, mode);
> > +    /* Pass the previous data to m_constfn.  */
> > +    return m_constfn (m_cfndata, prev, offset, mode);
> >    if (m_obj == NULL_RTX)
> >      return NULL_RTX;
> >    if (m_auto)
> > @@ -1041,6 +1059,9 @@ pieces_addr::maybe_postinc (HOST_WIDE_INT size)
> >
> >  class op_by_pieces_d
> >  {
> > + private:
> > +  scalar_int_mode get_usable_mode (scalar_int_mode mode, unsigned int);
> > +
> >   protected:
> >    pieces_addr m_to, m_from;
> >    unsigned HOST_WIDE_INT m_len;
> > @@ -1048,6 +1069,10 @@ class op_by_pieces_d
> >    unsigned int m_align;
> >    unsigned int m_max_size;
> >    bool m_reverse;
> > +  /* True if this is a stack push.  */
> > +  bool m_push;
> > +  /* True if targetm.overlap_op_by_pieces_p () returns true.  */
> > +  bool m_overlap_op_by_pieces;
> >
> >    /* Virtual functions, overriden by derived classes for the specific
> >       operation.  */
> > @@ -1059,7 +1084,7 @@ class op_by_pieces_d
> >
> >   public:
> >    op_by_pieces_d (rtx, bool, rtx, bool, by_pieces_constfn, void *,
> > -                 unsigned HOST_WIDE_INT, unsigned int);
> > +                 unsigned HOST_WIDE_INT, unsigned int, bool);
> >    void run ();
> >  };
> >
> > @@ -1074,10 +1099,11 @@ op_by_pieces_d::op_by_pieces_d (rtx to, bool 
> > to_load,
> >                                 by_pieces_constfn from_cfn,
> >                                 void *from_cfn_data,
> >                                 unsigned HOST_WIDE_INT len,
> > -                               unsigned int align)
> > +                               unsigned int align, bool push)
> >    : m_to (to, to_load, NULL, NULL),
> >      m_from (from, from_load, from_cfn, from_cfn_data),
> > -    m_len (len), m_max_size (MOVE_MAX_PIECES + 1)
> > +    m_len (len), m_max_size (MOVE_MAX_PIECES + 1),
> > +    m_push (push)
> >  {
> >    int toi = m_to.get_addr_inc ();
> >    int fromi = m_from.get_addr_inc ();
> > @@ -1106,6 +1132,27 @@ op_by_pieces_d::op_by_pieces_d (rtx to, bool to_load,
> >
> >    align = alignment_for_piecewise_move (MOVE_MAX_PIECES, align);
> >    m_align = align;
> > +
> > +  m_overlap_op_by_pieces = targetm.overlap_op_by_pieces_p ();
> > +}
> > +
> > +/* This function returns the largest usable integer mode for LEN bytes
> > +   whose size is no bigger than size of MODE.  */
> > +
> > +scalar_int_mode
> > +op_by_pieces_d::get_usable_mode (scalar_int_mode mode, unsigned int len)
> > +{
> > +  unsigned int size;
> > +  do
> > +    {
> > +      size = GET_MODE_SIZE (mode);
> > +      if (len >= size && prepare_mode (mode, m_align))
> > +       break;
> > +      /* NB: widest_int_mode_for_size checks SIZE > 1.  */
> > +      mode = widest_int_mode_for_size (size);
> > +    }
> > +  while (1);
> > +  return mode;
> >  }
> >
> >  /* This function contains the main loop used for expanding a block
> > @@ -1116,42 +1163,80 @@ op_by_pieces_d::op_by_pieces_d (rtx to, bool 
> > to_load,
> >  void
> >  op_by_pieces_d::run ()
> >  {
> > -  while (m_max_size > 1 && m_len > 0)
> > +  if (m_len == 0)
> > +    return;
> > +
> > +  /* NB: widest_int_mode_for_size checks M_MAX_SIZE > 1.  */
> > +  scalar_int_mode mode = widest_int_mode_for_size (m_max_size);
> > +  mode = get_usable_mode (mode, m_len);
> > +
> > +  by_pieces_prev to_prev = { nullptr, mode };
> > +  by_pieces_prev from_prev = { nullptr, mode };
> > +
> > +  do
> 
> splitting out the refactoring from a while to a do-while loop might ease 
> review.

Will do.  I will submit the v3 patch set.

Thanks.

> 
> Otherwise it looks OK - looks like my memmove concern is moot since
> by-pieces isn't used for memmove anyway.
> 
> Richard.
> 

H.J.

Re: [PATCH v2] Generate offset adjusted operation for op_by_pieces operations

Reply via email to