On Mon, Apr 26, 2021 at 01:45:30PM +0200, Richard Biener wrote: > On Mon, Apr 26, 2021 at 12:21 AM H.J. Lu <hjl.to...@gmail.com> wrote: > > > > On Fri, Apr 23, 2021 at 09:12:10AM +0200, Richard Biener wrote: > > > On Fri, Apr 23, 2021 at 1:35 AM H.J. Lu via Gcc-patches > > > <gcc-patches@gcc.gnu.org> wrote: > > > > > > > > For op_by_pieces operations between two areas of memory on non-strict > > > > alignment target, add -foverlap-op-by-pieces=[off|on|max-memset] to > > > > generate overlapping operations to minimize number of operations if it > > > > is not a stack push which must not overlap. > > > > > > > > When operating on LENGTH bytes of memory, -foverlap-op-by-pieces=on > > > > starts with the widest usable integer size, MAX_SIZE, for LENGTH bytes > > > > and finishes with the smallest usable integer size, MIN_SIZE, for the > > > > remaining bytes where MAX_SIZE >= MIN_SIZE. If MIN_SIZE > the remaining > > > > bytes, the last operation is performed on MIN_SIZE bytes of overlapping > > > > memory from the previous operation. > > > > > > > > For memset with non-zero byte, -foverlap-op-by-pieces=max-memset > > > > generates > > > > an overlapping fill with MAX_SIZE if the number of the remaining bytes > > > > is > > > > greater than one. > > > > > > > > Tested on Linux/x86-64 with both -foverlap-op-by-pieces enabled and > > > > disabled by default. > > > > > > Neither the user documentation nor the patch description tells me what > > > "generate overlapping operations" does. I _suspect_ it's doing an > > > offset adjusted read/write of the last piece of a memory region to > > > avoid doing more than one smaller operations. Thus for a region > > > of size 7 and 4-byte granular ops you'd do operations at > > > offset 0 and 3 rather than one at 0, a two-byte at offset 4 and > > > a one-byte at offset 7. > > > > That is correct. > > > > > > > > When the tail is of power-of-two size you still generate non-overlapping > > > ops? > > > > No. > > > > > > > > For memmove there's a correctness issue so you have to make sure > > > to first load the last two ops before performing the stores which > > > increases register pressure. > > > > I believe memmove is handled correctly. > > > > > > > > I'm not sure we want a -f option to control this - not all targets will > > > be able to support this. So I'd use a target hook or rather extend > > > the existing use_by_pieces_infrastructure_p hook with an alternate > > > return (some flags bitmask I guess). We do have one extra > > > target hook, compare_by_pieces_branch_ratio, so by that using > > > an alternate hook might be also OK. > > > > Fixed. > > > > > > > > Adding a -m option in targets that want this user-controllable would > > > be OK of course. > > > > > > > I can add a -m option to control if neeed. > > > > Here is the v2 patch. Changes are: > > > > 1. Added a target hook, TARGET_OVERLAP_OP_BY_PIECES_P. > > 2. Added a pointer argument to pieces_addr::adjust to pass the RTL > > information from the previous iteraton to m_constfn. > > 3. Updated builtin_memset_read_str and builtin_memset_gen_str to > > generate the new RTL from the previous RTL info. > > > > OK for master branch? > > > > Thanks. > > > > H.J. > > ---- > > Add an overlap_op_by_pieces_p target hook for op_by_pieces operations > > between two areas of memory to generate one offset adjusted operation > > in the smallest integer mode for the remaining bytes on the last piece > > operation of a memory region to avoid doing more than one smaller > > operations. > > > > Pass the RTL information from the previous iteration to m_constfn in > > op_by_pieces operation so that builtin_memset_[read|gen]_str can > > generate the new RTL from the previous RTL. > > > > Tested on Linux/x86-64. > > > > gcc/ > > > > PR middl-end/90773 > > * builtins.c (builtin_memcpy_read_str): Add a dummy argument. > > (builtin_strncpy_read_str): Likewise. > > (builtin_memset_read_str): Add an argument for the previous RTL > > information and generate the new RTL from the previous RTL info. > > (builtin_memset_gen_str): Likewise. > > * builtins.h (builtin_strncpy_read_str): Update the prototype. > > (builtin_memset_read_str): Likewise. > > * expr.c (by_pieces_ninsns): If targetm.overlap_op_by_pieces_p() > > returns true, round up size and alignment to the widest integer > > mode for maximum size. > > (pieces_addr::adjust): Add a pointer to by_pieces_prev argument > > and pass it to m_constfn. > > (op_by_pieces_d): Add get_usable_mode, m_push and > > m_overlap_op_by_pieces. > > (op_by_pieces_d::op_by_pieces_d): Add a bool argument to > > initialize m_push. Initialize m_overlap_op_by_pieces with > > targetm.overlap_op_by_pieces_p (). > > (op_by_pieces_d::get_usable_mode): New. > > (op_by_pieces_d::run): Use get_usable_mode to get the largest > > usable integer mode, pass the previous info to pieces_addr::adjust > > and generate overlapping operations if m_overlap_op_by_pieces is > > true. > > (PUSHG_P): New. > > (move_by_pieces_d::move_by_pieces_d): Updated for op_by_pieces_d > > change. > > (store_by_pieces_d::store_by_pieces_d): Updated for op_by_pieces_d > > change. > > (can_store_by_pieces): Use by_pieces_constfn on constfun. > > (store_by_pieces): Use by_pieces_constfn on constfun. Updated > > for op_by_pieces_d change. > > (clear_by_pieces_1): Add a dummy argument. > > (clear_by_pieces): Updated for op_by_pieces_d change. > > (compare_by_pieces_d::compare_by_pieces_d): Likewise. > > (string_cst_read_str): Add a dummy argument. > > * expr.h (by_pieces_constfn): Add a dummy argument. > > (by_pieces_prev): New. > > * target.def (overlap_op_by_pieces_p): New target hook. > > * config/i386/i386.c (TARGET_OVERLAP_OP_BY_PIECES_P): New. > > * doc/tm.texi.in: Add TARGET_OVERLAP_OP_BY_PIECES_P. > > * doc/tm.texi: Regenerated. > > > > gcc/testsuite/ > > > > PR middl-end/90773 > > * g++.dg/pr90773-1.h: New test. > > * g++.dg/pr90773-1a.C: Likewise. > > * g++.dg/pr90773-1b.C: Likewise. > > * g++.dg/pr90773-1c.C: Likewise. > > * g++.dg/pr90773-1d.C: Likewise. > > * gcc.target/i386/pr90773-1.c: Likewise. > > * gcc.target/i386/pr90773-2.c: Likewise. > > * gcc.target/i386/pr90773-3.c: Likewise. > > * gcc.target/i386/pr90773-4.c: Likewise. > > * gcc.target/i386/pr90773-5.c: Likewise. > > * gcc.target/i386/pr90773-6.c: Likewise. > > * gcc.target/i386/pr90773-7.c: Likewise. > > * gcc.target/i386/pr90773-8.c: Likewise. > > * gcc.target/i386/pr90773-9.c: Likewise. > > * gcc.target/i386/pr90773-10.c: Likewise. > > * gcc.target/i386/pr90773-11.c: Likewise. > > * gcc.target/i386/pr90773-12.c: Likewise. > > * gcc.target/i386/pr90773-13.c: Likewise. > > * gcc.target/i386/pr90773-14.c: Likewise. > > --- > > gcc/builtins.c | 47 +++++- > > gcc/builtins.h | 6 +- > > gcc/config/i386/i386.c | 3 + > > gcc/doc/tm.texi | 7 + > > gcc/doc/tm.texi.in | 2 + > > gcc/expr.c | 171 ++++++++++++++++----- > > gcc/expr.h | 10 +- > > gcc/target.def | 9 ++ > > gcc/testsuite/g++.dg/pr90773-1.h | 14 ++ > > gcc/testsuite/g++.dg/pr90773-1a.C | 13 ++ > > gcc/testsuite/g++.dg/pr90773-1b.C | 5 + > > gcc/testsuite/g++.dg/pr90773-1c.C | 5 + > > gcc/testsuite/g++.dg/pr90773-1d.C | 19 +++ > > gcc/testsuite/gcc.target/i386/pr90773-1.c | 17 ++ > > gcc/testsuite/gcc.target/i386/pr90773-10.c | 13 ++ > > gcc/testsuite/gcc.target/i386/pr90773-11.c | 13 ++ > > gcc/testsuite/gcc.target/i386/pr90773-12.c | 11 ++ > > gcc/testsuite/gcc.target/i386/pr90773-13.c | 11 ++ > > gcc/testsuite/gcc.target/i386/pr90773-14.c | 13 ++ > > gcc/testsuite/gcc.target/i386/pr90773-2.c | 20 +++ > > gcc/testsuite/gcc.target/i386/pr90773-3.c | 23 +++ > > gcc/testsuite/gcc.target/i386/pr90773-4.c | 13 ++ > > gcc/testsuite/gcc.target/i386/pr90773-5.c | 13 ++ > > gcc/testsuite/gcc.target/i386/pr90773-6.c | 11 ++ > > gcc/testsuite/gcc.target/i386/pr90773-7.c | 11 ++ > > gcc/testsuite/gcc.target/i386/pr90773-8.c | 13 ++ > > gcc/testsuite/gcc.target/i386/pr90773-9.c | 13 ++ > > 27 files changed, 457 insertions(+), 49 deletions(-) > > create mode 100644 gcc/testsuite/g++.dg/pr90773-1.h > > create mode 100644 gcc/testsuite/g++.dg/pr90773-1a.C > > create mode 100644 gcc/testsuite/g++.dg/pr90773-1b.C > > create mode 100644 gcc/testsuite/g++.dg/pr90773-1c.C > > create mode 100644 gcc/testsuite/g++.dg/pr90773-1d.C > > create mode 100644 gcc/testsuite/gcc.target/i386/pr90773-1.c > > create mode 100644 gcc/testsuite/gcc.target/i386/pr90773-10.c > > create mode 100644 gcc/testsuite/gcc.target/i386/pr90773-11.c > > create mode 100644 gcc/testsuite/gcc.target/i386/pr90773-12.c > > create mode 100644 gcc/testsuite/gcc.target/i386/pr90773-13.c > > create mode 100644 gcc/testsuite/gcc.target/i386/pr90773-14.c > > create mode 100644 gcc/testsuite/gcc.target/i386/pr90773-2.c > > create mode 100644 gcc/testsuite/gcc.target/i386/pr90773-3.c > > create mode 100644 gcc/testsuite/gcc.target/i386/pr90773-4.c > > create mode 100644 gcc/testsuite/gcc.target/i386/pr90773-5.c > > create mode 100644 gcc/testsuite/gcc.target/i386/pr90773-6.c > > create mode 100644 gcc/testsuite/gcc.target/i386/pr90773-7.c > > create mode 100644 gcc/testsuite/gcc.target/i386/pr90773-8.c > > create mode 100644 gcc/testsuite/gcc.target/i386/pr90773-9.c > > > > diff --git a/gcc/builtins.c b/gcc/builtins.c > > index 8c5324bf7de..efb449f6c22 100644 > > --- a/gcc/builtins.c > > +++ b/gcc/builtins.c > > @@ -128,7 +128,6 @@ static rtx expand_builtin_va_copy (tree); > > static rtx inline_expand_builtin_bytecmp (tree, rtx); > > static rtx expand_builtin_strcmp (tree, rtx); > > static rtx expand_builtin_strncmp (tree, rtx, machine_mode); > > -static rtx builtin_memcpy_read_str (void *, HOST_WIDE_INT, > > scalar_int_mode); > > static rtx expand_builtin_memchr (tree, rtx); > > static rtx expand_builtin_memcpy (tree, rtx); > > static rtx expand_builtin_memory_copy_args (tree dest, tree src, tree len, > > @@ -145,7 +144,6 @@ static rtx expand_builtin_stpcpy (tree, rtx, > > machine_mode); > > static rtx expand_builtin_stpncpy (tree, rtx); > > static rtx expand_builtin_strncat (tree, rtx); > > static rtx expand_builtin_strncpy (tree, rtx); > > -static rtx builtin_memset_gen_str (void *, HOST_WIDE_INT, scalar_int_mode); > > static rtx expand_builtin_memset (tree, rtx, machine_mode); > > static rtx expand_builtin_memset_args (tree, tree, tree, rtx, > > machine_mode, tree); > > static rtx expand_builtin_bzero (tree); > > @@ -3860,7 +3858,7 @@ expand_builtin_strnlen (tree exp, rtx target, > > machine_mode target_mode) > > a target constant. */ > > > > static rtx > > -builtin_memcpy_read_str (void *data, HOST_WIDE_INT offset, > > +builtin_memcpy_read_str (void *data, void *, HOST_WIDE_INT offset, > > scalar_int_mode mode) > > { > > /* The REPresentation pointed to by DATA need not be a nul-terminated > > @@ -6373,7 +6371,7 @@ expand_builtin_stpncpy (tree exp, rtx) > > constant. */ > > > > rtx > > -builtin_strncpy_read_str (void *data, HOST_WIDE_INT offset, > > +builtin_strncpy_read_str (void *data, void *, HOST_WIDE_INT offset, > > scalar_int_mode mode) > > { > > const char *str = (const char *) data; > > @@ -6584,12 +6582,33 @@ expand_builtin_strncpy (tree exp, rtx target) > > > > /* Callback routine for store_by_pieces. Read GET_MODE_BITSIZE (MODE) > > bytes from constant string DATA + OFFSET and return it as target > > - constant. */ > > + constant. If PREV isn't nullptr, it has the RTL info from the > > + previous iteration. */ > > > > rtx > > -builtin_memset_read_str (void *data, HOST_WIDE_INT offset ATTRIBUTE_UNUSED, > > +builtin_memset_read_str (void *data, void *prevp, > > + HOST_WIDE_INT offset ATTRIBUTE_UNUSED, > > scalar_int_mode mode) > > { > > + by_pieces_prev *prev = (by_pieces_prev *) prevp; > > + if (prev != nullptr && prev->data != nullptr) > > + { > > + /* Use the previous data in the same mode. */ > > + if (prev->mode == mode) > > + return prev->data; > > + > > + /* Truncate the previous constant integer to a const_wide_int in > > + MODE. */ > > + if (!CONST_SCALAR_INT_P (prev->data)) > > + gcc_unreachable (); > > + > > + rtx_mode_t op = rtx_mode_t (prev->data, prev->mode); > > + wide_int result = wide_int::from (op, > > + GET_MODE_PRECISION (prev->mode), > > + UNSIGNED); > > + return immed_wide_int_const (result, mode); > > Isn't this premature optimization? The original code should arrive at the > very same CONST_INT, no? ...
Removed. > > > + } > > + > > const char *c = (const char *) data; > > char *p = XALLOCAVEC (char, GET_MODE_SIZE (mode)); > > > > @@ -6601,16 +6620,28 @@ builtin_memset_read_str (void *data, HOST_WIDE_INT > > offset ATTRIBUTE_UNUSED, > > /* Callback routine for store_by_pieces. Return the RTL of a register > > containing GET_MODE_SIZE (MODE) consecutive copies of the unsigned > > char value given in the RTL register data. For example, if mode is > > - 4 bytes wide, return the RTL for 0x01010101*data. */ > > + 4 bytes wide, return the RTL for 0x01010101*data. If PREV isn't > > + nullptr, it has the RTL info from the previous iteration. */ > > > > static rtx > > -builtin_memset_gen_str (void *data, HOST_WIDE_INT offset ATTRIBUTE_UNUSED, > > +builtin_memset_gen_str (void *data, void *prevp, > > + HOST_WIDE_INT offset ATTRIBUTE_UNUSED, > > scalar_int_mode mode) > > { > > rtx target, coeff; > > size_t size; > > char *p; > > > > + by_pieces_prev *prev = (by_pieces_prev *) prevp; > > + if (prev != nullptr && prev->data != nullptr) > > + { > > + /* Use the previous data in the same mode. */ > > + if (prev->mode == mode) > > + return prev->data; > > + > > + return simplify_gen_subreg (mode, prev->data, prev->mode, 0); > > + } > > + > > ... compared to here where CSE might have difficulties creating the very > same subreg. > > > size = GET_MODE_SIZE (mode); > > if (size == 1) > > return (rtx) data; > > diff --git a/gcc/builtins.h b/gcc/builtins.h > > index 307a20fbadb..e71f40c300a 100644 > > --- a/gcc/builtins.h > > +++ b/gcc/builtins.h > > @@ -110,8 +110,10 @@ extern void expand_builtin_update_setjmp_buf (rtx); > > extern tree mathfn_built_in (tree, enum built_in_function fn); > > extern tree mathfn_built_in (tree, combined_fn); > > extern tree mathfn_built_in_type (combined_fn); > > -extern rtx builtin_strncpy_read_str (void *, HOST_WIDE_INT, > > scalar_int_mode); > > -extern rtx builtin_memset_read_str (void *, HOST_WIDE_INT, > > scalar_int_mode); > > +extern rtx builtin_strncpy_read_str (void *, void *, HOST_WIDE_INT, > > + scalar_int_mode); > > +extern rtx builtin_memset_read_str (void *, void *, HOST_WIDE_INT, > > + scalar_int_mode); > > extern rtx expand_builtin_saveregs (void); > > extern tree std_build_builtin_va_list (void); > > extern tree std_fn_abi_va_list (tree); > > diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c > > index adcef1e98bf..68f33f96f5a 100644 > > --- a/gcc/config/i386/i386.c > > +++ b/gcc/config/i386/i386.c > > @@ -23538,6 +23538,9 @@ ix86_run_selftests (void) > > #undef TARGET_ADDRESS_COST > > #define TARGET_ADDRESS_COST ix86_address_cost > > > > +#undef TARGET_OVERLAP_OP_BY_PIECES_P > > +#define TARGET_OVERLAP_OP_BY_PIECES_P hook_bool_void_true > > + > > #undef TARGET_FLAGS_REGNUM > > #define TARGET_FLAGS_REGNUM FLAGS_REG > > #undef TARGET_FIXED_CONDITION_CODE_REGS > > diff --git a/gcc/doc/tm.texi b/gcc/doc/tm.texi > > index 823f85ba9ab..8631435f271 100644 > > --- a/gcc/doc/tm.texi > > +++ b/gcc/doc/tm.texi > > @@ -6758,6 +6758,13 @@ in code size, for example where the number of insns > > emitted to perform a > > move would be greater than that of a library call. > > @end deftypefn > > > > +@deftypefn {Target Hook} bool TARGET_OVERLAP_OP_BY_PIECES_P (void) > > +This target hook should return true if when the @code{by_pieces} > > +infrastructure is used, an offset adjusted memory operation in the > > .. offset adjusted unaligned memory operation .. > > might be worth noting that the access will be unaligned. Fixed. > > > +smallest integer mode for the last piece operation of a memory region > > +can be generated to avoid doing more than one smaller operations. > > +@end deftypefn > > + > > @deftypefn {Target Hook} int TARGET_COMPARE_BY_PIECES_BRANCH_RATIO > > (machine_mode @var{mode}) > > When expanding a block comparison in MODE, gcc can try to reduce the > > number of branches at the expense of more memory operations. This hook > > diff --git a/gcc/doc/tm.texi.in b/gcc/doc/tm.texi.in > > index 2321a5fc4e0..4ac3452278e 100644 > > --- a/gcc/doc/tm.texi.in > > +++ b/gcc/doc/tm.texi.in > > @@ -4586,6 +4586,8 @@ If you don't define this, a reasonable default is > > used. > > > > @hook TARGET_USE_BY_PIECES_INFRASTRUCTURE_P > > > > +@hook TARGET_OVERLAP_OP_BY_PIECES_P > > + > > @hook TARGET_COMPARE_BY_PIECES_BRANCH_RATIO > > > > @defmac MOVE_MAX_PIECES > > diff --git a/gcc/expr.c b/gcc/expr.c > > index a0e19465965..b5b96ea1185 100644 > > --- a/gcc/expr.c > > +++ b/gcc/expr.c > > @@ -815,12 +815,27 @@ by_pieces_ninsns (unsigned HOST_WIDE_INT l, unsigned > > int align, > > unsigned int max_size, by_pieces_operation op) > > { > > unsigned HOST_WIDE_INT n_insns = 0; > > + scalar_int_mode mode; > > + > > + if (targetm.overlap_op_by_pieces_p () && op != COMPARE_BY_PIECES) > > + { > > + /* NB: Round up L and ALIGN to the widest integer mode for > > + MAX_SIZE. */ > > + mode = widest_int_mode_for_size (max_size); > > + if (optab_handler (mov_optab, mode) != CODE_FOR_nothing) > > + { > > + unsigned HOST_WIDE_INT up = ROUND_UP (l, GET_MODE_SIZE (mode)); > > + if (up > l) > > + l = up; > > + align = GET_MODE_ALIGNMENT (mode); > > why's this necessary? It looks like alignment_for_piecewise_move will > properly honor slow_unaligned_access which better should be false for > overlap_op_by_pieces Without it, by_pieces_ninsns will return 19, instead of 16, for lengh 255 on x86-64. > > > + } > > + } > > > > align = alignment_for_piecewise_move (MOVE_MAX_PIECES, align); > > > > while (max_size > 1 && l > 0) > > { > > - scalar_int_mode mode = widest_int_mode_for_size (max_size); > > + mode = widest_int_mode_for_size (max_size); > > enum insn_code icode; > > > > unsigned int modesize = GET_MODE_SIZE (mode); > > @@ -888,7 +903,8 @@ class pieces_addr > > void *m_cfndata; > > public: > > pieces_addr (rtx, bool, by_pieces_constfn, void *); > > - rtx adjust (scalar_int_mode, HOST_WIDE_INT); > > + rtx adjust (scalar_int_mode, HOST_WIDE_INT, > > + by_pieces_prev * = nullptr); > > void increment_address (HOST_WIDE_INT); > > void maybe_predec (HOST_WIDE_INT); > > void maybe_postinc (HOST_WIDE_INT); > > @@ -990,10 +1006,12 @@ pieces_addr::decide_autoinc (machine_mode ARG_UNUSED > > (mode), bool reverse, > > but we still modify the MEM's properties. */ > > > > rtx > > -pieces_addr::adjust (scalar_int_mode mode, HOST_WIDE_INT offset) > > +pieces_addr::adjust (scalar_int_mode mode, HOST_WIDE_INT offset, > > + by_pieces_prev *prev) > > { > > if (m_constfn) > > - return m_constfn (m_cfndata, offset, mode); > > + /* Pass the previous data to m_constfn. */ > > + return m_constfn (m_cfndata, prev, offset, mode); > > if (m_obj == NULL_RTX) > > return NULL_RTX; > > if (m_auto) > > @@ -1041,6 +1059,9 @@ pieces_addr::maybe_postinc (HOST_WIDE_INT size) > > > > class op_by_pieces_d > > { > > + private: > > + scalar_int_mode get_usable_mode (scalar_int_mode mode, unsigned int); > > + > > protected: > > pieces_addr m_to, m_from; > > unsigned HOST_WIDE_INT m_len; > > @@ -1048,6 +1069,10 @@ class op_by_pieces_d > > unsigned int m_align; > > unsigned int m_max_size; > > bool m_reverse; > > + /* True if this is a stack push. */ > > + bool m_push; > > + /* True if targetm.overlap_op_by_pieces_p () returns true. */ > > + bool m_overlap_op_by_pieces; > > > > /* Virtual functions, overriden by derived classes for the specific > > operation. */ > > @@ -1059,7 +1084,7 @@ class op_by_pieces_d > > > > public: > > op_by_pieces_d (rtx, bool, rtx, bool, by_pieces_constfn, void *, > > - unsigned HOST_WIDE_INT, unsigned int); > > + unsigned HOST_WIDE_INT, unsigned int, bool); > > void run (); > > }; > > > > @@ -1074,10 +1099,11 @@ op_by_pieces_d::op_by_pieces_d (rtx to, bool > > to_load, > > by_pieces_constfn from_cfn, > > void *from_cfn_data, > > unsigned HOST_WIDE_INT len, > > - unsigned int align) > > + unsigned int align, bool push) > > : m_to (to, to_load, NULL, NULL), > > m_from (from, from_load, from_cfn, from_cfn_data), > > - m_len (len), m_max_size (MOVE_MAX_PIECES + 1) > > + m_len (len), m_max_size (MOVE_MAX_PIECES + 1), > > + m_push (push) > > { > > int toi = m_to.get_addr_inc (); > > int fromi = m_from.get_addr_inc (); > > @@ -1106,6 +1132,27 @@ op_by_pieces_d::op_by_pieces_d (rtx to, bool to_load, > > > > align = alignment_for_piecewise_move (MOVE_MAX_PIECES, align); > > m_align = align; > > + > > + m_overlap_op_by_pieces = targetm.overlap_op_by_pieces_p (); > > +} > > + > > +/* This function returns the largest usable integer mode for LEN bytes > > + whose size is no bigger than size of MODE. */ > > + > > +scalar_int_mode > > +op_by_pieces_d::get_usable_mode (scalar_int_mode mode, unsigned int len) > > +{ > > + unsigned int size; > > + do > > + { > > + size = GET_MODE_SIZE (mode); > > + if (len >= size && prepare_mode (mode, m_align)) > > + break; > > + /* NB: widest_int_mode_for_size checks SIZE > 1. */ > > + mode = widest_int_mode_for_size (size); > > + } > > + while (1); > > + return mode; > > } > > > > /* This function contains the main loop used for expanding a block > > @@ -1116,42 +1163,80 @@ op_by_pieces_d::op_by_pieces_d (rtx to, bool > > to_load, > > void > > op_by_pieces_d::run () > > { > > - while (m_max_size > 1 && m_len > 0) > > + if (m_len == 0) > > + return; > > + > > + /* NB: widest_int_mode_for_size checks M_MAX_SIZE > 1. */ > > + scalar_int_mode mode = widest_int_mode_for_size (m_max_size); > > + mode = get_usable_mode (mode, m_len); > > + > > + by_pieces_prev to_prev = { nullptr, mode }; > > + by_pieces_prev from_prev = { nullptr, mode }; > > + > > + do > > splitting out the refactoring from a while to a do-while loop might ease > review. Will do. I will submit the v3 patch set. Thanks. > > Otherwise it looks OK - looks like my memmove concern is moot since > by-pieces isn't used for memmove anyway. > > Richard. > H.J.