For op_by_pieces operations between two areas of memory on non-strict alignment target, add -foverlap-op-by-pieces=[off|on|max-memset] to generate overlapping operations to minimize number of operations if it is not a stack push which must not overlap.
When operating on LENGTH bytes of memory, -foverlap-op-by-pieces=on starts with the widest usable integer size, MAX_SIZE, for LENGTH bytes and finishes with the smallest usable integer size, MIN_SIZE, for the remaining bytes where MAX_SIZE >= MIN_SIZE. If MIN_SIZE > the remaining bytes, the last operation is performed on MIN_SIZE bytes of overlapping memory from the previous operation. For memset with non-zero byte, -foverlap-op-by-pieces=max-memset generates an overlapping fill with MAX_SIZE if the number of the remaining bytes is greater than one. Tested on Linux/x86-64 with both -foverlap-op-by-pieces enabled and disabled by default. gcc/ PR middl-end/90773 * common.opt (-foverlap-op-by-pieces): New. * expr.c (by_pieces_ninsns): If -foverlap-op-by-pieces is enabled, round up size and alignment to the widest integer mode for maximum size (op_by_pieces_d): Add get_usable_mode, m_push and m_non_zero_memset. (op_by_pieces_d::op_by_pieces_d): Add 2 bool arguments to initialize m_push and m_non_zero_memset. (op_by_pieces_d::get_usable_mode): New. (op_by_pieces_d::run): Use get_usable_mode to get the largest usable integer mode and generate overlapping operations for -foverlap-op-by-pieces. (PUSHG_P): New. (move_by_pieces_d::move_by_pieces_d): Updated for op_by_pieces_d change. (store_by_pieces_d::store_by_pieces_d): Likewise. (clear_by_pieces): Likewsie. * toplev.c (process_options): Issue an error when -foverlap-op-by-pieces is used for strict alignment target. * doc/invoke.texi: Document -foverlap-op-by-pieces. gcc/testsuite/ PR middl-end/90773 * g++.dg/pr90773-1.h: New test. * g++.dg/pr90773-1a.C: Likewise. * g++.dg/pr90773-1b.C: Likewise. * g++.dg/pr90773-1c.C: Likewise. * g++.dg/pr90773-1d.C: Likewise. * gcc.target/i386/pr90773-1.c: Likewise. * gcc.target/i386/pr90773-2.c: Likewise. * gcc.target/i386/pr90773-3.c: Likewise. * gcc.target/i386/pr90773-4.c: Likewise. * gcc.target/i386/pr90773-5.c: Likewise. * gcc.target/i386/pr90773-6.c: Likewise. * gcc.target/i386/pr90773-7.c: Likewise. * gcc.target/i386/pr90773-8.c: Likewise. * gcc.target/i386/pr90773-9.c: Likewise. * gcc.target/i386/pr90773-10.c: Likewise. * gcc.target/i386/pr90773-11.c: Likewise. --- gcc/common.opt | 19 +++ gcc/doc/invoke.texi | 14 ++ gcc/expr.c | 159 ++++++++++++++++----- gcc/testsuite/g++.dg/pr90773-1.h | 14 ++ gcc/testsuite/g++.dg/pr90773-1a.C | 13 ++ gcc/testsuite/g++.dg/pr90773-1b.C | 5 + gcc/testsuite/g++.dg/pr90773-1c.C | 5 + gcc/testsuite/g++.dg/pr90773-1d.C | 19 +++ gcc/testsuite/gcc.target/i386/pr90773-1.c | 17 +++ gcc/testsuite/gcc.target/i386/pr90773-10.c | 13 ++ gcc/testsuite/gcc.target/i386/pr90773-11.c | 13 ++ gcc/testsuite/gcc.target/i386/pr90773-2.c | 20 +++ gcc/testsuite/gcc.target/i386/pr90773-3.c | 23 +++ gcc/testsuite/gcc.target/i386/pr90773-4.c | 13 ++ gcc/testsuite/gcc.target/i386/pr90773-5.c | 13 ++ gcc/testsuite/gcc.target/i386/pr90773-6.c | 11 ++ gcc/testsuite/gcc.target/i386/pr90773-7.c | 11 ++ gcc/testsuite/gcc.target/i386/pr90773-8.c | 13 ++ gcc/testsuite/gcc.target/i386/pr90773-9.c | 13 ++ gcc/toplev.c | 8 ++ 20 files changed, 383 insertions(+), 33 deletions(-) create mode 100644 gcc/testsuite/g++.dg/pr90773-1.h create mode 100644 gcc/testsuite/g++.dg/pr90773-1a.C create mode 100644 gcc/testsuite/g++.dg/pr90773-1b.C create mode 100644 gcc/testsuite/g++.dg/pr90773-1c.C create mode 100644 gcc/testsuite/g++.dg/pr90773-1d.C create mode 100644 gcc/testsuite/gcc.target/i386/pr90773-1.c create mode 100644 gcc/testsuite/gcc.target/i386/pr90773-10.c create mode 100644 gcc/testsuite/gcc.target/i386/pr90773-11.c create mode 100644 gcc/testsuite/gcc.target/i386/pr90773-2.c create mode 100644 gcc/testsuite/gcc.target/i386/pr90773-3.c create mode 100644 gcc/testsuite/gcc.target/i386/pr90773-4.c create mode 100644 gcc/testsuite/gcc.target/i386/pr90773-5.c create mode 100644 gcc/testsuite/gcc.target/i386/pr90773-6.c create mode 100644 gcc/testsuite/gcc.target/i386/pr90773-7.c create mode 100644 gcc/testsuite/gcc.target/i386/pr90773-8.c create mode 100644 gcc/testsuite/gcc.target/i386/pr90773-9.c diff --git a/gcc/common.opt b/gcc/common.opt index a75b44ee47e..7f5b38c7810 100644 --- a/gcc/common.opt +++ b/gcc/common.opt @@ -2123,6 +2123,25 @@ foptimize-sibling-calls Common Var(flag_optimize_sibling_calls) Optimization Optimize sibling and tail recursive calls. +foverlap-op-by-pieces +Common RejectNegative Alias(foverlap-op-by-pieces=,on) + +foverlap-op-by-pieces= +Common Joined RejectNegative Enum(overlap_op_by_pieces) Var(flag_overlap_op_by_pieces) Init(0) +-foverlap-op-by-pieces=[off|on|max-memset] Generate overlapping operations between two areas of memory. + +Enum +Name(overlap_op_by_pieces) Type(int) + +EnumValue +Enum(overlap_op_by_pieces) String(off) Value(0) + +EnumValue +Enum(overlap_op_by_pieces) String(on) Value(1) + +EnumValue +Enum(overlap_op_by_pieces) String(max-memset) Value(2) + fpartial-inlining Common Var(flag_partial_inlining) Optimization Perform partial inlining. diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi index e98b0962b9f..dbdd1095216 100644 --- a/gcc/doc/invoke.texi +++ b/gcc/doc/invoke.texi @@ -530,6 +530,7 @@ Objective-C and Objective-C++ Dialects}. -fno-sched-spec -fno-signed-zeros @gol -fno-toplevel-reorder -fno-trapping-math -fno-zero-initialized-in-bss @gol -fomit-frame-pointer -foptimize-sibling-calls @gol +-foverlap-op-by-pieces=@r{[}off@r{|}on@r{|}max-memset@r{]} @gol -fpartial-inlining -fpeel-loops -fpredictive-commoning @gol -fprefetch-loop-arrays @gol -fprofile-correction @gol @@ -10360,6 +10361,19 @@ their @code{_FORTIFY_SOURCE} counterparts into faster alternatives. Enabled at levels @option{-O2}, @option{-O3}. +@item -foverlap-op-by-pieces=@r{[}off@r{|}on@r{|}max-memset@r{]} +@opindex -foverlap-op-by-pieces +The value @code{on} tells the compiler to generate overlapping +operations between two areas of memory by using the largest integer +operation to minimize number of operations if it is not a stack push. +The value @code{max-memset} tells the compiler to generate an +overlapping fill with non-zero byte in the maximum single fill size +if the last fill size is greater than one. The value @code{off} +turns off this optimization. + +This option is only valid for targets which do not require strict +alignment. + @item -fno-inline @opindex fno-inline @opindex finline diff --git a/gcc/expr.c b/gcc/expr.c index a0e19465965..375a5497309 100644 --- a/gcc/expr.c +++ b/gcc/expr.c @@ -815,12 +815,27 @@ by_pieces_ninsns (unsigned HOST_WIDE_INT l, unsigned int align, unsigned int max_size, by_pieces_operation op) { unsigned HOST_WIDE_INT n_insns = 0; + scalar_int_mode mode; + + if (flag_overlap_op_by_pieces && op != COMPARE_BY_PIECES) + { + /* NB: Round up L and ALIGN to the widest integer mode for + MAX_SIZE. */ + mode = widest_int_mode_for_size (max_size); + if (optab_handler (mov_optab, mode) != CODE_FOR_nothing) + { + unsigned HOST_WIDE_INT up = ROUND_UP (l, GET_MODE_SIZE (mode)); + if (up > l) + l = up; + align = GET_MODE_ALIGNMENT (mode); + } + } align = alignment_for_piecewise_move (MOVE_MAX_PIECES, align); while (max_size > 1 && l > 0) { - scalar_int_mode mode = widest_int_mode_for_size (max_size); + mode = widest_int_mode_for_size (max_size); enum insn_code icode; unsigned int modesize = GET_MODE_SIZE (mode); @@ -1041,6 +1056,9 @@ pieces_addr::maybe_postinc (HOST_WIDE_INT size) class op_by_pieces_d { + private: + scalar_int_mode get_usable_mode (scalar_int_mode mode, unsigned int); + protected: pieces_addr m_to, m_from; unsigned HOST_WIDE_INT m_len; @@ -1048,6 +1066,10 @@ class op_by_pieces_d unsigned int m_align; unsigned int m_max_size; bool m_reverse; + /* True if this is a stash push. */ + bool m_push; + /* True if this memset with non-zero byte. */ + bool m_non_zero_memset; /* Virtual functions, overriden by derived classes for the specific operation. */ @@ -1059,7 +1081,7 @@ class op_by_pieces_d public: op_by_pieces_d (rtx, bool, rtx, bool, by_pieces_constfn, void *, - unsigned HOST_WIDE_INT, unsigned int); + unsigned HOST_WIDE_INT, unsigned int, bool, bool); void run (); }; @@ -1074,10 +1096,12 @@ op_by_pieces_d::op_by_pieces_d (rtx to, bool to_load, by_pieces_constfn from_cfn, void *from_cfn_data, unsigned HOST_WIDE_INT len, - unsigned int align) + unsigned int align, bool push, + bool non_zero_memset) : m_to (to, to_load, NULL, NULL), m_from (from, from_load, from_cfn, from_cfn_data), - m_len (len), m_max_size (MOVE_MAX_PIECES + 1) + m_len (len), m_max_size (MOVE_MAX_PIECES + 1), + m_push (push), m_non_zero_memset (non_zero_memset) { int toi = m_to.get_addr_inc (); int fromi = m_from.get_addr_inc (); @@ -1108,6 +1132,25 @@ op_by_pieces_d::op_by_pieces_d (rtx to, bool to_load, m_align = align; } +/* This function returns the largest usable integer mode for LEN bytes + whose size is no bigger than size of MODE. */ + +scalar_int_mode +op_by_pieces_d::get_usable_mode (scalar_int_mode mode, unsigned int len) +{ + unsigned int size; + do + { + size = GET_MODE_SIZE (mode); + if (len >= size && prepare_mode (mode, m_align)) + break; + /* NB: widest_int_mode_for_size checks SIZE > 1. */ + mode = widest_int_mode_for_size (size); + } + while (1); + return mode; +} + /* This function contains the main loop used for expanding a block operation. First move what we can in the largest integer mode, then go to successively smaller modes. For every access, call @@ -1116,42 +1159,80 @@ op_by_pieces_d::op_by_pieces_d (rtx to, bool to_load, void op_by_pieces_d::run () { - while (m_max_size > 1 && m_len > 0) + if (m_len == 0) + return; + + /* NB: widest_int_mode_for_size checks M_MAX_SIZE > 1. */ + scalar_int_mode mode = widest_int_mode_for_size (m_max_size); + mode = get_usable_mode (mode, m_len); + + do { - scalar_int_mode mode = widest_int_mode_for_size (m_max_size); + unsigned int size = GET_MODE_SIZE (mode); + rtx to1 = NULL_RTX, from1; - if (prepare_mode (mode, m_align)) + while (m_len >= size) { - unsigned int size = GET_MODE_SIZE (mode); - rtx to1 = NULL_RTX, from1; + if (m_reverse) + m_offset -= size; - while (m_len >= size) - { - if (m_reverse) - m_offset -= size; + to1 = m_to.adjust (mode, m_offset); + from1 = m_from.adjust (mode, m_offset); - to1 = m_to.adjust (mode, m_offset); - from1 = m_from.adjust (mode, m_offset); + m_to.maybe_predec (-(HOST_WIDE_INT)size); + m_from.maybe_predec (-(HOST_WIDE_INT)size); - m_to.maybe_predec (-(HOST_WIDE_INT)size); - m_from.maybe_predec (-(HOST_WIDE_INT)size); + generate (to1, from1, mode); - generate (to1, from1, mode); + m_to.maybe_postinc (size); + m_from.maybe_postinc (size); - m_to.maybe_postinc (size); - m_from.maybe_postinc (size); + if (!m_reverse) + m_offset += size; - if (!m_reverse) - m_offset += size; + m_len -= size; + } - m_len -= size; - } + finish_mode (mode); - finish_mode (mode); - } + if (m_len == 0) + return; - m_max_size = GET_MODE_SIZE (mode); + if (!m_push && flag_overlap_op_by_pieces) + { + /* NB: Generate overlapping operations if it is not a stack + push since stack push must not overlap. */ + if (m_len == 1 + || !m_non_zero_memset + || flag_overlap_op_by_pieces < 2) + { + /* If the remaining length is 1, this is not memset with + non-zero byte or max-memset isn't enabled, get the + smallest integer mode for M_LEN bytes. */ + mode = smallest_int_mode_for_size (m_len * BITS_PER_UNIT); + mode = get_usable_mode (mode, GET_MODE_SIZE (mode)); + } + int gap = GET_MODE_SIZE (mode) - m_len; + if (gap > 0) + { + /* If size of MODE > M_LEN, generate the last operation + in MODE for the remaining bytes with ovelapping memory + from the previois operation. */ + if (m_reverse) + m_offset += gap; + else + m_offset -= gap; + m_len += gap; + } + } + else + { + /* NB: widest_int_mode_for_size checks SIZE > 1. */ + mode = widest_int_mode_for_size (size); + mode = get_usable_mode (mode, m_len); + } } + while (1); /* The code above should have handled everything. */ gcc_assert (!m_len); @@ -1160,6 +1241,12 @@ op_by_pieces_d::run () /* Derived class from op_by_pieces_d, providing support for block move operations. */ +#ifdef PUSH_ROUNDING +#define PUSHG_P(to) ((to) == nullptr) +#else +#define PUSHG_P(to) false +#endif + class move_by_pieces_d : public op_by_pieces_d { insn_gen_fn m_gen_fun; @@ -1169,7 +1256,8 @@ class move_by_pieces_d : public op_by_pieces_d public: move_by_pieces_d (rtx to, rtx from, unsigned HOST_WIDE_INT len, unsigned int align) - : op_by_pieces_d (to, false, from, true, NULL, NULL, len, align) + : op_by_pieces_d (to, false, from, true, NULL, NULL, len, align, + PUSHG_P (to), false) { } rtx finish_retmode (memop_ret); @@ -1263,8 +1351,10 @@ class store_by_pieces_d : public op_by_pieces_d public: store_by_pieces_d (rtx to, by_pieces_constfn cfn, void *cfn_data, - unsigned HOST_WIDE_INT len, unsigned int align) - : op_by_pieces_d (to, false, NULL_RTX, true, cfn, cfn_data, len, align) + unsigned HOST_WIDE_INT len, unsigned int align, + bool non_zero_memset) + : op_by_pieces_d (to, false, NULL_RTX, true, cfn, cfn_data, len, + align, false, non_zero_memset) { } rtx finish_retmode (memop_ret); @@ -1411,7 +1501,8 @@ store_by_pieces (rtx to, unsigned HOST_WIDE_INT len, memsetp ? SET_BY_PIECES : STORE_BY_PIECES, optimize_insn_for_speed_p ())); - store_by_pieces_d data (to, constfun, constfundata, len, align); + store_by_pieces_d data (to, constfun, constfundata, len, align, + memsetp); data.run (); if (retmode != RETURN_BEGIN) @@ -1438,7 +1529,8 @@ clear_by_pieces (rtx to, unsigned HOST_WIDE_INT len, unsigned int align) if (len == 0) return; - store_by_pieces_d data (to, clear_by_pieces_1, NULL, len, align); + store_by_pieces_d data (to, clear_by_pieces_1, NULL, len, align, + false); data.run (); } @@ -1460,7 +1552,8 @@ class compare_by_pieces_d : public op_by_pieces_d compare_by_pieces_d (rtx op0, rtx op1, by_pieces_constfn op1_cfn, void *op1_cfn_data, HOST_WIDE_INT len, int align, rtx_code_label *fail_label) - : op_by_pieces_d (op0, true, op1, true, op1_cfn, op1_cfn_data, len, align) + : op_by_pieces_d (op0, true, op1, true, op1_cfn, op1_cfn_data, len, + align, false, false) { m_fail_label = fail_label; } diff --git a/gcc/testsuite/g++.dg/pr90773-1.h b/gcc/testsuite/g++.dg/pr90773-1.h new file mode 100644 index 00000000000..abdb78b078b --- /dev/null +++ b/gcc/testsuite/g++.dg/pr90773-1.h @@ -0,0 +1,14 @@ +class fixed_wide_int_storage { +public: + long val[10]; + int len; + fixed_wide_int_storage () + { + len = sizeof (val) / sizeof (val[0]); + for (int i = 0; i < len; i++) + val[i] = i; + } +}; + +extern void foo (fixed_wide_int_storage); +extern int record_increment(void); diff --git a/gcc/testsuite/g++.dg/pr90773-1a.C b/gcc/testsuite/g++.dg/pr90773-1a.C new file mode 100644 index 00000000000..3ab8d929f74 --- /dev/null +++ b/gcc/testsuite/g++.dg/pr90773-1a.C @@ -0,0 +1,13 @@ +// { dg-do compile } +// { dg-options "-O2" } +// { dg-additional-options "-mno-avx -msse2 -mtune=skylake" { target { i?86-*-* x86_64-*-* } } } + +#include "pr90773-1.h" + +int +record_increment(void) +{ + fixed_wide_int_storage x; + foo (x); + return 0; +} diff --git a/gcc/testsuite/g++.dg/pr90773-1b.C b/gcc/testsuite/g++.dg/pr90773-1b.C new file mode 100644 index 00000000000..9713b2dd612 --- /dev/null +++ b/gcc/testsuite/g++.dg/pr90773-1b.C @@ -0,0 +1,5 @@ +// { dg-do compile } +// { dg-options "-O2" } +// { dg-additional-options "-mno-avx512f -march=skylake" { target { i?86-*-* x86_64-*-* } } } + +#include "pr90773-1a.C" diff --git a/gcc/testsuite/g++.dg/pr90773-1c.C b/gcc/testsuite/g++.dg/pr90773-1c.C new file mode 100644 index 00000000000..699357a88dc --- /dev/null +++ b/gcc/testsuite/g++.dg/pr90773-1c.C @@ -0,0 +1,5 @@ +// { dg-do compile } +// { dg-options "-O2" } +// { dg-additional-options "-march=skylake-avx512" { target { i?86-*-* x86_64-*-* } } } + +#include "pr90773-1a.C" diff --git a/gcc/testsuite/g++.dg/pr90773-1d.C b/gcc/testsuite/g++.dg/pr90773-1d.C new file mode 100644 index 00000000000..bf9d8543c1b --- /dev/null +++ b/gcc/testsuite/g++.dg/pr90773-1d.C @@ -0,0 +1,19 @@ +// { dg-do run } +// { dg-options "-O2" } +// { dg-additional-options "-march=native" { target { i?86-*-* x86_64-*-* } } } +// { dg-additional-sources "pr90773-1a.C" } + +#include "pr90773-1.h" + +void +foo (fixed_wide_int_storage x) +{ + for (int i = 0; i < x.len; i++) + if (x.val[i] != i) + __builtin_abort (); +} + +int main () +{ + return record_increment (); +} diff --git a/gcc/testsuite/gcc.target/i386/pr90773-1.c b/gcc/testsuite/gcc.target/i386/pr90773-1.c new file mode 100644 index 00000000000..86fec27dad0 --- /dev/null +++ b/gcc/testsuite/gcc.target/i386/pr90773-1.c @@ -0,0 +1,17 @@ +/* { dg-do compile } */ +/* { dg-options "-O2 -mtune=generic -foverlap-op-by-pieces" } */ + +extern char *dst, *src; + +void +foo (void) +{ + __builtin_memcpy (dst, src, 15); +} + +/* { dg-final { scan-assembler-times "movq\[\\t \]+\\(%\[\^,\]+\\)," 1 { target { ! ia32 } } } } */ +/* { dg-final { scan-assembler-times "movq\[\\t \]+7\\(%\[\^,\]+\\)," 1 { target { ! ia32 } } } } */ +/* { dg-final { scan-assembler-times "movl\[\\t \]+\\(%\[\^,\]+\\)," 1 { target ia32 } } } */ +/* { dg-final { scan-assembler-times "movl\[\\t \]+4\\(%\[\^,\]+\\)," 1 { target ia32 } } } */ +/* { dg-final { scan-assembler-times "movl\[\\t \]+8\\(%\[\^,\]+\\)," 1 { target ia32 } } } */ +/* { dg-final { scan-assembler-times "movl\[\\t \]+11\\(%\[\^,\]+\\)," 1 { target ia32 } } } */ diff --git a/gcc/testsuite/gcc.target/i386/pr90773-10.c b/gcc/testsuite/gcc.target/i386/pr90773-10.c new file mode 100644 index 00000000000..5985877cc10 --- /dev/null +++ b/gcc/testsuite/gcc.target/i386/pr90773-10.c @@ -0,0 +1,13 @@ +/* { dg-do compile } */ +/* { dg-options "-O2 -mtune=generic -foverlap-op-by-pieces=max-memset" } */ + +extern char *dst; + +void +foo (int c) +{ + __builtin_memset (dst, c, 5); +} + +/* { dg-final { scan-assembler-times "movl\[\\t \]+.+, \\(%\[\^,\]+\\)" 1 } } */ +/* { dg-final { scan-assembler-times "movb\[\\t \]+.+, 4\\(%\[\^,\]+\\)" 1 } } */ diff --git a/gcc/testsuite/gcc.target/i386/pr90773-11.c b/gcc/testsuite/gcc.target/i386/pr90773-11.c new file mode 100644 index 00000000000..9bf57aa3a44 --- /dev/null +++ b/gcc/testsuite/gcc.target/i386/pr90773-11.c @@ -0,0 +1,13 @@ +/* { dg-do compile } */ +/* { dg-options "-O2 -mtune=generic -foverlap-op-by-pieces=max-memset" } */ + +extern char *dst; + +void +foo (int c) +{ + __builtin_memset (dst, c, 6); +} + +/* { dg-final { scan-assembler-times "movl\[\\t \]+.+, \\(%\[\^,\]+\\)" 1 } } */ +/* { dg-final { scan-assembler-times "movl\[\\t \]+.+, 2\\(%\[\^,\]+\\)" 1 } } */ diff --git a/gcc/testsuite/gcc.target/i386/pr90773-2.c b/gcc/testsuite/gcc.target/i386/pr90773-2.c new file mode 100644 index 00000000000..ebdf9dac6e8 --- /dev/null +++ b/gcc/testsuite/gcc.target/i386/pr90773-2.c @@ -0,0 +1,20 @@ +/* { dg-do compile } */ +/* { dg-options "-O2 -mtune=generic -foverlap-op-by-pieces" } */ +/* { dg-additional-options "-mno-avx -msse2" { target { ! ia32 } } } */ +/* { dg-additional-options "-mno-sse" { target ia32 } } */ + +extern char *dst, *src; + +void +foo (void) +{ + __builtin_memcpy (dst, src, 19); +} + +/* { dg-final { scan-assembler-times "movdqu\[\\t \]+\\(%\[\^,\]+\\)," 1 { target { ! ia32 } } } } */ +/* { dg-final { scan-assembler-times "movl\[\\t \]+15\\(%\[\^,\]+\\)," 1 { target { ! ia32 } } } } */ +/* { dg-final { scan-assembler-times "movl\[\\t \]+\\(%\[\^,\]+\\)," 1 { target ia32 } } } */ +/* { dg-final { scan-assembler-times "movl\[\\t \]+4\\(%\[\^,\]+\\)," 1 { target ia32 } } } */ +/* { dg-final { scan-assembler-times "movl\[\\t \]+8\\(%\[\^,\]+\\)," 1 { target ia32 } } } */ +/* { dg-final { scan-assembler-times "movl\[\\t \]+12\\(%\[\^,\]+\\)," 1 { target ia32 } } } */ +/* { dg-final { scan-assembler-times "movl\[\\t \]+15\\(%\[\^,\]+\\)," 1 { target ia32 } } } */ diff --git a/gcc/testsuite/gcc.target/i386/pr90773-3.c b/gcc/testsuite/gcc.target/i386/pr90773-3.c new file mode 100644 index 00000000000..d876f878f60 --- /dev/null +++ b/gcc/testsuite/gcc.target/i386/pr90773-3.c @@ -0,0 +1,23 @@ +/* { dg-do compile } */ +/* { dg-options "-O2 -mtune=generic -foverlap-op-by-pieces" } */ +/* { dg-additional-options "-mno-avx -msse2" { target { ! ia32 } } } */ +/* { dg-additional-options "-mno-sse" { target ia32 } } */ + +extern char *dst, *src; + +void +foo (void) +{ + __builtin_memcpy (dst, src, 31); +} + +/* { dg-final { scan-assembler-times "movdqu\[\\t \]+\\(%\[\^,\]+\\)," 1 { target { ! ia32 } } } } */ +/* { dg-final { scan-assembler-times "movdqu\[\\t \]+15\\(%\[\^,\]+\\)," 1 { target { ! ia32 } } } } */ +/* { dg-final { scan-assembler-times "movl\[\\t \]+\\(%\[\^,\]+\\)," 1 { target ia32 } } } */ +/* { dg-final { scan-assembler-times "movl\[\\t \]+4\\(%\[\^,\]+\\)," 1 { target ia32 } } } */ +/* { dg-final { scan-assembler-times "movl\[\\t \]+8\\(%\[\^,\]+\\)," 1 { target ia32 } } } */ +/* { dg-final { scan-assembler-times "movl\[\\t \]+12\\(%\[\^,\]+\\)," 1 { target ia32 } } } */ +/* { dg-final { scan-assembler-times "movl\[\\t \]+16\\(%\[\^,\]+\\)," 1 { target ia32 } } } */ +/* { dg-final { scan-assembler-times "movl\[\\t \]+20\\(%\[\^,\]+\\)," 1 { target ia32 } } } */ +/* { dg-final { scan-assembler-times "movl\[\\t \]+24\\(%\[\^,\]+\\)," 1 { target ia32 } } } */ +/* { dg-final { scan-assembler-times "movl\[\\t \]+27\\(%\[\^,\]+\\)," 1 { target ia32 } } } */ diff --git a/gcc/testsuite/gcc.target/i386/pr90773-4.c b/gcc/testsuite/gcc.target/i386/pr90773-4.c new file mode 100644 index 00000000000..0df1b2fc247 --- /dev/null +++ b/gcc/testsuite/gcc.target/i386/pr90773-4.c @@ -0,0 +1,13 @@ +/* { dg-do compile { target { ! ia32 } } } */ +/* { dg-options "-O2 -mno-avx -msse2 -mtune=generic -foverlap-op-by-pieces" } */ + +extern char *dst; + +void +foo (void) +{ + __builtin_memset (dst, 0, 31); +} + +/* { dg-final { scan-assembler-times "movups\[\\t \]+%xmm\[0-9\]+, \\(%\[\^,\]+\\)" 1 } } */ +/* { dg-final { scan-assembler-times "movups\[\\t \]+%xmm\[0-9\]+, 15\\(%\[\^,\]+\\)" 1 } } */ diff --git a/gcc/testsuite/gcc.target/i386/pr90773-5.c b/gcc/testsuite/gcc.target/i386/pr90773-5.c new file mode 100644 index 00000000000..65c9fe88696 --- /dev/null +++ b/gcc/testsuite/gcc.target/i386/pr90773-5.c @@ -0,0 +1,13 @@ +/* { dg-do compile { target { ! ia32 } } } */ +/* { dg-options "-O2 -mno-avx -msse2 -mtune=generic -foverlap-op-by-pieces" } */ + +extern char *dst; + +void +foo (void) +{ + __builtin_memset (dst, 0, 21); +} + +/* { dg-final { scan-assembler-times "movups\[\\t \]+%xmm\[0-9\]+, \\(%\[\^,\]+\\)" 1 } } */ +/* { dg-final { scan-assembler-times "movq\[\\t \]+\\\$0+, 13\\(%\[\^,\]+\\)" 1 } } */ diff --git a/gcc/testsuite/gcc.target/i386/pr90773-6.c b/gcc/testsuite/gcc.target/i386/pr90773-6.c new file mode 100644 index 00000000000..0c84d492974 --- /dev/null +++ b/gcc/testsuite/gcc.target/i386/pr90773-6.c @@ -0,0 +1,11 @@ +/* { dg-do compile { target { ! ia32 } } } */ +/* { dg-options "-O2 -mno-avx -msse2 -mtune=generic -foverlap-op-by-pieces" } */ + +void +foo (char *dst, char *src) +{ + __builtin_memcpy (dst, src, 255); +} + +/* { dg-final { scan-assembler-times "movdqu\[\\t \]+\[0-9\]*\\(%\[\^,\]+\\)," 16 } } */ +/* { dg-final { scan-assembler-not "mov\[bwlq\]" } } */ diff --git a/gcc/testsuite/gcc.target/i386/pr90773-7.c b/gcc/testsuite/gcc.target/i386/pr90773-7.c new file mode 100644 index 00000000000..732b4d3d992 --- /dev/null +++ b/gcc/testsuite/gcc.target/i386/pr90773-7.c @@ -0,0 +1,11 @@ +/* { dg-do compile { target { ! ia32 } } } */ +/* { dg-options "-O2 -mno-avx -msse2 -mtune=skylake -foverlap-op-by-pieces" } */ + +void +foo (char *dst) +{ + __builtin_memset (dst, 0, 255); +} + +/* { dg-final { scan-assembler-times "movups\[\\t \]+%xmm\[0-9\]+, \[0-9\]*\\(%\[\^,\]+\\)" 16 } } */ +/* { dg-final { scan-assembler-not "mov\[bwlq\]" } } */ diff --git a/gcc/testsuite/gcc.target/i386/pr90773-8.c b/gcc/testsuite/gcc.target/i386/pr90773-8.c new file mode 100644 index 00000000000..7ff5ba12daf --- /dev/null +++ b/gcc/testsuite/gcc.target/i386/pr90773-8.c @@ -0,0 +1,13 @@ +/* { dg-do compile } */ +/* { dg-options "-O2 -mtune=generic -foverlap-op-by-pieces=max-memset" } */ + +extern char *dst; + +void +foo (void) +{ + __builtin_memset (dst, 0, 5); +} + +/* { dg-final { scan-assembler-times "movl\[\\t \]+.+, \\(%\[\^,\]+\\)" 1 } } */ +/* { dg-final { scan-assembler-times "movb\[\\t \]+.+, 4\\(%\[\^,\]+\\)" 1 } } */ diff --git a/gcc/testsuite/gcc.target/i386/pr90773-9.c b/gcc/testsuite/gcc.target/i386/pr90773-9.c new file mode 100644 index 00000000000..c2fc3ba59a7 --- /dev/null +++ b/gcc/testsuite/gcc.target/i386/pr90773-9.c @@ -0,0 +1,13 @@ +/* { dg-do compile } */ +/* { dg-options "-O2 -mtune=generic -foverlap-op-by-pieces=max-memset" } */ + +extern char *dst; + +void +foo (void) +{ + __builtin_memset (dst, 0, 6); +} + +/* { dg-final { scan-assembler-times "movl\[\\t \]+.+, \\(%\[\^,\]+\\)" 1 } } */ +/* { dg-final { scan-assembler-times "movw\[\\t \]+.+, 4\\(%\[\^,\]+\\)" 1 } } */ diff --git a/gcc/toplev.c b/gcc/toplev.c index d8cc254adef..23c88c788a2 100644 --- a/gcc/toplev.c +++ b/gcc/toplev.c @@ -1323,6 +1323,14 @@ process_options (void) } } + if (flag_overlap_op_by_pieces && STRICT_ALIGNMENT) + { + error_at (UNKNOWN_LOCATION, + "%<-foverlap-op-by-pieces%> is not supported for " + "strict alignment target"); + flag_overlap_op_by_pieces = 0; + } + /* One region RA really helps to decrease the code size. */ if (flag_ira_region == IRA_REGION_AUTODETECT) flag_ira_region -- 2.30.2