On February 21, 2018 11:28:36 PM GMT+01:00, Jakub Jelinek <ja...@redhat.com> wrote: >Hi! > >The long comment above the new check_no_overlap function below should >hopefully explain the problem. coalesce_immediate_stores is splitting >the >stores from the same base into groups that we are going to optimize >individually; for each successfully merged group we emit new stores at >the >location of the last store from that group. And >coalesce_immediate_stores >processes stores ordered primarily by increasing bit position and only >secondarily by the original ordering in the basic block. >On the following testcases, we have the unfortunate ordering of: > MEM[(long long int *)p_28] = 0; > MEM[(long long int *)p_28 + 8B] = 0; > MEM[(long long int *)p_28 + 16B] = 0; > MEM[(long long int *)p_28 + 24B] = 0; > _129 = (int) _130; > MEM[(int *)p_28 + 8B] = _129; > MEM[(int *)p_28].a = -1; >We already have > MEM[(long long int *)p_28] = 0; > MEM[(int *)p_28].a = -1; >in the first group (which is fine) and the following 2 stores to >consider >are > MEM[(long long int *)p_28 + 8B] = 0; >and > MEM[(int *)p_28 + 8B] = _129; >We add the first store to the group, which has the effect of emitting >the >merged stores at the location of former MEM[(int *)p_28].a = -1; >store. Then we process MEM[(int *)p_28 + 8B] = _129; (which was >handled >just as potential bswap possibility and as it overlaps with previous >and can't be merged with next either, it will be its own group and thus >reuse the original stmt; the net effect is that we've reordered the >MEM[(long long int *)p_28 + 8B] = 0; store from before the >MEM[(int *)p_28 + 8B] = _129; store to after it, but those two do >alias. > >Instead of detecting this situation late, where we could just throw >away the >whole group (which would mean we couldn't merge even the >MEM[(long long int *)p_28] = 0; and MEM[(int *)p_28].a = -1; stores) >this patch attempts to check before calling merge_into whether it will >not overlap with later stores we won't be able to emit in the same >group >and if it does, it terminates the group right away. > >I had to fix also the merged_store_group::merge_into width computation, >it was mistakenly just counting sum of the bits we've added to the >group, >but we really need the distance between the first bit in the group and >last >bit, including any gaps in between, but exclusing gaps before the start >and >after the end (still within the bitregion). > >Bootstrapped/regtested on x86_64-linux and i686-linux, ok for trunk?
OK. Richard. >For 7.x I think we'll need something along the lines of PR82916 >instead, >while the same testcase fails there, we only handle stores with lhs >being >a constant and so the problem is that we aren't terminating the chain >when >we should on the variable store. > >2018-02-21 Jakub Jelinek <ja...@redhat.com> > > PR tree-optimization/84503 > * gimple-ssa-store-merging.c (merged_store_group::merge_into): Compute > width as info->bitpos + info->bitsize - start. > (merged_store_group::merge_overlapping): Simplify width computation. > (check_no_overlap): New function. > (imm_store_chain_info::try_coalesce_bswap): Compute expected > start + width and last_order of the group, fail if check_no_overlap > fails. > (imm_store_chain_info::coalesce_immediate_stores): Don't merge info > to group if check_no_overlap fails. > > * gcc.dg/pr84503-1.c: New test. > * gcc.dg/pr84503-2.c: New test. > >--- gcc/gimple-ssa-store-merging.c.jj 2018-01-16 09:52:26.230235131 >+0100 >+++ gcc/gimple-ssa-store-merging.c 2018-02-21 20:13:45.239129599 +0100 >@@ -1891,12 +1891,11 @@ merged_store_group::do_merge (store_imme > void > merged_store_group::merge_into (store_immediate_info *info) > { >- unsigned HOST_WIDE_INT wid = info->bitsize; >/* Make sure we're inserting in the position we think we're inserting. >*/ > gcc_assert (info->bitpos >= start + width > && info->bitregion_start <= bitregion_end); > >- width += wid; >+ width = info->bitpos + info->bitsize - start; > do_merge (info); > } > >@@ -1909,7 +1908,7 @@ merged_store_group::merge_overlapping (s > { > /* If the store extends the size of the group, extend the width. */ > if (info->bitpos + info->bitsize > start + width) >- width += info->bitpos + info->bitsize - (start + width); >+ width = info->bitpos + info->bitsize - start; > > do_merge (info); > } >@@ -2304,6 +2303,55 @@ gather_bswap_load_refs (vec<tree> *refs, > } > } > >+/* Check if there are any stores in M_STORE_INFO after index I >+ (where M_STORE_INFO must be sorted by sort_by_bitpos) that overlap >+ a potential group ending with END that have their order >+ smaller than LAST_ORDER. RHS_CODE is the kind of store in the >+ group. Return true if there are no such stores. >+ Consider: >+ MEM[(long long int *)p_28] = 0; >+ MEM[(long long int *)p_28 + 8B] = 0; >+ MEM[(long long int *)p_28 + 16B] = 0; >+ MEM[(long long int *)p_28 + 24B] = 0; >+ _129 = (int) _130; >+ MEM[(int *)p_28 + 8B] = _129; >+ MEM[(int *)p_28].a = -1; >+ We already have >+ MEM[(long long int *)p_28] = 0; >+ MEM[(int *)p_28].a = -1; >+ stmts in the current group and need to consider if it is safe to >+ add MEM[(long long int *)p_28 + 8B] = 0; store into the same group. >+ There is an overlap between that store and the MEM[(int *)p_28 + >8B] = _129; >+ store though, so if we add the MEM[(long long int *)p_28 + 8B] = 0; >+ into the group and merging of those 3 stores is successful, merged >+ stmts will be emitted at the latest store from that group, i.e. >+ LAST_ORDER, which is the MEM[(int *)p_28].a = -1; store. >+ The MEM[(int *)p_28 + 8B] = _129; store that originally follows >+ the MEM[(long long int *)p_28 + 8B] = 0; would now be before it, >+ so we need to refuse merging MEM[(long long int *)p_28 + 8B] = 0; >+ into the group. That way it will be its own store group and will >+ not be touched. If RHS_CODE is INTEGER_CST and there are >overlapping >+ INTEGER_CST stores, those are mergeable using merge_overlapping, >+ so don't return false for those. */ >+ >+static bool >+check_no_overlap (vec<store_immediate_info *> m_store_info, unsigned >int i, >+ enum tree_code rhs_code, unsigned int last_order, >+ unsigned HOST_WIDE_INT end) >+{ >+ unsigned int len = m_store_info.length (); >+ for (++i; i < len; ++i) >+ { >+ store_immediate_info *info = m_store_info[i]; >+ if (info->bitpos >= end) >+ break; >+ if (info->order < last_order >+ && (rhs_code != INTEGER_CST || info->rhs_code != INTEGER_CST)) >+ return false; >+ } >+ return true; >+} >+ > /* Return true if m_store_info[first] and at least one following store > form a group which store try_size bitsize value which is byte swapped > from a memory load or some value, or identity from some value. >@@ -2375,6 +2423,7 @@ imm_store_chain_info::try_coalesce_bswap > unsigned int last_order = merged_store->last_order; > gimple *first_stmt = merged_store->first_stmt; > gimple *last_stmt = merged_store->last_stmt; >+ unsigned HOST_WIDE_INT end = merged_store->start + >merged_store->width; > store_immediate_info *infof = m_store_info[first]; > > for (unsigned int i = first; i <= last; ++i) >@@ -2413,25 +2462,23 @@ imm_store_chain_info::try_coalesce_bswap > } > else > { >- if (n.base_addr) >+ if (n.base_addr && n.vuse != this_n.vuse) > { >- if (n.vuse != this_n.vuse) >- { >- if (vuse_store == 0) >- return false; >- vuse_store = 1; >- } >- if (info->order > last_order) >- { >- last_order = info->order; >- last_stmt = info->stmt; >- } >- else if (info->order < first_order) >- { >- first_order = info->order; >- first_stmt = info->stmt; >- } >+ if (vuse_store == 0) >+ return false; >+ vuse_store = 1; > } >+ if (info->order > last_order) >+ { >+ last_order = info->order; >+ last_stmt = info->stmt; >+ } >+ else if (info->order < first_order) >+ { >+ first_order = info->order; >+ first_stmt = info->stmt; >+ } >+ end = MAX (end, info->bitpos + info->bitsize); > > ins_stmt = perform_symbolic_merge (ins_stmt, &n, info->ins_stmt, > &this_n, &n); >@@ -2452,6 +2499,9 @@ imm_store_chain_info::try_coalesce_bswap > if (n.base_addr == NULL_TREE && !is_gimple_val (n.src)) > return false; > >+ if (!check_no_overlap (m_store_info, last, LROTATE_EXPR, last_order, >end)) >+ return false; >+ > /* Don't handle memory copy this way if normal non-bswap processing > would handle it too. */ > if (n.n == cmpnop && (unsigned) n.n_ops == last - first + 1) >@@ -2633,7 +2683,13 @@ imm_store_chain_info::coalesce_immediate > : !info->ops[0].base_addr) > && (infof->ops[1].base_addr > ? compatible_load_p (merged_store, info, base_addr, 1) >- : !info->ops[1].base_addr)) >+ : !info->ops[1].base_addr) >+ && check_no_overlap (m_store_info, i, info->rhs_code, >+ MAX (merged_store->last_order, >+ info->order), >+ MAX (merged_store->start >+ + merged_store->width, >+ info->bitpos + info->bitsize))) > { > merged_store->merge_into (info); > continue; >--- gcc/testsuite/gcc.dg/pr84503-1.c.jj 2018-02-21 20:36:33.474226039 >+0100 >+++ gcc/testsuite/gcc.dg/pr84503-1.c 2018-02-21 20:20:53.144165116 >+0100 >@@ -0,0 +1,68 @@ >+/* PR tree-optimization/84503 */ >+/* { dg-do run } */ >+/* { dg-options "-O3" } */ >+ >+typedef __SIZE_TYPE__ size_t; >+typedef __UINTPTR_TYPE__ uintptr_t; >+ >+struct S { int a; unsigned short b; int c, d, e; long f, g, h; int i, >j; }; >+static struct S *k; >+static size_t l = 0; >+int m; >+ >+static int >+bar (void) >+{ >+ unsigned i; >+ int j; >+ if (k[0].c == 0) >+ { >+ ++m; >+ size_t n = l * 2; >+ struct S *o; >+ o = (struct S *) __builtin_realloc (k, sizeof (struct S) * n); >+ if (!o) >+ __builtin_exit (0); >+ k = o; >+ for (i = l; i < n; i++) >+ { >+ void *p = (void *) &k[i]; >+ int q = 0; >+ size_t r = sizeof (struct S); >+ if ((((uintptr_t) p) % __alignof__ (long)) == 0 >+ && r % sizeof (long) == 0) >+ { >+ long __attribute__ ((may_alias)) *s = (long *) p; >+ long *t = (long *) ((char *) s + r); >+ while (s < t) >+ *s++ = 0; >+ } >+ else >+ __builtin_memset (p, q, r); >+ k[i].c = i + 1; >+ k[i].a = -1; >+ } >+ k[n - 1].c = 0; >+ k[0].c = l; >+ l = n; >+ } >+ j = k[0].c; >+ k[0].c = k[j].c; >+ return j; >+} >+ >+int >+main () >+{ >+ k = (struct S *) __builtin_malloc (sizeof (struct S)); >+ if (!k) >+ __builtin_exit (0); >+ __builtin_memset (k, '\0', sizeof (struct S)); >+ k->a = -1; >+ l = 1; >+ for (int i = 0; i < 15; ++i) >+ bar (); >+ if (m != 4) >+ __builtin_abort (); >+ return 0; >+} >--- gcc/testsuite/gcc.dg/pr84503-2.c.jj 2018-02-21 20:36:41.874226585 >+0100 >+++ gcc/testsuite/gcc.dg/pr84503-2.c 2018-02-21 20:36:59.875227757 >+0100 >@@ -0,0 +1,5 @@ >+/* PR tree-optimization/84503 */ >+/* { dg-do run } */ >+/* { dg-options "-O3 -fno-tree-vectorize -fno-ivopts" } */ >+ >+#include "pr84503-1.c" > > Jakub