On Wed, Mar 9, 2022 at 6:43 PM Roger Sayle <ro...@nextmovesoftware.com> wrote: > > > Hi Richard, > Many thanks. Yes, your proposed ao_ref_alignment is exactly what I was > looking for. > Here's the second revision of my patch for PR tree-optimization/98335 that > both uses/ > introduces ao_ref_alignment and more intelligently aligns/trims both head and > tail, > for example handling the case discussed by Richard and Jeff Law, of a 16 > byte-aligned > object where we wish to avoid trimming (just) the last three bytes. It uses > the useful > property that writing N consecutive bytes, typically requires popcount(N) > store > instructions, so we wish to align (if we can) that we begin/end with a store > of N' bytes > where popcount(N') is one, if that isn't already the case. > > This patch has been tested on x86_64-pc-linux-gnu with make bootstrap and > make -k check with no new failures. Is this revised version ok for mainline?
OK. I wonder if we can also handle bitpos != 0, like if we have a an access 3 bytes into an 8 byte aligned object and want to trim 2 bytes at start then it would be good to trim only 1 byte so the access is then 4 byte aligned (4 bytes into the 8 byte aligned object). Similar, trimming 6 bytes should be reduced to trimming 5 bytes. Richard. > 2022-03-09 Roger Sayle <ro...@nextmovesoftware.com> > Richard Biener <rguent...@suse.de> > > gcc/ChangeLog > PR tree-optimization/98335 > * builtins.cc (get_object_alignment_2): Export. > * builtins.h (get_object_alignment_2): Likewise. > * tree-ssa-alias.cc (ao_ref_alignment): New. > * tree-ssa-alias.h (ao_ref_alignment): Declare. > > * tree-ssa-dse.cc (compute_trims): Improve logic deciding whether > to align head/tail, writing more bytes but using fewer store insns. > (maybe_trim_memstar_call): Silence compiler warnings by using > memset to initialize lendata. > > gcc/testsuite/ChangeLog > PR tree-optimization/98335 > * g++.dg/pr98335.C: New test case. > * gcc.dg/pr86010.c: New test case. > * gcc.dg/pr86010-2.c: New test case. > > Thanks again for your help. > Roger > -- > > > -----Original Message----- > > From: Richard Biener <richard.guent...@gmail.com> > > Sent: 08 March 2022 10:44 > > To: Roger Sayle <ro...@nextmovesoftware.com> > > Cc: GCC Patches <gcc-patches@gcc.gnu.org> > > Subject: Re: [PATCH] PR tree-optimization/98335: Improvements to DSE's > > compute_trims. > > > > On Tue, Mar 8, 2022 at 11:10 AM Richard Biener > > <richard.guent...@gmail.com> wrote: > > > > > > On Mon, Mar 7, 2022 at 11:04 AM Roger Sayle > > <ro...@nextmovesoftware.com> wrote: > > > > > > > > > > > > This patch is the main middle-end piece of a fix for PR > > > > tree-opt/98335, which is a code-quality regression affecting > > > > mainline. The issue occurs in DSE's (dead store elimination's) > > > > compute_trims function that determines where a store to memory can > > > > be trimmed. In the testcase given in the PR, this function notices > > > > that the first byte of a DImode store is dead, and replaces the > > > > 8-byte store at (aligned) offset zero, with a 7-byte store at > > > > (unaligned) offset one. Most architectures can store a power-of-two > > > > bytes (up to a maximum) in single instruction, so writing 7 bytes > > > > requires more instructions than writing 8 bytes. This patch follows > > > > Jakub > > Jelinek's suggestion in comment 5, that compute_trims needs improved > > heuristics. > > > > > > > > In this patch, decision of whether and how to align trim_head is > > > > based on the number of bytes being written, the alignment of the > > > > start of the object and where within the object the first byte is > > > > written. The first tests check whether we're already writing to the > > > > start of the object, and that we're writing three or more bytes. If > > > > we're only writing one or two bytes, there's no benefit from providing > > additional alignment. > > > > Then we determine the alignment of the object, which is either 1, 2, > > > > 4, 8 or 16 byte aligned (capping at 16 guarantees that we never > > > > write more than 7 bytes beyond the minimum required). If the buffer > > > > is only > > > > 1 or 2 byte aligned there's no benefit from additional alignment. > > > > For the remaining cases, alignment of trim_head is based upon where > > > > within each aligned block (word) the first byte is written. For > > > > example, storing the last byte (or last half-word) of a word can be > > > > performed with a single insn. > > > > > > > > On x86_64-pc-linux-gnu with -O2 the new test case in the PR goes from: > > > > > > > > movl $0, -24(%rsp) > > > > movabsq $72057594037927935, %rdx > > > > movl $0, -21(%rsp) > > > > andq -24(%rsp), %rdx > > > > movq %rdx, %rax > > > > salq $8, %rax > > > > movb c(%rip), %al > > > > ret > > > > > > > > to > > > > > > > > xorl %eax, %eax > > > > movb c(%rip), %al > > > > ret > > > > > > > > This patch has been tested on x86_64-pc-linux-gnu with make > > > > bootstrap and make -k check with no new failures. I've also added > > > > new testcases for the original motivating PR > > > > tree-optimization/86010, to ensure that those remain optimized (in > > > > future). > > Ok for mainline? > > > > > > diff --git a/gcc/tree-ssa-dse.cc b/gcc/tree-ssa-dse.cc index > > > 2b22a61..080e406 100644 > > > --- a/gcc/tree-ssa-dse.cc > > > +++ b/gcc/tree-ssa-dse.cc > > > @@ -405,10 +405,36 @@ compute_trims (ao_ref *ref, sbitmap live, int > > > *trim_head, int *trim_tail, > > > int first_live = bitmap_first_set_bit (live); > > > *trim_head = first_live - first_orig; > > > > > > - /* If more than a word remains, then make sure to keep the > > > - starting point at least word aligned. */ > > > - if (last_live - first_live > UNITS_PER_WORD) > > > - *trim_head &= ~(UNITS_PER_WORD - 1); > > > + /* If REF is aligned, try to maintain this alignment if it reduces > > > + the number of (power-of-two sized aligned) writes to memory. > > > + First check that we're writing >= 3 bytes at a non-zero offset. > > > + */ if (first_live > > > + && last_live - first_live >= 2) > > > + { > > > + unsigned int align = TYPE_ALIGN_UNIT (TREE_TYPE (ref->base)); > > > > > > you can't simply use TYPE_ALIGN_* on ref->base. You can use > > > get_object_alignment on ref->ref, but ref->ref can be NULL in case the > > > ref was initialized from a builtin call like memcpy. > > > > > > Also ref->base is offsetted by ref->offset which you don't seem to > > > account for. In theory one could export get_object_alignment_2 and if > > > ref->ref is NULL, use that on ref->base, passing addr_p = true, and > > > then adjust the resulting bitpos by ref->offset and fix align > > > accordingly (trimming might also align an access if the original > > > access was offsetted from known alignment). > > > > > > That said, a helper like ao_ref_alignment () might be useful here. > > > > Like the attached - free feel to use that. > > > > Richard. > > > > > > > > I wonder if we can apply good heuristics to compute_trims without > > > taking into account context, like maybe_trimp_complex_store is already > > > limiting itself to useful subsets and the constructor and memstar > > > cases will only benefit if they end up being expanded inline via > > > *_by_pieces, not if expanded as a call. > > > > > > You don't seem to adjust *trim_tail at all, if an aligned 16 byte > > > region is trimmed there by 3 that will result in two extra stores as > > > well, no? > > > > >